Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping Broad IDs across different versions #13

Merged

Conversation

niranjchandrasekaran
Copy link
Member

@niranjchandrasekaran niranjchandrasekaran commented Apr 14, 2020

Aims to address #11

I have created a mapping table connecting four fields - broad_id, pert_iname, target, moa across each version of repurposing data using InChIKey14 as the common field. The map is a single table with 13 fields. Fields with multiple values are combined into a single string, separated by pipes. I have not run any tests to confirm whether the code generates meaningful results. I will do that as the next step.

A couple of points to note

  • Each row does not correspond to a single Broad ID.
    • This may not be desirable when data frames are merged by inbuilt join commands using Broad IDs.
  • When two fields have multiple values, it not clear how they map to one another. For e.g. if there are 2 pert_inames and 2 moas, the current mapping table does not say which pert_iname corresponds to which moa.

@gwaygenomics, @shntnu Do you think this mapping would work? Or should I take a different approach?

- The 2017, 2018 and 2020 versions are merged using the first 14 characters of the InChIKey
- broad_id, pert_iname, moa and target of all the version are included
- Fields with multiple values are pipe separated
@gwaybio
Copy link
Member

gwaybio commented Apr 14, 2020

Looking good so far @niranjchandrasekaran - a couple notes:

Each row does not correspond to a single Broad ID.

Is it true that each row corresponds to a single most recent version Broad ID?

if there are 2 pert_inames and 2 moas, the current mapping table does not say which pert_iname corresponds to which moa

Oh interesting! I didn't know there could be two pert_inames. For cases where there are multiple moas or targets, for the purposes of the map, I think it is ok to keep them pipe delimited.

Two more points

Please generate an nbconverted .py file

Run the following command and add the resulting file to this PR

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts/nbconverted *.ipynb

This will help with making inline comments in the pull request review.

What to do with deprecated_broad_id column?

Do you think we should handle this column in the map?

@shntnu
Copy link
Collaborator

shntnu commented Apr 14, 2020

Some checks (WIP)

x %>% count %>% knitr::kable()
n
6997
x %>% distinct(InChIKey14) %>% count %>% knitr::kable()
n
6997
x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")

x %>% 
  summarise_each(~sum(is.na(.))) %>% 
  select(matches("^broad_id")) %>% 
  pivot_longer(cols = everything(), values_to = "na_count") %>% 
  knitr::kable()
name na_count
broad_id_2017 1371
broad_id_2018 966
broad_id_2020 262

@shntnu
Copy link
Collaborator

shntnu commented Apr 14, 2020

x %>% 
  select(matches("^pert_iname")) %>% 
  rowwise() %>% 
  mutate_each(function(x) str_detect(x, "\\|")) %>% 
  ungroup() %>% 
  summarise_each(~sum(., na.rm = T)) %>% 
  pivot_longer(cols = everything(), values_to = "multiple_inames") %>% 
  knitr::kable()  
name multiple_inames
pert_iname_2017 111
pert_iname_2018 113
pert_iname_2020 127

@shntnu
Copy link
Collaborator

shntnu commented Apr 14, 2020

Is it true that each row corresponds to a single most recent version Broad ID?

IIUC, no: each row corresponds to a sample that at some point in history had a broad_id assigned to it.

Some NAs in each version

x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")

x %>% 
  summarise_each(~sum(is.na(.))) %>% 
  select(matches("^broad_id")) %>% 
  pivot_longer(cols = everything(), values_to = "na_count") %>% 
  knitr::kable()
name na_count
broad_id_2017 1371
broad_id_2018 966
broad_id_2020 262

No InChIKey14 has all broad_ids as NA

x %>% 
  select(InChIKey14, matches("^broad_id")) %>% 
  mutate_at(vars(matches(("^broad_id"))), is.na) %>% 
  pivot_longer(cols = matches(("^broad_id"))) %>%
  group_by(InChIKey14) %>%
  summarize(all_broad_id_are_na = all(value)) %>%
  filter(all_broad_id_are_na) %>%
  count %>%
  knitr::kable()
n
0

@niranjchandrasekaran
Copy link
Member Author

Is it true that each row corresponds to a single most recent version Broad ID?

There are 262 rows where the broad_id_2020 is NA while there are 133 rows with multiple broad_id_2020s

print('Number of NA: %d' % (len(broad_id_2020)-broad_id_2020.count()))
broad_id = broad_id_2020.fillna('')
print('Rows with multiple broad_id_2020: %d' % len(broad_id.loc[broad_id.broad_id_2020.str.contains('\|')]))

generates

Number of NA: 262
Rows with multiple broad_id_2020: 133

Please generate an nbconverted .py file

Done

What to do with deprecated_broad_id column?
Do you think we should handle this column in the map?

I will check if the InChIKey14 of every deprecated_broad_id maps to the same InChIKey14 when the ID was still in use. If they do, then there won't be any additional information in the deprecated_broad_id field.

@niranjchandrasekaran
Copy link
Member Author

I found the InChIKey14 of most of the deprecated_broad_ids from the 2018 version to be present in the 2017 version, except for the following

InChIKey14 deprecated_broad_id_2018
0 CAIDFDPKOGBHAT BRD-A09283158
1 CLKOFPXJLQSYAH BRD-A21002158
2 DUUFLTUTPZFOAH BRD-A39845846
3 DVUJFSXMTBOIKW BRD-U66818247
4 IRPYFWIZKIOHQN BRD-U68276368
5 JFAHBAVHONXXFG BRD-U22164625
6 KNLORLKZMHLRBH BRD-U77299140
7 MEOTUCBFWIQPRL BRD-M52026403
8 NOSAJPUYIASWEH BRD-A52946717,BRD-K10549975
9 NVNLLIYOARQCIX BRD-U59029435
10 ONCSSPOCYKWCTO BRD-A26049141
11 PQYCRDPLPKGSME BRD-K94313941
12 RMRCNWBMXRMIRW BRD-A59561452,BRD-M14495474
13 VPZCKRKZFRCZMX BRD-A09838294,BRD-K35776642
14 XTLSVXWJVRKFIP BRD-K79233274
15 ZNRGQMMCGHDTEI BRD-A79226577,BRD-M19407551

This would mean that there are deprecated_broad_ids that do not correspond to broad_ids that were in use. I am not sure what to make of this.

Similarly, apart from the following, I found the InChIKey of most of the deprecated_broad_ids from the 2020 version to be present in the 2018 version.

InChIKey14 deprecated_broad_id_2020
0 AUZDMWWRYCUANI BRD-K34751753
1 BKWHZSFINQBEQH BRD-A25576662
2 DAIGSUHCWOZEQG BRD-K98733875,BRD-K99455969
3 JBIWCJUYHHGXTC BRD-M28128163
4 KTGRHKOEFSJQNS BRD-A47598013
5 KTUXFHGUHYRDNF BRD-A56621826,BRD-A78694248
6 OPMVDWXPLVKNQN BRD-K05218031
7 OSZMNJRKIPAVOS BRD-A40504327
8 PTQGBRPUALXNCZ BRD-K01666924
9 SZINUGQCTHLQAZ BRD-K93666001
10 TWHPPRXFYWSOQF BRD-K83146339,BRD-K88014802,BRD-M65265990
11 ZCHZSBUNJOCWCW BRD-M37437038,BRD-M92812175
12 ZIUSSTSXXLLKKK BRD-K69690935
13 ZQHDVMCIDXEICF BRD-A99182808

@shntnu
Copy link
Collaborator

shntnu commented Apr 15, 2020

This would mean that there are deprecated_broad_ids that do not correspond to broad_ids that were in use. I am not sure what to make of this.

Looks like the deprecated_broad_id column was added in 2018; there was no such column in 2017. So it is possible those correspond to ids that were already deprecated in 2017.

@shntnu
Copy link
Collaborator

shntnu commented Apr 15, 2020

@niranjchandrasekaran I just noticed that there are two 2018 files on https://clue.io/repurposing#download-data

I don't recollect seeing this before. Do you read both?

Latest version: 3/24/2020
Archived versions: 9/7/2018 | 5/16/2018 | 3/27/2017

@shntnu
Copy link
Collaborator

shntnu commented Apr 15, 2020

My notebook dump is below. Sorry, ran out of time to document what I am doing here, but hope it helps

Click to expand

R Notebook

library(glue)
library(magrittr)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::collapse()  masks glue::collapse()
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
clue_2017 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20170327.txt", comment = "!", guess_max = 20000) 
## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double()
## )
clue_2018b <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180907.txt", comment = "!", guess_max = 20000) 
## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_character(),
##   deprecated_broad_id = col_character()
## )
clue_2018a <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180516.txt", comment = "!", guess_max = 20000) 
## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double(),
##   deprecated_broad_id = col_character()
## )
clue_2020 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt", comment = "!", guess_max = 20000)
## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double(),
##   deprecated_broad_id = col_character()
## )

Some rows have InChIKey as NA! Not sure how this should be handled.
For now, I propose we drop it unless they are present in the Cell
Painting dataset

clue_2017 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid
BRD-U08520523-000-01-0 colesevalam 1 0 MicroSource 1505905 COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml) NA NA NA NA
BRD-U59029435-000-01-7 nisin 0 NA Cayman 16532 NA NA NA NA NA
BRD-U51753822-000-01-1 phosphatidylcholine 1 0 Selleck S4411 Phosphatidylcholine NA NA NA NA
BRD-U45393375-000-01-6 sevelamer 1 0 Selleck S4129 Sevelamer HCl NA NA NA NA
BRD-U59360347-000-01-3 teicoplanin-a2-3 0 0 Selleck S1399 Teicoplanin 2615.3 NA NA NA
BRD-U25960968-000-01-9 tyloxapol 1 0 Prestwick Prestw-954 Tyloxapol NA NA NA NA
BRD-U48018661-000-01-9 tyloxapol 1 0 MicroSource 1506053 TYLOXAPOL NA NA NA NA
clue_2018a %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid deprecated_broad_id
BRD-U08520523-000-01-0 colesevalam 1 0 MicroSource 1505905 COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml) NA NA NA NA NA
BRD-U51753822-000-01-1 phosphatidylcholine 1 0 Selleck S4411 Phosphatidylcholine NA NA NA NA NA
BRD-U45393375-000-01-6 sevelamer 1 0 Selleck S4129 Sevelamer HCl NA NA NA NA BRD-M22581557-003-01-7
BRD-U59360347-000-01-3 teicoplanin-a2-3 0 0 Selleck S1399 Teicoplanin 2615.3 NA NA NA NA
BRD-U25960968-000-01-9 tyloxapol 1 0 Prestwick Prestw-954 Tyloxapol NA NA NA NA NA
BRD-U48018661-000-01-9 tyloxapol 1 0 MicroSource 1506053 TYLOXAPOL NA NA NA NA NA
clue_2018b %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid deprecated_broad_id
BRD-U08520523-000-01-0 colesevalam 1 0 MicroSource 1505905 COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml) NA NA NA #N/A BRD-M08475060-001-01-4
BRD-U51753822-000-01-1 phosphatidylcholine 1 0 Selleck S4411 Phosphatidylcholine NA NA NA #N/A BRD-M44205332-065-01-8
BRD-U45393375-000-01-6 sevelamer 1 0 Selleck S4129 Sevelamer HCl NA NA NA #N/A BRD-M22581557-003-01-7
BRD-U59360347-000-01-3 teicoplanin-a2-3 0 0 Selleck S1399 Teicoplanin 2615.3 NA NA #N/A NA
BRD-U25960968-000-01-9 tyloxapol 1 0 Prestwick Prestw-954 Tyloxapol NA NA NA #N/A BRD-K46156294-001-05-5
BRD-U48018661-000-01-9 tyloxapol 1 0 MicroSource 1506053 TYLOXAPOL NA NA NA #N/A BRD-M13714491-001-01-8
clue_2020 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid deprecated_broad_id
clue_2017 %<>% dplyr::filter(!is.na(InChIKey))

clue_2018a %<>% dplyr::filter(!is.na(InChIKey))

clue_2018b %<>% dplyr::filter(!is.na(InChIKey))

clue_2020 %<>% dplyr::filter(!is.na(InChIKey))
f_deprecated_id <-  function(df)
  tibble (
    deprecated_id = 
      df %>%
      pull(deprecated_broad_id) %>%  
      paste(collapse = "|") %>% 
      str_split("\\|") %>%
      extract2(1)
  )
deprecated_broad_id_f <- function(df) {
  df %>% 
    filter(!is.na(deprecated_broad_id)) %>%
    mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>% 
    select(InChIKey, deprecated_broad_id) %>%
    group_by(InChIKey) %>% 
    nest() %>% 
    mutate(deprecated_broad_id = 
             map(data, 
                 function(df) {
                   deprecated_broad_id = 
                     paste(df$deprecated_broad_id, collapse = "|") %>% 
                     str_split("\\|") %>%
                     extract2(1)
                 }
             )
    ) %>%
    unnest(deprecated_broad_id) %>%
    select(-data) %>%
    distinct(InChIKey, deprecated_broad_id) %>%
    arrange(InChIKey, deprecated_broad_id)
}
  

broad_id_f <- function(df) {
  df %>% 
    filter(!is.na(broad_id)) %>%
    mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>% 
    select(InChIKey, broad_id) %>%
    group_by(InChIKey) %>% 
    nest() %>% 
    mutate(broad_id = 
             map(data, 
                 function(df) {
                   broad_id = 
                     paste(df$broad_id, collapse = "|") %>% 
                     str_split("\\|") %>%
                     extract2(1)
                 }
             )
    ) %>%
    unnest(broad_id) %>%
    select(-data) %>%
    distinct(InChIKey, broad_id) %>%
    arrange(InChIKey, broad_id)
}
deprecated_broad_id_2018a <- deprecated_broad_id_f(clue_2018a)

deprecated_broad_id_2018b <- deprecated_broad_id_f(clue_2018b)

deprecated_broad_id_2020 <- deprecated_broad_id_f(clue_2020)

broad_id_2017 <- broad_id_f(clue_2017)

broad_id_2018a <- broad_id_f(clue_2018a)

broad_id_2018b <- broad_id_f(clue_2018b)

broad_id_2020 <- broad_id_f(clue_2020)
all_inchi <- 
  bind_rows(
    deprecated_broad_id_2018a  %>% distinct(InChIKey),
    deprecated_broad_id_2018b  %>% distinct(InChIKey),
    deprecated_broad_id_2020  %>% distinct(InChIKey),
    broad_id_2017  %>% distinct(InChIKey),
    broad_id_2018a  %>% distinct(InChIKey),
    broad_id_2018b  %>% distinct(InChIKey),
    broad_id_2020  %>% distinct(InChIKey)
  ) %>%
  distinct(InChIKey) %>%
  arrange(InChIKey) %>%
  ungroup()

all_inchi %>%
  count() %>%
  knitr::kable()
n
7012

This master table that has the full mapping across all pairs of ids. It
has 759229 rows because of the many-to-many relationship between
broad_id (and deprecated_broad_id) and InChiKey. This is the
“tidy” way to do it

master_table <-
  all_inchi %>%
  left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2020,  by = "InChIKey") %>% rename(deprecated_broad_id_2020  = deprecated_broad_id) %>%
  left_join(broad_id_2017,  by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
  left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
  left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
  left_join(broad_id_2020,  by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
  distinct() %>%
  ungroup()

master_table %>%
  count() %>%
  knitr::kable()
n
759228

Here is the non-tidy way to do it: we |-separate broad_ids and
deprecated_broad_id per InChiKey

pipe_separate <- function(df, column) {
  column <- sym(column)
  df %>% group_by(InChIKey) %>% mutate(!!column := paste(sort(!!column), collapse = "|"))
}

broad_id_2017 %<>% pipe_separate("broad_id")

broad_id_2018a %<>% pipe_separate("broad_id")

broad_id_2018b %<>% pipe_separate("broad_id")

broad_id_2020 %<>% pipe_separate("broad_id")

deprecated_broad_id_2018a %<>% pipe_separate("deprecated_broad_id")

deprecated_broad_id_2018b %<>% pipe_separate("deprecated_broad_id")

deprecated_broad_id_2020 %<>% pipe_separate("deprecated_broad_id")
master_table_pipe_sep <-
  all_inchi %>%
  left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2020,  by = "InChIKey") %>% rename(deprecated_broad_id_2020  = deprecated_broad_id) %>%
  left_join(broad_id_2017,  by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
  left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
  left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
  left_join(broad_id_2020,  by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
  distinct() %>%
  ungroup

master_table_pipe_sep %>%
  count() %>%
  knitr::kable()
n
7012

@niranjchandrasekaran
Copy link
Member Author

I don't recollect seeing this before. Do you read both?

I had included only the September-2018 version and not the May-2018 version. Now

My notebook dump is below. Sorry, ran out of time to document what I am doing here, but hope it helps

I believe you are treating the broad_id and deprecated_broad_id separately and then merging them in the end, is that right?

I tried doing the same and I too end up with 7012 rows (I guess that's good :))

Some rows have InChIKey as NA! Not sure how this should be handled.
For now, I propose we drop it unless they are present in the Cell Painting dataset

I also found that in the 2020 version a bunch of rows may not have been formatted correctly as I end up extracting InChI instead of InChIKey. Is that the case for you as well @shntnu? If so, these have to be fixed.

InChIKey14 pert_iname_2020
6368 InChI=1S/C10H7 CHC
6369 InChI=1S/C10H8 mirin
6370 InChI=1S/C11H1 BU-239
6371 InChI=1S/C12H6 bithionol
6372 InChI=1S/C13H1 pomalidomide
6373 InChI=1S/C14H1 PD-158780
6374 InChI=1S/C14H2 GSK-LSD-1
6375 InChI=1S/C15H1 JNJ-7706621
6376 InChI=1S/C15H2 LMK-235
6377 InChI=1S/C16H1 oncrasin-1
6378 InChI=1S/C17H1 GYKI-52466
6379 InChI=1S/C18H2 AZD5438
6380 InChI=1S/C18H3 phytosphingosine
6381 InChI=1S/C19H1 PF-04217903
6382 InChI=1S/C19H2 AT-9283
6383 InChI=1S/C19H3 perhexiline
6384 InChI=1S/C20H1 A-1120
6385 InChI=1S/C20H2 TC-H-106
6386 InChI=1S/C20H3 epoprostenol
6387 InChI=1S/C21H1 regorafenib
6388 InChI=1S/C21H2 dinaciclib
6389 InChI=1S/C22H1 idelalisib
6390 InChI=1S/C22H2 I-BET-762
6391 InChI=1S/C22H3 chlorhexidine
6392 InChI=1S/C23H2 GDC-0941
6393 InChI=1S/C23H3 TCS-OX2-29
6394 InChI=1S/C24H1 BRD7389
6395 InChI=1S/C24H2 canertinib
6396 InChI=1S/C24H3 romidepsin
6397 InChI=1S/C25H1 NU-7441
6398 InChI=1S/C25H2 GF109203X
6399 InChI=1S/C25H3 EHop-016
6400 InChI=1S/C26H2 KX2-391
6401 InChI=1S/C26H3 AZD4547
6402 InChI=1S/C27H2 ML228
6403 InChI=1S/C27H3 MK-1775
6404 InChI=1S/C28H2 GNF-5837
6405 InChI=1S/C28H3 BI-2536
6406 InChI=1S/C30H2 ETP-46464
6407 InChI=1S/C30H3 CEP-37440
6408 InChI=1S/C31H2 idasanutlin
6409 InChI=1S/C31H4 bardoxolone
6410 InChI=1S/C32H4 bardoxolone-methyl
6411 InChI=1S/C34H3 foretinib
6412 InChI=1S/C34H4 UK-356618
6413 InChI=1S/C34H5 volasertib
6414 InChI=1S/C36H4 indinavir
6415 InChI=1S/C38H3 tariquidar
6416 InChI=1S/C42H4 ABT-737
6417 InChI=1S/C42H5 birinapant
6418 InChI=1S/C8H9N paracetamol
6419 InChI=1S/C9H11 floxuridine
6420 InChI=1S/C9H13 cytidine

@gwaybio
Copy link
Member

gwaybio commented Apr 15, 2020

Progress here is looking great - 2 minor notes:

  1. @niranjchandrasekaran in addition to adding the 20180516 version in d570fae, can you also make sure to add md5sum details to md5sum.txt?
  2. @shntnu - note my edit adding a collapsible section tag in Mapping Broad IDs across different versions #13 (comment). While these details are super helpful for troubleshooting (and essential for double-checking integrity!) adding a collapsible tag here reduces bloat

@niranjchandrasekaran
Copy link
Member Author

@niranjchandrasekaran in addition to adding the 20180516 version in d570fae, can you also make sure to add md5sum details to md5sum.txt?

Done

- For rows where InChI was being extracted instead of InChIKey, InChI was converted to InChIKey using the rdkit package
@niranjchandrasekaran
Copy link
Member Author

I fixed the InChI problem by converting InChI to InChIKey using the rdkit cheminformatics package. Now, there are 6959 unique rows in the mapping table.

@shntnu
Copy link
Collaborator

shntnu commented Apr 15, 2020

I fixed the InChI problem by converting InChI to InChIKey using the rdkit cheminformatics package. Now, there are 6959 unique rows in the mapping table.

We should ping Josh to fix this. Not tagging him right now to avoid too much traffic for him.

@shntnu
Copy link
Collaborator

shntnu commented Apr 15, 2020

I believe you are treating the broad_id and deprecated_broad_id separately and then merging them in the end, is that right?

Correct

- Also removed error in pipe delimiting
- Also merged functions and removed those that are no longer in use
- Added documentation
@niranjchandrasekaran niranjchandrasekaran changed the title [WIP] Mapping Broad IDs across different versions Mapping Broad IDs across different versions Apr 16, 2020
@niranjchandrasekaran
Copy link
Member Author

@gwaygenomics I have removed the target and moa fields. This PR is ready for your review.

@gwaybio gwaybio self-requested a review April 16, 2020 20:45
@gwaybio gwaybio mentioned this pull request Apr 16, 2020
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments really - we're close to a merge!

Oh, also important to remember to make any changes in the .ipynb file and not the .py file since the .py is autogenerated with nbconvert

metadata/moa/clue/md5sum.txt Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
metadata/moa/scripts/nbconverted/2.map-broad_id.py Outdated Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Apr 17, 2020

A general comment before I proceed with the review. I think this should be an optional todo, but I will start doing it for (likely) all of my jupyter notebooks.

I've been using black on all of my .py files (.py files that are not autogenerated) for over a year. I love it! With black, I never have to worry about formatting, so it removes a super important but sometimes annoying style requirement.

Around the time I started using black, I was hopeful that there would be a solution for .ipynb files. Either there wasn't one or I didn't look hard enough, but I haven't used a black tool for my .ipynb files... until now! I spent my morning researching some available tools, and it looks like there are a couple!

First, I tried jupyter black, but the install was clunky and it looks like there is a manual step required to autoformat.

Next, I tried nb_black and I love it! I've only used it for the past 10 minutes, but I think I will keep trying it out. Note that I installed it using the instructions here. After enabling the magic %load_ext nb_black in the first executed code cell, the tool autoformats upon code block execution automatically 🎉 .

My plan is to keep testing out nb_black to see if I run into any critical limitations. Given that it is a fairly new tool, it is likely there are some things to work out.

Summary

  1. nb_black seems to be a neat tool that I will use
  2. I wanted to point you to this option
  3. feel free to add to this PR if you also want to try out, but otherwise (b/c it is a new tool with likely some bugs) feel free to ignore.

@niranjchandrasekaran
Copy link
Member Author

I usually work with jupyter notebooks within PyCharm (their support for jupyter notebooks is not great), so I looked for ways to make nb_black work with PyCharm without any luck. For this exercise I opened the notebook in a browser and then reformatted the code using nb_black. In the future, I will look into other packages that play well with PyCharm.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - merge at will!

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2020

@niranjchandrasekaran - were you planning on adding/modifying anything else to this PR? If not, lets execute the merge 🚧

@niranjchandrasekaran
Copy link
Member Author

@gwaygenomics It looks like I don't have write access to this repository and I don't see the merge button.

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2020

Ah, well let's change that!

@gwaybio
Copy link
Member

gwaybio commented Apr 21, 2020

done. should see it now

@niranjchandrasekaran niranjchandrasekaran merged commit f5ba5f2 into broadinstitute:master Apr 21, 2020
@niranjchandrasekaran niranjchandrasekaran deleted the mapping_broad_id branch April 22, 2020 20:12
@gwaybio gwaybio mentioned this pull request Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants