Mapping Broad IDs across different versions #13

niranjchandrasekaran · 2020-04-14T00:11:01Z

Aims to address #11

I have created a mapping table connecting four fields - broad_id, pert_iname, target, moa across each version of repurposing data using InChIKey14 as the common field. The map is a single table with 13 fields. Fields with multiple values are combined into a single string, separated by pipes. I have not run any tests to confirm whether the code generates meaningful results. I will do that as the next step.

A couple of points to note

Each row does not correspond to a single Broad ID.
- This may not be desirable when data frames are merged by inbuilt join commands using Broad IDs.
When two fields have multiple values, it not clear how they map to one another. For e.g. if there are 2 pert_inames and 2 moas, the current mapping table does not say which pert_iname corresponds to which moa.

@gwaygenomics, @shntnu Do you think this mapping would work? Or should I take a different approach?

- The 2017, 2018 and 2020 versions are merged using the first 14 characters of the InChIKey - broad_id, pert_iname, moa and target of all the version are included - Fields with multiple values are pipe separated

gwaybio · 2020-04-14T19:44:18Z

Looking good so far @niranjchandrasekaran - a couple notes:

Each row does not correspond to a single Broad ID.

Is it true that each row corresponds to a single most recent version Broad ID?

if there are 2 pert_inames and 2 moas, the current mapping table does not say which pert_iname corresponds to which moa

Oh interesting! I didn't know there could be two pert_inames. For cases where there are multiple moas or targets, for the purposes of the map, I think it is ok to keep them pipe delimited.

Two more points

Please generate an `nbconverted` `.py` file

Run the following command and add the resulting file to this PR

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts/nbconverted *.ipynb

This will help with making inline comments in the pull request review.

What to do with `deprecated_broad_id` column?

Do you think we should handle this column in the map?

shntnu · 2020-04-14T20:46:44Z

Some checks (WIP)

x %>% count %>% knitr::kable()

n
6997

x %>% distinct(InChIKey14) %>% count %>% knitr::kable()

n
6997

x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")

x %>% 
  summarise_each(~sum(is.na(.))) %>% 
  select(matches("^broad_id")) %>% 
  pivot_longer(cols = everything(), values_to = "na_count") %>% 
  knitr::kable()

name	na_count
broad_id_2017	1371
broad_id_2018	966
broad_id_2020	262

shntnu · 2020-04-14T20:56:21Z

x %>% 
  select(matches("^pert_iname")) %>% 
  rowwise() %>% 
  mutate_each(function(x) str_detect(x, "\\|")) %>% 
  ungroup() %>% 
  summarise_each(~sum(., na.rm = T)) %>% 
  pivot_longer(cols = everything(), values_to = "multiple_inames") %>% 
  knitr::kable()

name	multiple_inames
pert_iname_2017	111
pert_iname_2018	113
pert_iname_2020	127

shntnu · 2020-04-14T21:07:38Z

Is it true that each row corresponds to a single most recent version Broad ID?

IIUC, no: each row corresponds to a sample that at some point in history had a broad_id assigned to it.

Some NAs in each version

x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")

x %>% 
  summarise_each(~sum(is.na(.))) %>% 
  select(matches("^broad_id")) %>% 
  pivot_longer(cols = everything(), values_to = "na_count") %>% 
  knitr::kable()

name	na_count
broad_id_2017	1371
broad_id_2018	966
broad_id_2020	262

No InChIKey14 has all broad_ids as NA

x %>% 
  select(InChIKey14, matches("^broad_id")) %>% 
  mutate_at(vars(matches(("^broad_id"))), is.na) %>% 
  pivot_longer(cols = matches(("^broad_id"))) %>%
  group_by(InChIKey14) %>%
  summarize(all_broad_id_are_na = all(value)) %>%
  filter(all_broad_id_are_na) %>%
  count %>%
  knitr::kable()

n
0

niranjchandrasekaran · 2020-04-14T21:25:45Z

Is it true that each row corresponds to a single most recent version Broad ID?

There are 262 rows where the broad_id_2020 is NA while there are 133 rows with multiple broad_id_2020s

print('Number of NA: %d' % (len(broad_id_2020)-broad_id_2020.count()))
broad_id = broad_id_2020.fillna('')
print('Rows with multiple broad_id_2020: %d' % len(broad_id.loc[broad_id.broad_id_2020.str.contains('\|')]))

generates

Number of NA: 262
Rows with multiple broad_id_2020: 133

Please generate an nbconverted .py file

Done

What to do with deprecated_broad_id column?
Do you think we should handle this column in the map?

I will check if the InChIKey14 of every deprecated_broad_id maps to the same InChIKey14 when the ID was still in use. If they do, then there won't be any additional information in the deprecated_broad_id field.

niranjchandrasekaran · 2020-04-15T01:40:08Z

I found the InChIKey14 of most of the deprecated_broad_ids from the 2018 version to be present in the 2017 version, except for the following

	InChIKey14	deprecated_broad_id_2018
0	CAIDFDPKOGBHAT	BRD-A09283158
1	CLKOFPXJLQSYAH	BRD-A21002158
2	DUUFLTUTPZFOAH	BRD-A39845846
3	DVUJFSXMTBOIKW	BRD-U66818247
4	IRPYFWIZKIOHQN	BRD-U68276368
5	JFAHBAVHONXXFG	BRD-U22164625
6	KNLORLKZMHLRBH	BRD-U77299140
7	MEOTUCBFWIQPRL	BRD-M52026403
8	NOSAJPUYIASWEH	BRD-A52946717,BRD-K10549975
9	NVNLLIYOARQCIX	BRD-U59029435
10	ONCSSPOCYKWCTO	BRD-A26049141
11	PQYCRDPLPKGSME	BRD-K94313941
12	RMRCNWBMXRMIRW	BRD-A59561452,BRD-M14495474
13	VPZCKRKZFRCZMX	BRD-A09838294,BRD-K35776642
14	XTLSVXWJVRKFIP	BRD-K79233274
15	ZNRGQMMCGHDTEI	BRD-A79226577,BRD-M19407551

This would mean that there are deprecated_broad_ids that do not correspond to broad_ids that were in use. I am not sure what to make of this.

Similarly, apart from the following, I found the InChIKey of most of the deprecated_broad_ids from the 2020 version to be present in the 2018 version.

	InChIKey14	deprecated_broad_id_2020
0	AUZDMWWRYCUANI	BRD-K34751753
1	BKWHZSFINQBEQH	BRD-A25576662
2	DAIGSUHCWOZEQG	BRD-K98733875,BRD-K99455969
3	JBIWCJUYHHGXTC	BRD-M28128163
4	KTGRHKOEFSJQNS	BRD-A47598013
5	KTUXFHGUHYRDNF	BRD-A56621826,BRD-A78694248
6	OPMVDWXPLVKNQN	BRD-K05218031
7	OSZMNJRKIPAVOS	BRD-A40504327
8	PTQGBRPUALXNCZ	BRD-K01666924
9	SZINUGQCTHLQAZ	BRD-K93666001
10	TWHPPRXFYWSOQF	BRD-K83146339,BRD-K88014802,BRD-M65265990
11	ZCHZSBUNJOCWCW	BRD-M37437038,BRD-M92812175
12	ZIUSSTSXXLLKKK	BRD-K69690935
13	ZQHDVMCIDXEICF	BRD-A99182808

shntnu · 2020-04-15T11:12:17Z

This would mean that there are deprecated_broad_ids that do not correspond to broad_ids that were in use. I am not sure what to make of this.

Looks like the deprecated_broad_id column was added in 2018; there was no such column in 2017. So it is possible those correspond to ids that were already deprecated in 2017.

shntnu · 2020-04-15T11:25:52Z

@niranjchandrasekaran I just noticed that there are two 2018 files on https://clue.io/repurposing#download-data

I don't recollect seeing this before. Do you read both?

Latest version: 3/24/2020
Archived versions: 9/7/2018 | 5/16/2018 | 3/27/2017

shntnu · 2020-04-15T13:20:41Z

My notebook dump is below. Sorry, ran out of time to document what I am doing here, but hope it helps

Click to expand

R Notebook

library(glue)
library(magrittr)
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::collapse()  masks glue::collapse()
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()

clue_2017 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20170327.txt", comment = "!", guess_max = 20000)

## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double()
## )

clue_2018b <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180907.txt", comment = "!", guess_max = 20000)

## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_character(),
##   deprecated_broad_id = col_character()
## )

clue_2018a <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180516.txt", comment = "!", guess_max = 20000)

## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double(),
##   deprecated_broad_id = col_character()
## )

clue_2020 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt", comment = "!", guess_max = 20000)

## Parsed with column specification:
## cols(
##   broad_id = col_character(),
##   pert_iname = col_character(),
##   qc_incompatible = col_double(),
##   purity = col_double(),
##   vendor = col_character(),
##   catalog_no = col_character(),
##   vendor_name = col_character(),
##   expected_mass = col_number(),
##   smiles = col_character(),
##   InChIKey = col_character(),
##   pubchem_cid = col_double(),
##   deprecated_broad_id = col_character()
## )

Some rows have InChIKey as NA! Not sure how this should be handled.
For now, I propose we drop it unless they are present in the Cell
Painting dataset

clue_2017 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()

broad_id	pert_iname	qc_incompatible	purity	vendor	catalog_no	vendor_name	expected_mass	smiles	InChIKey	pubchem_cid
BRD-U08520523-000-01-0	colesevalam	1	0	MicroSource	1505905	COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml)	NA	NA	NA	NA
BRD-U59029435-000-01-7	nisin	0	NA	Cayman	16532	NA	NA	NA	NA	NA
BRD-U51753822-000-01-1	phosphatidylcholine	1	0	Selleck	S4411	Phosphatidylcholine	NA	NA	NA	NA
BRD-U45393375-000-01-6	sevelamer	1	0	Selleck	S4129	Sevelamer HCl	NA	NA	NA	NA
BRD-U59360347-000-01-3	teicoplanin-a2-3	0	0	Selleck	S1399	Teicoplanin	2615.3	NA	NA	NA
BRD-U25960968-000-01-9	tyloxapol	1	0	Prestwick	Prestw-954	Tyloxapol	NA	NA	NA	NA
BRD-U48018661-000-01-9	tyloxapol	1	0	MicroSource	1506053	TYLOXAPOL	NA	NA	NA	NA

clue_2018a %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()

broad_id	pert_iname	qc_incompatible	vendor	catalog_no	vendor_name	expected_mass	smiles	InChIKey	pubchem_cid	deprecated_broad_id
BRD-U08520523-000-01-0	colesevalam	1	MicroSource	1505905	COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml)	NA	NA	NA	NA	NA
BRD-U51753822-000-01-1	phosphatidylcholine	1	Selleck	S4411	Phosphatidylcholine	NA	NA	NA	NA	NA
BRD-U45393375-000-01-6	sevelamer	1	Selleck	S4129	Sevelamer HCl	NA	NA	NA	NA	BRD-M22581557-003-01-7
BRD-U59360347-000-01-3	teicoplanin-a2-3	0	Selleck	S1399	Teicoplanin	2615.3	NA	NA	NA	NA
BRD-U25960968-000-01-9	tyloxapol	1	Prestwick	Prestw-954	Tyloxapol	NA	NA	NA	NA	NA
BRD-U48018661-000-01-9	tyloxapol	1	MicroSource	1506053	TYLOXAPOL	NA	NA	NA	NA	NA

clue_2018b %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()

broad_id	pert_iname	qc_incompatible	vendor	catalog_no	vendor_name	expected_mass	smiles	InChIKey	pubchem_cid	deprecated_broad_id
BRD-U08520523-000-01-0	colesevalam	1	MicroSource	1505905	COLESEVALAM HYDROCHLORIDE (high mol wt copolymer @ 5mg/ml)	NA	NA	NA	#N/A	BRD-M08475060-001-01-4
BRD-U51753822-000-01-1	phosphatidylcholine	1	Selleck	S4411	Phosphatidylcholine	NA	NA	NA	#N/A	BRD-M44205332-065-01-8
BRD-U45393375-000-01-6	sevelamer	1	Selleck	S4129	Sevelamer HCl	NA	NA	NA	#N/A	BRD-M22581557-003-01-7
BRD-U59360347-000-01-3	teicoplanin-a2-3	0	Selleck	S1399	Teicoplanin	2615.3	NA	NA	#N/A	NA
BRD-U25960968-000-01-9	tyloxapol	1	Prestwick	Prestw-954	Tyloxapol	NA	NA	NA	#N/A	BRD-K46156294-001-05-5
BRD-U48018661-000-01-9	tyloxapol	1	MicroSource	1506053	TYLOXAPOL	NA	NA	NA	#N/A	BRD-M13714491-001-01-8

clue_2020 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()

broad_id	pert_iname	qc_incompatible	purity	vendor	catalog_no	vendor_name	expected_mass	smiles	InChIKey	pubchem_cid	deprecated_broad_id

clue_2017 %<>% dplyr::filter(!is.na(InChIKey))

clue_2018a %<>% dplyr::filter(!is.na(InChIKey))

clue_2018b %<>% dplyr::filter(!is.na(InChIKey))

clue_2020 %<>% dplyr::filter(!is.na(InChIKey))

f_deprecated_id <-  function(df)
  tibble (
    deprecated_id = 
      df %>%
      pull(deprecated_broad_id) %>%  
      paste(collapse = "|") %>% 
      str_split("\\|") %>%
      extract2(1)
  )

deprecated_broad_id_f <- function(df) {
  df %>% 
    filter(!is.na(deprecated_broad_id)) %>%
    mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>% 
    select(InChIKey, deprecated_broad_id) %>%
    group_by(InChIKey) %>% 
    nest() %>% 
    mutate(deprecated_broad_id = 
             map(data, 
                 function(df) {
                   deprecated_broad_id = 
                     paste(df$deprecated_broad_id, collapse = "|") %>% 
                     str_split("\\|") %>%
                     extract2(1)
                 }
             )
    ) %>%
    unnest(deprecated_broad_id) %>%
    select(-data) %>%
    distinct(InChIKey, deprecated_broad_id) %>%
    arrange(InChIKey, deprecated_broad_id)
}
  

broad_id_f <- function(df) {
  df %>% 
    filter(!is.na(broad_id)) %>%
    mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>% 
    select(InChIKey, broad_id) %>%
    group_by(InChIKey) %>% 
    nest() %>% 
    mutate(broad_id = 
             map(data, 
                 function(df) {
                   broad_id = 
                     paste(df$broad_id, collapse = "|") %>% 
                     str_split("\\|") %>%
                     extract2(1)
                 }
             )
    ) %>%
    unnest(broad_id) %>%
    select(-data) %>%
    distinct(InChIKey, broad_id) %>%
    arrange(InChIKey, broad_id)
}

deprecated_broad_id_2018a <- deprecated_broad_id_f(clue_2018a)

deprecated_broad_id_2018b <- deprecated_broad_id_f(clue_2018b)

deprecated_broad_id_2020 <- deprecated_broad_id_f(clue_2020)

broad_id_2017 <- broad_id_f(clue_2017)

broad_id_2018a <- broad_id_f(clue_2018a)

broad_id_2018b <- broad_id_f(clue_2018b)

broad_id_2020 <- broad_id_f(clue_2020)

all_inchi <- 
  bind_rows(
    deprecated_broad_id_2018a  %>% distinct(InChIKey),
    deprecated_broad_id_2018b  %>% distinct(InChIKey),
    deprecated_broad_id_2020  %>% distinct(InChIKey),
    broad_id_2017  %>% distinct(InChIKey),
    broad_id_2018a  %>% distinct(InChIKey),
    broad_id_2018b  %>% distinct(InChIKey),
    broad_id_2020  %>% distinct(InChIKey)
  ) %>%
  distinct(InChIKey) %>%
  arrange(InChIKey) %>%
  ungroup()

all_inchi %>%
  count() %>%
  knitr::kable()

n
7012

This master table that has the full mapping across all pairs of ids. It
has 759229 rows because of the many-to-many relationship between
broad_id (and deprecated_broad_id) and InChiKey. This is the
“tidy” way to do it

master_table <-
  all_inchi %>%
  left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2020,  by = "InChIKey") %>% rename(deprecated_broad_id_2020  = deprecated_broad_id) %>%
  left_join(broad_id_2017,  by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
  left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
  left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
  left_join(broad_id_2020,  by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
  distinct() %>%
  ungroup()

master_table %>%
  count() %>%
  knitr::kable()

n
759228

Here is the non-tidy way to do it: we |-separate broad_ids and
deprecated_broad_id per InChiKey

pipe_separate <- function(df, column) {
  column <- sym(column)
  df %>% group_by(InChIKey) %>% mutate(!!column := paste(sort(!!column), collapse = "|"))
}

broad_id_2017 %<>% pipe_separate("broad_id")

broad_id_2018a %<>% pipe_separate("broad_id")

broad_id_2018b %<>% pipe_separate("broad_id")

broad_id_2020 %<>% pipe_separate("broad_id")

deprecated_broad_id_2018a %<>% pipe_separate("deprecated_broad_id")

deprecated_broad_id_2018b %<>% pipe_separate("deprecated_broad_id")

deprecated_broad_id_2020 %<>% pipe_separate("deprecated_broad_id")

master_table_pipe_sep <-
  all_inchi %>%
  left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
  left_join(deprecated_broad_id_2020,  by = "InChIKey") %>% rename(deprecated_broad_id_2020  = deprecated_broad_id) %>%
  left_join(broad_id_2017,  by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
  left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
  left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
  left_join(broad_id_2020,  by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
  distinct() %>%
  ungroup

master_table_pipe_sep %>%
  count() %>%
  knitr::kable()

n
7012

niranjchandrasekaran · 2020-04-15T15:53:41Z

I don't recollect seeing this before. Do you read both?

I had included only the September-2018 version and not the May-2018 version. Now

My notebook dump is below. Sorry, ran out of time to document what I am doing here, but hope it helps

I believe you are treating the broad_id and deprecated_broad_id separately and then merging them in the end, is that right?

I tried doing the same and I too end up with 7012 rows (I guess that's good :))

Some rows have InChIKey as NA! Not sure how this should be handled.
For now, I propose we drop it unless they are present in the Cell Painting dataset

I also found that in the 2020 version a bunch of rows may not have been formatted correctly as I end up extracting InChI instead of InChIKey. Is that the case for you as well @shntnu? If so, these have to be fixed.

	InChIKey14	pert_iname_2020
6368	InChI=1S/C10H7	CHC
6369	InChI=1S/C10H8	mirin
6370	InChI=1S/C11H1	BU-239
6371	InChI=1S/C12H6	bithionol
6372	InChI=1S/C13H1	pomalidomide
6373	InChI=1S/C14H1	PD-158780
6374	InChI=1S/C14H2	GSK-LSD-1
6375	InChI=1S/C15H1	JNJ-7706621
6376	InChI=1S/C15H2	LMK-235
6377	InChI=1S/C16H1	oncrasin-1
6378	InChI=1S/C17H1	GYKI-52466
6379	InChI=1S/C18H2	AZD5438
6380	InChI=1S/C18H3	phytosphingosine
6381	InChI=1S/C19H1	PF-04217903
6382	InChI=1S/C19H2	AT-9283
6383	InChI=1S/C19H3	perhexiline
6384	InChI=1S/C20H1	A-1120
6385	InChI=1S/C20H2	TC-H-106
6386	InChI=1S/C20H3	epoprostenol
6387	InChI=1S/C21H1	regorafenib
6388	InChI=1S/C21H2	dinaciclib
6389	InChI=1S/C22H1	idelalisib
6390	InChI=1S/C22H2	I-BET-762
6391	InChI=1S/C22H3	chlorhexidine
6392	InChI=1S/C23H2	GDC-0941
6393	InChI=1S/C23H3	TCS-OX2-29
6394	InChI=1S/C24H1	BRD7389
6395	InChI=1S/C24H2	canertinib
6396	InChI=1S/C24H3	romidepsin
6397	InChI=1S/C25H1	NU-7441
6398	InChI=1S/C25H2	GF109203X
6399	InChI=1S/C25H3	EHop-016
6400	InChI=1S/C26H2	KX2-391
6401	InChI=1S/C26H3	AZD4547
6402	InChI=1S/C27H2	ML228
6403	InChI=1S/C27H3	MK-1775
6404	InChI=1S/C28H2	GNF-5837
6405	InChI=1S/C28H3	BI-2536
6406	InChI=1S/C30H2	ETP-46464
6407	InChI=1S/C30H3	CEP-37440
6408	InChI=1S/C31H2	idasanutlin
6409	InChI=1S/C31H4	bardoxolone
6410	InChI=1S/C32H4	bardoxolone-methyl
6411	InChI=1S/C34H3	foretinib
6412	InChI=1S/C34H4	UK-356618
6413	InChI=1S/C34H5	volasertib
6414	InChI=1S/C36H4	indinavir
6415	InChI=1S/C38H3	tariquidar
6416	InChI=1S/C42H4	ABT-737
6417	InChI=1S/C42H5	birinapant
6418	InChI=1S/C8H9N	paracetamol
6419	InChI=1S/C9H11	floxuridine
6420	InChI=1S/C9H13	cytidine

gwaybio · 2020-04-15T17:44:35Z

Progress here is looking great - 2 minor notes:

@niranjchandrasekaran in addition to adding the 20180516 version in d570fae, can you also make sure to add md5sum details to md5sum.txt?
@shntnu - note my edit adding a collapsible section tag in Mapping Broad IDs across different versions #13 (comment). While these details are super helpful for troubleshooting (and essential for double-checking integrity!) adding a collapsible tag here reduces bloat

niranjchandrasekaran · 2020-04-15T17:54:51Z

@niranjchandrasekaran in addition to adding the 20180516 version in d570fae, can you also make sure to add md5sum details to md5sum.txt?

Done

- For rows where InChI was being extracted instead of InChIKey, InChI was converted to InChIKey using the rdkit package

niranjchandrasekaran · 2020-04-15T19:17:14Z

I fixed the InChI problem by converting InChI to InChIKey using the rdkit cheminformatics package. Now, there are 6959 unique rows in the mapping table.

shntnu · 2020-04-15T20:38:50Z

I fixed the InChI problem by converting InChI to InChIKey using the rdkit cheminformatics package. Now, there are 6959 unique rows in the mapping table.

We should ping Josh to fix this. Not tagging him right now to avoid too much traffic for him.

shntnu · 2020-04-15T20:39:17Z

I believe you are treating the broad_id and deprecated_broad_id separately and then merging them in the end, is that right?

Correct

- Also removed error in pipe delimiting

- Also merged functions and removed those that are no longer in use - Added documentation

niranjchandrasekaran · 2020-04-16T18:03:46Z

@gwaygenomics I have removed the target and moa fields. This PR is ready for your review.

gwaybio

Minor comments really - we're close to a merge!

Oh, also important to remember to make any changes in the .ipynb file and not the .py file since the .py is autogenerated with nbconvert

metadata/moa/clue/md5sum.txt

metadata/moa/scripts/nbconverted/2.map-broad_id.py

- Updated the keyword within functions to specify version to "version" instead of "year" - Field names use version names based on the release date of the dataset (for e.g. 20180907 instead 2018b)

-Also changed variable name

gwaybio · 2020-04-17T14:07:13Z

A general comment before I proceed with the review. I think this should be an optional todo, but I will start doing it for (likely) all of my jupyter notebooks.

I've been using black on all of my .py files (.py files that are not autogenerated) for over a year. I love it! With black, I never have to worry about formatting, so it removes a super important but sometimes annoying style requirement.

Around the time I started using black, I was hopeful that there would be a solution for .ipynb files. Either there wasn't one or I didn't look hard enough, but I haven't used a black tool for my .ipynb files... until now! I spent my morning researching some available tools, and it looks like there are a couple!

First, I tried jupyter black, but the install was clunky and it looks like there is a manual step required to autoformat.

Next, I tried nb_black and I love it! I've only used it for the past 10 minutes, but I think I will keep trying it out. Note that I installed it using the instructions here. After enabling the magic %load_ext nb_black in the first executed code cell, the tool autoformats upon code block execution automatically 🎉 .

My plan is to keep testing out nb_black to see if I run into any critical limitations. Given that it is a fairly new tool, it is likely there are some things to work out.

Summary

nb_black seems to be a neat tool that I will use
I wanted to point you to this option
feel free to add to this PR if you also want to try out, but otherwise (b/c it is a new tool with likely some bugs) feel free to ignore.

niranjchandrasekaran · 2020-04-17T17:14:16Z

I usually work with jupyter notebooks within PyCharm (their support for jupyter notebooks is not great), so I looked for ways to make nb_black work with PyCharm without any luck. For this exercise I opened the notebook in a browser and then reformatted the code using nb_black. In the future, I will look into other packages that play well with PyCharm.

gwaybio

LGTM - merge at will!

metadata/moa/scripts/nbconverted/2.map-broad_id.py

gwaybio · 2020-04-21T13:37:41Z

@niranjchandrasekaran - were you planning on adding/modifying anything else to this PR? If not, lets execute the merge 🚧

niranjchandrasekaran · 2020-04-21T14:38:49Z

@gwaygenomics It looks like I don't have write access to this repository and I don't see the merge button.

gwaybio · 2020-04-21T14:39:39Z

Ah, well let's change that!

gwaybio · 2020-04-21T14:40:42Z

done. should see it now

niranjchandrasekaran added 2 commits April 13, 2020 19:28

added repurposing hub files (2017 version)

cf8c6dd

Preliminary mapping

e23a823

- The 2017, 2018 and 2020 versions are merged using the first 14 characters of the InChIKey - broad_id, pert_iname, moa and target of all the version are included - Fields with multiple values are pipe separated

niranjchandrasekaran added 2 commits April 14, 2020 16:30

Additional documentation

69aa201

generate nbconverted .py file

a787f7b

Compare deprecated_broad_id with previous versions

8f1eda6

added repurposing hub files (2018-may version)

d570fae

niranjchandrasekaran added 2 commits April 15, 2020 13:48

Added deprecated broad ids

2e89496

included md5sum for 2017 and 2018-may versions

678b53f

Fixed column formatting issues

ffa174e

- For rows where InChI was being extracted instead of InChIKey, InChI was converted to InChIKey using the rdkit package

niranjchandrasekaran added 2 commits April 15, 2020 17:57

Separate column for deprecated_broad_id

71ea591

- Also removed error in pipe delimiting

Removed target and moa fields

1228b0f

- Also merged functions and removed those that are no longer in use - Added documentation

niranjchandrasekaran changed the title ~~[WIP] Mapping Broad IDs across different versions~~ Mapping Broad IDs across different versions Apr 16, 2020

gwaybio self-requested a review April 16, 2020 20:45

gwaybio mentioned this pull request Apr 16, 2020

Add conda environment #15

Closed

gwaybio requested changes Apr 16, 2020

View reviewed changes

niranjchandrasekaran added 11 commits April 16, 2020 18:02

reordered md5sum.txt

5ce8d93

assign() instead of direct assignment

40e9589

.shape() instead to data frame length

6201514

avoiding hardcoding paths

b7c1365

changed function name

1b7d3dc

Cosmetic Changes

6e2961f

Cosmetic Changes

857ddbd

Using f-strings to format strings

f2ce183

- Updated the keyword within functions to specify version to "version" instead of "year" - Field names use version names based on the release date of the dataset (for e.g. 20180907 instead 2018b)

Column names defined outside read_csv

e98ecce

Use chaining to merge multiple data frames

e7bc9ff

-Also changed variable name

use of methods chaining

7eeea81

niranjchandrasekaran requested a review from gwaybio April 17, 2020 13:27

code reformatting using nb_black

cf96d2c

gwaybio approved these changes Apr 17, 2020

View reviewed changes

metadata/moa/scripts/nbconverted/2.map-broad_id.py Show resolved Hide resolved

niranjchandrasekaran merged commit f5ba5f2 into broadinstitute:master Apr 21, 2020

niranjchandrasekaran deleted the mapping_broad_id branch April 22, 2020 20:12

gwaybio mentioned this pull request Apr 22, 2020

Old/Updated Broad IDs #11

Closed

niranjchandrasekaran mentioned this pull request Apr 23, 2020

Using InChIKey as the common field for mapping #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping Broad IDs across different versions #13

Mapping Broad IDs across different versions #13

niranjchandrasekaran commented Apr 14, 2020 •

edited

Loading

gwaybio commented Apr 14, 2020

shntnu commented Apr 14, 2020 •

edited

Loading

shntnu commented Apr 14, 2020

shntnu commented Apr 14, 2020

niranjchandrasekaran commented Apr 14, 2020

niranjchandrasekaran commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020 •

edited by gwaybio

Loading

R Notebook

niranjchandrasekaran commented Apr 15, 2020

gwaybio commented Apr 15, 2020 •

edited

Loading

niranjchandrasekaran commented Apr 15, 2020

niranjchandrasekaran commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020 •

edited

Loading

niranjchandrasekaran commented Apr 16, 2020

gwaybio left a comment •

edited

Loading

gwaybio commented Apr 17, 2020 •

edited

Loading

niranjchandrasekaran commented Apr 17, 2020

gwaybio left a comment

gwaybio commented Apr 21, 2020

niranjchandrasekaran commented Apr 21, 2020

gwaybio commented Apr 21, 2020

gwaybio commented Apr 21, 2020

Mapping Broad IDs across different versions #13

Mapping Broad IDs across different versions #13

Conversation

niranjchandrasekaran commented Apr 14, 2020 • edited Loading

gwaybio commented Apr 14, 2020

Two more points

Please generate an nbconverted .py file

What to do with deprecated_broad_id column?

shntnu commented Apr 14, 2020 • edited Loading

shntnu commented Apr 14, 2020

shntnu commented Apr 14, 2020

niranjchandrasekaran commented Apr 14, 2020

niranjchandrasekaran commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020 • edited by gwaybio Loading

R Notebook

niranjchandrasekaran commented Apr 15, 2020

gwaybio commented Apr 15, 2020 • edited Loading

niranjchandrasekaran commented Apr 15, 2020

niranjchandrasekaran commented Apr 15, 2020

shntnu commented Apr 15, 2020

shntnu commented Apr 15, 2020 • edited Loading

niranjchandrasekaran commented Apr 16, 2020

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

gwaybio commented Apr 17, 2020 • edited Loading

Summary

niranjchandrasekaran commented Apr 17, 2020

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio commented Apr 21, 2020

niranjchandrasekaran commented Apr 21, 2020

gwaybio commented Apr 21, 2020

gwaybio commented Apr 21, 2020

niranjchandrasekaran commented Apr 14, 2020 •

edited

Loading

Please generate an `nbconverted` `.py` file

What to do with `deprecated_broad_id` column?

shntnu commented Apr 14, 2020 •

edited

Loading

shntnu commented Apr 15, 2020 •

edited by gwaybio

Loading

gwaybio commented Apr 15, 2020 •

edited

Loading

shntnu commented Apr 15, 2020 •

edited

Loading

gwaybio left a comment •

edited

Loading

gwaybio commented Apr 17, 2020 •

edited

Loading