-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping Broad IDs across different versions #13
Mapping Broad IDs across different versions #13
Conversation
- The 2017, 2018 and 2020 versions are merged using the first 14 characters of the InChIKey - broad_id, pert_iname, moa and target of all the version are included - Fields with multiple values are pipe separated
Looking good so far @niranjchandrasekaran - a couple notes:
Is it true that each row corresponds to a single most recent version Broad ID?
Oh interesting! I didn't know there could be two Two more pointsPlease generate an
|
Some checks (WIP) x %>% count %>% knitr::kable()
x %>% distinct(InChIKey14) %>% count %>% knitr::kable()
x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")
x %>%
summarise_each(~sum(is.na(.))) %>%
select(matches("^broad_id")) %>%
pivot_longer(cols = everything(), values_to = "na_count") %>%
knitr::kable()
|
|
IIUC, no: each row corresponds to a sample that at some point in history had a Some x <- read_csv("https://raw.githubusercontent.com/niranjchandrasekaran/lincs-cell-painting/mapping_broad_id/metadata/moa/clue/broad_id_map.csv")
x %>%
summarise_each(~sum(is.na(.))) %>%
select(matches("^broad_id")) %>%
pivot_longer(cols = everything(), values_to = "na_count") %>%
knitr::kable()
No x %>%
select(InChIKey14, matches("^broad_id")) %>%
mutate_at(vars(matches(("^broad_id"))), is.na) %>%
pivot_longer(cols = matches(("^broad_id"))) %>%
group_by(InChIKey14) %>%
summarize(all_broad_id_are_na = all(value)) %>%
filter(all_broad_id_are_na) %>%
count %>%
knitr::kable()
|
There are 262 rows where the
generates
Done
I will check if the |
I found the
This would mean that there are Similarly, apart from the following, I found the
|
Looks like the |
@niranjchandrasekaran I just noticed that there are two 2018 files on https://clue.io/repurposing#download-data I don't recollect seeing this before. Do you read both?
|
My notebook dump is below. Sorry, ran out of time to document what I am doing here, but hope it helps Click to expandR Notebooklibrary(glue)
library(magrittr)
library(tidyverse)
clue_2017 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20170327.txt", comment = "!", guess_max = 20000)
clue_2018b <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180907.txt", comment = "!", guess_max = 20000)
clue_2018a <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180516.txt", comment = "!", guess_max = 20000)
clue_2020 <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt", comment = "!", guess_max = 20000)
Some rows have clue_2017 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
clue_2018a %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
clue_2018b %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
clue_2020 %>% dplyr::filter(is.na(InChIKey)) %>% knitr::kable()
clue_2017 %<>% dplyr::filter(!is.na(InChIKey))
clue_2018a %<>% dplyr::filter(!is.na(InChIKey))
clue_2018b %<>% dplyr::filter(!is.na(InChIKey))
clue_2020 %<>% dplyr::filter(!is.na(InChIKey)) f_deprecated_id <- function(df)
tibble (
deprecated_id =
df %>%
pull(deprecated_broad_id) %>%
paste(collapse = "|") %>%
str_split("\\|") %>%
extract2(1)
) deprecated_broad_id_f <- function(df) {
df %>%
filter(!is.na(deprecated_broad_id)) %>%
mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>%
select(InChIKey, deprecated_broad_id) %>%
group_by(InChIKey) %>%
nest() %>%
mutate(deprecated_broad_id =
map(data,
function(df) {
deprecated_broad_id =
paste(df$deprecated_broad_id, collapse = "|") %>%
str_split("\\|") %>%
extract2(1)
}
)
) %>%
unnest(deprecated_broad_id) %>%
select(-data) %>%
distinct(InChIKey, deprecated_broad_id) %>%
arrange(InChIKey, deprecated_broad_id)
}
broad_id_f <- function(df) {
df %>%
filter(!is.na(broad_id)) %>%
mutate(InChIKey = str_sub(InChIKey, 1, 14)) %>%
select(InChIKey, broad_id) %>%
group_by(InChIKey) %>%
nest() %>%
mutate(broad_id =
map(data,
function(df) {
broad_id =
paste(df$broad_id, collapse = "|") %>%
str_split("\\|") %>%
extract2(1)
}
)
) %>%
unnest(broad_id) %>%
select(-data) %>%
distinct(InChIKey, broad_id) %>%
arrange(InChIKey, broad_id)
} deprecated_broad_id_2018a <- deprecated_broad_id_f(clue_2018a)
deprecated_broad_id_2018b <- deprecated_broad_id_f(clue_2018b)
deprecated_broad_id_2020 <- deprecated_broad_id_f(clue_2020)
broad_id_2017 <- broad_id_f(clue_2017)
broad_id_2018a <- broad_id_f(clue_2018a)
broad_id_2018b <- broad_id_f(clue_2018b)
broad_id_2020 <- broad_id_f(clue_2020) all_inchi <-
bind_rows(
deprecated_broad_id_2018a %>% distinct(InChIKey),
deprecated_broad_id_2018b %>% distinct(InChIKey),
deprecated_broad_id_2020 %>% distinct(InChIKey),
broad_id_2017 %>% distinct(InChIKey),
broad_id_2018a %>% distinct(InChIKey),
broad_id_2018b %>% distinct(InChIKey),
broad_id_2020 %>% distinct(InChIKey)
) %>%
distinct(InChIKey) %>%
arrange(InChIKey) %>%
ungroup()
all_inchi %>%
count() %>%
knitr::kable()
This master table that has the full mapping across all pairs of ids. It master_table <-
all_inchi %>%
left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
left_join(deprecated_broad_id_2020, by = "InChIKey") %>% rename(deprecated_broad_id_2020 = deprecated_broad_id) %>%
left_join(broad_id_2017, by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
left_join(broad_id_2020, by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
distinct() %>%
ungroup()
master_table %>%
count() %>%
knitr::kable()
Here is the non-tidy way to do it: we pipe_separate <- function(df, column) {
column <- sym(column)
df %>% group_by(InChIKey) %>% mutate(!!column := paste(sort(!!column), collapse = "|"))
}
broad_id_2017 %<>% pipe_separate("broad_id")
broad_id_2018a %<>% pipe_separate("broad_id")
broad_id_2018b %<>% pipe_separate("broad_id")
broad_id_2020 %<>% pipe_separate("broad_id")
deprecated_broad_id_2018a %<>% pipe_separate("deprecated_broad_id")
deprecated_broad_id_2018b %<>% pipe_separate("deprecated_broad_id")
deprecated_broad_id_2020 %<>% pipe_separate("deprecated_broad_id") master_table_pipe_sep <-
all_inchi %>%
left_join(deprecated_broad_id_2018a, by = "InChIKey") %>% rename(deprecated_broad_id_2018a = deprecated_broad_id) %>%
left_join(deprecated_broad_id_2018b, by = "InChIKey") %>% rename(deprecated_broad_id_2018b = deprecated_broad_id) %>%
left_join(deprecated_broad_id_2020, by = "InChIKey") %>% rename(deprecated_broad_id_2020 = deprecated_broad_id) %>%
left_join(broad_id_2017, by = "InChIKey") %>% rename(broad_id_2017 = broad_id) %>%
left_join(broad_id_2018a, by = "InChIKey") %>% rename(broad_id_2018a = broad_id) %>%
left_join(broad_id_2018b, by = "InChIKey") %>% rename(broad_id_2018b = broad_id) %>%
left_join(broad_id_2020, by = "InChIKey") %>% rename(broad_id_2020 = broad_id) %>%
distinct() %>%
ungroup
master_table_pipe_sep %>%
count() %>%
knitr::kable()
|
I had included only the September-2018 version and not the May-2018 version. Now
I believe you are treating the I tried doing the same and I too end up with 7012 rows (I guess that's good :))
I also found that in the 2020 version a bunch of rows may not have been formatted correctly as I end up extracting
|
Progress here is looking great - 2 minor notes:
|
Done |
- For rows where InChI was being extracted instead of InChIKey, InChI was converted to InChIKey using the rdkit package
I fixed the InChI problem by converting InChI to InChIKey using the rdkit cheminformatics package. Now, there are 6959 unique rows in the mapping table. |
We should ping Josh to fix this. Not tagging him right now to avoid too much traffic for him. |
Correct |
- Also removed error in pipe delimiting
- Also merged functions and removed those that are no longer in use - Added documentation
@gwaygenomics I have removed the target and moa fields. This PR is ready for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments really - we're close to a merge!
Oh, also important to remember to make any changes in the .ipynb
file and not the .py
file since the .py
is autogenerated with nbconvert
- Updated the keyword within functions to specify version to "version" instead of "year" - Field names use version names based on the release date of the dataset (for e.g. 20180907 instead 2018b)
-Also changed variable name
A general comment before I proceed with the review. I think this should be an optional todo, but I will start doing it for (likely) all of my jupyter notebooks. I've been using Around the time I started using black, I was hopeful that there would be a solution for First, I tried Next, I tried My plan is to keep testing out Summary
|
I usually work with jupyter notebooks within PyCharm (their support for jupyter notebooks is not great), so I looked for ways to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - merge at will!
@niranjchandrasekaran - were you planning on adding/modifying anything else to this PR? If not, lets execute the merge 🚧 |
@gwaygenomics It looks like I don't have write access to this repository and I don't see the merge button. |
Ah, well let's change that! |
done. should see it now |
Aims to address #11
I have created a mapping table connecting four fields -
broad_id
,pert_iname
,target
,moa
across each version of repurposing data usingInChIKey14
as the common field. The map is a single table with 13 fields. Fields with multiple values are combined into a single string, separated by pipes. I have not run any tests to confirm whether the code generates meaningful results. I will do that as the next step.A couple of points to note
pert_iname
s and 2moa
s, the current mapping table does not say whichpert_iname
corresponds to whichmoa
.@gwaygenomics, @shntnu Do you think this mapping would work? Or should I take a different approach?