Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding batch 2 consensus profiles #61

Merged
merged 5 commits into from
Mar 22, 2021

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented Mar 21, 2021

Here, I add consensus profiles for batch 2 profiles. I also add Metadata_cell_id to the aggregation columns for both batches (batch 2 has three cell lines). I make some minor changes throughout the notebook.

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

@shntnu
Copy link
Collaborator

shntnu commented Mar 22, 2021

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

This must be a metadata issue or a missing grouping column

Batch 2 has 3 cell lines x 3 dose points x 3 time points x 360 compounds = ~9720 (not exact because some compounds might be missing all doses)

@shntnu
Copy link
Collaborator

shntnu commented Mar 22, 2021

Here is the exact number of consensus profiles for batch 2

library(tidyverse)
platemaps <- 
  c("https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
    "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
    "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") 

n_cell_lines <- 3
n_time_points <- 3

platemaps %>%
  map_df(read_tsv) %>% 
  distinct(broad_sample, mmoles_per_liter) %>% 
  tally(name = "n_consensus") %>%
  mutate(n_consensus = n_consensus * n_cell_lines * n_time_points) %>%
  knitr::kable()
n_consensus
9396

@gwaybio
Copy link
Member Author

gwaybio commented Mar 22, 2021

missing time as a grouping column, thanks!

@gwaybio
Copy link
Member Author

gwaybio commented Mar 22, 2021

this turned out to be an even larger problem. the aggregate function will drop samples if one of their aggregating columns (strata) has missing values. eek! I opened cytomining/pycytominer#133 to resolve this globally, but for this PR, my solution is to recode missing values as "unknown". This only impacts the MOA and target columns.

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles. Note that your example above does not include platemaps from multiple time points.

Also note that I do update MOAs in the profiling step for both batches:

anno_df = annotate(
profiles=out_file,
platemap=platemap_file,
join_on=["Metadata_well_position", well_col],
cell_id=cell_id,
format_broad_cmap=True,
perturbation_mode="chemical",
external_metadata=moa_df,
external_join_left=["Metadata_broad_sample"],
external_join_right=["Metadata_broad_sample"],
)

But i wonder if I need to update the external moa file first with the new batch broad ids...

# Load and check MOA information
moa_file = pathlib.PurePath(
"../metadata/moa/repurposing_info_external_moa_map_resolved.tsv"
)
moa_df = pd.read_csv(moa_file, sep="\t")

@gwaybio
Copy link
Member Author

gwaybio commented Mar 22, 2021

in other words, if I have to do this, then I'll need to rerun the profiling pipeline again for at least batch 2 data

@gwaybio gwaybio requested a review from shntnu March 22, 2021 15:34
@gwaybio
Copy link
Member Author

gwaybio commented Mar 22, 2021

@shntnu - this PR is ready for review. Let's discuss a potential full reprocessing in #62. We need not decide to reprocess in full before merging this PR.

@shntnu
Copy link
Collaborator

shntnu commented Mar 22, 2021

the aggregate function will drop samples if one of their aggregating columns (strata) has missing values.

Wow, glad you found it! Bad 🐼 !

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles.
Ah that's because we are using pert_well in grouping (as per plan 👍 ). 3 x 3 x 3 x 384 = 10,368

Note that your example above does not include platemaps from multiple time points.

For our notes: It does actually – there are only 3 unique platemaps (containing 3 doses x ~360 compounds), so I read 3 of them then multiplied that by 3x3. But that example is useless given that we are computing consensus by including the pert_well column :D

@shntnu
Copy link
Collaborator

shntnu commented Mar 22, 2021

this PR is ready for review.

lgtm

@gwaybio gwaybio merged commit f865c79 into broadinstitute:master Mar 22, 2021
@gwaybio gwaybio deleted the batch2-consensus branch March 22, 2021 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants