Adding batch 2 consensus profiles #61

gwaybio · 2021-03-21T22:03:43Z

Here, I add consensus profiles for batch 2 profiles. I also add Metadata_cell_id to the aggregation columns for both batches (batch 2 has three cell lines). I make some minor changes throughout the notebook.

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

also adding Metadata_cell_id column

shntnu · 2021-03-22T00:17:48Z

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

This must be a metadata issue or a missing grouping column

Batch 2 has 3 cell lines x 3 dose points x 3 time points x 360 compounds = ~9720 (not exact because some compounds might be missing all doses)

shntnu · 2021-03-22T00:25:18Z

Here is the exact number of consensus profiles for batch 2

library(tidyverse)
platemaps <- 
  c("https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
    "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
    "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") 

n_cell_lines <- 3
n_time_points <- 3

platemaps %>%
  map_df(read_tsv) %>% 
  distinct(broad_sample, mmoles_per_liter) %>% 
  tally(name = "n_consensus") %>%
  mutate(n_consensus = n_consensus * n_cell_lines * n_time_points) %>%
  knitr::kable()

n_consensus
9396

gwaybio · 2021-03-22T11:35:38Z

missing time as a grouping column, thanks!

gwaybio · 2021-03-22T15:08:04Z

this turned out to be an even larger problem. the aggregate function will drop samples if one of their aggregating columns (strata) has missing values. eek! I opened cytomining/pycytominer#133 to resolve this globally, but for this PR, my solution is to recode missing values as "unknown". This only impacts the MOA and target columns.

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles. Note that your example above does not include platemaps from multiple time points.

Also note that I do update MOAs in the profiling step for both batches:

lincs-cell-painting/profiles/profile_cells.py

Lines 73 to 83 in d471bbd

    
           anno_df = annotate( 
        
               profiles=out_file, 
        
               platemap=platemap_file, 
        
               join_on=["Metadata_well_position", well_col], 
        
               cell_id=cell_id, 
        
               format_broad_cmap=True, 
        
               perturbation_mode="chemical", 
        
               external_metadata=moa_df, 
        
               external_join_left=["Metadata_broad_sample"], 
        
               external_join_right=["Metadata_broad_sample"], 
        
           )

But i wonder if I need to update the external moa file first with the new batch broad ids...

lincs-cell-painting/profiles/profiling_pipeline.py

Lines 46 to 50 in d471bbd

    
           # Load and check MOA information 
        
           moa_file = pathlib.PurePath( 
        
               "../metadata/moa/repurposing_info_external_moa_map_resolved.tsv" 
        
           ) 
        
           moa_df = pd.read_csv(moa_file, sep="\t")

gwaybio · 2021-03-22T15:10:27Z

in other words, if I have to do this, then I'll need to rerun the profiling pipeline again for at least batch 2 data

gwaybio · 2021-03-22T15:35:14Z

@shntnu - this PR is ready for review. Let's discuss a potential full reprocessing in #62. We need not decide to reprocess in full before merging this PR.

shntnu · 2021-03-22T15:49:50Z

the aggregate function will drop samples if one of their aggregating columns (strata) has missing values.

Wow, glad you found it! Bad 🐼 !

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles.
Ah that's because we are using pert_well in grouping (as per plan 👍 ). 3 x 3 x 3 x 384 = 10,368

Note that your example above does not include platemaps from multiple time points.

For our notes: It does actually – there are only 3 unique platemaps (containing 3 doses x ~360 compounds), so I read 3 of them then multiplied that by 3x3. But that example is useless given that we are computing consensus by including the pert_well column :D

shntnu · 2021-03-22T15:50:18Z

this PR is ready for review.

lgtm

gwaybio added 3 commits March 21, 2021 17:57

add batch 2 logic to consensus notebook

058c007

add batch 2 consensus profiles

ac2a1f0

reprocess batch 1 data

d4272a4

also adding Metadata_cell_id column

gwaybio mentioned this pull request Mar 21, 2021

Second batch of lincs data #57

Closed

gwaybio mentioned this pull request Mar 22, 2021

Should we reprocess all profiles before frozen data release? #62

Closed

2 tasks

gwaybio added 2 commits March 22, 2021 11:28

make missing values unknown in target and moa columns

3afbf49

rewrite consensus files

f19c91b

gwaybio requested a review from shntnu March 22, 2021 15:34

gwaybio merged commit f865c79 into broadinstitute:master Mar 22, 2021

gwaybio deleted the batch2-consensus branch March 22, 2021 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding batch 2 consensus profiles #61

Adding batch 2 consensus profiles #61

gwaybio commented Mar 21, 2021

shntnu commented Mar 22, 2021 •

edited

Loading

shntnu commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

shntnu commented Mar 22, 2021

shntnu commented Mar 22, 2021

Adding batch 2 consensus profiles #61

Adding batch 2 consensus profiles #61

Conversation

gwaybio commented Mar 21, 2021

shntnu commented Mar 22, 2021 • edited Loading

shntnu commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Mar 22, 2021

shntnu commented Mar 22, 2021

shntnu commented Mar 22, 2021

shntnu commented Mar 22, 2021 •

edited

Loading