-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process new batch of data 2019_06_10_Batch3 #14
Comments
@bethac07 Two questions for you
|
I think I am having CP3-related issues This step where CP is used to print out groups given a batch file no longer works I get
but then when I specify a pipeline via
|
Try specifying the batch file in the |
RE: your 2 questions- yes and yes |
I get a similar error. Maybe this is a problem upstream, when creating the batch file?
I ran this
|
Strangely, although it errors, the config files get created without any trouble. I'll continue processing, and you can ignore this error for now @bethac07 |
Wow, thanks a lot for digging into this. Good news is that things seem to be ok in spite of that error. I'll link to this issue in the handbook so that we have some record of it. |
@gwaygenomics Did we decide whether to split profiles by colony / non-colony going forward? |
Yes, we did decide to split. However, I think it would be good to process complete aggregate profiles too. |
@shntnu the split filters are described in #9 (comment) |
I am processing only 1 plate for now, and for some reason 2 images in this set are not getting processed. Beth help diagnose and her best bet was that it was due to the machines on which they were running had crashed, so she did various things in AWS to restart the process. The same two images are stuck again (I purged the queue) so it is very likely something to do with the images. In the interest of time, I'm going to skip these two images (out of 384 * 9) and proceed
|
Did you check the logs for those two sites?
…On Wed, Jun 26, 2019 at 3:47 PM Shantanu Singh ***@***.***> wrote:
I am processing only 1 plate for now, and for some reason 2 images in this
set are not getting processed. Beth help diagnose and her best bet was that
it was due to the machines on which they were running had crashed, so she
did various things in AWS to restart the process. The same two images are
stuck again (I purged the queue) so it is very likely something to do with
the images.
In the interest of time, I'm going to skip these two images (out of 384 *
9) and proceed
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=ABTI726CBPX2STRLUMRR433P4PBUZA5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUT2BA#issuecomment-506019076>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABTI7252EFKQPZSIGFMJYWLP4PBUZANCNFSM4H3JIGAA>
.
--
Beth Cimini, PhD
Computational Biologist, Imaging Platform
Broad Institute
415 Main St Room 5011
Cambridge, MA 02142
(617) 714-8189
|
oooh I should have. Here it is for the first one
and here is the second one
@bethac07 Don't worry about diagnosing right now; it isn't blocking |
The total size of the CSV files is ~70Gb, which means that the SQLite will be ~35Gb, which is ~3x the size in the pilots. |
It's actually not good, because it means we're back to square one figuring out why this is happening... |
I can pull out images for that site and file an issue if that’s helpful
…On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***> wrote:
It's actually not good, because it means we're back to square one figuring
out why this is happening...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA>
.
|
Admittedly what’s more helpful is to fix the bug :)
On Thu, Jun 27, 2019 at 8:58 AM Shantanu Singh <shsingh@broadinstitute.org>
wrote:
… I can pull out images for that site and file an issue if that’s helpful
On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***>
wrote:
> It's actually not good, because it means we're back to square one
> figuring out why this is happening...
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA>
> .
>
|
Ah, I was not clear- I was not referring to MeasureTexture (I will try to
get into that next week), but the "order of plating is not correlated with
cell count"; you said it was "good news" they didn't correlate, but I feel
like it's bad news because it means we now don't understand why it's
happening. So it's good from a "scientific method" point of view, but bad
from a "now we need a new hypothesis" point of view ;)
On Thu, Jun 27, 2019 at 9:00 AM Shantanu Singh <notifications@github.com>
wrote:
… Admittedly what’s more helpful is to fix the bug :)
On Thu, Jun 27, 2019 at 8:58 AM Shantanu Singh ***@***.***
>
wrote:
> I can pull out images for that site and file an issue if that’s helpful
>
> On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***>
> wrote:
>
>> It's actually not good, because it means we're back to square one
>> figuring out why this is happening...
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <
#14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547
>,
>> or mute the thread
>> <
https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA
>
>> .
>>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=ABTI725TQC3ZS6M6HFHVK5TP4S2X3A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYXBEPI#issuecomment-506335805>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABTI727RWMFF3Y4LMCHRDG3P4S2X3ANCNFSM4H3JIGAA>
.
--
Beth Cimini, PhD
Computational Biologist, Imaging Platform
Broad Institute
415 Main St Room 5011
Cambridge, MA 02142
(617) 714-8189
|
I'm not clear as to how the graph represents this? Isn't Based on the figure, I would think cell count is associated with My interpretation is that cell count is somewhat associated with |
This is an interesting figure (generated here) its not a particularly strong correlation, but this is alluding to plate density impacting replicate correlation. Is this true for isolated, colony, and aggregated profiles? |
It is also the order in which the cells were plated |
Turns out that the isolated + colony profiles did not get computed after all because of memory errors. I checked https://github.com/broadinstitute/cytominer_scripts/blob/8369881bc0a7cb22a58bd376b4240abd5709e5da/aggregate.R and there's not nothing much we can do there to avoid this. I'll get a larger instance to run this step. |
A larger instance doesn't seem to have helped either – the instance ran for a couple of days (!) and then died. I'll try on an even larger VM, and if that fails, try modifying the code. |
The process gets killed after aggregating cytoplasm
Nothing in stderr, but this is the parallel log:
The process ended at 1562599514.964 + 1316.246 = 15:47 UTC, so looks like the aggregate cytoplasm ran for 17 mins, where as aggregate cells ran for 5 minutes |
Ok so this is definitely a memory issue (takes up >30Gb) but for some reason |
Finally this worked! We needed at 64Gb instance (r5.2xlarge) for this to complete successfully |
wow, that is some serious compute! Do we know what step this is required for? Is it only one step, or is it all? |
Only the colony aggregation, strangely. I see why isolated was ok – there are fewer cells. But I am surprised that the "regular" aggregation (no split) did not need as much memory. It is possible that the modified code is in someway more memory intensive, but I haven't delved into that yet. |
Reusing the variable selection from the "regular" (all cells) profiles did not work so well because of NAs in normalized data. This is likely because there are more features that have a near zero variance once you filter out cells. I am rerunning the variable selection for |
Meanwhile @apnathan @TiffanyAmariuta – please go ahead and use the single cell data uploaded here to do the analysis at your end |
As of bed396c, we have results from all 3 sets of profiles (all, colony, isolated). |
@apnathan asked:
@apnathan the data is in SQLite format. I tried to convert to to TSV using @gwaygenomics 's code here but the files are too large to fit in memory for that approach to work. Are you able to work with SQLite files? |
@gwaygenomics @bethac07 @AnneCarpenter We had previously observed that the similarity between profiles and (dis)similarity of their cell counts (i.e. absolute difference between counts) are related (r = -.329), so we wondered whether this goes away when we consider only colonies and only isolated cells. We found that the similarities are still related (colony r = -0.291; isolated r = -0.289). So we then tested whether this was driven by cell lines that have too many (>4000 measured) or too few cells (<1000 measured). The relationship between similarities is in fact a bit stronger when we filter out these (all cells r=-0.396; colony r = -0.395; isolated r = -0.335). In summary, profile similarity and cell count similarity are certainly related, but I haven't thought through as to what extent we should be concerned about this and whether this should affect our protocol. In this notebook, search for "Report relationship between profiles similarity and cell count similarity" which is about 3/4th the way down to see the plots I've described above. |
@gwaygenomics said It may also be helpful to ask Matt for phenotype metadata at this time. Maybe all those highly correlated samples with similar cell counts are the same disease state? |
@AnneCarpenter said |
@shntnu @TiffanyAmariuta I don't think we've ever used SQLite files before, but they should be importable into R, so we'll see what changes need to be made to our previous code to run the same analyses. I imagine the final data coming out of the experiment will also be on this order of magnitude, so it's probably worth it to start thinking about best approaches for analyzing it. |
Aha! Diagnosed, it is due to CellProfiler/CellProfiler#3831 |
(That same error hit 8 images in Batch4; the new versions of the pipeline I just added should bypass the issue for Batch5 onward). |
Profiles for one plate of |
For isolated / colony profiles for
|
This does reproduce library(glue)
library(tidyverse)
#plate_id <- "cmqtlpl1.5-31-2019-mt"
plate_id <- "cmqtlpl261-2019-mt"
batch_id <- "2019_06_10_Batch3"
#sc_type <- "isolated_"
sc_type <- "colony"
#sc_type <- ""
file_name <- glue("../../../backend/{batch_id}/{plate_id}/{plate_id}_{sc_type}.csv")
read_csv(
file_name
) %>%
select(
Cells_Neighbors_AngleBetweenNeighbors_10,
Cells_Neighbors_AngleBetweenNeighbors_Adjacent,
Cells_Neighbors_FirstClosestDistance_10,
Cells_Neighbors_FirstClosestDistance_Adjacent,
Cells_Neighbors_SecondClosestDistance_10,
Cells_Neighbors_SecondClosestDistance_Adjacent,
Nuclei_Neighbors_AngleBetweenNeighbors_2,
Nuclei_Neighbors_FirstClosestDistance_2,
Nuclei_Neighbors_SecondClosestDistance_2) %>%
distinct() %>%
head() %>%
glimpse()
|
@gwaygenomics any clue why this might be happening: The features listed above are all I think it might come down to these lines, but I don't see anything offending here https://github.com/broadinstitute/cytominer_scripts/blob/issues/29/aggregate.R#L50-L54 |
@gwaygenomics hold off on investigating – I think I know the issue |
I still couldn't narrow it down, but what I found was that even the regular version of the profiles have this problem for that plate alone.
Here's how to diagnose. These are regular profiles, not colony/isolated issues/29: $ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_recomputed.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
NA master: $ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
20
27.37795303974804
92.29368296583094
98.18367406208154
77.57808586074458
84.75963089557231
84.09949372076152
89.88335916308186 |
I tried reproducing as below, but wasn't able to do so when I probe just a couple of features at a time db <- "/home/ubuntu/ebs_tmp/2018_06_05_cmQTL/workspace/backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.sqlite"
suppressWarnings(suppressMessages(library(docopt)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr)))
suppressWarnings(suppressMessages(library(stringr)))
db <- src_sqlite(path = db)
image <- tbl(src = db, "image") %>%
select(TableNumber, ImageNumber, Metadata_Plate, Metadata_Well)
compartment <- "cells"
object <- tbl(src = db, compartment)
object %<>% inner_join(image, by = c("TableNumber", "ImageNumber"))
compartment_tag <-
str_c("^", str_sub(compartment, 1, 1) %>% str_to_upper(), str_sub(compartment, 2), "_")
variables <- c("Cells_Neighbors_AngleBetweenNeighbors_10", "Cells_Neighbors_SecondClosestDistance_Adjacent")
cells_profiles <- cytominer::aggregate(
population = object,
variables = variables,
strata = c("Metadata_Plate", "Metadata_Well"),
operation = "mean"
) %>% collect()
|
Given the time constraints, I'm making a decision here to skip fixing the issue in colony and isolated versions of this plate Specifically, these features listed above will be all NA in the files listed below:
To make |
Line counts look ok
|
No description provided.
The text was updated successfully, but these errors were encountered: