Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process new batch of data 2019_06_10_Batch3 #14

Closed
shntnu opened this issue Jun 25, 2019 · 49 comments
Closed

Process new batch of data 2019_06_10_Batch3 #14

shntnu opened this issue Jun 25, 2019 · 49 comments
Labels
Experiments Tracking experimental questions, results, or analysis

Comments

@shntnu
Copy link
Collaborator

shntnu commented Jun 25, 2019

No description provided.

@shntnu shntnu self-assigned this Jun 25, 2019
@shntnu shntnu added the Experiments Tracking experimental questions, results, or analysis label Jun 25, 2019
@shntnu
Copy link
Collaborator Author

shntnu commented Jun 25, 2019

@bethac07 Two questions for you

  1. Is this the pipeline set you used for the previous batch? https://github.com/broadinstitute/imaging-platform-pipelines/tree/master/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3

  2. I've created DCP config files here. I modified my previous configs based on your most recent config. I just wanted to double check that it is indeed ok to change the ECS cluster name to default

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 25, 2019

I think I am having CP3-related issues

This step where CP is used to print out groups given a batch file no longer works

I get

  File "/usr/local/bin/cellprofiler", line 11, in <module>
    load_entry_point('CellProfiler', 'console_scripts', 'cellprofiler')()
  File "/usr/local/src/CellProfiler/cellprofiler/__main__.py", line 122, in main
    raise ValueError("You must specify a pipeline filename to run")
ValueError: You must specify a pipeline filename to run

but then when I specify a pipeline via --pipeline=/pipeline_dir/illum.cppipe, I get

Traceback (most recent call last):
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2097, in prepare_run
    if ((not module.prepare_run(workspace)) or
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 841, in prepare_run
    fd = self.open_csv()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 742, in open_csv
    return open(self.csv_path, 'rb')
IOError: [Errno 2] No such file or directory: u'/root/Desktop\\load_data_csv\\2019_05_13_Batch2\\BR00103267/load_data.csv'

@bethac07
Copy link
Contributor

Try specifying the batch file in the --pipeline

@bethac07
Copy link
Contributor

RE: your 2 questions- yes and yes

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 25, 2019

Try specifying the batch file in the --pipeline

I get a similar error. Maybe this is a problem upstream, when creating the batch file?

Error detected during run of module LoadData
Traceback (most recent call last):
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
    self.run_module(module, workspace)
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
    module.run(workspace)
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 1194, in run
    objects_names = self.get_object_names()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 801, in get_object_names
    header = self.get_header(do_not_cache=do_not_cache)
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 775, in get_header
    entry = self.get_cache_info()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 705, in get_cache_info
    ctime = os.stat(self.csv_path).st_ctime
OSError: [Errno 2] No such file or directory: '/root/Desktop/load_data_csv/2019_05_13_Batch2/BR00103267/load_data.csv'

I ran this

ubuntu@ip-10-0-4-54:~/efs/2018_06_05_cmQTL/workspace/software/cellpainting_scripts$ docker run -e S6_LOGGING=1 --rm --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3:/pipeline_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/filelist/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt:/filelist_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt:/datafile_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/batchfiles/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/illum:/batchfile_dir  --volume=/tmp:/tmp_dir  --volume=/home/ubuntu/bucket/:/home/ubuntu/bucket/ cellprofiler/cellprofiler:3.1.8  --pipeline=/batchfile_dir/Batch_data.h5 --print-groups=/batchfile_dir/Batch_data.h5

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 25, 2019

Strangely, although it errors, the config files get created without any trouble. I'll continue processing, and you can ignore this error for now @bethac07

@bethac07
Copy link
Contributor

bethac07 commented Jun 25, 2019

I'm really not sure, sorry; my best guess is to try to change both of the "location" settings from things containing "Default Input Folder" to "Elsewhere" and see if that helps. I don't use that method for making config and job files, so I'd need to delve into it more to diagnose.

image

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 25, 2019

I'm really not sure, sorry; my best guess is to try to change both of the "location" settings from things containing "Default Input Folder" to "Elsewhere" and see if that helps. I don't use that method for making config and job files, so I'd need to delve into it more to diagnose.

Wow, thanks a lot for digging into this. Good news is that things seem to be ok in spite of that error. I'll link to this issue in the handbook so that we have some record of it.

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 26, 2019

@gwaygenomics Did we decide whether to split profiles by colony / non-colony going forward?

@gwaybio
Copy link
Member

gwaybio commented Jun 26, 2019

Did we decide whether to split profiles by colony / non-colony going forward?

Yes, we did decide to split. However, I think it would be good to process complete aggregate profiles too.

@gwaybio
Copy link
Member

gwaybio commented Jun 26, 2019

@shntnu the split filters are described in #9 (comment)

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 26, 2019

I am processing only 1 plate for now, and for some reason 2 images in this set are not getting processed. Beth help diagnose and her best bet was that it was due to the machines on which they were running had crashed, so she did various things in AWS to restart the process. The same two images are stuck again (I purged the queue) so it is very likely something to do with the images.

In the interest of time, I'm going to skip these two images (out of 384 * 9) and proceed

2019-06-26 18:46:49.918626 In process: 0 Pending 3456
2019-06-26 18:47:49.990986 In process: 2 Pending 3379
2019-06-26 18:48:50.063506 In process: 2 Pending 3224
2019-06-26 18:49:50.137452 In process: 2 Pending 3069
2019-06-26 18:50:50.317943 In process: 4 Pending 2919
2019-06-26 18:51:50.413163 In process: 4 Pending 2768
2019-06-26 18:52:50.486628 In process: 4 Pending 2624
2019-06-26 18:53:50.558921 In process: 6 Pending 2481
2019-06-26 18:54:50.632221 In process: 6 Pending 2345
2019-06-26 18:55:50.706765 In process: 6 Pending 2210
2019-06-26 18:56:50.912252 In process: 8 Pending 2071
2019-06-26 18:57:50.986726 In process: 8 Pending 1939
2019-06-26 18:58:51.059642 In process: 8 Pending 1791
2019-06-26 18:59:51.247832 In process: 8 Pending 1659
2019-06-26 19:00:51.323514 In process: 8 Pending 1513
2019-06-26 19:01:51.420210 In process: 8 Pending 1373
2019-06-26 19:02:51.492810 In process: 8 Pending 1234
2019-06-26 19:03:51.565966 In process: 8 Pending 1092
2019-06-26 19:04:51.638489 In process: 8 Pending 958
2019-06-26 19:05:51.709394 In process: 8 Pending 830
2019-06-26 19:06:51.807689 In process: 8 Pending 703
2019-06-26 19:07:51.881262 In process: 8 Pending 563
2019-06-26 19:08:51.956042 In process: 8 Pending 416
2019-06-26 19:09:52.040915 In process: 8 Pending 283
2019-06-26 19:10:52.121361 In process: 8 Pending 162
2019-06-26 19:11:52.227774 In process: 8 Pending 47
2019-06-26 19:12:52.306046 In process: 2 Pending 0
2019-06-26 19:42:55.310149 In process: 1 Pending 1
2019-06-26 19:43:55.387417 In process: 2 Pending 0
2019-06-26 19:47:55.725168 In process: 1 Pending 1

@bethac07
Copy link
Contributor

bethac07 commented Jun 26, 2019 via email

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 26, 2019

Did you check the logs for those two sites?

oooh I should have. Here it is for the first one
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=2018_06_05_cmQTL_Analysis;stream=cmqtlpl1.5-31-2019-mt-P05-3;start=2019-06-25T19:52:36Z

cellprofiler -c -r -p /home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3/analysis_without_batchfile.cppipe -i /home/ubuntu/bucket/dummy -o /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3 -d /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3/cp.is.done --data-file=/home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/load_data_with_illum.csv -g Metadata_Plate=cmqtlpl1.5-31-2019-mt,Metadata_Well=P05,Metadata_Site=3
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Times reported are CPU and Wall-clock times for each module
Wed Jun 26 19:11:51 2019: Image # 3279, module LoadData # 1: CPU_time = 10.56 secs, Wall_time = 16.52 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module CorrectIlluminationApply # 2: CPU_time = 0.14 secs, Wall_time = 0.14 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module MeasureImageQuality # 3: CPU_time = 85.38 secs, Wall_time = 88.39 secs
Wed Jun 26 19:13:36 2019: Image # 3279, module MeasureImageQuality # 4: CPU_time = 1.71 secs, Wall_time = 1.71 secs
Wed Jun 26 19:13:38 2019: Image # 3279, module EnhanceOrSuppressFeatures # 6: CPU_time = 24.01 secs, Wall_time = 24.00 secs
Wed Jun 26 19:14:02 2019: Image # 3279, module IdentifyPrimaryObjects # 7: CPU_time = 6.03 secs, Wall_time = 6.02 secs
Wed Jun 26 19:14:08 2019: Image # 3279, module IdentifySecondaryObjects # 8: CPU_time = 6.43 secs, Wall_time = 6.43 secs
Wed Jun 26 19:14:14 2019: Image # 3279, module IdentifyTertiaryObjects # 9: CPU_time = 1.23 secs, Wall_time = 1.24 secs
Wed Jun 26 19:14:16 2019: Image # 3279, module MeasureColocalization # 10: CPU_time = 595.47 secs, Wall_time = 595.08 secs
Wed Jun 26 19:24:11 2019: Image # 3279, module MeasureGranularity # 11: CPU_time = 93.04 secs, Wall_time = 92.98 secs
Wed Jun 26 19:25:44 2019: Image # 3279, module MeasureObjectIntensity # 12: CPU_time = 35.56 secs, Wall_time = 35.54 secs
Wed Jun 26 19:26:19 2019: Image # 3279, module MeasureObjectNeighbors # 13: CPU_time = 5.36 secs, Wall_time = 5.36 secs
Wed Jun 26 19:26:25 2019: Image # 3279, module MeasureObjectNeighbors # 14: CPU_time = 2.09 secs, Wall_time = 2.08 secs
Wed Jun 26 19:26:27 2019: Image # 3279, module MeasureObjectNeighbors # 15: CPU_time = 2.25 secs, Wall_time = 2.24 secs
Wed Jun 26 19:26:29 2019: Image # 3279, module MeasureObjectIntensityDistribution # 16: CPU_time = 22.41 secs, Wall_time = 22.40 secs
Wed Jun 26 19:26:51 2019: Image # 3279, module MeasureObjectSizeShape # 17: CPU_time = 48.30 secs, Wall_time = 48.27 secs
/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:141: UserWarning: Possible precision loss when converting from float32 to uint8
.format(dtypeobj_in, dtypeobj_out))
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure

and here is the second one
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=2018_06_05_cmQTL_Analysis;stream=cmqtlpl1.5-31-2019-mt-P05-3;start=2019-06-25T19:52:36Z

cellprofiler -c -r -p /home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3/analysis_without_batchfile.cppipe -i /home/ubuntu/bucket/dummy -o /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3 -d /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3/cp.is.done --data-file=/home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/load_data_with_illum.csv -g Metadata_Plate=cmqtlpl1.5-31-2019-mt,Metadata_Well=P05,Metadata_Site=3
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Times reported are CPU and Wall-clock times for each module
Wed Jun 26 19:11:51 2019: Image # 3279, module LoadData # 1: CPU_time = 10.56 secs, Wall_time = 16.52 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module CorrectIlluminationApply # 2: CPU_time = 0.14 secs, Wall_time = 0.14 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module MeasureImageQuality # 3: CPU_time = 85.38 secs, Wall_time = 88.39 secs
Wed Jun 26 19:13:36 2019: Image # 3279, module MeasureImageQuality # 4: CPU_time = 1.71 secs, Wall_time = 1.71 secs
Wed Jun 26 19:13:38 2019: Image # 3279, module EnhanceOrSuppressFeatures # 6: CPU_time = 24.01 secs, Wall_time = 24.00 secs
Wed Jun 26 19:14:02 2019: Image # 3279, module IdentifyPrimaryObjects # 7: CPU_time = 6.03 secs, Wall_time = 6.02 secs
Wed Jun 26 19:14:08 2019: Image # 3279, module IdentifySecondaryObjects # 8: CPU_time = 6.43 secs, Wall_time = 6.43 secs
Wed Jun 26 19:14:14 2019: Image # 3279, module IdentifyTertiaryObjects # 9: CPU_time = 1.23 secs, Wall_time = 1.24 secs
Wed Jun 26 19:14:16 2019: Image # 3279, module MeasureColocalization # 10: CPU_time = 595.47 secs, Wall_time = 595.08 secs
Wed Jun 26 19:24:11 2019: Image # 3279, module MeasureGranularity # 11: CPU_time = 93.04 secs, Wall_time = 92.98 secs
Wed Jun 26 19:25:44 2019: Image # 3279, module MeasureObjectIntensity # 12: CPU_time = 35.56 secs, Wall_time = 35.54 secs
Wed Jun 26 19:26:19 2019: Image # 3279, module MeasureObjectNeighbors # 13: CPU_time = 5.36 secs, Wall_time = 5.36 secs
Wed Jun 26 19:26:25 2019: Image # 3279, module MeasureObjectNeighbors # 14: CPU_time = 2.09 secs, Wall_time = 2.08 secs
Wed Jun 26 19:26:27 2019: Image # 3279, module MeasureObjectNeighbors # 15: CPU_time = 2.25 secs, Wall_time = 2.24 secs
Wed Jun 26 19:26:29 2019: Image # 3279, module MeasureObjectIntensityDistribution # 16: CPU_time = 22.41 secs, Wall_time = 22.40 secs
Wed Jun 26 19:26:51 2019: Image # 3279, module MeasureObjectSizeShape # 17: CPU_time = 48.30 secs, Wall_time = 48.27 secs
/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:141: UserWarning: Possible precision loss when converting from float32 to uint8
.format(dtypeobj_in, dtypeobj_out))
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure

@bethac07 Don't worry about diagnosing right now; it isn't blocking

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 26, 2019

The total size of the CSV files is ~70Gb, which means that the SQLite will be ~35Gb, which is ~3x the size in the pilots.

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 27, 2019

A sneak peek into the data suggests that there is no evidence that cell id (and thus settling time) is correlated with cell count, which is good!

image

image

@bethac07
Copy link
Contributor

It's actually not good, because it means we're back to square one figuring out why this is happening...

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 27, 2019 via email

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 27, 2019 via email

@bethac07
Copy link
Contributor

bethac07 commented Jun 27, 2019 via email

@gwaybio
Copy link
Member

gwaybio commented Jun 27, 2019

there is no evidence that cell id (and thus settling time) is correlated with cell count

I'm not clear as to how the graph represents this? Isn't cell ID a nominal variable?

Based on the figure, I would think cell count is associated with cell ID. cell IDs with low cell counts have low cell counts across replicates. cell IDs with high cell counts tend to have high variance across replicates.

My interpretation is that cell count is somewhat associated with cell ID, but potentially not entirely explained by it given the high variance across replicates. (assuming that all cells were plated with approximately the same density)

@gwaybio
Copy link
Member

gwaybio commented Jun 27, 2019

This is an interesting figure (generated here)

image

its not a particularly strong correlation, but this is alluding to plate density impacting replicate correlation.

Is this true for isolated, colony, and aggregated profiles?

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 27, 2019

Isn't cell ID a nominal variable?

It is also the order in which the cells were plated

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 27, 2019

Turns out that the isolated + colony profiles did not get computed after all because of memory errors. I checked https://github.com/broadinstitute/cytominer_scripts/blob/8369881bc0a7cb22a58bd376b4240abd5709e5da/aggregate.R and there's not nothing much we can do there to avoid this. I'll get a larger instance to run this step.

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 8, 2019

A larger instance doesn't seem to have helped either – the instance ran for a couple of days (!) and then died. I'll try on an even larger VM, and if that fails, try modifying the code.

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 8, 2019

The process gets killed after aggregating cytoplasm

INFO [2019-07-08 15:25:16] Started aggregating cells
INFO [2019-07-08 15:30:30] Started aggregating cytoplasm

Nothing in stderr, but this is the parallel log:

Seq     Host    Starttime       Runtime Send    Receive Exitval Signal  Command
1       :       1562599514.964  1316.246        0       0       0       9       ./aggregate.R --sqlite_file /home/ubuntu/ebs_tmp/2018_06_05_cmQTL/workspace/backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt.sqlite --output ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated.csv --sc_type isolated

The process ended at 1562599514.964 + 1316.246 = 15:47 UTC, so looks like the aggregate cytoplasm ran for 17 mins, where as aggregate cells ran for 5 minutes

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 8, 2019

Ok so this is definitely a memory issue (takes up >30Gb) but for some reason colony works but isolated gets killed

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 9, 2019

Finally this worked! We needed at 64Gb instance (r5.2xlarge) for this to complete successfully

@gwaybio
Copy link
Member

gwaybio commented Jul 9, 2019

wow, that is some serious compute! Do we know what step this is required for? Is it only one step, or is it all?

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 9, 2019

Only the colony aggregation, strangely. I see why isolated was ok – there are fewer cells. But I am surprised that the "regular" aggregation (no split) did not need as much memory. It is possible that the modified code is in someway more memory intensive, but I haven't delved into that yet.

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 11, 2019

Reusing the variable selection from the "regular" (all cells) profiles did not work so well because of NAs in normalized data. This is likely because there are more features that have a near zero variance once you filter out cells.

I am rerunning the variable selection for colony and isolated

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 11, 2019

Meanwhile @apnathan @TiffanyAmariuta – please go ahead and use the single cell data uploaded here to do the analysis at your end

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 15, 2019

As of bed396c, we have results from all 3 sets of profiles (all, colony, isolated).

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 16, 2019

Meanwhile @apnathan @TiffanyAmariuta – please go ahead and use the single cell data uploaded here to do the analysis at your end

@apnathan asked:

What format is the data in at that AWS link? Last time we used the matrices uploaded by Greg here, which I think are similar to what you've sent me in the past: https://github.com/broadinstitute/cmQTL/tree/master/0.pilot-determine-conditions/data Are these files also available for the new data?

@apnathan the data is in SQLite format. I tried to convert to to TSV using @gwaygenomics 's code here but the files are too large to fit in memory for that approach to work.

Are you able to work with SQLite files?

shntnu added a commit that referenced this issue Jul 16, 2019
Process new batch of data
@shntnu
Copy link
Collaborator Author

shntnu commented Jul 16, 2019

@gwaygenomics @bethac07 @AnneCarpenter

We had previously observed that the similarity between profiles and (dis)similarity of their cell counts (i.e. absolute difference between counts) are related (r = -.329), so we wondered whether this goes away when we consider only colonies and only isolated cells. We found that the similarities are still related (colony r = -0.291; isolated r = -0.289). So we then tested whether this was driven by cell lines that have too many (>4000 measured) or too few cells (<1000 measured). The relationship between similarities is in fact a bit stronger when we filter out these (all cells r=-0.396; colony r = -0.395; isolated r = -0.335).

In summary, profile similarity and cell count similarity are certainly related, but I haven't thought through as to what extent we should be concerned about this and whether this should affect our protocol.


In this notebook, search for "Report relationship between profiles similarity and cell count similarity" which is about 3/4th the way down to see the plots I've described above.

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 16, 2019

@gwaygenomics said
This is great! I agree we do not yet know how concerned we should be. One thing we could do (and I think Anne had mentioned this once before) is a subsampling experiment where we randomly select cells to more closely match cell counts and ask if pairwise subsampling pearson correlation is different than no subsampling. This will at least rule out technical bias in median collapsing, but does not rule out technical bias in plating density.

It may also be helpful to ask Matt for phenotype metadata at this time. Maybe all those highly correlated samples with similar cell counts are the same disease state?

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 16, 2019

@AnneCarpenter said
I don't think subsampling will yield anything different, so I'd skip that. The more I think about it, the more I think that our goal is not to differentiate one cell line from another. Our goal is to identify individual features that are affected more by cell line identity than by cell density. So while we wish ALL features were unaffected by cell density, we know it won't be the case and we will be ignoring them later in the pipeline. So I think we should just carry on and not treat this as a failed QC step.

@apnathan
Copy link
Collaborator

@shntnu @TiffanyAmariuta I don't think we've ever used SQLite files before, but they should be importable into R, so we'll see what changes need to be made to our previous code to run the same analyses. I imagine the final data coming out of the experiment will also be on this order of magnitude, so it's probably worth it to start thinking about best approaches for analyzing it.

@bethac07
Copy link
Contributor

bethac07 commented Sep 4, 2019

Did you check the logs for those two sites?

oooh I should have. Here it is for the first one
[...]
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure


@bethac07 Don't worry about diagnosing right now; it isn't blocking

Aha! Diagnosed, it is due to CellProfiler/CellProfiler#3831

@bethac07
Copy link
Contributor

bethac07 commented Sep 4, 2019

(That same error hit 8 images in Batch4; the new versions of the pipeline I just added should bypass the issue for Batch5 onward).

@shntnu
Copy link
Collaborator Author

shntnu commented Sep 10, 2019

Profiles for one plate of 2019_06_10_Batch3 (cmqtlpl1.5-31-2019-mt) are available within the repo https://github.com/broadinstitute/cmQTL/tree/master/1.profile-cell-lines/profiles by copying them from backend. Will do the same for the others.

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 17, 2019

For isolated / colony profiles for cmqtlpl261-2019-mt, the following features are all NA (but the regular profiles don't have this issue). We don't observe this with any other plate in any batch. I am going to re-run aggregate for cmqtlpl261-2019-mt and see if it reproduces

Cells_Neighbors_AngleBetweenNeighbors_10
Cells_Neighbors_AngleBetweenNeighbors_Adjacent
Cells_Neighbors_FirstClosestDistance_10
Cells_Neighbors_FirstClosestDistance_Adjacent
Cells_Neighbors_SecondClosestDistance_10
Cells_Neighbors_SecondClosestDistance_Adjacent
Nuclei_Neighbors_AngleBetweenNeighbors_2
Nuclei_Neighbors_FirstClosestDistance_2
Nuclei_Neighbors_SecondClosestDistance_

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 17, 2019

I am going to re-run aggregate for cmqtlpl261-2019-mt and see if it reproduces

This does reproduce

library(glue)
library(tidyverse)

#plate_id <- "cmqtlpl1.5-31-2019-mt"
plate_id <- "cmqtlpl261-2019-mt"

batch_id <- "2019_06_10_Batch3"

#sc_type <- "isolated_"
sc_type <- "colony"
#sc_type <- ""

file_name <- glue("../../../backend/{batch_id}/{plate_id}/{plate_id}_{sc_type}.csv")

read_csv(
    file_name
  ) %>% 
  select(
  Cells_Neighbors_AngleBetweenNeighbors_10,
  Cells_Neighbors_AngleBetweenNeighbors_Adjacent,
  Cells_Neighbors_FirstClosestDistance_10,
  Cells_Neighbors_FirstClosestDistance_Adjacent,
  Cells_Neighbors_SecondClosestDistance_10,
  Cells_Neighbors_SecondClosestDistance_Adjacent,
  Nuclei_Neighbors_AngleBetweenNeighbors_2,
  Nuclei_Neighbors_FirstClosestDistance_2,
  Nuclei_Neighbors_SecondClosestDistance_2) %>% 
  distinct() %>% 
  head() %>%
  glimpse()
Observations: 1
Variables: 9
$ Cells_Neighbors_AngleBetweenNeighbors_10       <lgl> NA
$ Cells_Neighbors_AngleBetweenNeighbors_Adjacent <lgl> NA
$ Cells_Neighbors_FirstClosestDistance_10        <lgl> NA
$ Cells_Neighbors_FirstClosestDistance_Adjacent  <lgl> NA
$ Cells_Neighbors_SecondClosestDistance_10       <lgl> NA
$ Cells_Neighbors_SecondClosestDistance_Adjacent <lgl> NA
$ Nuclei_Neighbors_AngleBetweenNeighbors_2       <lgl> NA
$ Nuclei_Neighbors_FirstClosestDistance_2        <lgl> NA
$ Nuclei_Neighbors_SecondClosestDistance_2       <lgl> NA
``

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 17, 2019

@gwaygenomics any clue why this might be happening:

The features listed above are all NA in the colony and isolated version of the profiles, but not in the regular version of the profiles. This happens only for one plate.

I think it might come down to these lines, but I don't see anything offending here

https://github.com/broadinstitute/cytominer_scripts/blob/issues/29/aggregate.R#L50-L54

https://github.com/broadinstitute/cytominer_scripts/pull/30/files#diff-74eeb9ecf93aa8a5256641e5c7605ee0

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 17, 2019

@gwaygenomics hold off on investigating – I think I know the issue

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 18, 2019

I still couldn't narrow it down, but what I found was that even the regular version of the profiles have this problem for that plate alone.

aggregate.R on the current master branch works fine, but that on the current issues/29 branch does not.

Here's how to diagnose. These are regular profiles, not colony/isolated

issues/29:

$ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_recomputed.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
NA

master:

$ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
20
27.37795303974804
92.29368296583094
98.18367406208154
77.57808586074458
84.75963089557231
84.09949372076152
89.88335916308186

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 18, 2019

I tried reproducing as below, but wasn't able to do so when I probe just a couple of features at a time

db <- "/home/ubuntu/ebs_tmp/2018_06_05_cmQTL/workspace/backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.sqlite"

suppressWarnings(suppressMessages(library(docopt)))

suppressWarnings(suppressMessages(library(dplyr)))

suppressWarnings(suppressMessages(library(magrittr)))

suppressWarnings(suppressMessages(library(stringr)))

db <- src_sqlite(path = db)

image <- tbl(src = db, "image") %>%
  select(TableNumber, ImageNumber, Metadata_Plate, Metadata_Well)

compartment <- "cells"

object <- tbl(src = db, compartment)

object %<>% inner_join(image, by = c("TableNumber", "ImageNumber"))

compartment_tag <-
    str_c("^", str_sub(compartment, 1, 1) %>% str_to_upper(), str_sub(compartment, 2), "_")

variables <- c("Cells_Neighbors_AngleBetweenNeighbors_10", "Cells_Neighbors_SecondClosestDistance_Adjacent")

cells_profiles <- cytominer::aggregate(
    population = object,
    variables = variables,
    strata = c("Metadata_Plate", "Metadata_Well"),
    operation = "mean"
  ) %>% collect()
Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
> cells_profiles
# A tibble: 384 x 4
   Metadata_Plate     Metadata_Well Cells_Neighbors_Angl… Cells_Neighbors_Seco…
   <chr>              <chr>                         <dbl>                 <dbl>
 1 cmqtlpl261-2019-mt A01                            20                    604.
 2 cmqtlpl261-2019-mt A02                            27.4                  339.
 3 cmqtlpl261-2019-mt A03                            92.3                  130.
 4 cmqtlpl261-2019-mt A04                            98.2                  109.
 5 cmqtlpl261-2019-mt A05                            77.6                  411.
 6 cmqtlpl261-2019-mt A06                            84.8                  333.
 7 cmqtlpl261-2019-mt A07                            84.1                  205.
 8 cmqtlpl261-2019-mt A08                            89.9                  167.
 9 cmqtlpl261-2019-mt A09                            76.8                  252.
10 cmqtlpl261-2019-mt A10                            79.0                  199.

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 18, 2019

Given the time constraints, I'm making a decision here to skip fixing the issue in colony and isolated versions of this plate

Specifically, these features listed above will be all NA in the files listed below:

cmqtlpl261-2019-mt_colony_normalized_variable_selected_augmented.csv
cmqtlpl261-2019-mt_colony_normalized_variable_selected.csv
cmqtlpl261-2019-mt_colony_normalized.csv

To make aggregate.R work with all NA columns, I made this minor edit

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 18, 2019

Line counts look ok

385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_colony_normalized.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated_normalized.csv

385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_colony_normalized_variable_selected.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated_normalized_variable_selected.csv

373 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_colony_normalized.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_isolated_normalized.csv

373 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_colony_normalized_variable_selected.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_isolated_normalized_variable_selected.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Experiments Tracking experimental questions, results, or analysis
Projects
None yet
Development

No branches or pull requests

4 participants