Process new batch of data 2019_06_10_Batch3 #14

shntnu · 2019-06-25T14:57:09Z

No description provided.

shntnu · 2019-06-25T16:42:32Z

@bethac07 Two questions for you

Is this the pipeline set you used for the previous batch? https://github.com/broadinstitute/imaging-platform-pipelines/tree/master/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3
I've created DCP config files here. I modified my previous configs based on your most recent config. I just wanted to double check that it is indeed ok to change the ECS cluster name to default

shntnu · 2019-06-25T17:33:42Z

I think I am having CP3-related issues

This step where CP is used to print out groups given a batch file no longer works

I get

  File "/usr/local/bin/cellprofiler", line 11, in <module>
    load_entry_point('CellProfiler', 'console_scripts', 'cellprofiler')()
  File "/usr/local/src/CellProfiler/cellprofiler/__main__.py", line 122, in main
    raise ValueError("You must specify a pipeline filename to run")
ValueError: You must specify a pipeline filename to run

but then when I specify a pipeline via --pipeline=/pipeline_dir/illum.cppipe, I get

Traceback (most recent call last):
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2097, in prepare_run
    if ((not module.prepare_run(workspace)) or
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 841, in prepare_run
    fd = self.open_csv()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 742, in open_csv
    return open(self.csv_path, 'rb')
IOError: [Errno 2] No such file or directory: u'/root/Desktop\\load_data_csv\\2019_05_13_Batch2\\BR00103267/load_data.csv'

bethac07 · 2019-06-25T17:36:36Z

Try specifying the batch file in the --pipeline

bethac07 · 2019-06-25T17:37:31Z

RE: your 2 questions- yes and yes

shntnu · 2019-06-25T17:38:57Z

Try specifying the batch file in the --pipeline

I get a similar error. Maybe this is a problem upstream, when creating the batch file?

Error detected during run of module LoadData
Traceback (most recent call last):
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
    self.run_module(module, workspace)
  File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
    module.run(workspace)
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 1194, in run
    objects_names = self.get_object_names()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 801, in get_object_names
    header = self.get_header(do_not_cache=do_not_cache)
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 775, in get_header
    entry = self.get_cache_info()
  File "/usr/local/src/CellProfiler/cellprofiler/modules/loaddata.py", line 705, in get_cache_info
    ctime = os.stat(self.csv_path).st_ctime
OSError: [Errno 2] No such file or directory: '/root/Desktop/load_data_csv/2019_05_13_Batch2/BR00103267/load_data.csv'

I ran this

ubuntu@ip-10-0-4-54:~/efs/2018_06_05_cmQTL/workspace/software/cellpainting_scripts$ docker run -e S6_LOGGING=1 --rm --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3:/pipeline_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/filelist/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt:/filelist_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt:/datafile_dir  --volume=/home/ubuntu/efs/2018_06_05_cmQTL/workspace/batchfiles/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/illum:/batchfile_dir  --volume=/tmp:/tmp_dir  --volume=/home/ubuntu/bucket/:/home/ubuntu/bucket/ cellprofiler/cellprofiler:3.1.8  --pipeline=/batchfile_dir/Batch_data.h5 --print-groups=/batchfile_dir/Batch_data.h5

shntnu · 2019-06-25T17:49:29Z

Strangely, although it errors, the config files get created without any trouble. I'll continue processing, and you can ignore this error for now @bethac07

bethac07 · 2019-06-25T17:57:48Z

I'm really not sure, sorry; my best guess is to try to change both of the "location" settings from things containing "Default Input Folder" to "Elsewhere" and see if that helps. I don't use that method for making config and job files, so I'd need to delve into it more to diagnose.

shntnu · 2019-06-25T17:59:21Z

I'm really not sure, sorry; my best guess is to try to change both of the "location" settings from things containing "Default Input Folder" to "Elsewhere" and see if that helps. I don't use that method for making config and job files, so I'd need to delve into it more to diagnose.

Wow, thanks a lot for digging into this. Good news is that things seem to be ok in spite of that error. I'll link to this issue in the handbook so that we have some record of it.

shntnu · 2019-06-26T12:05:34Z

@gwaygenomics Did we decide whether to split profiles by colony / non-colony going forward?

gwaybio · 2019-06-26T16:43:58Z

Did we decide whether to split profiles by colony / non-colony going forward?

Yes, we did decide to split. However, I think it would be good to process complete aggregate profiles too.

gwaybio · 2019-06-26T16:51:07Z

@shntnu the split filters are described in #9 (comment)

shntnu · 2019-06-26T19:47:23Z

I am processing only 1 plate for now, and for some reason 2 images in this set are not getting processed. Beth help diagnose and her best bet was that it was due to the machines on which they were running had crashed, so she did various things in AWS to restart the process. The same two images are stuck again (I purged the queue) so it is very likely something to do with the images.

In the interest of time, I'm going to skip these two images (out of 384 * 9) and proceed

2019-06-26 18:46:49.918626 In process: 0 Pending 3456
2019-06-26 18:47:49.990986 In process: 2 Pending 3379
2019-06-26 18:48:50.063506 In process: 2 Pending 3224
2019-06-26 18:49:50.137452 In process: 2 Pending 3069
2019-06-26 18:50:50.317943 In process: 4 Pending 2919
2019-06-26 18:51:50.413163 In process: 4 Pending 2768
2019-06-26 18:52:50.486628 In process: 4 Pending 2624
2019-06-26 18:53:50.558921 In process: 6 Pending 2481
2019-06-26 18:54:50.632221 In process: 6 Pending 2345
2019-06-26 18:55:50.706765 In process: 6 Pending 2210
2019-06-26 18:56:50.912252 In process: 8 Pending 2071
2019-06-26 18:57:50.986726 In process: 8 Pending 1939
2019-06-26 18:58:51.059642 In process: 8 Pending 1791
2019-06-26 18:59:51.247832 In process: 8 Pending 1659
2019-06-26 19:00:51.323514 In process: 8 Pending 1513
2019-06-26 19:01:51.420210 In process: 8 Pending 1373
2019-06-26 19:02:51.492810 In process: 8 Pending 1234
2019-06-26 19:03:51.565966 In process: 8 Pending 1092
2019-06-26 19:04:51.638489 In process: 8 Pending 958
2019-06-26 19:05:51.709394 In process: 8 Pending 830
2019-06-26 19:06:51.807689 In process: 8 Pending 703
2019-06-26 19:07:51.881262 In process: 8 Pending 563
2019-06-26 19:08:51.956042 In process: 8 Pending 416
2019-06-26 19:09:52.040915 In process: 8 Pending 283
2019-06-26 19:10:52.121361 In process: 8 Pending 162
2019-06-26 19:11:52.227774 In process: 8 Pending 47
2019-06-26 19:12:52.306046 In process: 2 Pending 0
2019-06-26 19:42:55.310149 In process: 1 Pending 1
2019-06-26 19:43:55.387417 In process: 2 Pending 0
2019-06-26 19:47:55.725168 In process: 1 Pending 1

bethac07 · 2019-06-26T21:01:02Z

Did you check the logs for those two sites?

…

On Wed, Jun 26, 2019 at 3:47 PM Shantanu Singh ***@***.***> wrote: I am processing only 1 plate for now, and for some reason 2 images in this set are not getting processed. Beth help diagnose and her best bet was that it was due to the machines on which they were running had crashed, so she did various things in AWS to restart the process. The same two images are stuck again (I purged the queue) so it is very likely something to do with the images. In the interest of time, I'm going to skip these two images (out of 384 * 9) and proceed — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14?email_source=notifications&email_token=ABTI726CBPX2STRLUMRR433P4PBUZA5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUT2BA#issuecomment-506019076>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTI7252EFKQPZSIGFMJYWLP4PBUZANCNFSM4H3JIGAA> .

-- Beth Cimini, PhD Computational Biologist, Imaging Platform Broad Institute 415 Main St Room 5011 Cambridge, MA 02142 (617) 714-8189

shntnu · 2019-06-26T21:08:13Z

Did you check the logs for those two sites?

oooh I should have. Here it is for the first one
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=2018_06_05_cmQTL_Analysis;stream=cmqtlpl1.5-31-2019-mt-P05-3;start=2019-06-25T19:52:36Z

cellprofiler -c -r -p /home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3/analysis_without_batchfile.cppipe -i /home/ubuntu/bucket/dummy -o /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3 -d /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3/cp.is.done --data-file=/home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/load_data_with_illum.csv -g Metadata_Plate=cmqtlpl1.5-31-2019-mt,Metadata_Well=P05,Metadata_Site=3
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Times reported are CPU and Wall-clock times for each module
Wed Jun 26 19:11:51 2019: Image # 3279, module LoadData # 1: CPU_time = 10.56 secs, Wall_time = 16.52 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module CorrectIlluminationApply # 2: CPU_time = 0.14 secs, Wall_time = 0.14 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module MeasureImageQuality # 3: CPU_time = 85.38 secs, Wall_time = 88.39 secs
Wed Jun 26 19:13:36 2019: Image # 3279, module MeasureImageQuality # 4: CPU_time = 1.71 secs, Wall_time = 1.71 secs
Wed Jun 26 19:13:38 2019: Image # 3279, module EnhanceOrSuppressFeatures # 6: CPU_time = 24.01 secs, Wall_time = 24.00 secs
Wed Jun 26 19:14:02 2019: Image # 3279, module IdentifyPrimaryObjects # 7: CPU_time = 6.03 secs, Wall_time = 6.02 secs
Wed Jun 26 19:14:08 2019: Image # 3279, module IdentifySecondaryObjects # 8: CPU_time = 6.43 secs, Wall_time = 6.43 secs
Wed Jun 26 19:14:14 2019: Image # 3279, module IdentifyTertiaryObjects # 9: CPU_time = 1.23 secs, Wall_time = 1.24 secs
Wed Jun 26 19:14:16 2019: Image # 3279, module MeasureColocalization # 10: CPU_time = 595.47 secs, Wall_time = 595.08 secs
Wed Jun 26 19:24:11 2019: Image # 3279, module MeasureGranularity # 11: CPU_time = 93.04 secs, Wall_time = 92.98 secs
Wed Jun 26 19:25:44 2019: Image # 3279, module MeasureObjectIntensity # 12: CPU_time = 35.56 secs, Wall_time = 35.54 secs
Wed Jun 26 19:26:19 2019: Image # 3279, module MeasureObjectNeighbors # 13: CPU_time = 5.36 secs, Wall_time = 5.36 secs
Wed Jun 26 19:26:25 2019: Image # 3279, module MeasureObjectNeighbors # 14: CPU_time = 2.09 secs, Wall_time = 2.08 secs
Wed Jun 26 19:26:27 2019: Image # 3279, module MeasureObjectNeighbors # 15: CPU_time = 2.25 secs, Wall_time = 2.24 secs
Wed Jun 26 19:26:29 2019: Image # 3279, module MeasureObjectIntensityDistribution # 16: CPU_time = 22.41 secs, Wall_time = 22.40 secs
Wed Jun 26 19:26:51 2019: Image # 3279, module MeasureObjectSizeShape # 17: CPU_time = 48.30 secs, Wall_time = 48.27 secs
/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:141: UserWarning: Possible precision loss when converting from float32 to uint8
.format(dtypeobj_in, dtypeobj_out))
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure

and here is the second one
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=2018_06_05_cmQTL_Analysis;stream=cmqtlpl1.5-31-2019-mt-P05-3;start=2019-06-25T19:52:36Z

cellprofiler -c -r -p /home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/github/imaging-platform-pipelines/cellpainting_ipsc_20x_phenix_with_bf_bin1_cp3/analysis_without_batchfile.cppipe -i /home/ubuntu/bucket/dummy -o /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3 -d /home/ubuntu/local_output/cmqtlpl1.5-31-2019-mt-P05-3/cp.is.done --data-file=/home/ubuntu/bucket/projects/2018_06_05_cmQTL/workspace/load_data_csv/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/load_data_with_illum.csv -g Metadata_Plate=cmqtlpl1.5-31-2019-mt,Metadata_Well=P05,Metadata_Site=3
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Experiment-wide values for mean threshold, etc calculated by MeasureImageQuality may be incorrect if the run is split into subsets of images.
Times reported are CPU and Wall-clock times for each module
Wed Jun 26 19:11:51 2019: Image # 3279, module LoadData # 1: CPU_time = 10.56 secs, Wall_time = 16.52 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module CorrectIlluminationApply # 2: CPU_time = 0.14 secs, Wall_time = 0.14 secs
Wed Jun 26 19:12:08 2019: Image # 3279, module MeasureImageQuality # 3: CPU_time = 85.38 secs, Wall_time = 88.39 secs
Wed Jun 26 19:13:36 2019: Image # 3279, module MeasureImageQuality # 4: CPU_time = 1.71 secs, Wall_time = 1.71 secs
Wed Jun 26 19:13:38 2019: Image # 3279, module EnhanceOrSuppressFeatures # 6: CPU_time = 24.01 secs, Wall_time = 24.00 secs
Wed Jun 26 19:14:02 2019: Image # 3279, module IdentifyPrimaryObjects # 7: CPU_time = 6.03 secs, Wall_time = 6.02 secs
Wed Jun 26 19:14:08 2019: Image # 3279, module IdentifySecondaryObjects # 8: CPU_time = 6.43 secs, Wall_time = 6.43 secs
Wed Jun 26 19:14:14 2019: Image # 3279, module IdentifyTertiaryObjects # 9: CPU_time = 1.23 secs, Wall_time = 1.24 secs
Wed Jun 26 19:14:16 2019: Image # 3279, module MeasureColocalization # 10: CPU_time = 595.47 secs, Wall_time = 595.08 secs
Wed Jun 26 19:24:11 2019: Image # 3279, module MeasureGranularity # 11: CPU_time = 93.04 secs, Wall_time = 92.98 secs
Wed Jun 26 19:25:44 2019: Image # 3279, module MeasureObjectIntensity # 12: CPU_time = 35.56 secs, Wall_time = 35.54 secs
Wed Jun 26 19:26:19 2019: Image # 3279, module MeasureObjectNeighbors # 13: CPU_time = 5.36 secs, Wall_time = 5.36 secs
Wed Jun 26 19:26:25 2019: Image # 3279, module MeasureObjectNeighbors # 14: CPU_time = 2.09 secs, Wall_time = 2.08 secs
Wed Jun 26 19:26:27 2019: Image # 3279, module MeasureObjectNeighbors # 15: CPU_time = 2.25 secs, Wall_time = 2.24 secs
Wed Jun 26 19:26:29 2019: Image # 3279, module MeasureObjectIntensityDistribution # 16: CPU_time = 22.41 secs, Wall_time = 22.40 secs
Wed Jun 26 19:26:51 2019: Image # 3279, module MeasureObjectSizeShape # 17: CPU_time = 48.30 secs, Wall_time = 48.27 secs
/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:141: UserWarning: Possible precision loss when converting from float32 to uint8
.format(dtypeobj_in, dtypeobj_out))
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure

@bethac07 Don't worry about diagnosing right now; it isn't blocking

shntnu · 2019-06-26T23:30:55Z

The total size of the CSV files is ~70Gb, which means that the SQLite will be ~35Gb, which is ~3x the size in the pilots.

shntnu · 2019-06-27T01:22:27Z

A sneak peek into the data suggests that there is no evidence that cell id (and thus settling time) is correlated with cell count, which is good!

bethac07 · 2019-06-27T12:23:27Z

It's actually not good, because it means we're back to square one figuring out why this is happening...

shntnu · 2019-06-27T12:59:07Z

I can pull out images for that site and file an issue if that’s helpful

…

On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***> wrote: It's actually not good, because it means we're back to square one figuring out why this is happening... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA> .

shntnu · 2019-06-27T13:00:44Z

Admittedly what’s more helpful is to fix the bug :) On Thu, Jun 27, 2019 at 8:58 AM Shantanu Singh <shsingh@broadinstitute.org> wrote:

…

I can pull out images for that site and file an issue if that’s helpful On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***> wrote: > It's actually not good, because it means we're back to square one > figuring out why this is happening... > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA> > . >

bethac07 · 2019-06-27T13:23:38Z

Ah, I was not clear- I was not referring to MeasureTexture (I will try to get into that next week), but the "order of plating is not correlated with cell count"; you said it was "good news" they didn't correlate, but I feel like it's bad news because it means we now don't understand why it's happening. So it's good from a "scientific method" point of view, but bad from a "now we need a new hypothesis" point of view ;) On Thu, Jun 27, 2019 at 9:00 AM Shantanu Singh <notifications@github.com> wrote:

…

Admittedly what’s more helpful is to fix the bug :) On Thu, Jun 27, 2019 at 8:58 AM Shantanu Singh ***@***.*** > wrote: > I can pull out images for that site and file an issue if that’s helpful > > On Thu, Jun 27, 2019 at 8:23 AM Beth Cimini ***@***.***> > wrote: > >> It's actually not good, because it means we're back to square one >> figuring out why this is happening... >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> < #14?email_source=notifications&email_token=AAJHQPH3SST5YPZH3RWZBKTP4SWL7A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYW6EWY#issuecomment-506323547 >, >> or mute the thread >> < https://github.com/notifications/unsubscribe-auth/AAJHQPDIKJNINW3JKHVK6VLP4SWL7ANCNFSM4H3JIGAA > >> . >> > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14?email_source=notifications&email_token=ABTI725TQC3ZS6M6HFHVK5TP4S2X3A5CNFSM4H3JIGAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYXBEPI#issuecomment-506335805>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTI727RWMFF3Y4LMCHRDG3P4S2X3ANCNFSM4H3JIGAA> .

-- Beth Cimini, PhD Computational Biologist, Imaging Platform Broad Institute 415 Main St Room 5011 Cambridge, MA 02142 (617) 714-8189

gwaybio · 2019-06-27T13:31:45Z

there is no evidence that cell id (and thus settling time) is correlated with cell count

I'm not clear as to how the graph represents this? Isn't cell ID a nominal variable?

Based on the figure, I would think cell count is associated with cell ID. cell IDs with low cell counts have low cell counts across replicates. cell IDs with high cell counts tend to have high variance across replicates.

My interpretation is that cell count is somewhat associated with cell ID, but potentially not entirely explained by it given the high variance across replicates. (assuming that all cells were plated with approximately the same density)

gwaybio · 2019-06-27T13:36:08Z

This is an interesting figure (generated here)

its not a particularly strong correlation, but this is alluding to plate density impacting replicate correlation.

Is this true for isolated, colony, and aggregated profiles?

shntnu · 2019-06-27T20:21:28Z

Isn't cell ID a nominal variable?

It is also the order in which the cells were plated

shntnu · 2019-06-27T20:23:43Z

Turns out that the isolated + colony profiles did not get computed after all because of memory errors. I checked https://github.com/broadinstitute/cytominer_scripts/blob/8369881bc0a7cb22a58bd376b4240abd5709e5da/aggregate.R and there's not nothing much we can do there to avoid this. I'll get a larger instance to run this step.

shntnu · 2019-07-08T11:43:02Z

A larger instance doesn't seem to have helped either – the instance ran for a couple of days (!) and then died. I'll try on an even larger VM, and if that fails, try modifying the code.

shntnu · 2019-07-08T16:04:17Z

The process gets killed after aggregating cytoplasm

INFO [2019-07-08 15:25:16] Started aggregating cells
INFO [2019-07-08 15:30:30] Started aggregating cytoplasm

Nothing in stderr, but this is the parallel log:

Seq     Host    Starttime       Runtime Send    Receive Exitval Signal  Command
1       :       1562599514.964  1316.246        0       0       0       9       ./aggregate.R --sqlite_file /home/ubuntu/ebs_tmp/2018_06_05_cmQTL/workspace/backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt.sqlite --output ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated.csv --sc_type isolated

The process ended at 1562599514.964 + 1316.246 = 15:47 UTC, so looks like the aggregate cytoplasm ran for 17 mins, where as aggregate cells ran for 5 minutes

shntnu · 2019-07-08T17:52:42Z

Ok so this is definitely a memory issue (takes up >30Gb) but for some reason colony works but isolated gets killed

shntnu · 2019-07-09T13:09:50Z

Finally this worked! We needed at 64Gb instance (r5.2xlarge) for this to complete successfully

gwaybio · 2019-07-09T13:14:35Z

wow, that is some serious compute! Do we know what step this is required for? Is it only one step, or is it all?

shntnu · 2019-07-09T13:20:53Z

Only the colony aggregation, strangely. I see why isolated was ok – there are fewer cells. But I am surprised that the "regular" aggregation (no split) did not need as much memory. It is possible that the modified code is in someway more memory intensive, but I haven't delved into that yet.

shntnu · 2019-07-11T23:24:13Z

Reusing the variable selection from the "regular" (all cells) profiles did not work so well because of NAs in normalized data. This is likely because there are more features that have a near zero variance once you filter out cells.

I am rerunning the variable selection for colony and isolated

shntnu · 2019-07-11T23:26:07Z

Meanwhile @apnathan @TiffanyAmariuta – please go ahead and use the single cell data uploaded here to do the analysis at your end

shntnu · 2019-07-15T13:23:19Z

As of bed396c, we have results from all 3 sets of profiles (all, colony, isolated).

shntnu · 2019-07-16T01:25:33Z

Meanwhile @apnathan @TiffanyAmariuta – please go ahead and use the single cell data uploaded here to do the analysis at your end

@apnathan asked:

What format is the data in at that AWS link? Last time we used the matrices uploaded by Greg here, which I think are similar to what you've sent me in the past: https://github.com/broadinstitute/cmQTL/tree/master/0.pilot-determine-conditions/data Are these files also available for the new data?

@apnathan the data is in SQLite format. I tried to convert to to TSV using @gwaygenomics 's code here but the files are too large to fit in memory for that approach to work.

Are you able to work with SQLite files?

Process new batch of data

shntnu · 2019-07-16T13:02:36Z

@gwaygenomics @bethac07 @AnneCarpenter

We had previously observed that the similarity between profiles and (dis)similarity of their cell counts (i.e. absolute difference between counts) are related (r = -.329), so we wondered whether this goes away when we consider only colonies and only isolated cells. We found that the similarities are still related (colony r = -0.291; isolated r = -0.289). So we then tested whether this was driven by cell lines that have too many (>4000 measured) or too few cells (<1000 measured). The relationship between similarities is in fact a bit stronger when we filter out these (all cells r=-0.396; colony r = -0.395; isolated r = -0.335).

In summary, profile similarity and cell count similarity are certainly related, but I haven't thought through as to what extent we should be concerned about this and whether this should affect our protocol.

In this notebook, search for "Report relationship between profiles similarity and cell count similarity" which is about 3/4th the way down to see the plots I've described above.

shntnu · 2019-07-16T13:02:58Z

@gwaygenomics said
This is great! I agree we do not yet know how concerned we should be. One thing we could do (and I think Anne had mentioned this once before) is a subsampling experiment where we randomly select cells to more closely match cell counts and ask if pairwise subsampling pearson correlation is different than no subsampling. This will at least rule out technical bias in median collapsing, but does not rule out technical bias in plating density.

It may also be helpful to ask Matt for phenotype metadata at this time. Maybe all those highly correlated samples with similar cell counts are the same disease state?

shntnu · 2019-07-16T13:03:11Z

@AnneCarpenter said
I don't think subsampling will yield anything different, so I'd skip that. The more I think about it, the more I think that our goal is not to differentiate one cell line from another. Our goal is to identify individual features that are affected more by cell line identity than by cell density. So while we wish ALL features were unaffected by cell density, we know it won't be the case and we will be ignoring them later in the pipeline. So I think we should just carry on and not treat this as a failed QC step.

apnathan · 2019-07-18T11:13:59Z

@shntnu @TiffanyAmariuta I don't think we've ever used SQLite files before, but they should be importable into R, so we'll see what changes need to be made to our previous code to run the same analyses. I imagine the final data coming out of the experiment will also be on this order of magnitude, so it's probably worth it to start thinking about best approaches for analyzing it.

bethac07 · 2019-09-04T22:17:38Z

Did you check the logs for those two sites?

oooh I should have. Here it is for the first one
[...]
Error detected during run of module MeasureTexture
Traceback (most recent call last):
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 1782, in run_with_yield
self.run_module(module, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/pipeline.py", line 2034, in run_module
module.run(workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 544, in run
statistics += self.run_image(image_name, scale, workspace)
File "/usr/local/src/CellProfiler/cellprofiler/modules/measuretexture.py", line 640, in run_image
pixel_data = skimage.util.img_as_ubyte(image.pixel_data)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 492, in img_as_ubyte
return convert(image, np.uint8, force_copy)
File "/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py", line 261, in convert
raise ValueError("Images of type float must be between -1 and 1.")
ValueError: Images of type float must be between -1 and 1.
Wed Jun 26 19:27:40 2019: Image # 3279, module MeasureTexture # 18: CPU_time = 1095.07 secs, Wall_time = 1094.46 secs
CP PROBLEM: Done file reports failure
@bethac07 Don't worry about diagnosing right now; it isn't blocking

Aha! Diagnosed, it is due to CellProfiler/CellProfiler#3831

bethac07 · 2019-09-04T22:19:08Z

(That same error hit 8 images in Batch4; the new versions of the pipeline I just added should bypass the issue for Batch5 onward).

shntnu · 2019-09-10T17:28:14Z

Profiles for one plate of 2019_06_10_Batch3 (cmqtlpl1.5-31-2019-mt) are available within the repo https://github.com/broadinstitute/cmQTL/tree/master/1.profile-cell-lines/profiles by copying them from backend. Will do the same for the others.

shntnu · 2019-10-17T19:56:03Z

For isolated / colony profiles for cmqtlpl261-2019-mt, the following features are all NA (but the regular profiles don't have this issue). We don't observe this with any other plate in any batch. I am going to re-run aggregate for cmqtlpl261-2019-mt and see if it reproduces

Cells_Neighbors_AngleBetweenNeighbors_10
Cells_Neighbors_AngleBetweenNeighbors_Adjacent
Cells_Neighbors_FirstClosestDistance_10
Cells_Neighbors_FirstClosestDistance_Adjacent
Cells_Neighbors_SecondClosestDistance_10
Cells_Neighbors_SecondClosestDistance_Adjacent
Nuclei_Neighbors_AngleBetweenNeighbors_2
Nuclei_Neighbors_FirstClosestDistance_2
Nuclei_Neighbors_SecondClosestDistance_

shntnu · 2019-10-17T21:09:12Z

I am going to re-run aggregate for cmqtlpl261-2019-mt and see if it reproduces

This does reproduce

library(glue)
library(tidyverse)

#plate_id <- "cmqtlpl1.5-31-2019-mt"
plate_id <- "cmqtlpl261-2019-mt"

batch_id <- "2019_06_10_Batch3"

#sc_type <- "isolated_"
sc_type <- "colony"
#sc_type <- ""

file_name <- glue("../../../backend/{batch_id}/{plate_id}/{plate_id}_{sc_type}.csv")

read_csv(
    file_name
  ) %>% 
  select(
  Cells_Neighbors_AngleBetweenNeighbors_10,
  Cells_Neighbors_AngleBetweenNeighbors_Adjacent,
  Cells_Neighbors_FirstClosestDistance_10,
  Cells_Neighbors_FirstClosestDistance_Adjacent,
  Cells_Neighbors_SecondClosestDistance_10,
  Cells_Neighbors_SecondClosestDistance_Adjacent,
  Nuclei_Neighbors_AngleBetweenNeighbors_2,
  Nuclei_Neighbors_FirstClosestDistance_2,
  Nuclei_Neighbors_SecondClosestDistance_2) %>% 
  distinct() %>% 
  head() %>%
  glimpse()

Observations: 1
Variables: 9
$ Cells_Neighbors_AngleBetweenNeighbors_10       <lgl> NA
$ Cells_Neighbors_AngleBetweenNeighbors_Adjacent <lgl> NA
$ Cells_Neighbors_FirstClosestDistance_10        <lgl> NA
$ Cells_Neighbors_FirstClosestDistance_Adjacent  <lgl> NA
$ Cells_Neighbors_SecondClosestDistance_10       <lgl> NA
$ Cells_Neighbors_SecondClosestDistance_Adjacent <lgl> NA
$ Nuclei_Neighbors_AngleBetweenNeighbors_2       <lgl> NA
$ Nuclei_Neighbors_FirstClosestDistance_2        <lgl> NA
$ Nuclei_Neighbors_SecondClosestDistance_2       <lgl> NA
``

shntnu · 2019-10-17T21:24:41Z

@gwaygenomics any clue why this might be happening:

The features listed above are all NA in the colony and isolated version of the profiles, but not in the regular version of the profiles. This happens only for one plate.

I think it might come down to these lines, but I don't see anything offending here

https://github.com/broadinstitute/cytominer_scripts/blob/issues/29/aggregate.R#L50-L54

https://github.com/broadinstitute/cytominer_scripts/pull/30/files#diff-74eeb9ecf93aa8a5256641e5c7605ee0

shntnu · 2019-10-17T23:18:31Z

@gwaygenomics hold off on investigating – I think I know the issue

shntnu · 2019-10-18T05:09:31Z

I still couldn't narrow it down, but what I found was that even the regular version of the profiles have this problem for that plate alone.

aggregate.R on the current master branch works fine, but that on the current issues/29 branch does not.

Here's how to diagnose. These are regular profiles, not colony/isolated

issues/29:

$ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_recomputed.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
NA

master:

$ csvcut -c Cells_Neighbors_AngleBetweenNeighbors_10 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.csv|uniq|head
Cells_Neighbors_AngleBetweenNeighbors_10
20
27.37795303974804
92.29368296583094
98.18367406208154
77.57808586074458
84.75963089557231
84.09949372076152
89.88335916308186

shntnu · 2019-10-18T05:26:27Z

I tried reproducing as below, but wasn't able to do so when I probe just a couple of features at a time

db <- "/home/ubuntu/ebs_tmp/2018_06_05_cmQTL/workspace/backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt.sqlite"

suppressWarnings(suppressMessages(library(docopt)))

suppressWarnings(suppressMessages(library(dplyr)))

suppressWarnings(suppressMessages(library(magrittr)))

suppressWarnings(suppressMessages(library(stringr)))

db <- src_sqlite(path = db)

image <- tbl(src = db, "image") %>%
  select(TableNumber, ImageNumber, Metadata_Plate, Metadata_Well)

compartment <- "cells"

object <- tbl(src = db, compartment)

object %<>% inner_join(image, by = c("TableNumber", "ImageNumber"))

compartment_tag <-
    str_c("^", str_sub(compartment, 1, 1) %>% str_to_upper(), str_sub(compartment, 2), "_")

variables <- c("Cells_Neighbors_AngleBetweenNeighbors_10", "Cells_Neighbors_SecondClosestDistance_Adjacent")

cells_profiles <- cytominer::aggregate(
    population = object,
    variables = variables,
    strata = c("Metadata_Plate", "Metadata_Well"),
    operation = "mean"
  ) %>% collect()

Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
> cells_profiles
# A tibble: 384 x 4
   Metadata_Plate     Metadata_Well Cells_Neighbors_Angl… Cells_Neighbors_Seco…
   <chr>              <chr>                         <dbl>                 <dbl>
 1 cmqtlpl261-2019-mt A01                            20                    604.
 2 cmqtlpl261-2019-mt A02                            27.4                  339.
 3 cmqtlpl261-2019-mt A03                            92.3                  130.
 4 cmqtlpl261-2019-mt A04                            98.2                  109.
 5 cmqtlpl261-2019-mt A05                            77.6                  411.
 6 cmqtlpl261-2019-mt A06                            84.8                  333.
 7 cmqtlpl261-2019-mt A07                            84.1                  205.
 8 cmqtlpl261-2019-mt A08                            89.9                  167.
 9 cmqtlpl261-2019-mt A09                            76.8                  252.
10 cmqtlpl261-2019-mt A10                            79.0                  199.

shntnu · 2019-10-18T05:28:34Z

Given the time constraints, I'm making a decision here to skip fixing the issue in colony and isolated versions of this plate

Specifically, these features listed above will be all NA in the files listed below:

cmqtlpl261-2019-mt_colony_normalized_variable_selected_augmented.csv
cmqtlpl261-2019-mt_colony_normalized_variable_selected.csv
cmqtlpl261-2019-mt_colony_normalized.csv

To make aggregate.R work with all NA columns, I made this minor edit

shntnu · 2019-10-18T11:16:08Z

Line counts look ok

385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_colony_normalized.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated_normalized.csv

385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_colony_normalized_variable_selected.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl1.5-31-2019-mt/cmqtlpl1.5-31-2019-mt_isolated_normalized_variable_selected.csv

373 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_colony_normalized.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_isolated_normalized.csv

373 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_colony_normalized_variable_selected.csv
385 ../../backend/2019_06_10_Batch3/cmqtlpl261-2019-mt/cmqtlpl261-2019-mt_isolated_normalized_variable_selected.csv

shntnu self-assigned this Jun 25, 2019

shntnu added the Experiments Tracking experimental questions, results, or analysis label Jun 25, 2019

shntnu mentioned this issue Jun 25, 2019

Process new batch of data #15

Merged

This was referenced Jun 25, 2019

Update instructions for using with CP3 cytomining/profiling-handbook#32

Closed

Update CP3 pipeline to avoid CreateBatchFile related issues broadinstitute/imaging-platform-pipelines#16

Open

shntnu added a commit that referenced this issue Jul 16, 2019

Merge pull request #15 from /issues/14

d18929d

Process new batch of data

shntnu mentioned this issue Oct 19, 2019

Process all plates from Batch 3,4,5 #18

Merged

shntnu closed this as completed Oct 19, 2019

broad-bitsdb unassigned shntnu Nov 8, 2019

shntnu mentioned this issue Apr 23, 2020

[WIP] Fixes related to default input folder broadinstitute/imaging-platform-pipelines#17

Closed

Process new batch of data 2019_06_10_Batch3 #14

Process new batch of data 2019_06_10_Batch3 #14

Comments

shntnu commented Jun 25, 2019

shntnu commented Jun 25, 2019

shntnu commented Jun 25, 2019 • edited Loading

bethac07 commented Jun 25, 2019

bethac07 commented Jun 25, 2019

shntnu commented Jun 25, 2019 • edited Loading

shntnu commented Jun 25, 2019

bethac07 commented Jun 25, 2019 • edited Loading

shntnu commented Jun 25, 2019

shntnu commented Jun 26, 2019

gwaybio commented Jun 26, 2019

gwaybio commented Jun 26, 2019

shntnu commented Jun 26, 2019 • edited Loading

bethac07 commented Jun 26, 2019 via email

shntnu commented Jun 26, 2019 • edited Loading

shntnu commented Jun 26, 2019

shntnu commented Jun 27, 2019

bethac07 commented Jun 27, 2019

shntnu commented Jun 27, 2019 via email

shntnu commented Jun 27, 2019 via email

bethac07 commented Jun 27, 2019 via email

gwaybio commented Jun 27, 2019

gwaybio commented Jun 27, 2019

shntnu commented Jun 27, 2019

shntnu commented Jun 27, 2019

shntnu commented Jul 8, 2019

shntnu commented Jul 8, 2019 • edited Loading

shntnu commented Jul 8, 2019

shntnu commented Jul 9, 2019

gwaybio commented Jul 9, 2019

shntnu commented Jul 9, 2019 • edited Loading

shntnu commented Jul 11, 2019

shntnu commented Jul 11, 2019

shntnu commented Jul 15, 2019

shntnu commented Jul 16, 2019

shntnu commented Jul 16, 2019

shntnu commented Jul 16, 2019

shntnu commented Jul 16, 2019

apnathan commented Jul 18, 2019

bethac07 commented Sep 4, 2019

bethac07 commented Sep 4, 2019

shntnu commented Sep 10, 2019

shntnu commented Oct 17, 2019

shntnu commented Oct 17, 2019

shntnu commented Oct 17, 2019 • edited Loading

shntnu commented Oct 17, 2019

shntnu commented Oct 18, 2019 • edited Loading

shntnu commented Oct 18, 2019

shntnu commented Oct 18, 2019 • edited Loading

shntnu commented Oct 18, 2019

shntnu commented Jun 25, 2019 •

edited

Loading

shntnu commented Jun 25, 2019 •

edited

Loading

bethac07 commented Jun 25, 2019 •

edited

Loading

shntnu commented Jun 26, 2019 •

edited

Loading

shntnu commented Jun 26, 2019 •

edited

Loading

shntnu commented Jul 8, 2019 •

edited

Loading

shntnu commented Jul 9, 2019 •

edited

Loading

shntnu commented Oct 17, 2019 •

edited

Loading

shntnu commented Oct 18, 2019 •

edited

Loading

shntnu commented Oct 18, 2019 •

edited

Loading