CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

gwaybio · 2020-05-14T19:24:09Z

A common issue that keeps surfacing involves how missing values are handled in CellProfiler, and, subsequently, in downstream image-based profiling pipelines.

There appear to be many somewhat independent issues around this problem. I was not sure where to file this issue, since it does seem to permeate into many other codebases. I will attempt to outline the issue here.

Different CellProfiler versions output missing values with different notations
- Specifically, >= CellProfiler 3 outputs na while < CellProfiler 2 outputs NaN
- This should be specified and deliberately coded. It is possible it is now deliberately coded in CP3, and, if so, this component should be considered solved. (cc @DavidStirling)
The first step after CellProfiler in a traditional image-based profiling pipeline is to call cytominer-database to ingest all compartment .csvs into a single .sqlite database.
- It is likely we will need to be mindful of how sqlite is encoding missing values
- Address SQLite handling of na cytominer-database#104 has more information specific to this step
- We must be especially deliberate about coding missing values when we add the parquet backend option to cytominer-database (cc @diskontinuum)
There are additional differences in how pandas (python) and dplyr (R) extract missing values from the .sqlite backend.
- pandas.read_sql() attempts to convert values to non-string values. This may or may not explain conversion of CellProfiler na or NaN values to zero.
Pandas version 1.0 updates include updated handling of missing values

As @shntnu noted, the missing value problem is solved by the aggregation and ignoring missing values. This solution boils down to a mean imputation solution. This problem is important for single cell profiles, however.

The text was updated successfully, but these errors were encountered:

diskontinuum · 2020-05-14T20:11:57Z

Great points Greg! I'm glad you addressed this explicitly!

Regarding subpoints of Point 2: Yes, especially since missing columns and NaN values are not part of the pytests yet.

For the parquet option, I added an entirely new chunk of code to deal with this scenario (which is one of the reasons why it took a while). The functions have been tested locally in the colaboratory notebook. More precisely, if the missing values manifest as missing columns, then the columns will simply be added as NaN-valued columns during the process of aligning the dataframes, before they are written to Parquet. For this we need to make sure that the writer schema encompassed all possible columns, though. That is, the Parquet writer needs to be opened with a .csv that certainly contains all columns. I build two options for this, following the suggestions of Beth and Greg: One is to put these reference .csvsinto a folder and save the relative path to the folder in the config file. Another is to set the option 'sample', which will choose the .csv file with the most columns from a subset of .csv files sampled uniformly at random (more info: readme on my branch ) .
What I have in mind to tackle this, is:

check how NaN v.s. na are read from .parquet or .csv back to the pandas dataframe and compare with original value and type --> include pytests for NaN values
include pytests for missing columns (easy for .parquet option)
solve the issue for the .sqlite backend (easy if we use the same pandas.dataframe.align() method and the same reference-table principle with the functions I already built for Parquet)

gwaybio · 2020-09-24T14:20:56Z

This should be specified and deliberately coded. It is possible it is now deliberately coded in CP3, and, if so, this component should be considered solved. (cc @DavidStirling)

@DavidStirling - do you know off hand how CP3 (and also CP4) handles missing values? (Congrats on CP4 release btw, super exciting)

DavidStirling · 2020-09-24T15:32:05Z

@gwaygenomics If you're running ExportToSpreadsheet there is a setting to choose whether invalid numerical values (Nan/inf) are represented with null or nan, but this has been there for a long time. However, I'm not sure what happens with missing values of other types or those that aren't recorded at all. I think it'll largely depend on what the module which created the particular measurement is set up to do.

gwaybio · 2020-09-24T15:40:16Z

Thanks David

I think it'll largely depend on what the module which created the particular measurement is set up to do.

This is important info - it sounds like we need to handle cases where there are multiple different kinds of missing values (nan, na, NA, NaN, null, etc.) since there is no standard.

From a software perspective, CellProfiler should manage this. However, we do need to buffer against this b/c legacy data exists and people use older CellProfiler versions.

DavidStirling · 2020-09-24T15:48:09Z

For the most part it looks like modules use numpy.NaN when a measurement is invalid, rather than specifying a particular string. Nonetheless with so many modules that I didn't write myself it's difficult to be sure!

gwaybio · 2020-09-24T16:00:17Z

got it - and it looks like the ExportToSpreadsheet CellProfiler module uses csv.writer() for output https://github.com/CellProfiler/CellProfiler/blob/master/cellprofiler/modules/exporttospreadsheet.py#L1080

David, it is the case that the single cell csv files that are ingested into the .sqlite file per plate are originally written using ExportToSpreadsheet?

DavidStirling · 2020-09-24T16:07:37Z

I believe so, ExportToDatabase can generate a .sqlite file natively but for DCP we tend to use ExportToSpreadsheet to ease combining the data from different cluster machines.

AdeboyeML · 2020-09-25T18:54:43Z

@gwaygenomics I think the best way to check how sqlite db backend encode missing values, is to check and ascertain the number of null values before and after ingestion of the csv files into the database/parquet
I tried ingesting pandas df with null/nan values into sqlite on my local machine, and when I read the pandas df back from the sqlite db, I realized that the number of null values before and after ingestion still remained the same.
The only difference is that before ingestion, the pandas df Null values are represented as NaN/nan which is a float datatype BUT after ingestion and calling back the pandas from sqlite db, the Null values are represented as None which is a Nonetype datatype...but this doesn't affect the way pandas check for null/None/Nan values
For example:
data - before ingestion, data_sql - after ingestion

gwaybio · 2020-09-25T19:41:47Z

The reason to dig into this deeply is to ensure consistency between pycytominer and cytominer processing.

This originally came up in the context of the analysis in the lincs-cell-painting repository, where we compared all 136 Drug Repurposing Hub plates processed with pycytominer (python) vs. cytominer (R) (see here). Briefly, we saw very small floating point differences between the two processing pipelines, and we hypothesized that the reason was because of how python and R handles missing values.

But.... ready or not, I am now going to throw in an additional layer of complexity! 🔧

The current method of processing sqlite files uses sqlalchemy, whereas the old method of processing sqlite files used odo. Here is the change that introduced this switch. It is possible that odo causes the sqlite missing value handling issue.

Decision

We do not need to dig into how odo handles missing values 😁 . odo is deprecated and we will never use it again.

The way forward is to make sure the pre-ingest missing values are not incorrectly converted post-ingest. @diskontinuum - what do you think about this strategy? Would it be quick to add a test to both sqlite and parquet options?

diskontinuum · 2020-09-28T12:20:38Z

Regarding value-based tests: Yes, absolutely! ETA: This week :)
Regarding how the current cytominer-database code deals with missing values, NaN etc. as follows:

Missing columns: The code uses a reference table (either from a fixed path, or, via sampling of the largest table - as specified in the config.ini file). When tables with missing columns are appended to the opened writer-table, the missing columns are added as value None.
NaN values and other type incompatibilities: The 'schema' section of the config.ini file allows to set type_conversion = all2string, in the case of which all values are converted to string.

Note that this is only the case for the --parquet backend, but the code can easily be adapted to allow sampling and reference frames also for the --sqlite backend [also on my TODO list, I can add it as a new issue].

gwaybio · 2020-09-28T17:11:22Z

Thanks @diskontinuum - a couple followup questions/comments:

When tables with missing columns are appended to the opened writer-table, the missing columns are added as value None.

Cool, yeah I remember this functionality being important. However, should we be using a value other than None so that we're consistent with other missing value types coming from CellProfiler ExportToSpreadsheet? (see #79 (comment))

The 'schema' section of the config.ini file allows to set type_conversion = all2string

It looks like the default behavior in the config.ini file is set to int2float. The reason we wanted all2string was because there are multiple ways CellProfiler encoded missing values. However, as @AdeboyeML demonstrated in #79 (comment), "but this doesn't affect the way pandas check for null/None/Nan values". I now believe that the int2float default will actually solve this issue for free, in pandas, under the hood.

can easily be adapted to allow sampling and reference frames also for the --sqlite backend [also on my TODO list, I can add it as a new issue].

Adding an issue would be a great first step (in the cytominer-database repo). I want to make sure all of the knowledge you worked hard on acquiring isn't lost once you start at Google!

diskontinuum · 2020-09-29T12:50:51Z

Cool, yeah I remember this functionality being important. However, should we be using a value other than None so that we're consistent with other missing value types coming from CellProfiler ExportToSpreadsheet? (see #79 (comment))

Yes, this can easily be done by changing the parameters in the pandas.dataframe.align() method that we're using to concatenate tables with a potentially different number or order of columns. By adding the parameter fill_value we can choose the value with which missing columns are padded (instead of the default None).

It looks like the default behavior in the config.ini file is set to int2float. The reason we wanted all2string was because there are multiple ways CellProfiler encoded missing values. However, as @AdeboyeML demonstrated in #79 (comment), "but this doesn't affect the way pandas check for null/None/Nan values". I now believe that the int2float default will actually solve this issue for free, in pandas, under the hood.

Great! Should I remove the string option in the next PR then ?

diskontinuum · 2020-09-29T13:16:47Z

Adding an issue would be a great first step (in the cytominer-database repo).

I opened issues #127, #128, #129.

gwaybio mentioned this issue May 14, 2020

Comparing cytominer- and pycytominer-derived profiles broadinstitute/lincs-cell-painting#29

Merged

gwaybio mentioned this issue May 22, 2020

Plate causing numerical issues broadinstitute/lincs-cell-painting#43

Open

diskontinuum mentioned this issue Sep 24, 2020

Integrate the use of a reference schema for the '--sqlite' option (done for --parquet) cytomining/cytominer-database#127

Open

gwaybio added bug Something isn't working help wanted Extra attention is needed labels Sep 25, 2020

diskontinuum mentioned this issue Sep 29, 2020

Add tests for ingestion of files with missing columns cytomining/cytominer-database#128

Open

gwaybio mentioned this issue May 25, 2022

Address SQLite Read Performance Issues #198

Closed

d33bs mentioned this issue Aug 16, 2022

Potential memory leak in SingleCell's .merge_single_cells() method #195

Closed

This was referenced Aug 17, 2022

Document input data format schema #221

Open

Provide DataFrame Data Type validation testing #222

Open

gwaybio mentioned this issue Mar 24, 2023

Duckdb's TypeMismatchException raised in CytoTable's convert() workflow due to nan values being stored as strings instead of expected types cytomining/CytoTable#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

gwaybio commented May 14, 2020 •

edited

Loading

diskontinuum commented May 14, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

AdeboyeML commented Sep 25, 2020

gwaybio commented Sep 25, 2020

diskontinuum commented Sep 28, 2020 •

edited

Loading

gwaybio commented Sep 28, 2020

diskontinuum commented Sep 29, 2020 •

edited

Loading

diskontinuum commented Sep 29, 2020

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

CellProfiler, Image-based Profiling Pipelines, and Missing Values #79

Comments

gwaybio commented May 14, 2020 • edited Loading

diskontinuum commented May 14, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

gwaybio commented Sep 24, 2020

DavidStirling commented Sep 24, 2020

AdeboyeML commented Sep 25, 2020

gwaybio commented Sep 25, 2020

Decision

diskontinuum commented Sep 28, 2020 • edited Loading

gwaybio commented Sep 28, 2020

diskontinuum commented Sep 29, 2020 • edited Loading

diskontinuum commented Sep 29, 2020

gwaybio commented May 14, 2020 •

edited

Loading

diskontinuum commented Sep 28, 2020 •

edited

Loading

diskontinuum commented Sep 29, 2020 •

edited

Loading