Name	Name	Last commit message	Last commit date
parent directory ..
cell_count	cell_count
media	media
.gitignore	.gitignore
2016_04_01_a549_48hr_batch1.dvc	2016_04_01_a549_48hr_batch1.dvc
2017_12_05_Batch2.dvc	2017_12_05_Batch2.dvc
README.md	README.md
profile_cells.py	profile_cells.py
profile_utils.py	profile_utils.py
profiling_pipeline.py	profiling_pipeline.py
run.sh	run.sh

Image-based profiling

Image-based profiling represents a series of data processing steps that turn image-based readouts into more manageable data matrices for downstream analyses (Caicedo et al. 2017). Typically, you derive image-based readouts using software, like CellProfiler (McQuin et al. 2018), that segment cells and extract so-called hand-engineered single cell morphology measurements. In this folder, we process the CellProfiler derived morphology features for the LINCS Cell Painting dataset using pycytominer - a tool enabling reproducible image-based profiling.

Specifically, we include:

Data processing scripts to perform the full unified, image-based profiling pipeline
Processed data for each Cell Painting plate (for several "data levels")
Instructions on how to reproduce the profiling pipeline

Workflow

Pycytominer

Pycytominer is a code base built by @gwaygenomics et al. It allows for easy processing of CellProfiler data and contains all functions that were used to create the data in this repository. Below, we describe the different steps of the pipeline. Please check the pycytominer repo for more details.

The steps from Level 3 to Level 4b can be found in the profile_cells script, the steps for spherizing can be found in this script, and the final aggregation to the consensus data is found in this notebook.

Note here that we do not include the intermediate step of generating .sqlite files per plate using a tool called cytominer-database. This repository and workflow begins after we applied cytominer-database.

Aggregation

We use the aggregation method twice in the workflow. First at this point and later for the creation of the consensus profiles. Here, the median of all cells within a well is aggregated to one profiler per well. The aggregation method doesn't persist the metadata which is why this step is followed by an annotation step to add the MOA data and others.

Normalization

Normalization can be done via different methods and over all wells in a plate or only the negative controls (DMSOs). In this case, we used mad_robustize method and both the output of the whole-plate and the DMSO normalization are saved in this repository. It is important to note that we normalize over each plate but not over the full batch.

Feature selection

The feature_select method incorporates ["variance_threshold", "correlation_threshold", "drop_na_columns", "drop_outliers"]. We developed these functions to drop redundant and invariant features and to improve post processing.

Spherizing

Spherizing (aka whitening) is a transformation of the data that tries to correct batch effects. Within Pycytominer, Spherizing can be found in the normalization function. Unlike the other normalizations, we perform spherizing on the full batch (all plates). Check the code for details and relevant papers.

Consensus

The consensus function is another step of aggregation and completes the pipeline. We aggregate the normalized and feature selected data (4b) via the Median or MODZ functions to consensus data. This means that each of the five replicates are combined to one profile representing one perturbation with a given dose.

Normalization vs Spherizing

While having the same overall goal, these two processes target different artifacts: The normalization step simply aligns all plates by scaling each plates values to a "unit height". The specific transformation used is RobustMad.

On the other hand, we use a spherize (a.k.a. whiten) transformation to adjust for plate and plate position effects (e.g. some plates may influence the data while the position within a plate could also create artifacts). The spherize transformation adjusts for plate position effects by transforming the profile data such that the DMSO profiles are left with an identity covariance matrix. Note that this operation is done on the full dataset, not per plate. See spherize-batch-effects.ipynb for implementation details.

Since CellProfiler data is not naturally normalized to zero, mean all data is 'MadRobustized' before Spherizing to maximize the usefulness of the spherize transformation.

Data levels

We include two batches of Cell Painting data in this repository: 2016_04_01_a549_48hr_batch1 and 2017_12_05_Batch2.

CellProfilier-derived profiles

For each batch, we include:

Data level	Description	File format	Included in this repo?	Version control
Level 1	Cell images	`.tif`	No^	NA
Level 2	Single cell profiles	`.sqlite`	No^	NA
Level 3	Aggregated profiles with metadata	`.csv.gz`	Yes	dvc
Level 4a	Normalized profiles with metadata	`.csv.gz`	Yes	dvc
Level 4b	Normalized and feature selected profiles with metadata	`.csv.gz`	Yes	dvc
Level 5	Consensus perturbation profiles	`.csv.gz`	Yes	git lfs

Importantly, we include files for two different types of normalization: Whole-plate normalization, and DMSO-specific normalization. See profile_cells.py for more details.

Note: If you use normalized profiles without feature selection, you may need to remove additional outlier features. See issues/65 for more details and a list of features.

Batch corrected profiles

For each batch we include four different spherized profiles. These data include all level 4b profiles for every batch.

Batch	Input data	Spherized output file	Version control
2016_04_01_a549_48hr_batch1	DMSO normalized	2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso.csv.gz	git lfs
2016_04_01_a549_48hr_batch1	Whole plate normalized	2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz	git lfs
2017_12_05_Batch2	DMSO normalized	2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_dmso.csv.gz	git lfs
2017_12_05_Batch2	Whole plate normalized	2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz	git lfs

Data access

We use a combination of git lfs and dvc to version all the data in this repository (see tables above for a breakdown). In order to access the data locally, you must first install both services.

# Download consensus profiles and spherized data
git lfs pull

# Download individual plate data
dvc pull

Note: The DVC remote is an AWS S3 bucket. You must have aws cli installed and configured (see details).

To access the files stored via DVC, you will need to created a AWS IAM user, who should, minimally, be able to Get objects and List buckets. One way of achieving this is to attach the AmazonS3ReadOnlyAccess policy, which is:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3-object-lambda:Get*",
                "s3-object-lambda:List*"
            ],
            "Resource": "*"
        }
    ]
}

(Note that the s3-object-lambda:Get* and s3-object-lambda:List* are not required but they don't hurt)

DeepProfiler-derived profiles

TBD

Reproduce pipeline

After activating the conda environment with:

conda activate lincs

you can reproduce the pipeline by simply executing the following:

# Make sure you are in the profiles/ directory
./run.sh

Recoding dose information

The Drug Repurposing Hub collected data on 6 to 7 dose points per compound. In general, most doses are very near the following 7 dose points (mmoles per liter):

[0.04, 0.12, 0.37, 1.11, 3.33, 10, 20]

Therefore, to make it easier to filter by dose when comparing compounds, we first align the doses collected in the dataset to their nearest dose point above. We then recode the dose points into ascending numerical levels and add a new metadata annotation Metadata_dose_recode to all profiles.

Dose	Dose Recode
0 (DMSO)	0
~0.04	1
~0.12	2
~0.37	3
~1.11	4
~3.33	5
~10	6
~20	7

Critical details

There are several critical details that are important for understanding data generation and processing. See profile_cells.py for more details about the specific processing steps and decisions.

Files

profiles

Directory actions

More options