Name		Name	Last commit message	Last commit date
parent directory ..
scripts/nbconverted		scripts/nbconverted
validations		validations
README.md		README.md
cell_health_correlations.ipynb		cell_health_correlations.ipynb
preview_CH_correlation_differences.ipynb		preview_CH_correlation_differences.ipynb
preview_CH_correlations.ipynb		preview_CH_correlations.ipynb
validate_model.sh		validate_model.sh

README.md

5. Validate Model

In this module, we validate the final ML model.

Validation Method 1

We use the models from 2.train_model to classify nuclei images from the Cell Health Dataset. The classification probabilities across CRISPR guide/cell line are then correlated to the Cell Health label in cell_health_correlations.ipynb for the the respective CRISPR perturbation/cell line.

The Cell Health dataset has cell painting images across 119 CRISPR guide perturbations (~2 per gene perturbation) and 3 cell lines. More information regarding the generation of this dataset can be found at https://github.com/broadinstitute/cell-health.

In Cell-Health-Data/4.classify-features, we use the trained models to determine phenotypic class probabilities for each of the Cell Health cells. We average these probabilities across CRISPR guide/cell line to create 357 classifiction profiles (119 CRISPR guides x 3 cell lines).

Way et al. derived cell health indicators as part of Predicting cell health phenotypes using image-based morphology profiling. These indicators consist of 70 specific cell health phenotypes including proliferation, apoptosis, reactive oxygen species, DNA damage, and cell cycle stage. Way et al averaged these indicators across CRISPR guide/cell line to create 357 Cell Health label profiles.

We use pandas.DataFrame.corr to find the Pearson correlation coefficient between the classifiction profiles and the Cell Health label profiles. The Pearson correlation coefficient measures the linear relationship between two datasets, with correlations of -1/+1 implying exact linear inverse/direct relationships respectively.

We also derive the Clustermatch Correlation Coefficient (CCC) introduced in Pividori et al, 2022. This is a not-only-linear coefficient based on machine learning models and gives an idea of how correlated the feature coefficients are (where 0 is no relationship and 1 is a perfect relationship).

These correlations are briefly interpreted in preview_CH_correlations.ipynb and preview_CH_correlations.ipynb with seaborn.clustermap to display the hierarchically-clustered correlation values. Searborn clustermap groups similar correlations into clusters that are broadly similar to each other.

Step 1: Define Folder Paths

Inside the notebook cell_health_correlations.ipynb, the variable classification_profiles_save_dir needs to be set to specify where the classficiation profiles are saved. We used an external harddrive and therefore needed to use specific paths. The classification profiles are the output of cell-health-data/4.classify-single-cell-phenotypes.

Step 2: Validate Model

Use the commands below to validate the final ML model:

# Make sure you are located in 5.validate_model
cd 5.validate_model

# Activate phenotypic_profiling conda environment
conda activate phenotypic_profiling

# Interpret model
bash validate_model.sh

Notes:

Intermediate .tsv data are stored in tidy format, a standardized data structure (see Tidy Data by Hadley Wickham for more details).
SCM stands for "single cell model(s)" and is used as an abbrevation for the binary, sinlge-class models throughout this module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.validate_model

5.validate_model

README.md

5. Validate Model

Validation Method 1

Step 1: Define Folder Paths

Step 2: Validate Model

Files

5.validate_model

Directory actions

More options

Directory actions

More options

Latest commit

History

5.validate_model

Folders and files

parent directory

README.md

5. Validate Model

Validation Method 1

Step 1: Define Folder Paths

Step 2: Validate Model