# Cell Painting data collection and preprocessing

The first step of the project was to gather and preprocess the compound morphological profiles derived from the Cell Painting assays conducted by Bray et. al (doi: 10.1093/gigascience/giw014) that would later be needed to train the models.

In [None]:
from src.utils import *

## Loading of Cell Painting data
We load the well-level morphological profiles from all plates obtained through the Cell Painting assay. 

In [None]:
# Load the well-level morphological profiles
raw_df = load_cellpainting_data('data/profiles.dir', drop_nan_cols=True)
raw_df

## Preprocessing of Cell Painting data

Next, we perform preprocessing of the Cell Painting data. The implemented pipeline is as follows:

* Plate-layout effects inspection. We verify the presence of edge effects and gradient artifacts present in most types of plates. 

* Well-level data preprocessing: Batch-effect correction. To correct for batch effects, we perform TVN transformation, which consists of transforming the morphological features using the Typical Variation Normalization (TVN) transform. It minimizes batch effects by obtaining a feature space where the technical variations sampled from controls (nuisance variation) are neutralized, to enhance the biological signal from treatments.

* Treatment-level profiles preprocessing: Well-level data aggregation. To aggregate the well-level profiles per treatment, we perform mean aggregation, which consists of computing the average feature value for each compound and dose combination.

* Feature distribution inspection. We examine the density distribution of some features before and after their transformation.

* Compound concentrations analysis. We plot the compound concentrations' statistics and distribution in the Cell Painting dataset.

In [None]:
# Perform preprocessing on well-level Cell Painting data
processed_df = data_preprocessing(raw_df, 
                                  well_level_preprocessing='TVN', treatment_aggregation='mean',
                                  check_plate_layout_effects=True, check_feature_distribution=True, check_compound_concentrations=True)

## Morphological feature selection 

Next, we perform feature selection on the morphological features resulting from the Cell Painting assay. Several procedures are carried out to obtain the final set of features:

* Highly correlated features. We reduce redundancy in the feature set by ensuring that no pair of features has a Pearson Correlation Coefficient above 0.9. For each pair of features identified as highly correlated, the feature with the largest mean absolute correlation with the rest of features is removed.

* Invariant features. We remove the features with zero or near-zero variance, which are defined as those with a variance lower than the 1st percentile.

In [None]:
# Perform feature selection on preprocessed treatment-level data
features_to_remove = feature_selection(processed_df, 
                                       correlated_features=True, invariant_features=True)

# Remove the selected features from the preprocessed dataframe
feature_selection_df = processed_df.drop(processed_df.loc[:, list(features_to_remove)], axis=1)
feature_selection_df.shape