# Process EBImage-based Features

We extracted single-cell features from all images using [EBImage](https://github.com/aoles/EBImage).
These features were acquired from DNA and actin channels and represent various morphology phenotypes.

Please view the [EBImage reference manual](https://bioconductor.org/packages/release/bioc/manuals/EBImage/man/EBImage.pdf) for a description of all features.

Many of these features have large differences in distributions, are redundant (highly correlated with each other), have low variance, or have a large proportion of missing values.

In this notebook, we use [pycytominer](https://github.com/cytomining/pycytominer) to select features and normalize training and test data used in downstream analyses.

## Processing Steps

1. Remove features that have high missingness
  * Remove features that have a proportion of missing values greater than 1%
2. Remove redundant features (high correlation)
  * Remove features that have correlations with other features greater than 0.95 Pearson correlation
  * Retain the feature with the lowest correlation in each highly correlated block of features
3. Remove low variance features
  * Remove features with a ratio of second most common value / most common less than 1%
    * Removes features that have a common and high outlier
  * Remove features with a ratio of second max count / max count less than 0.1%
    * Removes features that have a very high number of redundant values
4. Apply robust normalization
  * subtract median and divide by IQR
  * robust to outliers

**Note:** Feature selection applied to the training set is used to select features in the test set, but the training and test sets are normalized separately.

In [1]:
import os
import pandas as pd

from pycytominer.feature_select import feature_select
from pycytominer.normalize import normalize

In [2]:
file = os.path.join("data", "train.tsv.gz")
train_df = pd.read_csv(file, sep='\t')

print(train_df.shape)
train_df.head()

(46140, 123)


Unnamed: 0,cell_code,cell_id,plate,replicate,well,field,actin.s.area,actin.s.perimeter,actin.s.radius.mean,actin.s.radius.sd,...,DNA.h.ent.s3,DNA.h.dva.s3,DNA.h.den.s3,DNA.h.f12.s3,DNA.h.f13.s3,dist.10.nn,dist.20.nn,dist.30.nn,nuclear.displacement,target
0,Vw1NiaicT5,21,P1,2,L14,3,1276,141,21.0106,6.4898,...,0.0,0.0,0.0,0.0,0.0,64.4211,91.6678,115.0315,0.6564,DNA_intercalation
1,3OfiCVsytl,191,P1,1,B20,3,2530,232,29.6325,7.3113,...,0.0,0.0,0.0,0.0,0.0,68.6514,107.6736,134.6765,5.3418,dopaminereceptor
2,w5MtesHiAp,46,P2,1,H22,2,354,59,10.187,0.8613,...,0.0,0.0,0.0,0.0,0.0,327.8106,548.1329,817.8908,0.2296,DNAMetabolism
3,MqfxNtnHBu,126,P1,2,B11,4,198,53,7.8055,1.5897,...,0.0,0.0,0.0,0.0,0.0,180.4236,228.7135,274.3472,1.7604,dopaminereceptor
4,5Iaw5OkYAu,50,P1,1,M23,4,639,98,14.2977,3.339,...,0.0,0.0,0.0,0.0,0.0,260.5921,315.0347,339.9882,3.1661,MEK


## Perform Feature Selection

In [3]:
eb = ("actin", "DNA", "dist", "nuclear")
features = [x for x in train_df.columns if x.startswith(eb)]

In [4]:
train_feature_select_df = feature_select(
    profiles=train_df,
    features=features,
    operation=["drop_na_columns", "variance_threshold", "correlation_threshold"],
    na_cutoff=0.01,
    corr_threshold=0.95,
    corr_method="pearson",
    freq_cut=0.01,
    unique_cut=0.001
)

In [5]:
selected_features = [x for x in train_feature_select_df.columns if x.startswith(eb)]

print(train_feature_select_df.shape)
train_feature_select_df.head()

(46140, 34)


Unnamed: 0,cell_code,cell_id,plate,replicate,well,field,actin.s.area,actin.s.radius.mean,actin.s.radius.sd,actin.s.radius.min,...,DNA.b.q05,DNA.m.cx,DNA.m.cy,DNA.m.majoraxis,DNA.m.eccentricity,DNA.m.theta,dist.10.nn,dist.30.nn,nuclear.displacement,target
0,Vw1NiaicT5,21,P1,2,L14,3,1276,21.0106,6.4898,9.219,...,0.0056,277.6916,1298.9789,38.5191,0.9106,-1.3222,64.4211,115.0315,0.6564,DNA_intercalation
1,3OfiCVsytl,191,P1,1,B20,3,2530,29.6325,7.3113,16.1273,...,0.0115,698.7473,961.2215,51.9982,0.9122,-1.4732,68.6514,134.6765,5.3418,dopaminereceptor
2,w5MtesHiAp,46,P2,1,H22,2,354,10.187,0.8613,8.2477,...,0.0042,765.7384,1097.8634,21.1253,0.7536,-0.5,327.8106,817.8908,0.2296,DNAMetabolism
3,MqfxNtnHBu,126,P1,2,B11,4,198,7.8055,1.5897,4.8874,...,0.0061,21.1868,1397.569,14.0177,0.7353,-0.173,180.4236,274.3472,1.7604,dopaminereceptor
4,5Iaw5OkYAu,50,P1,1,M23,4,639,14.2977,3.339,6.7164,...,0.0071,514.5402,73.7514,22.7591,0.6506,-0.3738,260.5921,339.9882,3.1661,MEK


### Subset Test Set with the Selected Features

In [6]:
file = os.path.join("data", "test.tsv.gz")
test_df = pd.read_csv(file, sep='\t').reindex(train_feature_select_df.columns, axis='columns')

print(test_df.shape)
test_df.head()

(5127, 34)


Unnamed: 0,cell_code,cell_id,plate,replicate,well,field,actin.s.area,actin.s.radius.mean,actin.s.radius.sd,actin.s.radius.min,...,DNA.b.q05,DNA.m.cx,DNA.m.cy,DNA.m.majoraxis,DNA.m.eccentricity,DNA.m.theta,dist.10.nn,dist.30.nn,nuclear.displacement,target
0,ugqGCLfUnm,169,P1,1,K20,4,1033,19.2003,5.1297,10.6852,...,0.0062,245.574,310.0449,21.9464,0.5785,-1.2701,88.8333,144.5696,1.6818,AMPA
1,lqqb7taVnW,57,P2,1,C22,1,4436,39.6913,10.6677,13.9411,...,0.0051,1020.8605,556.2421,38.3966,0.7325,0.3813,132.5608,233.8309,4.0684,TopoII
2,BqdoFQiUTC,103,P4,1,K20,3,864,16.2811,3.9804,8.4662,...,0.012,708.5731,753.4066,40.8012,0.9236,0.78,120.1797,252.2629,0.9579,Ca2
3,1tbsBJxC48,50,P1,1,M11,3,1389,23.4096,7.6317,7.1806,...,0.0063,469.0087,804.736,22.7599,0.6785,-0.3604,68.8056,215.7705,5.6222,cellcycle
4,N5WcuwqolV,104,P4,1,B15,2,627,14.3132,3.7944,8.0233,...,0.0066,454.1615,63.6302,18.4396,0.5912,1.2913,80.2385,196.2588,5.2795,EGFR


## Perform Normalization

In [7]:
train_normalize_df = normalize(
    profiles=train_feature_select_df,
    features=selected_features,
    method="robustize"
)

print(train_normalize_df.shape)
train_normalize_df.head()

(46140, 34)


Unnamed: 0,cell_code,cell_id,plate,replicate,well,field,target,actin.s.area,actin.s.radius.mean,actin.s.radius.sd,...,DNA.b.q005,DNA.b.q05,DNA.m.cx,DNA.m.cy,DNA.m.majoraxis,DNA.m.eccentricity,DNA.m.theta,dist.10.nn,dist.30.nn,nuclear.displacement
0,Vw1NiaicT5,21,P1,2,L14,3,DNA_intercalation,0.226792,0.264616,0.930388,...,0.3,-0.454545,-0.576922,0.829042,1.21882,0.962112,-0.815669,-0.375466,-0.405006,-0.735221
1,3OfiCVsytl,191,P1,1,B20,3,dopaminereceptor,1.700353,1.411672,1.254113,...,0.7,2.227273,0.00448,0.359209,2.296359,0.968561,-0.91069,-0.28764,-0.218276,1.120566
2,w5MtesHiAp,46,P2,1,H22,2,DNAMetabolism,-0.856639,-1.175353,-1.287609,...,-1.1,-1.090909,0.096983,0.549283,-0.171666,0.329303,-0.298277,5.092832,6.275812,-0.904268
3,MqfxNtnHBu,126,P1,2,B11,4,dopaminereceptor,-1.039953,-1.492187,-1.000571,...,0.1,-0.227273,-0.931109,0.966185,-0.739858,0.255542,-0.092504,2.032892,1.109322,-0.29795
4,5Iaw5OkYAu,50,P1,1,M23,4,MEK,-0.521739,-0.628467,-0.311233,...,-0.6,0.227273,-0.249876,-0.875296,-0.041058,-0.085852,-0.218863,3.697291,1.733253,0.258818


In [8]:
test_normalize_df = normalize(
    profiles=test_df,
    features=selected_features,
    method="robustize"
)

print(test_normalize_df.shape)
test_normalize_df.head()

(5127, 34)


Unnamed: 0,cell_code,cell_id,plate,replicate,well,field,target,actin.s.area,actin.s.radius.mean,actin.s.radius.sd,...,DNA.b.q005,DNA.b.q05,DNA.m.cx,DNA.m.cy,DNA.m.majoraxis,DNA.m.eccentricity,DNA.m.theta,dist.10.nn,dist.30.nn,nuclear.displacement
0,ugqGCLfUnm,169,P1,1,K20,4,AMPA,-0.065612,0.020167,0.370674,...,-0.2,-0.173913,-0.569706,-0.570882,-0.113877,-0.382723,-0.781895,0.143072,-0.108655,-0.324762
1,lqqb7taVnW,57,P2,1,C22,1,TopoII,3.9215,2.744182,2.534928,...,-0.5,-0.652174,0.492667,-0.229372,1.164652,0.233153,0.261347,1.077373,0.766622,0.602253
2,BqdoFQiUTC,103,P4,1,K20,3,Ca2,-0.26362,-0.367904,-0.078473,...,0.0,2.347826,0.064741,0.044123,1.351541,0.997401,0.513219,0.812833,0.947363,-0.605943
3,1tbsBJxC48,50,P1,1,M11,3,cellcycle,0.351494,0.579739,1.348457,...,0.0,-0.130435,-0.263534,0.115324,-0.050651,0.017197,-0.207208,-0.284849,0.589526,1.205788
4,N5WcuwqolV,104,P4,1,B15,2,EGFR,-0.541301,-0.629511,-0.151162,...,-1.1,0.0,-0.283879,-0.912693,-0.38643,-0.331934,0.836224,-0.040568,0.398198,1.072674


In [9]:
# Assert that the final dataframes are aligned
pd.testing.assert_index_equal(train_normalize_df.columns, test_normalize_df.columns)

## Output Processed Files

In [10]:
file = os.path.join("data", "train_processed.tsv.gz")
train_normalize_df.to_csv(file, sep='\t', float_format="%.4f", index=False)

file = os.path.join("data", "test_processed.tsv.gz")
test_normalize_df.to_csv(file, sep='\t', float_format="%.4f", index=False)