# Feature selection
Morphological profile features often exhibit strong correlation structures. To
remove redundant features, `scmorph` integrates methods that detect correlated
features and removes them.

In [1]:
import scmorph as sm

adata = sm.datasets.rohban2017_minimal()
adata.shape

(12352, 1687)

This example dataset has 1687 features, many of which will be at least partly
redundant. `scmorph` makes removing redunant features easy:

In [2]:
adata_filtered_pearson = sm.pp.select_features(adata, method="pearson", copy=True)
adata_filtered_pearson.shape

(12352, 1455)

Behind the scenes, this is what happens:
1. Correlate all features with each other
2. For any feature pair with correlation coefficient > threshold (0.9 by
   default), remove one of the features. To decide which one, check which of the
   two features has the higher correlation with all other features.
   
By varying the treshold, we can be more or less stringent in our filtering.

In [None]:
adata_filtered_pearson = sm.pp.select_features(adata, method="pearson", cor_cutoff=0.8, copy=True)
adata_filtered_pearson.shape

Likewise, we can use other correlation coefficients that may be more suitable
for morphological features, which do not always follow normal distributions.

In [None]:
adata_filtered_spearman = sm.pp.select_features(adata, method="spearman", cor_cutoff=0.8, copy=True)
adata_filtered_spearman.shape

  c /= stddev[:, None]
  c /= stddev[None, :]


(12352, 1295)

We can also subset the data before performing this correlation filtering, which
can help speed up processing speeds for large datasets. For example, if we only
want to use 3000 cells while estimating correlations, we can use `n_obs` as
below. Note that, because we are not using the full data computing correlation
coefficients, this can the number of features retained.

In [None]:
adata_filtered_spearman = sm.pp.select_features(adata, method="spearman", cor_cutoff=0.8, copy=True, n_obs=3000)
adata_filtered_spearman.shape

  c /= stddev[:, None]
  c /= stddev[None, :]


(12352, 1293)

`scmorph` also integrates an adapted version of the Chatterjee correlation
coefficient, based on [work by Lin and Han (2021)](https://doi.org/10/grdrs2).
While it is slower to compute than the other correlation coefficients, it makes
fewer assumptions and can find correlations that may be missed by
other coefficients of correlation.

In [None]:
adata_filtered_spearman = sm.pp.select_features(adata, method="chatterjee", cor_cutoff=0.7, copy=True, n_obs=1000)
adata_filtered_spearman.shape

(12352, 1477)

Note that `select_features` also does some additional filtering behind the scenes.
Specifically, it removes features with very low variance. Features affected by
this filter are usually not informative and can be safely removed. You can see
which features are affected by this filter after running the function:

In [7]:
adata.var["qc_pass_var"].value_counts()

True     1585
False     102
Name: qc_pass_var, dtype: int64

In [8]:
adata.var.query("qc_pass_var == False").sample(5)

Unnamed: 0,Object,Module,feature_1,feature_2,feature_3,feature_4,qc_pass_var
Cells_Correlation_Costes_RNA_Mito,Cells,Correlation,Costes,RNA,Mito,,False
Cells_Correlation_Costes_ER_RNA,Cells,Correlation,Costes,ER,RNA,,False
Cytoplasm_Intensity_MeanIntensityEdge_DNA,Cytoplasm,Intensity,MeanIntensityEdge,DNA,,,False
Nuclei_AreaShape_Zernike_9_7,Nuclei,AreaShape,Zernike,9,7,,False
Cells_Intensity_MADIntensity_DNA,Cells,Intensity,MADIntensity,DNA,,,False
