# Aggregate, Annotate, Normalize, and Feature Select

This notebook will run all these above operations on a single-cell file obtained from DeepProfiler 1_Process_Outputs.

In [1]:
import pandas as pd
import numpy as np
import pycytominer
import easygui as eg
import os
# from generate_profiles import *
%load_ext autoreload
%autoreload 2

def stringToBool(correlation_input):
    if correlation_input == 'yes':
       return True
    elif correlation_input == 'no':
       return False
    else:
      raise ValueError

# 0) Inputs

In [2]:
profile = eg.fileopenbox(msg="Choose a file with samples and their features", default=r"D:")
print('Filename', profile)

project_name = input('Provide the name of this project:')
print('Project name:', project_name)

metadata_question = input(r"If you need to annotate your dataset with an external file, write yes and press enter. If already annotated, answer no and press enter.")
metadata_answer = stringToBool(metadata_question)

if metadata_answer:
  platemap = eg.fileopenbox(msg="Choose a map (csv file) with plates names and metadata filenames", default=r"G:")
  platemap_path = os.path.split(platemap)[0]
  print('Platemap file selected', platemap)
  barcode_df = pd.read_csv(platemap)

Filename D:\2022_10_04_AgNPCellRecovery_fossa_Cimini\workspace\deepprofiler\2023_04_25_CNN_CellPainting_GFPRNA\profiles\2023_04_25_CNN_CellPainting_GFPRNAsingle_cells.csv
Project name: 2023_04_25_CNN_CellPainting_GFPRNA
Platemap file selected G:\My Drive\2022_10_04_AgNPCellRecovery_fossa_Cimini\metadata\platemaps\2022_05_28_CellPainting\barcode_platemap.csv


## 1) Import extracted features file (single cell or well-aggregated)

In [11]:
df = pd.read_csv(profile)
df.head()

cells_that_run = []
cells_that_run.append(project_name)

### 1.a) Select features names

In [12]:
# extract metadata prior to normalization
metadata_cols = pycytominer.cyto_utils.infer_cp_features(df, metadata=True)
# locations are not automatically inferred with cp features
metadata_cols.append("Location_Center_X")
metadata_cols.append("Location_Center_Y")
derived_features = [
    x for x in df.columns.tolist() if x not in metadata_cols
]

## 2) Generate profile

### 2A) Aggregate

- Run the next cell to list the metadata columns, and copy them to choose from which metadata information you'd like to join the rows on (strata)

- If using **already aggregated data by plates and wells**, skip to 2B.

- As a default we are using **Metadata_Plate and Metadata_Well** to join the rows, using the **'median'** operation. 

Code below copied from pycytominer

Reference: https://github.com/cytomining/pycytominer/blob/a5ae6c81a275b692ef5d4c85cfeb37696bf69242/pycytominer/cyto_utils/DeepProfiler_processing.py#L437-L444


In [13]:
df = pycytominer.aggregate(df, 
                              strata=['Metadata_Plate', 'Metadata_Well'], 
                              operation='median',
                              features=derived_features)
df.head()

Unnamed: 0,Metadata_Plate,Metadata_Well,efficientnet_0,efficientnet_1,efficientnet_2,efficientnet_3,efficientnet_4,efficientnet_5,efficientnet_6,efficientnet_7,...,efficientnet_662,efficientnet_663,efficientnet_664,efficientnet_665,efficientnet_666,efficientnet_667,efficientnet_668,efficientnet_669,efficientnet_670,efficientnet_671
0,220528_102915_Plate_1,B10,0.518708,2.812436,3.218751,-0.131132,1.261852,0.519183,2.810942,0.530586,...,0.84188,0.827625,-0.180425,1.264115,1.49039,0.327663,3.036213,1.433404,0.538578,0.788551
1,220528_102915_Plate_1,B11,0.541497,2.77869,3.300595,-0.116309,1.234116,0.458571,2.823341,0.952075,...,0.910153,0.840212,-0.189845,1.23117,1.441773,0.352475,2.910588,1.445061,0.488202,0.716878
2,220528_102915_Plate_1,B2,0.544787,2.847978,3.176057,-0.119559,1.27001,0.545712,2.828436,0.286739,...,0.867258,0.761964,-0.173966,1.301321,1.526623,0.376966,3.095318,1.372507,0.488918,0.532168
3,220528_102915_Plate_1,B3,0.51826,2.842462,3.196894,-0.124644,1.264384,0.535548,2.816453,0.340362,...,0.8572,0.74278,-0.16478,1.170745,1.47074,0.355507,3.166224,1.287173,0.473407,0.44213
4,220528_102915_Plate_1,B4,0.570887,2.828761,3.1989,-0.096245,1.227968,0.53474,2.826558,0.319614,...,0.858262,0.780122,-0.168132,1.298958,1.543481,0.350412,3.091416,1.47859,0.4743,0.581399


### 2B) Annotate 

- Run the following to generate a plate list based on Metadata_Plate column.

In [6]:
plate_list = df['Metadata_Plate'].unique().tolist()
print(plate_list)

['220528_102915_Plate_1', '220609_145227_Plate_1']


- All metadata must be organized such:
    ```
    |- metadata 
    |   |   |- <barcode_platemap.csv> 
    |   |- platemap
    |   |   |- <platemap_1.csv>
    |   |   |- <platemap_2.csv>
    ```

- Run the next cell to annotate the profiles. 

In [14]:
df_temp_list = []
for pl in plate_list:
    df_plate = df.loc[df['Metadata_Plate'] == pl]
    #deal with metadata information
    barcode_map = barcode_df[barcode_df['Assay_Plate_Barcode'] == pl]
    index_map = barcode_map.index[0]
    metadata_filename = barcode_df['Plate_Map_Name'][index_map]
    metadata = pd.read_csv(platemap_path + r'/platemap/' + metadata_filename + '.txt', sep='\t')
    #annotate
    df_temp = pycytominer.annotate(profiles = df_plate, platemap = metadata, join_on =  ["Metadata_well_position", "Metadata_Well"])
    df_temp_list.append(df_temp)
    print('Shape of each plate ', df_temp.shape)
df = pd.concat(df_temp_list, axis=0)

Shape of each plate  (60, 682)
Shape of each plate  (60, 682)


### 2C) Normalize

- Normalize the dataset, per-plate basis, to **samples = all** or **samples = negcon**.
- CHOOSE one or the other below.

#### Normalize TO NEGCON => run next cell

In [8]:
df_temp_list = []
for pl in plate_list:
    df_temp = df.loc[df['Metadata_Plate'] == pl]
    print(df_temp.shape)
    df_norm_temp = pycytominer.normalize(df_temp, features=derived_features, method = 'mad_robustize', mad_robustize_epsilon = 0, samples = "Metadata_control_type == 'negcon'") 
    df_temp_list.append(df_norm_temp)
df_norm2 = pd.concat(df_temp_list, axis=0)
cells_that_run.append('normalized_negcon')

(60, 682)
(60, 682)


#### Normalize TO ALL => run next cell

In [15]:
df_temp_list = []
for pl in plate_list:
    df_temp = df.loc[df['Metadata_Plate'] == pl]
    print(df_temp.shape)
    df_norm_temp = pycytominer.normalize(df_temp, features=derived_features, method = 'mad_robustize', mad_robustize_epsilon = 0) 
    df_temp_list.append(df_norm_temp)
df_norm2 = pd.concat(df_temp_list, axis=0)
cells_that_run.append('normalized')

(60, 682)
(60, 682)


### Export only normalized

In [16]:
output_name = '_'.join(cells_that_run)
df_norm2.to_csv(output_path + r'/' + output_name + '.csv')
print('Successfully exported to:', output_path + r'/' + output_name + '.csv')

Successfully exported to: D:\2022_10_04_AgNPCellRecovery_fossa_Cimini\workspace\deepprofiler\2023_04_25_CNN_CellPainting_GFPRNA\profiles/2023_04_25_CNN_CellPainting_GFPRNA_normalized.csv


### 2D) Feature selection

In [60]:
df_selected = pycytominer.feature_select(df_norm2, features=derived_features, 
                                         operation = ['correlation_threshold', 'variance_threshold', 'drop_na_columns','drop_outliers'], outlier_cutoff = 500) 
print('Number of columns removed:', df_norm2.shape[1] - df_selected.shape[1])
print('Percentage of columns removed:',100 - ((df_selected.shape[1]*100)/df_norm2.shape[1]))
cells_that_run.append('feature_select')

Number of columns removed: 17
Percentage of columns removed: 2.489019033674964


# Export

In [9]:
output_path = eg.diropenbox(msg="Choose an output folder", default=r"D:")
print('Path to save the profile', output_path)

Path to save the profile D:\2022_10_04_AgNPCellRecovery_fossa_Cimini\workspace\deepprofiler\2023_04_25_CNN_CellPainting_GFPRNA\profiles


In [61]:
output_name = '_'.join(cells_that_run)
df_selected.to_csv(output_path + r'/' + output_name + '.csv')
print('Successfully exported to:', output_path + r'/' + output_name + '.csv')

Successfully exported to: D:\2022_10_04_AgNPCellRecovery_fossa_Cimini\workspace\deepprofiler\2023_04_25_CNN_CellPainting_GFPRNA\profiles/2023_04_25_CNN_CellPainting_GFPRNA_normalized_feature_select.csv
