# Processing data before Morpheus exploration

- In this notebook, we will import, annotate, normalize and perform feature selection in data extracted from cpg0004-lincs dataset available in the [cellpainting-gallery](https://github.com/broadinstitute/cellpainting-gallery). The comma-separated values (CSV) we import here "SQ00015195", "SQ00015218", "SQ00015219", "SQ00015220", "SQ00015221" plates that share [this platemap](https://github.com/broadinstitute/lincs-cell-painting/tree/master/metadata/platemaps/2016_04_01_a549_48hr_batch1/platemap).

- These data was generated by object segmentation and feature extraction performed by [CellProfiler](https://cellprofiler.org/), which exports a comma-separated value (CSV) with single-cell information in the rows, and features in the columns. 

- Then, the single-cell information was pre-processed by a [profiling recipe](https://github.com/cytomining/profiling-recipe) that takes the single-cell tables and aggregated into one table with data mean aggregated in a per-well basis.

## Input data structure

- You need to have a comma-separated value (CSV) file and the name of each file is the plate name followed by .csv: e.g., SQ00015195.csv. Each file have:
    - Rows: contains mean-aggregated well information;
    - Columns: features that comes from the objects: Nuclei, Cytoplasm or Cells, and the object name must be present as a prefix to the feature name (for example, Nuclei_AreaShape_Area);
    - A platemap.csv file containing a well_position column (Metadata_Well) and more information you'd like to add to your file (like Compound, Concentration, etc).

## Steps

0. Import libraries
1. Inputs: define a list with the files/plate names, and the directory and filename for platemap files. Import platemap.
2. Import data from public S3 bucket
3. Path to the files (locally)
4. Open, annotate and normalize function
5. Feature selection
6. Export to CSV


# 0. Import libraries

In [3]:
import pandas as pd
import numpy as np
import pycytominer
import boto3

# 1. Inputs

In [2]:
plates = ["SQ00015195", "SQ00015218", "SQ00015219", "SQ00015220", "SQ00015221"]

In [10]:
platemap_dir = r"G:\My Drive\GitHub\2022_Garcia-Fossa_submitted\basic_protocol_1\notebooks\data_processing"
platemap_filename = r"\platemap.csv"

## a. Import Platemap as a dataframe

In [11]:
platemap = pd.read_csv(platemap_dir + platemap_filename)

# 2. Import data from public S3 bucket (skip this if you have the files locally)

- We are using data that are publicly available in cellpainting-gallery inside AWS. Because of that, we can download each table using the bucket_name, bucket_dir (indicating where those files are), and the name of each file as the plates name inside plates list above. 

- This is easier because the bucket is publicly available. If you're retrieving these files from a private bucket, you should provide your AWS credentials as well:
    - Instead of s3 = boto3.resource(), you should use s3 = boto3.client('s3', aws_access_key_id=... , aws_secret_access_key=...)
    - The next steps should be the same.

In [6]:
bucket_name = "cellpainting-gallery"
bucket_dir = "cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/"

In [4]:
s3 = boto3.resource('s3')

In [7]:
for pl in plates:
    folder = pl + "/"
    filename = pl + '.csv'
    s3.meta.client.download_file(Filename=filename,Bucket=bucket_name,Key=bucket_dir + folder + filename)

# 3. Provide the path to the files

- Replace the path to where you have the files locally in your machine.

In [12]:
path = r"G:\My Drive\GitHub\2022_Garcia-Fossa_submitted\basic_protocol_1\notebooks\data_processing"

# 4. Open, annotate and normalize plate files

Here, we do a for loop to get the plate names contained in the plate list, and perform all the operations below in each table, to join them at the end using pd.concat:

- Then, we import and open each table separately;
- We print the shape of each dataframe to confirm they have the same number of rows (384) and columns (1785);
- We replace the old Metadata columns names to exclude the Image_ prefix (thils will help to maintain these columns after normalization);
- Add annotations using pycytominer. We use the platemap imported above and we annotate based on Metadata_Well column contained in both files (platemap and df_temp);
- Normalize the features to a common scale using mad_robustize method, using DMSO as our negative control.

In [13]:
df_lst = []
for plt in plates:
    df_temp = pd.read_csv(path + "\\" + plt + ".csv", low_memory=False)
    print(df_temp.shape)
    df_temp['Metadata_Plate'] = df_temp['Image_Metadata_Plate']
    df_temp['Metadata_Well'] = df_temp['Image_Metadata_Well']
    df_temp = pycytominer.annotate(df_temp, platemap, join_on = ['Metadata_Well', 'Metadata_Well'])
    df_norm = pycytominer.normalize(df_temp, method = 'mad_robustize', mad_robustize_epsilon = 0, samples = "Metadata_Compound == 'DMSO'") 
    df_lst.append(df_norm)
df = pd.concat(df_lst)


(384, 1785)
(384, 1785)
(384, 1785)
(384, 1785)
(384, 1785)


In [16]:
df.head()

Unnamed: 0,Metadata_Concentration,Metadata_moa,Metadata_Compound,Metadata_Plate,Metadata_Plate.1,Metadata_Plate.2,Metadata_Plate.3,Metadata_Plate.4,Metadata_Plate.5,Metadata_Plate.6,...,Nuclei_Texture_Variance_DNA_5_0,Nuclei_Texture_Variance_ER_10_0,Nuclei_Texture_Variance_ER_20_0,Nuclei_Texture_Variance_ER_5_0,Nuclei_Texture_Variance_Mito_10_0,Nuclei_Texture_Variance_Mito_20_0,Nuclei_Texture_Variance_Mito_5_0,Nuclei_Texture_Variance_RNA_10_0,Nuclei_Texture_Variance_RNA_20_0,Nuclei_Texture_Variance_RNA_5_0
0,0.0,DMSO,DMSO,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,...,1.368746,0.955682,1.411891,0.733342,-1.708083,-1.606106,-2.14179,0.045162,1.526507,-0.204441
1,0.0,DMSO,DMSO,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,...,-0.407413,0.645711,0.682114,0.448937,-1.507224,-1.409072,-2.572561,-1.329326,0.381323,-3.880608
2,0.0,DMSO,DMSO,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,...,0.178761,-0.485885,-1.030875,-0.148046,-0.827435,-0.837187,-1.457859,0.111732,0.071883,-0.197552
3,0.0,DMSO,DMSO,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,...,1.224848,1.724854,2.560806,1.153979,0.965171,1.071148,0.663331,0.262732,0.771792,0.537351
4,0.0,DMSO,DMSO,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,SQ00015195,...,1.257855,1.194356,1.142756,0.707998,0.172388,0.055924,0.381461,-0.614506,-1.148015,-1.03978


# 5. Feature selection using pycytominer methods

- We will select the features based on specified selection method/operation. We are using the default parameters for 'correlation_threshold', 'variance_threshold', 'drop_na_columns', 'blocklist','drop_outliers'.

- Print the number of removed columns. 

- For more details check [pycytominer documentation](https://pycytominer.readthedocs.io/en/latest/).

In [14]:
df_selected = pycytominer.feature_select(df, operation = ['correlation_threshold', 'variance_threshold', 'drop_na_columns', 'blocklist','drop_outliers'], outlier_cutoff = 500)

In [15]:
print('How many columns were dropped?',df.shape[1] - df_selected.shape[1])

How many columns were dropped? 1188


# 6. Remove duplicated columns and export to a CSV file

In [17]:
df_final = df_selected.loc[:,~df_selected.columns.duplicated()].copy()

In [18]:
df_final.to_csv(path+ "\\" + "BP1_Processed.csv")