# Data pre-processing

Data Source: https://www.kaggle.com/tunguz/rxrx19a

Code Author: Raquel Aoki (raoki@sfu.ca)

This Jupyter Notebook will read the dataset, and make a subset to be easily used during the Week 2. 

In [6]:
import pandas as pd 
import numpy as np 

full_embeddings = pd.read_csv('data\\embeddings.csv')
full_metadata = pd.read_csv('data\\metadata.csv')


In [7]:
print('Shape Full Embeddings: ',full_embeddings.shape)
print('Shape Full Metadata: ',full_metadata.shape)
print('Head Full Metadata:\n', full_metadata.head())

Shape Full Embeddings:  (305520, 1025)
Shape Full Metadata:  (305520, 10)
Head Full Metadata:
           site_id       well_id cell_type experiment  plate well  site  \
0  HRCE-1_1_T32_1  HRCE-1_1_T32      HRCE     HRCE-1      1  T32     1   
1  HRCE-1_1_S24_1  HRCE-1_1_S24      HRCE     HRCE-1      1  S24     1   
2  HRCE-1_1_R08_1  HRCE-1_1_R08      HRCE     HRCE-1      1  R08     1   
3  HRCE-1_1_Q40_1  HRCE-1_1_Q40      HRCE     HRCE-1      1  Q40     1   
4  HRCE-1_1_T40_1  HRCE-1_1_T40      HRCE     HRCE-1      1  T40     1   

  disease_condition treatment  treatment_conc  
0              Mock       NaN             NaN  
1              Mock       NaN             NaN  
2              Mock       NaN             NaN  
3              Mock       NaN             NaN  
4              Mock       NaN             NaN  


In [8]:
#Renaming disease_condition 
full_metadata["disease_condition"].replace({"Active SARS-CoV-2": "active", "Mock": "inactive", "UV Inactivated SARS-CoV-2": "inactive"}, inplace=True)
print(full_metadata['disease_condition'].value_counts()) 


active      280376
inactive     18240
Name: disease_condition, dtype: int64


## Challenge

The Full Embedding dataset is 3.3GB, and some students might have problems to load it. 

Solution: Make a subset and save it in a more manageable size. 
Bonus: We are going to make a balanced dataset, that will help us during the classification part. 

In [12]:
full_metadata_active = full_metadata[full_metadata['disease_condition']=='active']
full_metadata_inactive = full_metadata[full_metadata['disease_condition']=='inactive']

new_metadata_a = full_metadata_active.sample(n=8000, random_state=25)
new_metadata_i = full_metadata_inactive.sample(n=8000, random_state=25)

In [13]:
new_metadata = pd.concat([new_metadata_a,new_metadata_i], axis = 0)
print('Shape new Metadata:' , new_metadata.shape)

Shape new Embeddings: (16000, 10)


Next: We need to select in embeddings only the images present in the 'new_metadata' subset. 
The column 'side_id' is used to make the match. 

In [14]:
new_embeddings =  full_embeddings[full_embeddings['site_id'].isin(new_metadata['site_id'])]


In [15]:
print('Shape new metadata: ', new_metadata.shape)


Shape new metadata:  (16000, 10)


To reduce the size of the files, we save them in pickle format. 
Reference: https://docs.python.org/3/library/pickle.html

In [16]:
new_embeddings.to_pickle("embeddings.pkl", compression = 'xz')
new_metadata.to_pickle("metadata.pkl", compression = 'xz')
