# Data pre-processing

Data Source: https://www.kaggle.com/tunguz/rxrx19a

Code Author: Raquel Aoki (raoki@sfu.ca)

This Jupyter Notebook will read the dataset, and make a subset to be easily used during the Week 2. 

In [1]:
import pandas as pd 
import numpy as np 

full_embeddings = pd.read_csv('data\\embeddings.csv')
full_meta = pd.read_csv('data\\metadata.csv')


In [2]:
print('Shape Full Embeddings: ',full_embeddings.shape)
print('Shape Full Metadata: ',full_meta.shape)
print('Head Full Metadata:\n', full_meta.head())

Shape Full Embeddings:  (305520, 1025)
Shape Full Metadata:  (305520, 10)
Head Full Metadata:
           site_id       well_id cell_type experiment  plate well  site  \
0  HRCE-1_1_T32_1  HRCE-1_1_T32      HRCE     HRCE-1      1  T32     1   
1  HRCE-1_1_S24_1  HRCE-1_1_S24      HRCE     HRCE-1      1  S24     1   
2  HRCE-1_1_R08_1  HRCE-1_1_R08      HRCE     HRCE-1      1  R08     1   
3  HRCE-1_1_Q40_1  HRCE-1_1_Q40      HRCE     HRCE-1      1  Q40     1   
4  HRCE-1_1_T40_1  HRCE-1_1_T40      HRCE     HRCE-1      1  T40     1   

  disease_condition treatment  treatment_conc  
0              Mock       NaN             NaN  
1              Mock       NaN             NaN  
2              Mock       NaN             NaN  
3              Mock       NaN             NaN  
4              Mock       NaN             NaN  


## Challenge

The Full Embedding dataset is 3.3GB, and some students might have problems to load it. 

Solution: Make a subset and save it in a more manageable size. 

In [3]:
new_embeddings = full_embeddings.sample(n=15000, random_state=25)

In [4]:
print('Shape new Embeddings:' , new_embeddings.shape)

Shape new Embeddings: (15000, 1025)


Next: We need to select in metadata only the embeddings present in the 'new_embeddings' subset. 
The column 'side_id' is used to make the match. 

In [5]:
new_metadata =  full_meta[full_meta['site_id'].isin(new_embeddings['site_id'])]


In [6]:
print('Shape new metadata: ', new_metadata.shape)


Shape new metadata:  (15000, 10)


To reduce the size of the files, we save them in pickle format. 
Reference: https://docs.python.org/3/library/pickle.html

In [7]:
new_embeddings.to_pickle("embeddings.pkl", compression = 'xz')
new_metadata.to_pickle("metadata.pkl", compression = 'xz')
