# Day 3: Classification Models

In the first day we learnt about the embeddings, and yesterday about the Factor Models to reduce the dimensionality of the dataset. Today, our goal is to build a classification model to classify the images in X x X. 

## 1. Loading and Exploring Data
As yesterday, we are going yo clone the repo to get the data. We added 3 new files with the latent features from the 3 factor models we explored in day 2. 


In [1]:
! git clone https://github.com/ai4all-sfu/comp-biology-2020.git

Cloning into 'comp-biology-2020'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects:   8% (1/12)[Kremote: Counting objects:  16% (2/12)[Kremote: Counting objects:  25% (3/12)[Kremote: Counting objects:  33% (4/12)[Kremote: Counting objects:  41% (5/12)[Kremote: Counting objects:  50% (6/12)[Kremote: Counting objects:  58% (7/12)[Kremote: Counting objects:  66% (8/12)[Kremote: Counting objects:  75% (9/12)[Kremote: Counting objects:  83% (10/12)[Kremote: Counting objects:  91% (11/12)[Kremote: Counting objects: 100% (12/12)[Kremote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 32 (delta 1), reused 11 (delta 1), pack-reused 20[K
Unpacking objects: 100% (32/32), done.


In [2]:
#Loading the libraries 

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

#Loading Files we cloned from github 
embeddings = pd.read_pickle('comp-biology-2020/embeddings.pkl', compression = 'xz')
metadata = pd.read_pickle('comp-biology-2020/metadata.pkl', compression = 'xz')

#changing the index
embeddings.set_index('site_id', inplace=True)

In [3]:
# Loading latent features from yesterday 

pca = np.load('comp-biology-2020/latent_features/pca50.npz')['arr_0']
svd = np.load('comp-biology-2020/latent_features/svd60.npz')['arr_0']
aut = np.load('comp-biology-2020/latent_features/a32.npz')['arr_0']

print('PCA Dimensions: ',pca.shape, '\nSVD Dimensions: ',svd.shape, '\nAutoencoder Dimensions: ',aut.shape)

PCA Dimensions:  (15000, 50) 
SVD Dimensions:  (15000, 60) 
Autoencoder Dimensions:  (15000, 32)


In the left side, click in 'Files' and 'Upload to Session Storage' as show in the image below:  
![alt text](https://i.imgur.com/oLk88Mu.jpg)

Upload the files from yesterday's exercise: 'mypca.npz' and 'mysvd.npz' 

In [4]:
mypca = np.load('mypca.npz')['arr_0']
mysvd = np.load('mysvd.npz')['arr_0']

print('My PCA Dimensions: ',mypca.shape, '\nMy SVD Dimensions: ',mysvd.shape) 

My PCA Dimensions:  (15000, 50) 
My SVD Dimensions:  (15000, 50)


Now let's check the target, that is the disease condition. Based on the image, we want to be able to identify if is active for SARS-CoV-2 or inactive. 


In [None]:
print(metadata['disease_condition'].value_counts()) 


To use this information in our classification models, we are going to transform this column in a binary variable: if the disease condition is active, it will receive the value 1; and 0 if inactive. 

In [8]:
metadata['disease_condition'].replace({'active':1, 'inactive':0}, inplace = True) 
print(metadata['disease_condition'].value_counts()) 



1    8000
0    8000
Name: disease_condition, dtype: int64


## 2. Classification Models

We are ging 