# Day 3: Classification Models

On the first day, we got familiar with the dataset, and feature extraction methods. On the second day, we learnt about Factor Models to reduce the dataset's dimensionality. Our goal today is to build a classification model to classify the images into the classes 'active' and 'inactive'.

## 1. Loading and Exploring Data
Similar to what we did yesterday, we are going to clone the repository to get the data. We added three new files that contain the latent features generated using the factor models from Day 2.


In [None]:
! git clone https://github.com/ai4all-sfu/comp-biology-2020.git

In [None]:
#Loading the libraries 

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

#Loading files we cloned from github 
embeddings = pd.read_pickle('comp-biology-2020/embeddings.pkl', compression = 'xz')
metadata = pd.read_pickle('comp-biology-2020/metadata.pkl', compression = 'xz')

#Selecting only the columns 'site_id' and 'disease_condition'
metadata = metadata.iloc[:,[0,7]]

#Adding the target 'disease_condition' on embeddings dataset and transforming to 0-1 scale
embeddings01 = embeddings.set_index('site_id').join(metadata.set_index('site_id'))
print(embeddings01.head())

In [None]:
# Loading our latent features from yesterday (they are sorted in the same order of the embeddings dataset)
pca = np.load('comp-biology-2020/latent_features/pca50.npz')['arr_0']
nmf = np.load('comp-biology-2020/latent_features/mf60.npz')['arr_0']
aut = np.load('comp-biology-2020/latent_features/a32.npz')['arr_0']

print('PCA Dimensions: ',pca.shape, '\nMF Dimensions: ',nmf.shape, '\nAutoencoder Dimensions: ',aut.shape)

In the left side, click in 'Files' and 'Upload to Session Storage' as shown in the image below:  
![alt text](https://i.imgur.com/oLk88Mu.jpg)

Upload the files from yesterday's exercise: 'mypca.npz' and 'mysvd.npz' 

In [None]:
#load the latent features for PCA and MF into variables 'mypca' and 'mymf'
mypca = np.load('mypca.npz')['arr_0']
mymf = np.load('mymf.npz')['arr_0']

#printing the shape/dimensions of the loaded latent features
print('My PCA Dimensions: ',mypca.shape, '\nMy MF Dimensions: ',mymf.shape) 


Recall that if the cell is infected with the virus 'SARS-CoV-2', it is 'active'. Or else, it is 'inactive'. 

Now, let's check the target with the disease condition. Based on the cell image, we want to identify if it is active or inactive.

In [None]:
print(embeddings01['disease_condition'].value_counts()) 

To use this information in our classification models, we will transform the values in the column 'disease_condition' into binary values. If the disease condition is active, it will receive the value 1, and if it is inactive, it will receive the value 0.

In [None]:
target = embeddings01['disease_condition'].replace({'active':1, 'inactive':0}).tolist() 

print(embeddings01['disease_condition'].value_counts()) 
print("Before transforming:" ,embeddings01['disease_condition'][0:5].tolist())
print("After transforming:",target[0:5])

During our analysis, we will also check one specific image to have an idea how the models work. You can see the image below: ![alt text](https://i.imgur.com/0CnrNXp.png)

And according to our metadata, this is an 'active' cell. 

In [None]:
metadata[metadata['site_id']=='VERO-2_2_Z45_1']

In [None]:
#Bulding a dictionary with all the features 
features = {'PCA': pca, 'MF': nmf, 'Autoencoder':aut, 'MyPCa': mypca, 'MyMF': mymf}

indice = embeddings01.index.get_loc('VERO-2_2_Z45_1') 
features_example = {'PCA': pca[indice], 'MF': nmf[indice], 
                    'Autoencoder':aut[indice], 'MyPCa': mypca[indice], 'MyMF': mymf[indice]}

## 2. Classification Models

Our task is to classify if a cell image is infected (class 'active') or not (class 'inactive').


We will use sklearn to import our classifiers. The three models we are going to explore are Logistic Regression, KNN Classifier, and Random Forest.

In [None]:
import warnings
#from matplotlib import pyplot
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

warnings.filterwarnings('ignore')
seed = 9

# variables to hold the results and models' names
results = pd.DataFrame(columns=['model', 'features','accuracy_train', 'accuracy_test', 'precision_test', 'recall_test'])
active_image_example = pd.DataFrame(columns=['model', 'features','predicted_target'])

The first model we are going to explore is Logistic Regression. 



In [None]:
lr = LogisticRegression(random_state=seed)
model_name = 'Logistic Regression'

for f in features:
  #1) Split the features into training and testing set 
  X_train, X_test, y_train, y_test = train_test_split(features[f],target , test_size=0.3, random_state=seed) 

  #2) Fit the model (Fitting a model means that you're making your algorithm learn the relationship between predicted values and outcome so that you can predict better)
  model = lr.fit(X_train,y_train)

  #3) Predict the labels for the training set and testing set
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)

  #4) Save the results and calculate the accuracy 
  output = {'model':model_name, 
            'features':f, 
            'accuracy_train':accuracy_score(y_train,y_train_pred), 
            'accuracy_test':accuracy_score(y_test,y_test_pred),
            'precision_test':precision_score(y_test,y_test_pred) ,
            'recall_test':recall_score(y_test,y_test_pred) }
  #append and add to dataframe 'results'          
  results = results.append(output,ignore_index=True)
  active_image_example = active_image_example.append({'model':model_name, 'features':f, 'predicted_target':model.predict(features_example[f].reshape(1,-1))}, 
                              ignore_index=True)

print(results)

Now we are going to explore Random Forest. 

In [None]:
rf = RandomForestClassifier(n_estimators=30, random_state=seed)
model_name = 'Random Forest'

for f in features:
  #1) Split the features into training and testing set 
  X_train, X_test, y_train, y_test = train_test_split(features[f],target , test_size=0.3, random_state=seed) 

  #2) Fit the model
  model = rf.fit(X_train,y_train)

  #3) Predict the labels for the training set and testing set
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)
  
  #4) Save the results and calculate the accuracy 
  output = {'model':model_name, 
            'features':f, 
            'accuracy_train':accuracy_score(y_train,y_train_pred), 
            'accuracy_test':accuracy_score(y_test,y_test_pred),
            'precision_test':precision_score(y_test,y_test_pred) ,
            'recall_test':recall_score(y_test,y_test_pred) }
  results = results.append(output,ignore_index=True)
  active_image_example = active_image_example.append({'model':model_name, 'features':f, 'predicted_target':model.predict(features_example[f].reshape(1,-1))}, 
                              ignore_index=True)

print(results)

### Activity 1: 

Looking at the results printed, which combination of features + model produces the best results?

### Activity 2: 

Can you now fit the KNN by yourself? Explore different 'n_neighbors' to see if it improves the results. 

HINT: If you run a cell that appends outputs to the 'results' several times, your plot will have multiple entrances for the same feature + model combination. To avoid that and keep on 'results' only the final models, while you are exploring the questions, comment with # the lines: 
```
#results = results.append(output,ignore_index=True) 
#active_image_example = active_image_example.append({'model':model_name, 'features':f, 'predicted_target':model.predict(features_example[f].reshape(1,-1))},                              ignore_index=True)
```
and add: 

```
print(output)
```
After exploring the question and deciding the number of neighbors, add this code again: 
```
results = results.append(output,ignore_index=True) 
active_image_example = active_image_example.append({'model':model_name, 'features':f, 'predicted_target':model.predict(features_example[f].reshape(1,-1))},                              ignore_index=True)
```


In [None]:
knn = KNeighborsClassifier(n_neighbors=ADD A NUMBER HERE)
model_name = 'KNN'

#ADD YOUR CODE HERE
#HINT: EXPLORE DIFFERENT SIZES OF N_NEIGHBOORS TO CHECK IF IT IMPROVES THE RESULTS
print(results)

Let us use a scatter plot to compare the performance of all the three machine learning algorithms.

In [None]:
#Plot the comparison using seaborn which is a data visualization library
import seaborn as sns; sns.set()

#We had appended the model names and the corresponding features in the dataframe 'results', so we will use that for the plot.

#Converting the column "model" in dataframe 'results', to datatype 'category'
results["model"] = results["model"].astype('category')
results["features"] = results["features"].astype('category')

fig = plt.figure()
fig.suptitle('\nComparison of performance between the three ML algorithms') #title of the figure
ax = sns.scatterplot(x="accuracy_train", y="accuracy_test", hue="model", style="features", data=results, s = 80) #building scatterplot, and setting values for x-axis and y-axis, hue will group the variables to display different kinds of dots in the plot for the different models, style will produce points with different markers for the different models
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) #setting a legend for the plot
plt.xlabel('Accuracy - Training Set')
plt.ylabel('Accuracy - Testing Set')
plt.show()

The autoencoder features seem to have worse results. Let's try to remove it and make our plot again.

In [None]:
#remove the autoencoder features from the results dataframe
results_without_autoencoder = results[results['features']!='Autoencoder']
#again convert the columns 'model' and 'features' in the new dataframe 'results_without_encoder' to categorical datatype to generate the plot.
results_without_autoencoder["model"] = results_without_autoencoder["model"].astype('category')
results_without_autoencoder["features"] = results_without_autoencoder["features"].astype('category')

fig = plt.figure()
fig.suptitle('\nComparison of performance between the three ML algorithms')
ax = sns.scatterplot(x="accuracy_train", y="accuracy_test", hue="model", style="features", s = 80, data=results_without_autoencoder)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Accuracy - Training Set')
plt.ylabel('Accuracy - Testing Set')
plt.show()

### Activity 3

Based on the plot results, what combination of Model and Features do you think have the best results?

### Activity 4 (Advanced Level)

Now it's your turn to explore a classification model. 

There is classification model called [SVM](https://https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). After reading about this type of model, try to implement it by yourself! Before adding the outputs to the 'results' file, explore some of the parameters available as we did with KNN. Then, you can add your best SVM model in the 'results' file.

Is part of this activity check the documentation and learn how to import and load the library with the function, and use it based on the documentation. :) 


In [None]:
#INSERT YOUR CODE HERE


In [None]:
#INSERT YOUR CODE HERE TO COMPARE ALL THE MODELS ()

## 3. Convolutional Neural Network

In the previous section, we explored classification models with the features we built yesterday.

Now, we will explore one last model, called Convolutional Neural Network (CNN).
We will not be using our feature set here. Instead, we will be using the original feature embeddings file given in the dataset.

While the other models work for all types of inputs (images, tabular data, others), this CNN model receives the embeddings for only images. To use the CNN model, we will modify the format of the input data (embeddings file) a bit to fit the format of CNN.

In [None]:
#We will primarily be using the library 'keras' for most of our operations in CNN. Keras is an open-source deep learning/neural network library for Python.
from keras.utils import np_utils #import all the numpy utilities
from sklearn.model_selection import train_test_split

#Dropping the column 'disease_condition' from embeddings01 so that we can focus on the embeddings for now
temp = embeddings01.drop('disease_condition',axis=1)
#Reshaping the dimensions of the embeddings from 2D to 3D vector (CNN requires 3D input)
trainData = np.expand_dims(temp, axis=2)

#Splitting into training data and testing data
X_train, X_test, y_train, y_test = train_test_split(trainData, target, test_size=0.3, random_state=7)

# number of samples in training data
print("Number of training samples:", X_train.shape[0])
# number of samples in testing data
print("Number of testing samples:", X_test.shape[0])
# number of features used
# this is the vector of features extracted 
print("Number of features:", X_train.shape[1])

#convert the train and test labels to np arrays
y_train01=np.array(y_train)
y_test01=np.array(y_test)

print(y_train[0:5]) #print first five rows of y_train

#Convert the array of labeled data to a one-hot vector, i.e. converting the array containing the labels 'active' and 'inactive' to a binary class matrix (0's and 1's).
y_train = np_utils.to_categorical(y_train01)
y_test = np_utils.to_categorical(y_test01)

print(X_train.shape) #shape of training data
print(y_train.shape) #shape of training labels
print(X_test.shape) #shape of test data
print(y_test.shape) #shape of test labels


We use keras library to build a sequential basic CNN model consisting of 4 convolutional layers and one fully connected layer at the end supported by softmax activation function for classification. Each convolutional layer is followed by a maxpool operation and a dropout layer.

The code below is going to take a couple of minutes to run. 

In [None]:
#Loading libraries
from sklearn.metrics import accuracy_score
import keras
from keras.models import Sequential
from keras.layers import Input, Flatten, Dropout, Activation, Dense
from keras.layers import Conv1D, MaxPooling1D, AveragePooling1D, LSTM
from keras.models import Model
from keras.layers import Input, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from numpy import newaxis

model = Sequential() #calling sequential before defining the layers helps to linearly stack the layers

model.add(Conv1D(32,3,padding='same',activation='relu',input_shape=(1024,1))) #convolution layer
model.add(MaxPooling1D(pool_size=2)) #max pool layer
model.add(Dropout(0.1)) #dropout layer

model.add(Conv1D(32,3,padding='same',activation='relu')) #convolution layer
model.add(MaxPooling1D(pool_size=2)) #max pool layer
model.add(Dropout(0.1)) #dropout layer

model.add(Conv1D(64,3,padding='same',activation='relu')) #convolution layer
model.add(MaxPooling1D(pool_size=2)) #max pool layer
model.add(Dropout(0.1)) #dropout layer

model.add(Conv1D(64,3,padding='same',activation='relu')) #convolution layer
model.add(MaxPooling1D(pool_size=2)) #maxpool layer
model.add(Flatten()) #Flatten is the function that converts the pooled feature map from maxpool to a single column that is passed to the fully connected layer.
model.add(Dense(units=2,activation = 'softmax')) #Dense is the function that creates the fully connected layer

model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy']) #compile function configures a few parameters for training

#Fit the model, train for 10 epochs, with batch size = 32
cnnhistory=model.fit(X_train, np.array(y_train), batch_size=32, epochs=10, validation_data=(X_test, np.array(y_test)))

#Print the training accuracy obtained in each epoch.
print("\nAccuracy: {:.2f}%".format(cnnhistory.history['accuracy'][-1]*100))

#Print the validation accuracy obtained in each epoch.
print("Validation Accuracy: {:.2f}%".format(cnnhistory.history['val_accuracy'][-1]*100))

In [None]:
#model.predict_classes is used to make the prediction for the training and test data
y_train_pred = model.predict_classes(X_train) 
y_test_pred = model.predict_classes(X_test)

output = {'model':'CNN', 
            'features':'embeddings', 
            'accuracy_train':accuracy_score(y_train01,y_train_pred), 
            'accuracy_test':accuracy_score(y_test01,y_test_pred), 
            'precision_test':precision_score(y_test01,y_test_pred) ,
            'recall_test':recall_score(y_test01,y_test_pred) }

#Let us make a prediction for the sample test image that we saw in the beginning.
active_image_example = active_image_example.append({'model':'CNN', 'features':'embeddings', 'predicted_target':model.predict_classes(trainData[indice].reshape(1,1024,1))}, 
                              ignore_index=True)
results = results.append(output,ignore_index=True)
print(results)

Let's check how each model predicted the results for the image we used above.

In [None]:
active_image_example['predicted_target'] = ['inactive' if item==0 else 'active' for item in active_image_example['predicted_target']]

active_image_example

It looks like all our models predicted this image correctly! :D 

### Activity 5:

Let us see if we can achieve the same accurate prediction when the CNN model has fewer layers, and is trained for fewer epochs.

Create a CNN model consisting of just two layers and one fully connected layer by following the instructions given in the comments. Simply use the correct lines of code from the previous block for the correct layer. After building the model, train it for **5 epochs**. Then, run the code (already given) to print the training and validation accuracy, and to predict the class for the input data.

In [None]:
model = Sequential() 

#Insert a convolution layer, and a maxpool layer one after the other.


#Insert a convolution layer, a maxpool layer, and a dropout layer one after the other.


#Insert a convolution layer, a maxpool layer, a flatten layer, and a dense layer (fully connected layer) one after the other.


#setting all the other parameters
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy']) 

#Insert code to fit the model, train for 5 epochs, with batch size = 32


#Training accuracy obtained in each epoch.
print("Accuracy: {:.2f}%".format(cnnhistory.history['accuracy'][-1]*100))

#Validation accuracy obtained in each epoch.
print("Validation Accuracy: {:.2f}%".format(cnnhistory.history['val_accuracy'][-1]*100))

Now that you have built your model successfully, trained and printed the training and validation accuracy per epoch, run the below code to print the overall training and test accuracy, and make a prediction for our sample test image.

In [None]:
#model.predict_classes is used to make the prediction for the training and test data
y_train_pred = model.predict_classes(X_train) 
y_test_pred = model.predict_classes(X_test)

output = {'model':'CNN', 
            'features':'embeddings', 
            'accuracy_train':accuracy_score(y_train01,y_train_pred), 
            'accuracy_test':accuracy_score(y_test01,y_test_pred), 
            'precision_test':precision_score(y_test01,y_test_pred) ,
            'recall_test':recall_score(y_test01,y_test_pred)  }

active_image_example = active_image_example.append({'model':'myCNN', 'features':'embeddings', 'predicted_target':model.predict_classes(trainData[indice].reshape(1,1024,1))}, 
                              ignore_index=True)
results = results.append(output,ignore_index=True)
print(results)

Now, let us view the prediction made by the model you just built, along with the predictions made by all the other models we used till now!

In [None]:
active_image_example['predicted_target'] = ['inactive' if item==0 else 'active' for item in active_image_example['predicted_target']]

active_image_example

### Activity 6:
 
Create two plots to compare the previously explored models (CNN, KNN, LR, RF, others).

1. Accuracy on training data *versus* testing data; 
2. Precision *versus* Recall on testing data. 

Given a set of unlabelled cell images, if you had to choose only one model to classify if they are infected (label ‘active’) by SARS-CoV-2 or not (label ‘inactive’), which one would be? 



In [None]:
#INSERT YOUR CODE HERE