# Day 2: Factor Models

Yesterday we explored the embedding file and metadata.  

Today, we are going to work with factor models!  



## 1. Loading the datasets 

The first step is to clone the repo with the data. The original dataset available at [Kaggle](https://www.kaggle.com/tunguz/rxrx19a?select=embeddings.csv) has more than 3 GB and 305520 images. So we create a subset with only 15000 images. 

If you are curious to know how we created this subset, you can check our code in the [github repositoty](https://github.com/ai4all-sfu/comp-biology-2020/blob/master/day0-data-preprocessing.ipynb). 

In [None]:
! git clone https://github.com/ai4all-sfu/comp-biology-2020.git

[link text](https://)To check if the files are available, we use the code below. We should see two folders: 'sample_data' and 'comp-biology-2020'. 

In [None]:
! ls

## 2. Analysis

The data now is read to be used, so we can start our analysis! 


In [1]:
#Loading the libraries 
#Hint: You only need to do this once in your code
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler


In [None]:
#Loading Files we cloned from github 
embeddings = pd.read_pickle('comp-biology-2020/embeddings.pkl', compression = 'xz')
metadata = pd.read_pickle('comp-biology-2020/metadata.pkl', compression = 'xz')

#changing the index
embeddings.set_index('site_id', inplace=True)


With these libraries and the files load, the data is read for our analysis. 


In [None]:
#Checking how big are the datasets: 
print('Dimensions embeddings data: ',embeddings.shape)
print('Dimensions metadata data  : ',metadata.shape)

We can also check the format of the metadata file:

In [None]:
metadata.head()

The last step in the pre-processing is to standardize the dataset. This is a crucial step to avoid that some features have a more considerable influence in the final results only because their scale is larger than other features.


In [None]:
from sklearn.preprocessing import StandardScaler
#standardize the dataset’s features onto unit scale (mean = 0 and variance = 1)
x = StandardScaler().fit_transform(embeddings)

The metadata has the label of interest in the column 'disease_condition', and it will be more explored in the Day 3. 

Today, we are going to work on a very common challenge in Machine Learning: dimensionality reduction. 

The embeddings file has 1025 columns. Such large number of columns increase the complexity of the classification models, cause overfitting, and be very time-consuming. 

In the following analysis, we will explore three different methods to reduce the dimensionality of these datasets, known as *factor models*. 

To see more details about each method, check the slides. 

# Principal Components Analysis (PCA)

In [None]:
#Loading the PCA library  
from sklearn.decomposition import PCA 
#different optimizers 
solver = ['auto', 'full', 'arpack', 'randomized']


In [None]:
### Model Definition
#Number of features
k1 = 50 
#Model definition
pca = PCA(n_components=k1, svd_solver =solver[0]) 
#Model Fitting
pca.fit(x)

How good are the latent features?

When performing a PCA, a simple way to verify if the latent features are good is to check the variance explained by the principal components.
Below, we will construct a plot to see the value by each component individually and a cumulative value.

In [None]:
#Cumulative value: 
print('Explained Variance:', sum(pca.explained_variance_ratio_)) 


In [None]:

fig, ax = plt.subplots()
ax.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax.set_ylabel('PCA explained variance ratio')
ax.set_xlabel('Number of Components')


For k1=50 and solver = 'auto', the PCA results seems good. So we will move foward with these latent features. 




In [None]:
#principal components has a lower dimension and represents our latent variables
principalComponents = pca.transform(x)

The last step is transform the latent features to a scale between 0 and 1. By doing this, we can improve the performance of the classification models. 

In [None]:
#Creating the function to make the scale
scale01 = MinMaxScaler()
#Scaling the principal componentes 
principalComponents = scale01.fit_transform(principalComponents)

### Activity 1: 

Explore other combinations of K and svd_solvers. Can you find other promissing sets of latent features?

Save your best result to be used in the classification model tomorrow.

In [None]:
#Model definition and fit: 
k = 

In [None]:
#Evaluation: Plot and percentage of variance explained 


In [None]:
mypca = pca.fit_transform(x) #your new pca
mypca = scale01.fit_transform(mypca) #scale between 0 and 1
print('My k is ', principalComponents.shape[1])
np.savez_compressed('mypca.npz',mypca)

After saving, click on the folders on the left side, and you should see your 'mypca.npz' file. If not, click in 'refresh'.

Download your file by clicking on the three dots on the right side of your file's name. 

# Matrix Factorization

This factor model will decompose the original dataset in two smaller matrices. 

For more information, check out the slides! 

In [None]:
from sklearn.decomposition import NMF

#The input data should be between 0 and 1
x01 = scale01.fit_transform(x)

#Size of the latent variables
k2 = 60
#model definition 
nmf = NMF(n_components=k2, random_state=0, init = 'nndsvda') 
#model fitting
nmf.fit(x01)
#nmf_features has the latent variables with dimension k2
nmf_features = nmf.transform(x01)

To evaluate the matrix factorization, we measure how well it reconstruct the original data. 

In [None]:
print('Original Variables:\n', x01[0:4, 0:4])
reconstruction = np.dot(nmf_features,nmf.components_) 
print('New  Variables:\n', reconstruction[0:4, 0:4])

print('Average Error: ', nmf.reconstruction_err_/(x01.shape[0]*x01.shape[1]))

The reconstructed matrix looks very close the original variables, and the average error is very low. So we will keep the latent features of this factor model. Before saving, we are going to transform them to be between 0 and 1. 

In [None]:
nmf_features = scale01.fit_transform(nmf_features) 

### Activity 2: 

Choose a new value of k2. Run the Matrix Factorization again and save your output for tomorrow's lesson. 

In [None]:
#model definition and fitting


In [None]:
#evaluation


Why do you think this Matrix Factorization will improve our results?

In [None]:
mymf = nmf.transform(x01)
mymf = scale01.fit_transform(mymf) 
print('My k is ', mymf.shape[1])
np.savez_compressed('mymf.npz',mymf)

#Donwload the file as previously explained.

# Autoencoder 

This is the last model to reduce the dimensionality we are going to explore. 

In [None]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() 

#Don't worry about this class object, we are only going to use it :) 
#Reference: https://rubikscode.net/2018/11/26/3-ways-to-implement-autoencoders-with-tensorflow-and-python/
class Autoencoder(object):
    def __init__(self, inout_dim, encoded_dim):
        learning_rate = 0.1 
        
        # Weights and biases
        hiddel_layer_weights = tf.Variable(tf.random_normal([inout_dim, encoded_dim]))
        hiddel_layer_biases = tf.Variable(tf.random_normal([encoded_dim]))
        output_layer_weights = tf.Variable(tf.random_normal([encoded_dim, inout_dim]))
        output_layer_biases = tf.Variable(tf.random_normal([inout_dim]))
        
        # Neural network
        self._input_layer = tf.placeholder('float', [None, inout_dim])
        self._hidden_layer = tf.nn.relu(tf.add(tf.matmul(self._input_layer, hiddel_layer_weights), hiddel_layer_biases))
        self._output_layer = tf.matmul(self._hidden_layer, output_layer_weights) + output_layer_biases
        self._real_output = tf.placeholder('float', [None, inout_dim])
        
        self._meansq = tf.reduce_mean(tf.square(self._output_layer - self._real_output))
        self._optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(self._meansq)
        self._training = tf.global_variables_initializer()
        self._session = tf.Session()
        
    def train(self, input_train, input_test, batch_size, epochs):
        self._session.run(self._training)
        
        for epoch in range(epochs):
            epoch_loss = 0
            for i in range(int(input_train.shape[0]/batch_size)):
                epoch_input = input_train[ i * batch_size : (i + 1) * batch_size ]
                _, c = self._session.run([self._optimizer, self._meansq], feed_dict={self._input_layer: epoch_input, self._real_output: epoch_input})
                epoch_loss += c
                print('Epoch', epoch, '/', epochs, 'loss:',epoch_loss)
        
    def getEncoded(self, item):
        encoded_ = self._session.run(self._hidden_layer, feed_dict={self._input_layer:[item]})
        return encoded_
    
    def getDecoded(self, item):
        decoded_ = self._session.run(self._output_layer, feed_dict={self._input_layer:[item]})
        return decoded_


For this factor model, we need to split the dataset into 2 parts: training and testing set. (This part takes a couple of minutes to run)

In [None]:

from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(x,test_size=0.33, random_state=42)

#Model definition
autoencodertf = Autoencoder(x.shape[1], 32)

#Model Fitting
autoencodertf.train(x_train, x_test, 100, 50)


To evaluate the autoencoder, we are going to check how well it reconstruct the original dataset. 

In [None]:
from sklearn.metrics import mean_squared_error
testing_error = []
training_error = []

for i in range(len(x_test)):
  testing_error.append(mean_squared_error(x_test[i], autoencodertf.getDecoded(x_test[i]).reshape(-1,1)))

for i in range(len(x_train)):
  training_error.append(mean_squared_error(x_train[i],autoencodertf.getDecoded(x_train[i]).reshape(-1,1))) 


In [None]:
print('Mean Squared Error on Testing set',np.mean(testing_error))
print('Mean Squared Error on Training set',np.mean(training_error))

In [None]:
histogram_data = plt.hist([testing_error, training_error], bins=50, color=['b','r'])

The error in the training and testing set looks similar and low, so there are evidence that our autoencoder is doing a good job. 

In [None]:
#Creating the Latent Variables
autoenconderlv = []

for i in range(len(x)):
  autoenconderlv.append( autoencodertf.getEncoded(x[i])[0])

autoenconderlv = np.matrix(autoenconderlv)
print('Autoencoder latent variables shape: ', autoenconderlv.shape)


## Activity 3 (advanced level) 

Can you think of ways to improve the autoencoder? 

The Mean Squared Error was in the previous example, but could we reduce it even more? 

Play around with the parameters and see if you can find a better set of latent variables. Start by exploring the encoded_dim and learning_rate parameters. Then, you can also try different [optimizers](https://https://www.tensorflow.org/api_docs/python/tf/compat/v1/train) and [activate functions](https://https://www.tensorflow.org/api_docs/python/tf/nn).  

Save your best result for tomorrow's class! 




In [None]:
#ADD CODE HERE 

#np.savez_compressed('MyAmazingAutoencoder.npz',MyAmazingAutoencoder)
