***
<h2> <u>Goals of this notebook</u> </h2>

* So far, we looked at linear models and non-linear models for regression and classification on toy data.
* In this notebook, we consider **real-world datasets**!

*** 
<h2> <u>What am I supposed to do?</u> </h2>

* **The code in the all the cells in this notebook is already written!**
* So sit back and relax! Simply go through the notebook, execute the cells and try to understand what is going on. Feel free to insert new code cells in between and print stuff in order to better understand what is going on.

***
***
<h2> <u>Import required modules</u> </h2>


In [None]:
import numpy as np
import matplotlib.pylab as plt

<h2> <u>Mount Google drive folder</u> </h2>

In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/My Drive/ML_workshop

In [None]:
ls

***
<h2>High-dimensional data</h2>

Imaging data often yields high-dimensional features. Most of the time the number of dimensions is much higher than the number of samples. In these cases, a few difficulties arise: 
- it is difficult to visualize classification or regresion models
- when the number of samples is lower than number of dimensions, this may lead to trivial solutions
- models need to use some sort of feature selection or regularization to overcome this challenge

In this section, we will focus on two high-dimensional problems in neuroimaging. We will use measurements extracted from brain MRI: volumes of different anatomical structures and thickness of the cortical mantle at different locations.
***
***
<h3>Alzheimer's Disease classification</h3>

First one is to classify subjects with Alzheimer's Disease (AD) and healthy elderly (CN) using cortical thickness maps. We will directly work with the cortical thickness values extracted from 290 individuals and aligned on the same reference frame.

In [None]:
# features are saved in a matrix
features = np.loadtxt('machine_learning/data/features_ad_classification.txt')
# labels are saved as a vector
labels = np.loadtxt('machine_learning/data/labels_ad_classification.txt')
# printing information on the dataset
print("Number of subjects: {}".format(features.shape[0]))
print("Number of features (thickness values): {}".format(features.shape[1]))
print("Number of AD cases: {}".format(np.sum(labels)))
print("Number of CN cases: {}".format(np.sum(1-labels)))

<h3> For this, we use a linear <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machine</a> model.</h3>

This model builds a classifier for automatically discriminating subjects with Alzheimer's disease from healthy elderly using the data we just loaded.

After training the model, we compute prediction error on the training set (using the entire dataset) and estimate accuracy using a technique known as 'cross validation'.

We use classification accuracy and the accuracy_score function as in the previous notebook. 

In [None]:
# import the required function to compute classification accuracy
from sklearn.metrics import accuracy_score
from sklearn import svm
svml = svm.SVC(kernel='linear')

# Computing prediction error on the training set. 
svml.fit(features, labels) # train on all the data
preds = svml.predict(features)
print("Prediction accuracy on the training set: {}\n".format(accuracy_score(labels,preds)))

# import the required function to perform 5-fold stratified cross-validation
# in stratified K-fold cross validation in each fold the ratio of the 
# number of different classes is the same as the entire dataset. 
from sklearn.model_selection import StratifiedKFold

# creating an object to create partitions for the 5 fold cross validation
numFolds = 5
skf = StratifiedKFold(n_splits=numFolds)

# creating a vector to hold accuracies of different folds: 
acc_vec = np.zeros(numFolds)
# in this for loop we go over different partitions. 
n = 0
for trainind, testind in skf.split(features, labels):
    # training both classification models using the training partitions of the dataset. 
    svml.fit(features[trainind,:], labels[trainind])
    
    # predictions in the test partition of each fold
    preds_cv = svml.predict(features[testind,:])
    
    # computing accuracy for the test partitions
    acc_vec[n] = accuracy_score(labels[testind], preds_cv)
    n += 1

print("Accuracies at different folds:")
print("=============================")
print("Linear SVM: {}".format(acc_vec))
print("\n")
print("Generalization accuracy estimates:")
print("=============================")
print("Linear SVM: {}".format(np.mean(acc_vec)))


***
***
<h3>Age regression</h3>

In the second task, we will perform age regression using volumes of different anatomical structures. The underlying idea is that as humans age changes happen in the brain. Certain structures get larger and others get smaller. 

Let us first read the dataset:

In [None]:
# features are saved in a matrix
# note that here we read a csv file with np.loadtxt - this is another alternative to reading csv files
features = np.loadtxt('machine_learning/data/features_age_regression.csv', delimiter=',').T
# labels are saved as a vector
labels = np.loadtxt('machine_learning/data/labels_age_regression.csv', delimiter=',')
# printing information on the dataset
print("Number of subjects: {}".format(features.shape[0]))
print("Number of features: {}".format(features.shape[1]))
print("Mean age in the dataset: {}".format(np.mean(labels)))
print("Min / Max age in the dataset: {}/{}".format(np.min(labels), np.max(labels)))

<h3> Here, we use a <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)">LASSO</a> model.</h3>

This model builds a regressor for automatically predicting subjects' age from the volumes of anatomical structures, which we have read from file in the previous step. 

After training the model, we compute prediction error on the training set (using the entire dataset) and estimate accuracy using a technique known as 'cross validation'.

We use RMSE to compute the prediction error.

In [None]:
# import the required function to compute classification accuracy
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
lasso = linear_model.Lasso()

# Computing prediction error on the training set. 
lasso.fit(features, labels) # train on all the data
preds = lasso.predict(features)
print("RMSE on the training set: {}\n".format(np.sqrt(mean_squared_error(labels,preds))))
print("Pearson's correlation coefficient on the training set: {}\n".format(np.corrcoef(labels,preds)[0,1]))

# import the required function to perform 5-fold stratified cross-validation
# in stratified K-fold cross validation in each fold the ratio of the 
# number of different classes is the same as the entire dataset. 
from sklearn.model_selection import KFold

# creating an object to create partitions for the 5 fold cross validation
numFolds = 5
skf = KFold(n_splits=numFolds)

# creating a vector to hold accuracies of different folds: 
rmse_vec = np.zeros(numFolds)
r_vec = np.zeros(numFolds)
# in this for loop we go over different partitions. 
n = 0
for trainind, testind in skf.split(features, labels):
    # training both classification models using the training partitions of the dataset. 
    lasso.fit(features[trainind,:], labels[trainind])
    
    # predictions in the test partition of each fold
    preds_cv = lasso.predict(features[testind,:])
    
    # computing accuracy for the test partitions
    rmse_vec[n] = np.sqrt(mean_squared_error(labels[testind], preds_cv))
    r_vec[n] = np.corrcoef(labels[testind],preds_cv)[0,1]
                          
    n += 1

print("Accuracies at different folds:")
print("=============================")
print("LASSO - RMSE: {}".format(rmse_vec))
print("LASSO - r: {}".format(r_vec))
print("\n")
print("Generalization accuracy estimates:")
print("=============================")
print("LASSO - RMSE: {}".format(np.mean(rmse_vec)))
print("LASSO - r: {}".format(np.mean(r_vec)))