# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Dementia prediction using SVM

## Problem Statement

Prediction of Dementia using an SVM model on brain MRI features

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform data exploration, preprocessing and visualization
* implement SVM Classifier on the data
* explore various parameters of SVM classifier and implement OneVsOne classifier
* calculate the metrics and plot the roc_curve

## Information

**About Dementia**

Dementia is a general term for loss of memory and other mental abilities severe enough to interfere with daily life. It is caused by physical changes in the brain. Alzheimer's is the most common type of dementia, but there are many kinds.

**Brain Imaging via magnetic resonance imaging (MRI) and Machine Learning**

* MRI is used for the evaluation of patients with suspected Alzheimer's disease
* MRIs detect both, local and generalized shrinkage of brain tissue.
* MRI features predict the rate of decline of AD and may guide therapy in the future
* Using machine learning on MRI features could help in automatedly and accurately predicting the progress of a patient from mild cognitive impairment to dementia

To understand the basics of MRI technique, you could refer [here](https://case.edu/med/neurology/NR/MRI%20Basics.htm)

## Dataset

The dataset chosen for this mini-project is [OASIS - Longitudinal brain MRI Dataset](https://www.oasis-brains.org/). This dataset consists of a longitudinal MRI collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.

**Dataset fields:**

* Subject ID - Subject Identification
* MRI ID - MRI Exam Identification
* Group - Target variable with 3 labels ('NonDemented', 'Demented', 'Converted')
* Visit - Visit order
* MR Delay - MR Delay Time (Contrast)
* M/F - Male or Female
* Hand - Unique value 'R'
* MMSE - Mini-Mental State Examination score (range is from 0 = worst to 30 = best)
* CDR - Clinical Dementia Rating (0 = no dementia, 0.5 = very mild AD, 1 = mild AD, 2 = moderate AD)
* Derived anatomic volumes
* eTIV - Estimated total intracranial volume, mm3
* nWBV - Normalized whole-brain volume, expressed as a percent of all voxels in the atlas-masked image that are labeled as gray or white matter by the automated tissue segmentation process
* ASF - Atlas scaling factor (unitless). A computed scaling factor that transforms native-space brain and skull to the atlas target (i.e., the determinant of the transform matrix)

For learning more on building a machine learning model to predict dementia using SVM, refer [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7408873/).

## Grading = 10 Points

### Download the dataset

In [None]:
!wget https://cdn.iisc.talentsprint.com/CDS/MiniProjects/oasis_longitudinal.csv

### Import required packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

### Load the dataset

In [None]:
data = pd.read_csv("oasis_longitudinal.csv")
data.head(2)

In [None]:
data.shape

In [None]:
data.dtypes

### Pre-processing and Data Engineering

#### Remove unwanted columns

In [None]:
data.drop(['MRI ID','Hand','Subject ID'],axis=1, inplace=True)

In [None]:
data.head(2)

#### Encode columns into numeric

In [None]:
le = LabelEncoder()
data['M/F'] = le.fit_transform(data['M/F'])
data.head(2)

#### Handle the null values by removing or replacing

In [None]:
data.fillna(data.mean(), inplace=True)
data.isna().sum()

#### Identify feature and target and split it into train test

In [None]:
X = data.drop('Group',axis=1)
y = data['Group']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape

### EDA &  Visualization

#### Plot the distribution of all the variables using histogram

In [None]:
data.hist(bins=30, figsize=(20,15))
plt.show()

#### Visualize the frequency of Age

In [None]:
ax = sns.countplot(x='Age', data=data)
ax.figure.set_size_inches(18.5, 5)

#### How many people have Alzheimer? Visualize with an appropriate plot

the same person visits two or more times; extract the single visit data and plot

**Hint**: Visit = 1

In [None]:
sns.set_style("whitegrid")
ex_df = data.loc[data['Visit'] == 1]
sns.countplot(x='Group', data=ex_df)

#### Calculate the correlation of features and plot the heatmap

In [None]:
corr = data.corr()
_ , ax = plt.subplots( figsize =( 12 , 10 ) )
cmap = sns.diverging_palette( 240 , 10 , as_cmap = True )
sns.heatmap(corr,cmap = cmap,square=True, cbar_kws={ 'shrink' : .9 }, ax=ax, annot = True, annot_kws = { 'fontsize' : 12 })

### Model training and evaluation

**Hint:** SVM model from sklearn

In [None]:
# SVC model
svm = SVC(kernel="linear")
svm.fit(X_train, y_train)

test_acc = svm.score(X_test, y_test)
train_acc = svm.score(X_train, y_train)
print("Train accuracy is: {} \nTest accuracy is: {}".format(train_acc, test_acc))

In [None]:
# Checking the misclassifications
model1_predictions = svm.predict(data.drop('Group',axis=1))
model1_misclassified  = data[data['Group']!=model1_predictions]
len(model1_misclassified)

#### Support vectors of the model

* Find the samples of the dataset which are the support vectors of the model 

In [None]:
support_vectors = pd.DataFrame(svm.support_vectors_,columns=X_train.columns)
support_vectors

#### Confusion matrix for multi-class classification

* Predict the test and plot the confusion matrix

In [None]:
multi_cm = metrics.confusion_matrix(y_test, svm.predict(X_test))
plt.imshow(multi_cm,  cmap=plt.cm.Wistia_r)
classNames = ['Nondemented','Demented','Converted']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(3):
    for j in range(3):
        plt.text(j,i, str(multi_cm[i][j]))
plt.show()

#### One VS Rest Classifier

OneVsRestClassifier can also be used for multilabel classification. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed)

* Fit the OneVsRestClassifier on the data and find the accuracy

Hint: [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)

In [None]:
from sklearn.multiclass import OneVsRestClassifier
clf_OneVsRest = OneVsRestClassifier(SVC(kernel='linear')).fit(X_train, y_train)
clf_OneVsRest.score(X_test, y_test)

#### One VS One Classifier

This strategy consists in fitting one classifier per class pair. At prediction time, the class which received the most votes is selected.

* Fit the OneVsOneClassifier on the data and find the accuracy

Hint: [OneVsOneClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html)

In [None]:
from sklearn.multiclass import OneVsOneClassifier
clf_OneVsOne = OneVsOneClassifier(SVC(kernel='linear',random_state=0)).fit(X_train, y_train)
clf_OneVsOne.score(X_test, y_test)

#### Make it binary classification

As stated in dataset description, 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit. Change `Converted` label into `Demented`.

**Note:** In two-class classification, encode the labels into numerical to plot the roc_curve with predictions.

In [None]:
# Combining 3rd label and 2nd label
data['Group'] = data['Group'].replace(['Converted'], ['Demented'])
data['Group'] = LabelEncoder().fit_transform(data['Group'])

In [None]:
# SPlit the data which is having 2 labels
X1 = data.drop('Group',axis=1)
y1 = data['Group']
xtrain, xtest, ytrain, ytest = train_test_split(X1, y1, test_size=0.10, random_state=42)

In [None]:
ytrain.value_counts()

In [None]:
svm_binary_model = SVC(kernel="linear")
svm_binary_model.fit(xtrain, ytrain)
predicted = svm_binary_model.predict(xtest)
svm_binary_model.score(xtest, ytest), svm_binary_model.score(xtrain, ytrain)

In [None]:
# predictions
model2_predictions = svm_binary_model.predict(xtest)
model2_predictions

In [None]:
# comparing 3-class with 2-class predictions
np.array([0 if i=='Demented' else 1 for i in svm.predict(X_test) ]) == model2_predictions

### Classification report and metrics

#### Confusion matrix

Describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

In [None]:
cm = metrics.confusion_matrix(ytest, predicted)

plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['Nondemented','Demented']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
plt.show()

#### Plot the ROC Curve

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(predicted,y_test)
roc_auc = metrics.auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr,  lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

### Choice of C for SVM

experiment with different C values given and plot the ROC curve for each

In [None]:
import math

c_val = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
num_cols = 3
num_rows = math.ceil(len(c_val) / num_cols)

# create a single figure
plt.clf()
fig,axes = plt.subplots(num_rows,num_cols,sharey=True)
fig.set_size_inches(num_cols*5,num_rows*5)

for i,g in enumerate(c_val):
    svm_model = SVC(kernel='linear',C=g).fit(X_train,y_train )
    y_preds = svm_model.predict(X_test)
    fpr, tpr, _ = metrics.roc_curve(y_test, y_preds)
    auc_score = metrics.auc(fpr, tpr)
    ax = axes[i // num_cols, i % num_cols]
    ax.plot(fpr, tpr, label='AUC = {:.3f}'.format(auc_score))
    ax.legend(loc='lower right')
    ax.plot([0,1],[0,1],'r--')
    ax.set_xlim([-0.1,1.1])
    ax.set_ylim([-0.1,1.1])
    ax.set_ylabel('True Positive Rate')
    ax.set_xlabel('False Positive Rate')

plt.show()

### Report Analysis

* Compare the performance of the model with various Kernel parameters.
* Discuss the impact of parameter C and gamma on performance.
* Comment on the computational cost of implementing one vs one and one vs all to solve multi-class classification with binary classifier.
* When do you call a sample/record in the data as a support vector?