# Alzheimer’s Diagnostic with OASIS
A dataset of MRI brain scans of Alzheimer's patients and healthy controls.

### Content
<!-- Create content from the below markdowns heading  -->
- [Abstract](#Abstract)
- [Introduction](#Introduction)
- [Materials & Methods](#Materials-&-Methods)
- [Importing](#Importing)
- [Load Data](#Load-Data)
- [Processing Image Data](#Processing-Image-Data)
- [Visualizing MRI Data](#Visualizing-MRI-Data)
- [Data Preprocessing](#Data-Preprocessing)
- [Data Analysis](#Data-Analysis)
- [Metrics](#Metrics)
- [Post Analysis](#Post-Analysis)
- [Results](#results)
- [Discussion](#discussion)
- [Conclusion](#conclusions)
- [References](#References)


### Abstract
`Alzheimer` is a nervous system disease that affects memory and thinking abilities of humans. Doctors do not consider it to be curable but if detected at early stage its progression can be slowed.

`Open Access Series of Imaging Studies (OASIS)` brain data which can be used for Alzheimer's disease detection. It includes `MRI(Magnetic Resonance Imaging)` scans of the brain, which can help in detecting change of structural in brain of person diagnosed with Alzheimer's disease.

The aim of the project is to detect Alzheimer's disease at early stage using OASIS brain data set. This project involves using implementing `machine learning techniques` and exploring different algorithms and methods to accurately `detect the disease` through the given data. This model will detect change of structure is specific part of `brain and abnormalities` that leads to Alzheimer's disease. 

The results of the projects have a potential for the improvement in development of tools for diagnostics of the disease and further understand about the disease in details.

### Introduction
An estimated 40 million people, mostly older than 60 years, have `dementia` worldwide, and this figure is projected to double every 20 years, until at least 2050\cite{SCHELTENS2016505}. `Dementia of Alzheimer's Type (DAT)` is the most common form of dementia, affecting 1 in 9 people over the age of 65 years and as many as 1 in 3 people over the age of 85 \cite{popuri2020using}. Thus, its a major health concern among all the other modern health issues.

OASIS contains MRI scans of brain imaging with `neuroimaging and related clinical data` which is publicly available for research and analysis. It contains data to understand brain and helps in developing treatment approaches for various brain related diseases including Alzheimer's disease.

Currently, diagnostics of Alzheimer's disease diagnosis relies on a combination of `clinical evaluations, cognitive assessments, and neuroimaging` techniques. However, the accuracy and reliability of existing diagnostic methods can be limited, especially in the early stages of the disease.

Diagnostic the disease through machine learning models by utilizing the OASIS data set would be a better and much more effective way as the model will classify if the person's brain is normal or have `some patterns` that reflects presence of Alzheimer's disease.

### Materials & Methods
The data set consists of `MRI of 150 individuals` aged 60 to 96 years, all scanned in a similar environment. Everyone was scanned on two or more visits, separated by at least one year for 373 imaging sessions \cite{10.1162/jocn.2009.21407}. This data set contains brain images and demographic data of the person being scanned.

 

Implemented `Convolution Neural Network (CNN)` for image recognition and applied Transfer Learning on pre-trained `VGG16(Visual Geometry Group 16)` CNN model. VGG16 has 16 hidden layers and 14,714,688 parameters. The model has been trained on 1000 images of 1000 different categories with coloured images. The layers of the VGG16 model are freezed; so that the weights will not get updated in further training, the input layer is chopped off, and a new input layer is added for the current input data. After the pre-defined model(VGG16), the output is flattened and sent to a fully connected layer with 64 neurons with Leaky ReLU (alpha = 0.1) as the activation function. The final output layer consists of three neurons (One for each class) with softmax as the activation function.


### Importing
#### Importing Libraries

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
from IPython.display import HTML
import seaborn as sns
from matplotlib.animation import FuncAnimation
import psutil
import gc
import pandas as pd
from tqdm import tqdm
import re
import nibabel as nib
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# function to print bold text
def bold(text):
    return f"\033[1m{text}\033[0m"
    
# It will fix the file path to the correct path
# def clearFilePath(path):
#     file_name = re.sub(".*" + "/oasis2/", '', path)
#     file_name = re.sub('/', '_', file_name)
#     file_name = re.sub('mpr-', '', file_name)
#     file_name = re.sub('.nifti.img', '', file_name)
#     return file_name
def clearFilePath(path):
    file_name = path.split('\\')[-2:]
    return file_name[0]+'_'+file_name[1].split('-')[-1][0]

### Load Data
#### Importing excel data

In [None]:
df_demographics_input = pd.read_excel('oasis_longitudinal_demographics.xlsx') 
df_demographics_input


### Processing image data.
This code is responsible for loading MRI "nifti.img" files from a specified directory, processing the data, and cleaning the data based on none values in the demographics.
It utilizes the `nibabel` library to load the MRI files and extract relevant information.

In [None]:

# rootdir = '../input/oasis2'
rootdir = 'kaggle/input/oasis2'


mri_patients_scans_name = []
mri_ignored_file_names = []
mri_images = []
mri_images_data = []

# Count the total number of files in the root directory
file_count = sum(len(files) for _, _, files in os.walk(rootdir))

print(f'Found {file_count} files in "{rootdir}" subdirectories\n')
print('Loading MRI "nifti.img" files:')

# Display a progress bar using tqdm
with tqdm(total=file_count) as pbar:
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            filepath = os.path.join(subdir, file)

            if filepath.endswith("nifti.img"):
                try:
                    # Load the MRI image using nibabel
                    img = nib.load(filepath)
                    mri_images.append(img)
                    mri_patients_scans_name.append(clearFilePath(filepath))
                    mri_images_data.append(img.get_fdata())
                except nib.filebasedimages.ImageFileError as e:
                    # If the file type is not recognized, ignore the file and print a message
                    mri_ignored_file_names.append(filepath)
                    print(
                        f'File type not recognized - ignoring "{filepath}" file')

            pbar.update(1)

print(
    f'\nFound and loaded {len(mri_patients_scans_name)} MRI "nifti.img" files from "{rootdir}" subdirectories')
print(f'Ignored:')
print(*mri_ignored_file_names, sep=' ')

# Get unique patient visit names from MRI file names
mri_patients_visits_names = np.unique(
    [re.sub('_\d\Z', '', i) for i in mri_patients_scans_name])

# Filter the demographics dataframe based on the MRI IDs
df_demographics = df_demographics_input[df_demographics_input['MRI ID'].isin(
    mri_patients_visits_names)]

print("Before cleaning data from none values from demographics:")
print(
    f'Number of patients visits: {len(np.unique(mri_patients_visits_names))}')
print(f'Number of patients scans: {len(mri_patients_scans_name)}')
print(
    f'Number of slices of MRI scans (unique values): {np.unique([arr.shape[2] for arr in mri_images_data])}')

# Drop rows in demographics dataframe with None values
df_drop = df_demographics[df_demographics.isna().any(axis=1)]
list_drop = df_drop['MRI ID'].tolist()
print(
    f'None value rows in demographics data to drop: {df_demographics.isnull().any(axis=1).sum()}')
df_demographics = df_demographics.dropna().reset_index(drop=True)

# Remove corresponding MRI files and names based on dropped rows
fname_drop = [fname for fname in mri_patients_scans_name if re.sub(
    '_\d\Z', '', fname) in list_drop]
fname_drop_id = [i for i in range(
    len(mri_patients_scans_name)) if mri_patients_scans_name[i] in fname_drop]

for i in reversed(fname_drop_id):
    del mri_images[i]
    del mri_patients_scans_name[i]
    del mri_images_data[i]

print(f"After cleaning data from none values from demographics:")
print(f"Number of removed MRI files: {len(fname_drop)}")
print(
    f"Number of patients visits: {len(np.unique(mri_patients_visits_names))}")
print(f"Number of patients scans: {len(mri_patients_scans_name)}")


#### Visualizing MRI Data
This code segment performs visualizations of MRI slices using the matplotlib.pyplot library. It demonstrates the visualization of different slices from an MRI image.

In [None]:
i = np.transpose(mri_images_data[0], (1, 0, 2, 3))
# Transpose the MRI image data to rearrange the dimensions for visualization

plt.figure(figsize=(15, 5))
# Create a figure with a size of 15x5 inches

plt.subplot(131)
# Define the first subplot in a 1x3 grid

plt.imshow(np.transpose(
    mri_images_data[0][:, :, 75], (1, 0, 2)), origin='lower')
# Display the transposed MRI image slice at index 75
# Adjust the dimensions for proper display using np.transpose()
# Set the origin of the image to the lower-left corner

plt.subplot(132)
# Define the second subplot in the grid

plt.imshow(i[120, :, :], origin='lower')
# Display a different slice of the transposed MRI image (at index 120)

plt.subplot(133)
# Define the third subplot in the grid

plt.imshow(i[:, 70, :], origin='lower')
# Display another slice of the transposed MRI image (at index 70)

plt.show()
# Show the figure with the three subplots and MRI slice visualizations


In [None]:
image, s = 40, 75
# Define the image and slice indices to visualize

img_data = mri_images_data[image]
# Retrieve the MRI image data for the specified image index

img_data = np.transpose(img_data, (1, 0, 2, 3))
# Transpose the MRI image data to rearrange the dimensions for visualization

mid_slice_x = img_data[:, :, s]
# Extract the slice at index 's' from the transposed MRI image data

mri_file_name = mri_patients_scans_name[image]
# Retrieve the MRI file name corresponding to the specified image index

patient_id = list(filter(None, re.split('_|MR', mri_file_name)))[1:]
# Extract the patient ID from the MRI file name using regex

patient_data = df_demographics[df_demographics['MRI ID'] == re.sub(
    '_\d\Z', '', mri_file_name)]
# Retrieve the demographics data for the patient ID obtained from the MRI file name

plt.title(
    f'MRI ID: {patient_id[0]} - visit: {patient_id[1]} - scan: {patient_id[2]}\nGroup: {patient_data.iloc[0][2]}\nSlice {s}')
# Set the title of the plot with relevant information, such as MRI ID, visit, scan, group, and slice index

plt.imshow(mid_slice_x, cmap='gray', origin='lower')
# Display the extracted MRI slice using a grayscale colormap and with the origin set to the lower-left corner

plt.colorbar(label='Signal intensity')
# Add a colorbar to the plot with the label 'Signal intensity'

plt.show()
# Show the plot with the MRI slice visualization

patient_data
# Display the demographics data for the corresponding patient ID


In [None]:
slice_no = [120, 70, 70]
# Define the slice numbers to visualize in each dimension

fig, ax = plt.subplots(ncols=3, figsize=(15, 5))
# Create a figure with three subplots arranged in a single row

ax[0].imshow(img_data[:, :, slice_no[2]], origin='lower', cmap='gray')
ax[0].set_xlabel('Second dim voxel coords.', fontsize=12)
ax[0].set_ylabel('Third dim voxel coords', fontsize=12)
ax[0].set_title('First dimension', fontsize=15)
# Display the slice along the first dimension in the first subplot
# Set the x-axis and y-axis labels
# Set the title for the subplot indicating the dimension

ax[1].imshow(img_data[slice_no[0], :, :], origin='lower', cmap='gray')
ax[1].set_xlabel('First dim voxel coords.', fontsize=12)
ax[1].set_ylabel('Third dim voxel coords', fontsize=12)
ax[1].set_title(f'Second dimension', fontsize=15)
# Display the slice along the second dimension in the second subplot
# Set the x-axis and y-axis labels
# Set the title for the subplot indicating the dimension

ax[2].imshow(img_data[:, slice_no[1], :], origin='lower', cmap='gray')
ax[2].set_xlabel('First dim voxel coords.', fontsize=12)
ax[2].set_ylabel('Second dim voxel coords', fontsize=12)
ax[2].set_title(f'Third dimension', fontsize=15)
# Display the slice along the third dimension in the third subplot
# Set the x-axis and y-axis labels
# Set the title for the subplot indicating the dimension

fig.suptitle(
    f'MRI ID: {patient_id[0]} - visit: {patient_id[1]} - scan: {patient_id[2]}\nGroup: {patient_data.iloc[0][2]}\nSlices: {slice_no}', fontsize=15)
# Set the super-title of the figure with relevant information, such as MRI ID, visit, scan, group, and the selected slice numbers

fig.tight_layout()
# Adjust the subplots layout to prevent overlapping


In [None]:

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_demographics_input[df_demographics_input['Group']
                != 'Converted'], y='eTIV', x='Age', hue='Group')
plt.title("eTIV vs Age")


In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_demographics_input[df_demographics_input['Group']
                != 'Converted'], y='nWBV', x='Age', hue='Group')
plt.title("nWBV vs Age")


In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_demographics_input['Group'])


In [None]:
mri_images_data[0].shape


In [None]:
# Retrieve the patient details from the demographics data
def get_patient_detail(index):
    mri_file_name = mri_patients_scans_name[index]
    patient_data = df_demographics_input[df_demographics_input['MRI ID'] == re.sub(
        '_\d\Z', '', mri_file_name)]
    return patient_data


### Data Preprocessing
Encoding categorical label data in data preprocessing. This code segment demonstrates the use of the LabelEncoder classes to encode categorical data. 

Also, we are selecting only three frame images from 3D image  dataset to reduce the computational time for the model training.Also VGG16 model is used for transfer learning and it only support only 3 channel images.


In [None]:
from tensorflow.keras.utils import to_categorical

cnn_y = []
# Initialize an empty list to store the labels for CNN classification

for i in range(len(mri_images_data)):
    # Iterate through each MRI image data

    # Retrieve the patient group information using the 'get_patient_detail' function
    patient_group = get_patient_detail(i)['Group'].values[0]

    # Append the patient group to the 'cnn_y' list
    cnn_y.append(patient_group)

# Initialize a label encoder object for encoding the patient groups

# Encode the patient groups using label encoding and convert them to number format
cnn_y = to_categorical(label_encoder.fit_transform(cnn_y))


In [None]:
# selects a subset of three consecutive slices (indexed 70, 71, and 72) from each MRI image in the mri_images_data array.
# It then reshapes the subset to have a consistent shape of (256, 256, 3) representing an RGB image.
# The resulting transformed data is stored in the mri_images_data_p numpy array for further processing or visualization.
mri_images_data_p = np.array(
    [i[:, :, [70, 100, 125]].reshape(256, 256, 3) for i in mri_images_data])


In [None]:
print(f"{bold('Length if image data: ')}{len(mri_images_data)}")
print(f"{bold('Length of y: ')}{len(cnn_y)}")
print(f"{bold('shape of a image: ')}{mri_images_data_p[0].shape}")


### DATA ANALYSIS

#### Importing pre-trained model

VGG16 is a popular convolution's neural network architecture widely used for various computer vision tasks, including image classification. ImageNet is a large-scale dataset with millions of labeled images used for training deep learning models.

In [None]:
from tensorflow.keras import applications
# VGG16 pre-trained model without fully connected layers and with different input dimensions
input_shape = (256, 256, 3)
model = applications.VGG16(
    weights="imagenet", include_top=False, input_shape=input_shape)
model.summary()


In [None]:
# freeze the models layers
for layer in model.layers:
    layer.trainable = False


In [None]:
from tensorflow.keras.layers import LeakyReLU, Flatten, Dense
from tensorflow.keras.models import Model

# Flatten the output of the base model
flatten = Flatten()(model.output)

# Add a fully connected layer with 64 units
fc1 = Dense(64)(flatten)

# Apply LeakyReLU activation to introduce non-linearity
leaky_relu = LeakyReLU(alpha=0.1)(fc1)

# Add the final output layer with 3 units and softmax activation
output = Dense(3, activation='softmax')(leaky_relu)

# Create a new model by specifying the input and output layers
new_model = Model(inputs=model.input, outputs=output)

# Compile the new model with optimizer, loss function, and metrics
new_model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics=['accuracy'])

# Print the summary of the new model
new_model.summary()


#### Splitting data into training and test

In [None]:
from sklearn.model_selection import train_test_split

#  split the data into training and testing sets
train_X, test_X, train_y, test_y = train_test_split(
    mri_images_data_p, cnn_y, test_size=.25)

print(f"{bold('Train shape: ')}{train_X.shape}")
print(f"{bold('Test shape: ')}{test_X.shape}")


#### Training the model

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# Define an early stopping callback with a patience of 6 epochs
early_stopping = EarlyStopping(patience=6)

# Train the new model with early stopping
hist = new_model.fit(train_X, train_y, validation_split=.2,
                     epochs=20, callbacks=[early_stopping])


#### Visualization of Accuracy and Validation loss of the model while training

In [None]:
pd.DataFrame(hist.history).iloc[3:, [0, 2]].plot(
    title="Loss vs Validation Loss")
pd.DataFrame(hist.history).iloc[3:, [1, 3]].plot(
    title="Accuracy vs Validaion Accuracy")


#### Prediction

In [None]:
# predictions on x_test dataset
y_pred = np.argmax(new_model.predict(test_X), axis=1)


### Metrics
- Accuracy Score
- F1 Score
- Confusion Matrix
- Classification Report

In [None]:

print(f"{bold('Accuracy Score: ')}{accuracy_score(np.argmax(test_y,axis=1),y_pred)}\n")
print(f"{bold('Classification report')}{classification_report(np.argmax(test_y,axis=1),y_pred)}\n")
print(f"{bold('F1 Score: ')}{f1_score(np.argmax(test_y,axis=1),y_pred, average=None)}\n")


In [None]:
# CONFUSION MATRIX PLOT
plt.figure(figsize=(8, 6))

# Compute the confusion matrix
c_m = confusion_matrix(np.argmax(test_y, axis=1), y_pred)

# Plot the confusion matrix as a heatmap
sns.heatmap(c_m, annot=True,cmap='Set2')
plt.title("Confusion Matrix")

In [None]:
# Plot the 4 images from the test dataset after prediction with its actual and predicted labels

images = [6, 10, 14, 26]
plt.figure(figsize=(16, 12))

for index, image in enumerate(images):
    img_data = test_X[image]

    mri_file_name = mri_patients_scans_name[image]
    true = label_encoder.classes_[np.argmax(test_y, axis=1)[image]]
    actual = label_encoder.classes_[y_pred[image]]

    plt.subplot(2, 2, index+1)
    plt.title(f'\nGroup: {true}\nPredicted: {actual}')
    plt.imshow(img_data[:, :, -1], cmap='gray', origin='lower')
    plt.xlabel('First axis')
    plt.ylabel('Second axis')
    plt.colorbar(label='Signal intensity')
plt.show()


### Post Analysis 
Saving the model weights 

In [None]:
new_model.save("OASIS_MODEL_1.h5")

### Results 
The data set has MRI images of `3-dimensional black-and-white images with shape (256, 256, 128)`. But, the pre-trained model used has been trained on images of `shape (224,224,3)`. Hence the data set was transformed into a shape in the format required for the model, and `specific frames (i.e., 75th, 100th and 125th)` were selected. The labels/target column (i.e., demented, non-demented and converted) were extracted from each `subject's Demographic (DM) data` for the respective images. These labels, along with the MRI images, were combined and used for training the model.  

### Discussion 

`Convolution Neural Network(CNN)` is a popular algorithm for `image recognition` because it can detect patterns and objects in the image. We specifically decided to move ahead with VGG16, a pre-trained CNN model, as we had limited data sources, which would not have been enough to train a CNN from scratch due to its complex structure. 

 

Further, it's very interesting to note that in the `OASIS2 brain dataset`, we have 72 subjects who identified as nondemented throughout the study, 64 subjects as demented in the initial visit and throughout the study, which also includes 51 subjects with mild to moderate Alzheimer's. The remaining 14 subjects were determined as non-demented in the initial visit and marked as demented in the later visits (converted). `These 14 subjects (converted) data is crucial for the problem statement, i.e., early detection, as their MRI samples tend to have patterns of the disease that looks like or develops in the early stages.`
#### Evaluation 

    The model tends to classify the images as `demented, non-demented and converted with an accuracy of 94 percent`. Still, accuracy alone cannot give us the whole picture as accuracy metrics only considers the correctly predicted(True Positive + True Negative) output with respect to total outputs. Hence, it can give a false perception in case of unbalanced data. 

     

    To overcome the limitation of accuracy score and as a requirement for the defined problem statement, we want to emphasize on reducing False Negative results, i.e. where a subject suffers from Alzheimer's, but the model detects otherwise. Therefore, the metrics we will focus on to evaluate the model are `Recall and F1 Score`. Recall, the fraction of the items of interest to the user retrieved by the system \cite{alvarez2002exact}. The harmonic mean of precision and recall, F1 score, is widely used to measure the success of a classifier when one class is rare \cite{lipton2014thresholding}. 

    `The Recall and F1 score for the given model is 95 percent and 92 percent, respectively. `  

 

#### Limitation 

 

The OASIS2 brain dataset consists of only 150 subjects and 373 MR sessions, all the subjects are aged between 60 and 96, and all the subjects are right-handed. We can see bias in the data as we do not have samples from young and left-handed subjects. 

 

Moreover, due to the data's complexity and large size(3d images), we could only train the model on limited subjects, as training the model on the whole data set would require greater computational power. 

 

  

 

#### Future Scope 

 

The future scope of the study would be to work with the latest `OASIS4 dataset`, which has more than `600 subjects and MR sessions`. But then, to process extensive data, we require robust systems that can process and transform the data. Further, we would like to collaborate with subject matter experts as they can assist us better in decision-making and understanding the problem. 

 

### Conclusions 

In conclusion, the `study focuses on Alzheimer's disease detection at an early stage using image recognition with CNN.` This study can be used in the development of robust diagnostics tools. Additionally, it can further improve Alzheimer's disease detection and highlights the potential of machine learning and neuroimaging techniques in the advanced understanding of structural change in the brain.  

### References
- Data were provided by OASIS [Longitudinal: Principal Investigators: D. Marcus, R, Buckner, J. Csernansky, J. Morris; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, U24 RR021382](https://doi.org/10.1162/jocn.2009.21407)
- Philip Scheltens, Ka j Blennow, Monique M B Breteler, Bart de Strooper, Giovanni B Frisoni, Stephen Salloway, and Wiesje Maria Van der Flier. Alzheimer’s disease. The Lancet, 388(10043):505–517, 2016.
- Karteek Popuri, Da Ma, Lei Wang, and Mirza Faisal Beg. Using machine learn- ing to quantify structural mri neurodegen- eration patterns of alzheimer’s disease into dementia score: Independent validation on 8,834 images from adni, aibl, oasis, and miriad databases. Human Brain Mapping, 41(14):4127–4147, 2020.
- Daniel S. Marcus, Anthony F. Fotenos, John G. Csernansky, John C. Morris, and Randy L. Buckner. Open Access Se- ries of Imaging Studies: Longitudinal MRI Data in Nondemented and Demented Older Adults. Journal of Cognitive Neuroscience, 22(12):2677–2684, 12 2010.
- Sergio A Alvarez. An exact analytical re- lation among recall, precision, and classi- fication accuracy in information retrieval. Boston College, Boston, Technical Report BCCS-02-01, pages 1–22, 2002.
- Zachary Chase Lipton, Charles Elkan, and Balakrishnan Narayanaswamy. Threshold- ing classifiers to maximize f1 score. arXiv preprint arXiv:1402.1892, 2014.