#  <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109/209A Introduction to Data Science Final Project
## Alzheimer's Disease Neuroimaging Initiative
#### Group 16: Daniel Graziano, Daniel Molina-Hurtado, Esmail Fadae, Paxton Maeder-York


In [2]:
####################
#   IMPORTS CELL   #
####################

# ignore warnings for aesthetic reasons
import warnings
warnings.filterwarnings('ignore')

#Basic Imports
import numpy as np
import pandas as pd
import math
import requests

#Preprocessing Imports
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing.imputation import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

#Model Imports
from sklearn import svm
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC

#Plotting Imports
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')
pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)

#Keras Imports
import keras 
from keras.models import Sequential
from keras import regularizers
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Conv2D, MaxPooling2D, Dense, Input, Flatten, Dropout, UpSampling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import Adam, SGD
from keras.utils import np_utils

#Other Imports
from scipy.special import gamma
from IPython.display import display, HTMLfrom typing import List, Tuple, Dict
from io import BytesIO
from sklearn.pipeline import Pipeline
import platform
import warnings


#install and import extra packages not included in base environment
!pip install numba;
!pip install umap-learn;
!pip install graphviz;

if platform.system() == 'Linux':
  !apt-get install graphviz;
elif platform.system() == 'Darwin':
  !brew install graphviz;
else:
  warnings.warn('Please install graphviz binaries to generate decision tree graphs properly','Warning')

import graphviz 
from umap import UMAP

Using TensorFlow backend.


Reading package lists... Done
Building dependency tree       
Reading state information... Done
graphviz is already the newest version (2.40.1-2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.


In [0]:
#Import data from our shared google drive folder

#file_id comes from the end of the shareable link of the file from google drive
#it is a set of alphanumeric characters after the '/d/ tag in the share URL
#or the 'id' label, see example below
#https://drive.google.com/open?id=1FSjJjpS1Ob_BEbshyl9dXb1FFmCnZrHE
#have to decode the bytes before it can be read by pandas into df

def read_gdrive_data(file_id):
  response = requests.get('https://drive.google.com/uc?export=download&id=%s' % file_id)
  df = pd.read_csv(BytesIO(response.content))
  return df

## Introduction

### Motivation
Alzheimer's Disease (AD) is the 6th leading cause of death in the United States and the [3rd cause among people 65 and older](https://www.nia.nih.gov/news/number-alzheimers-deaths-found-be-underreported). It is an irreversible neurodegenerative disease that progresively damages an individual's cognitive functioning. It usually starts with memory loss and a worsening of thinking capabilities, and it develops to problems with language, orientation and loss of body functions, ultimately leading to death. The life expectancy following the diagnosis of the disease ranges from 3 to 8 years depending on the individual. The disease is significatnly more common in adults in an advanced age, usually over 65 years old.

Currently there is no cure for the disease and the causes are yet not completely understood. Given the relevance of Alzheimer there is a big global effort put in research of diagnosis and treatment. One of the biggest projects is the Alzheimer's Disease Neuroimaging Initiative (ADNI) whose data will be used in this project.

### ADNI

The Alzheimer’s Disease Neuroimaging Initiative is a joint research initiative whose goal is to improve the understanding, diagnosis and treatments for Alzheimer's Disease. As part of this study researchers at more than 60 sites in the United States and Canada collected and are colecting data from over a thousand participants. The types of data collected range from demographics and family history to genetics, neuropsycologichal tests, imaging or biomarkers. The project is divided in several phases in time: ADNI1 (2004-2009), ADNIGO (2009-2001), ADNI2 (2011-2016) and ADNI3 (2016-2021).

<table>
  <tr></tr>
  <tr>
    <td>
      <img src="http://drive.google.com/uc?export=view&id=1d3P68jeOS29j2MtfXFB1mGyUh-PZ6aj8">
    </td>
  </tr>
</table>

### Objectives

Our main goals for this projects are:

1) To build a model able to predict if Alzheimer's disease will be developed in an individual using the data provided by ADNI. This is very important since early diagnosis of Alzheimer can lead to a slower progression of the disease by accessing different treatments [[source]](https://www.alz.org/alzheimers-dementia/diagnosis/why-get-checked).

2) To identify the most relevant information or tests that allow us predicting a diagnosis. This can help reducing the cost and effort spent in the diagnosis phase by reducing the number of tests to the most essential.
