# Milestone 3: Alzheimer's Group 16


#### Daniel Graziano, Daniel Molina Hurtado, Esmail Fadae, Paxton Maeder-York

Our strategy is to look at all the supporting datasets, pull out columns that are relevant and join based on patient ID (RID), only saving test rows that occured post diagnosis.

## Introduction

In this project, we are going to work with the vast amount of information provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI). This data has been colected by researchers at more than 60 sites in the US and Canada working with thousands of participants between the ages of 55 and 90. Participants can start the study with normal cognitive function, mild cognitive impairment (MCI) or Alzheimer's Desease (AD). The project is divided in several phases in time: ADNI1 (2004-2009), ADNIGO(2009-2001), ADNI2(2011-2016) and ADNI3(2016-2021). The study collects different types of data such as demographics, family history, genetics, neuropsycologichal tests, imaging or biomarkers.

Our initial goal in this project will be to determine the most effective biomarkers and neuropsychological tests in order to predict Alzheimer's disease. If time permits we will also focus on summary imaging data but that is out of scope for now. 

We will start by exploring and putting together data from different datasets in different areas, cleaning it and preparing it for modeling and making predictions in further stages of the project.

#### Libraries and Imports

In [0]:
#Basic Imports
import numpy as np
import pandas as pd
import math
import requests

#Preprocessing Imports
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing.imputation import Imputer
from sklearn.preprocessing import StandardScaler

#Model Imports
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

#Plotting Imports
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')
pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)

#Other Imports
from scipy.special import gamma
from IPython.display import display
from typing import List, Tuple, Dict
from io import BytesIO

#Keras Imports
import keras 
from keras.models import Sequential
from keras import regularizers
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Conv2D, MaxPooling2D, Dense, Input, Flatten, Dropout, UpSampling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import Adam, SGD
import matplotlib.pyplot as plt
from keras.utils import np_utils


In [0]:

#Import some data from our shared google drive folder

#file_id comes from the end of the shareable link of the file from google drive
#it is a set of alphanumeric characters after the '/d/ tag in the share URL
#or the 'id' label, see example below
#https://drive.google.com/open?id=1FSjJjpS1Ob_BEbshyl9dXb1FFmCnZrHE
#have to decode the bytes before it can be read by pandas into df

def read_gdrive_data(file_id):
  response = requests.get('https://drive.google.com/uc?export=download&id=%s' % file_id)
  df = pd.read_csv(BytesIO(response.content),na_values='-4')
  return df