# DARWIN Notebook Assignment

In [3]:
#run this cell
import numpy as np
import pandas as pd
from datascience import *
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns

# About this dataset

This notebook assignment uses the DARWIN(Diagnosis AlzheimeR WIth haNdwriting) dataset from the UC Irvine Machine Learning Repository. It is a novel dataset containing handwriting data for the prediction of Alzheimer's Disease via handwriting analysis. The dataset contains data from 174 participants: 89 AD patients and 85 healthy people.
Participants were recruited using standard clinical tests, namely, Mini-Mental State Examination (MMSE), Frontal Assessment Battery (FAB), and Montreal Cognitive Assessment (MoCA). These tests use questionnaires to assess cognitive skills covering many areas, ranging from orientation in time and place to registration recall.

The 25 tasks used to collect handwriting data could be grouped as the following: Graphic tasks, which tested participant’s ability in writing elementary traits; they include joining some points and drawing geometrical figures; Copy tasks, which evaluated participant’s abilities in repeating complex graphic gestures, which have semantic meaning such as letters, words and numbers; Memory tasks, which tested the changes in writing process previously memorized or associated with objects shown in a picture, and Dictation tasks, which investigated how handwriting varies when the working memory is used.

# Interpreting the data

In [None]:
#load the data into a Pandas dataframe below

darwin_table = ...
darwin_table

In [None]:
#find some summary statistics for this dataset
...

#look for null values
...

1. What is the mean for 'airtime1'?
2. What are the minimum values for 'max_x_extension1' and 'max_y_extension1'?
3. Which column has the greatest standard deviation?
4. Are there missing values in the data?

In [None]:
#find out some information about this dataframe
...

1. What are the dimensions of this dataframe?
2. Which datatypes are included in this dataframe?

In [None]:
#The 'class' column consists of labels for each individual's handwriting data
darwin_table["class"]

#replace 'P' with Patient and 'H' with Healthy
darwin_table...


In [None]:
#find the number of healthy people in the dataset
darwin_table['class']...

number_healthy = ...
#find the number of patients(with Alzheimer's)
number_w_alz = ...

# Visual Representations of Data

In [4]:
#Select features with strong correlations to visualize data

#start off with all the features for the 1st task (column indecies 1-19)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Correlation is a measure of the linear relationship of 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that the good variables are highly correlated with the target. Furthermore, variables should be correlated with the target but should be uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only really needs one of them, as the second one does not add additional information.

In [6]:
#create a correlation matrix
cor_task1 = darwin_table...
print(cor_task1)

In [None]:
print(cor_task1 > abs(0.9))

In [None]:
#plotting heatmap
plt.figure(figsize = (100,20))
sns.heatmap(cor, annot = True)

1. Which features have the strongest correlation values(positive or negative)?
2. Create at least 2 scatterplots comparing features with strong correlations. Why is this not the best visualization method? (switch to numpy for creating scatterplots, if you do not have spyder 5.1.5 and pycodestyle>=2.8.0)

In [None]:
#air time and total time(one option)
x = np_darwin_table.where('class', are.equal_to('H')).column('air_time1')
y = np_darwin_table.where('class', are.equal_to('H')).column('total_time1')

plt.scatter(x, y, color = 'hotpink')


x = np_darwin_table.where('class', are.equal_to('P')).column('air_time1')
y = np_darwin_table.where('class', are.equal_to('P')).column('total_time1')

plt.scatter(x, y, color = '#88c999')

plt.show()

In [None]:
#choose own features
x = np_darwin_table.where('class', are.equal_to('H')).column(...)
y = np_darwin_table.where('class', are.equal_to('H')).column(...)

plt.scatter(x, y, color = 'hotpink')


x = np_darwin_table.where('class', are.equal_to('P')).column(...)
y = np_darwin_table.where('class', are.equal_to('P')).column(...)

plt.scatter(x, y, color = '#88c999')

plt.show()

# Using Information Gain

Information gain calculates the reduction in entropy from the transformation of a dataset. It can be used for feature selection by evaluating the Information gain of each variable in the context of the target variable.

In [7]:
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#declare features and target variables 
#use just task 1
X_task1 = ... #FEATURES
Y_task1 = ... #TARGET

print(X_task1)
print(Y_task1)

importances = mutual_info_classif(X_task1, Y_task1)
feat_importances = pd.Series(importances, darwin_table.columns[1:19])
print(feat_importances)

feat_importances.plot(kind='barh', color='teal')
plt.show()

1. Which features result in the most information gain?
2. Use the two features that result in the most information gain to create a scatterplot.

In [8]:
#create a scatterplot

Now, select the three most significant features for the first task, and create a 3D scatterplot.

In [9]:
#make 3D scatterplot

Rotate the visualization. Does it make a difference? Which orientation of the features results in the best visualization?

In [None]:
#rotate