# Section 08 - More Classification
### Introduction to Data Science EN.553.436/EN.553.636 - Fall 2021

For this analysis we will use the phoneme data set from OpenML. Read the full description [here](https://www.openml.org/d/1489), but key details are below:

Features:
- V1: Amplitude of first harmonic
- V2: Amplitude of second harmonic
- V3: Amplitude of third harmonic
- V4: Amplitude of fourth harmonic
- V5: Amplitude of fifth harmonic

Classes:
- 1 - Nasal vowel
- 2 - Oral vowel

You will first need to download the .csv file on blackboard, and save it in the same folder as this notebook.

In [None]:
## Import data set
import pandas as pd
import numpy as np
vowel_df = pd.read_csv("vowel_data.csv")
display(vowel_df)

To make this data easier to graph and visualize, we will only work with the first two features and drop the rest:

In [None]:
X = np.array(vowel_df[['V1','V2']])
y = np.array(vowel_df['Class'])
print('Features 1 and 2: \n')
print(X)
print('\n Class:')
print(y)

# 1. Quadratic Discriminant Analysis (QDA)

## 1.1
**Use a 30% train-test split and implement QDA and LDA on the same data (Use random state 235). Compare the accuracy scores between QDA and LDA.**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train,X_test,y_train,y_test = 


## 1.2 
**Compare the prediction results from QDA and LDA. Determine the number of test data points for which QDA and LDA predicted different classes.**

## 1.3
When we are dealing with 2 classes, we can plot a decision boundary where the probability of Class 1 is equal to the probability of Class 2. This concept becomes much less straightforward when dealing with more than 2 classes.

The plot below will show the decision boundaries for QDA and LDA (Which one is which?), along with the actual classes for each of the test points.

In [None]:
import matplotlib.pyplot as plt

## Update these lines
QDA_model = 
LDA_model = 
###


x_min = -3
x_max = 4
y_min = -5
y_max = 4

xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),np.linspace(y_min, y_max, 200))

Z_QDA = QDA_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z_QDA = Z_QDA[:, 1].reshape(xx.shape)

Z_LDA = LDA_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z_LDA = Z_LDA[:,1].reshape(xx.shape)

plt.contour(xx, yy, Z_QDA, [0.50], linewidths=2, colors='black') ##Draw the boundary where P = 0.50
plt.contour(xx, yy, Z_LDA, [0.50], linewidths=2, colors ='black') ## Draw the boundary where P = 0.50

plt.plot(X_test[y_test==1,0],X_test[y_test==1,1],'.b')
plt.plot(X_test[y_test==2,0],X_test[y_test==2,1],'.r')

plt.xlabel('V1')
plt.ylabel('V2')
plt.legend(['Class 1','Class 2'])
plt.show()
#####


**Create a new plot using the same decision boundaries, but this time add the following points from your test set (with different markers):**
- Both LDA and QDA predict class 1
- Both LDA and QDA predict class 0
- LDA and QDA have different predictions

# 2. Binary Decision Tree

## 2.1
**Using the same train and test data sets, classify the data using a binary decision tree with maximum depth 5.**

In [None]:
from sklearn import tree

## 2.2
The following cell plots the decision boundaries of your binary tree. 
**Observe the differences between the binary tree, LDA, and QDA.**

In [None]:
## Update this line
Tree = 
###

Z_Tree = Tree.predict(np.c_[xx.ravel(), yy.ravel()])
Z_Tree = Z_Tree.reshape(xx.shape)
plt.contourf(xx,yy,Z_Tree)
plt.plot(X_test[y_test==1,0],X_test[y_test==1,1],'.b')
plt.plot(X_test[y_test==2,0],X_test[y_test==2,1],'.y')
plt.xlabel('V1')
plt.ylabel('V2')
plt.legend(['Class 1','Class 2'])
plt.show()

## 2.3
Gini is one option for evaluating whether one decision node leads to more purity than another.

The Gini for a node with $K$ classes is defined by
$H(D) = \sum_{i=1}^K p_i (1-p_i)$

Here $p_i$ represents the probability of a data point in that leaf node belonging to class $i$.
To evaluate a split, we calculate the sum of the Ginis for the two leaf nodes. Gini is a measure of impurity so our goal is to
**minimize** the Gini impurity.

**Determine the impurity in the original (training) dataset. Compare this to the sklearn calculation in the depth-1 tree below**

In [None]:
## sklearn Gini calculation
Tree_One = tree.DecisionTreeClassifier(max_depth=1)
Tree_One.fit(X_train,y_train)
print('(sklearn) Gini Impurity in original data set: ' + str(Tree_One.tree_.impurity[0]))

## manual Gini calculation

## 2.4 
We can use cross validation to determine the best parameters to use for our classification model. For the case of binary trees, it can be used to determine what maximum depth to use.

**For depths from 1 to 15, perform 10-fold cross validation on your binary tree, using your full data set. Calculate and plot the mean accuracy for each depth.**
**What is the ideal maximum tree depth for our data? What do you observe as the depth increases to 20? 30?**


In [None]:
from sklearn.model_selection import cross_val_score