# Linear Discriminant Analysis

SkLearn LDA Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html <br>
Iris Data Set: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html <br>
Original LDA Tutorial: https://www.statology.org/linear-discriminant-analysis-in-python/ <br>
Sklearn Model Selection Documentation: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection 


## 0. Load Libraries

In [4]:
#LOAD NECESSARY LIBRARIES
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn import datasets

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as pp
import matplotlib.cm as cm

# pip install dmba (install dmba if you dont have it already)
from dmba import classificationSummary


## LDA on the Iris Dataset
Goal is to perform LDA on the Iris dataset.

## Step 0. Load Iris Dataset

In [5]:
#Load Iris Dataset
iris = datasets.load_iris()
print(iris)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

The iris dataset as downloaded from scikitlearn is kind of ugly.  <br>
**Data:** Is an array with the feature information.  <br>
**Target:** Is a seperate array with the classifications. <br>
We can use np.c_ to concatenate these two arrays together.

In [6]:
#LOAD AND VIEW IRIS DATASET
df = pd.DataFrame(data = np.c_[iris['data'], iris['target']],
                 columns = iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.columns = ['s_length', 's_width', 'p_length', 'p_width', 'target', 'species']
print(df.head())

   s_length  s_width  p_length  p_width  target species
0       5.1      3.5       1.4      0.2     0.0  setosa
1       4.9      3.0       1.4      0.2     0.0  setosa
2       4.7      3.2       1.3      0.2     0.0  setosa
3       4.6      3.1       1.5      0.2     0.0  setosa
4       5.0      3.6       1.4      0.2     0.0  setosa


## Step 1. Seperate out the s_length, s_width, p_length, and p_width columns to be feature columns X, and species to be response y

In [8]:
X = df[['s_length', 's_width', 'p_length', 'p_width']]
y = df['species']

## Step 2. Initalize an LDA model and fit it using all the data

In [9]:
lda = LinearDiscriminantAnalysis().fit(X,y)


## Step 3. Use RepeatedStratifiedKFold test a model

Documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html <br>
Use 
`n_splits=10`, `n_repeats=3`, and `random_state=1` <br>

Use `cross_val_score` to find the "average" accuracy accross the 3 replicates and 10 folds.

In [20]:
rskf = RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)

acc = [0,0,0]

scores = cross_val_score(lda, X, y ,scoring='accuracy', cv=rskf, n_jobs=1)

print(f'The average accuracy is {np.mean(scores)}')

The average accuracy is 0.9800000000000001


## Step 4. Use your model to make a prediction on a new point
predict which type of iris has following dimensions <br>
s_length = 5 <br>
s_width = 3 <br>
p_length = 1 <br>
p_width = 0.4

## Use your LDA model to transform X into X_LDA

##  Plot data in reduced dimension space 

In [None]:
#CREATE LDA PLOT
y = iris.target
target_names = iris.target_names

plt.figure()
colors = ['red', 'green', 'blue']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_LDA[y == i, 0], X_LDA[y == i, 1], alpha=.8, color=color,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlim(-10, 10)
plt.ylim(-3,3)
plt.show()