<h3>Linear Discriminant Analysis</h3>
This is a **supervised** algorithm, since it's purpose is seperating known categories. From now on I will refer to linear discriminant analysis as LDA. It falls into the classifer type of algorithm, it's useful when you need to classify a data point based on many features/dimensions.<br>

LDA is a way of maximising the seperability of data points. it is like PCA (principal Component Analysis), in that it reduces dimensions in the dataset, the difference is LDA is interested in the maximum seperation of features with known categories, where as PCA is used to capture the most variance in the data.


In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import roc_curve

df_exams = pd.read_csv('../input/StudentsPerformance.csv')

print('Finished importing modules and data')

Finished importing modules and data


In [2]:
df_exams.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [3]:
#the target 'gender' is in text form
print(df_exams['gender'].unique())

#make a label encoder and change it to numerical so it can be used in the models
le = LabelEncoder()
le.fit(df_exams['gender'])
df_exams['gender'] = le.transform(df_exams['gender'])

le_lunch = LabelEncoder()
le_lunch.fit(df_exams['lunch'])
df_exams['lunch'] = le_lunch.transform(df_exams['lunch'])

#0 = female 1 = male
print(le.classes_)
print(le_lunch.classes_)

['female' 'male']
['female' 'male']
['free/reduced' 'standard']


In [4]:
colorsIdx = {0: '#F7B5E7', 1:'#4BDEB7'}
cols      = df_exams['gender'].map(colorsIdx)

graph_fig = go.Figure(
    data=[go.Scatter(x=df_exams['reading score'], 
                     y=df_exams['math score'],
                     mode='markers',
                     marker=dict(size=7,
                                 line=dict(width=1),
                                color=cols)
                    )],
    layout=go.Layout(
        title=go.layout.Title(text="Reading score against Math score by gender (Population)")))
    
graph_fig.update_xaxes(title_text='Reading Score')
graph_fig.update_yaxes(title_text='Math Score')

graph_fig.show()

In this small sample of students, we can see females have performed better at the reading tests and males have performed better at math score. With this in mind, let's build the LDA classifiers for just mathematics and reading as a predictor, then we will introduce writing and 'lunch' to see if it makes a difference. 

<h3>2 feature LDA</h3>

Lets say we have some data and we want to **linearly** seperate the categories on one line.<br>

LDA uses all the features we supply to it, and it creates a new axis (we won't see this line in any of the graphs produced, but that is our descision line). It projects the data points onto this new single axis in a way which maximises the linear seperation of the points (in euclidean distance).<br>

<b>Step 1</b>: When the data is on the new axis, we want to maximise the distance between the means $\mu$, of the categories we're seperating.<br>
<b>Step 2</b>: We want to minimize the variance $\sigma^2$ within each category. This is sometimes known as scatter in LDA terminology.<br>

The following formula is used to perform these steps simultaneously (for seperating two categories):

$$ (\mu_{1} - \mu_{2})^2 \over \sigma_{1}^2 + \sigma_{2}^2$$<br>

This description is roughly what LDA is, but I am using the scikit method which may vary slightly in how it works.

In [5]:
def make_roc_graph(fpr,tpr):
    graph_fig = go.Figure(
        data=[go.Scatter(x=fpr, 
                         y=tpr,
                         marker=dict(size=7,
                                     line=dict(width=1),
                                     color='#4BDEB7')
                        )],
        layout=go.Layout(
            width=500,
            height=500,
            title=go.layout.Title(text="Roc Two Features")))
    graph_fig.update_xaxes(title_text='false positive rate')
    graph_fig.update_yaxes(title_text='true positive rate')
    graph_fig.show()

prediction_features = ['math score','reading score']
cv = StratifiedKFold(n_splits=5)
X = df_exams[prediction_features]
y = df_exams['gender']

roc_results = []
fpr_two = []
tpr_two = []
for train, test in cv.split(X, y):
    lda_model = LDA(solver="svd", store_covariance=True)
    y_pred = lda_model.fit(X.iloc[train], y.iloc[train]).predict(X.iloc[test])
    fpr, tpr, thresholds = roc_curve(y.iloc[test],y_pred)
    roc_results.append((fpr,tpr))
    fpr_two.append(fpr)
    tpr_two.append(tpr)

for fpr,tpr in roc_results:
    make_roc_graph(fpr,tpr)

The results of the Cross validated ROC (receiver operator characteristic) show it performed best on the 2nd fold of data. The ROC is very similar in concept to a Confusion matrix if you have never seen it before. It shows how well a classifier has performed.

<h4>Multiple Categories</h4>
If we have multiple Categories, then we use a slightly different formula. for easiness sake, firstly let $(\mu_{1} - \mu_{2})^2 = \text{d}^2$<br><br>
Now we choose a point central to all the data points.
Now find the distances, $\text{d}^2$, between each category and the centre, using the centre of each category and the central point.Then we use the following formula:<br><br>
$$ \text{d}_1^2 + \text{d}_2^2 + \text{d}_3^2 \over \sigma_{1}^2 + \sigma_{2}^2 + \sigma_{3}^2$$<br>

Again I am using scikit, but it is likely working similar to this.

In [6]:
def make_roc_graph(fpr,tpr):
    graph_fig = go.Figure(
        data=[go.Scatter(x=fpr, 
                         y=tpr,
                         marker=dict(size=7,
                                     line=dict(width=1),
                                     color='#f76d7f')
                        )],
        layout=go.Layout(
            width=500,
            height=500,
            title=go.layout.Title(text="Roc Multiple Features")))
    graph_fig.update_xaxes(title_text='false positive rate')
    graph_fig.update_yaxes(title_text='true positive rate')
    graph_fig.show()

prediction_features = ['math score','writing score','reading score','lunch']
    
cv = StratifiedKFold(n_splits=5)
X = df_exams[prediction_features]
y = df_exams['gender']

roc_results = []
fpr_multi = []
tpr_multi = []
for train, test in cv.split(X, y):
    lda_model = LDA(solver="svd", store_covariance=True)
    y_pred = lda_model.fit(X.iloc[train], y.iloc[train]).predict(X.iloc[test])
    fpr, tpr, thresholds = roc_curve(y.iloc[test],y_pred)
    roc_results.append((fpr,tpr))
    fpr_multi.append(fpr)
    tpr_multi.append(tpr)

for fpr,tpr in roc_results:
    make_roc_graph(fpr,tpr)

In [7]:
def sum_fpr_tpr(fpr_tpr_list):
    total = 0
    for a in fpr_tpr_list:
        for b in a:
            total = total + b
    print(total)

#false positive sums (we want this to be as low as possible)
sum_fpr_tpr(fpr_two)
sum_fpr_tpr(fpr_multi)

#true positive sums (we want this to be as high as possible)
sum_fpr_tpr(tpr_two)
sum_fpr_tpr(tpr_multi)

#we can see that more features performs marginally 'better' for both true positive and false positive
#this is also evident from the ROC graphs but here are some hard numbers

5.82029499626587
5.482356235997012
9.024484536082474
9.294888316151203


We can see in this instance, adding features has raised the average ROC true positive score, and lowered the ROC false positive score. This is good as it has shown more features has helped us classify slightly more accurately.

In [8]:
le_race = LabelEncoder()
le_race.fit(df_exams['race/ethnicity'])
df_exams['race/ethnicity'] = le_race.transform(df_exams['race/ethnicity'])

le_ped = LabelEncoder()
le_ped.fit(df_exams['parental level of education'])
df_exams['parental level of education'] = le_ped.transform(df_exams['parental level of education'])

le_prep = LabelEncoder()
le_prep.fit(df_exams['test preparation course'])
df_exams['test preparation course'] = le_prep.transform(df_exams['test preparation course'])

prediction_features = ['race/ethnicity',
                       'parental level of education',
                       'lunch','test preparation course',
                       'math score','reading score','writing score']

X_train, X_test, y_train, y_test = train_test_split(df_exams[prediction_features],
                                                    df_exams['gender'], test_size=0.2, random_state=0)

lda_model = LDA(solver="svd", store_covariance=True)
y_pred = lda_model.fit(X_train, y_train).predict(X_test)

<h4>Actual Vs Predicted Visualised</h4>
Although already proven numerically and with ROC curves, here is the visual difference between the actual and predicted values. It shows that the LDA classifer has worked very well with the dimensions it was given.

In [9]:
X_test['actual test gender'] = y_test
colorsIdx = {0: '#F7B5E7', 1:'#4BDEB7'}
cols      = X_test['actual test gender'].map(colorsIdx)

graph_fig = go.Figure(
    data=[go.Scatter(x=X_test['reading score'], 
                     y=X_test['math score'],
                     mode='markers',
                     marker=dict(size=7,
                                 line=dict(width=1),
                                color=cols)
                    )],
    layout=go.Layout(
        title=go.layout.Title(text="Reading score against Math score by gender Actual")))
    
graph_fig.update_xaxes(title_text='Reading Score')
graph_fig.update_yaxes(title_text='Math Score')

graph_fig.show()
#this graph shows the __actual__ genders

In [10]:
X_test['predicted gender'] = y_pred
colorsIdx = {0: '#F7B5E7', 1:'#4BDEB7'}
cols      = X_test['predicted gender'].map(colorsIdx)

graph_fig = go.Figure(
    data=[go.Scatter(x=X_test['reading score'], 
                     y=X_test['math score'],
                     mode='markers',
                     marker=dict(size=7,
                                 line=dict(width=1),
                                color=cols)
                    )],
    layout=go.Layout(
        title=go.layout.Title(text="Reading score against Math score by gender Predicted")))
    
graph_fig.update_xaxes(title_text='Reading Score')
graph_fig.update_yaxes(title_text='Math Score')

graph_fig.show()
#this graph shows what was __predicted__