# Linear Discriminant Analysis
# Quadratic Discriminant Analysis


### Principle Component Analysis

find best direction 
greater the distance , better the projection

1st PCA of original data 

2nd PCA , direction at right angles 

in general, there are as many principal component as there are dimension in original data 

### LDA

* use same PCA 
* LDA chooses axis to maximise distance between points of different categories (preserve different categories)
* LDA works find when x variables share uniform covariances 
* LDA fails when covariances of X are a function of value of Y

### QDA

* variant LDA
* suited to cases where x variables corresponding to different y labels have different covariances
* QDA separates points 

<img src="./images/qdalda.png">

### Covariance
measure relationship between two variables, specifically whether greater values of one variable
correspond to greater values in the other .

    * +ve covariance
    * -ve covariance
    
    

### LDA

Find best axes 
> refer to those axes that best separate the date into different classes 

### QDA
Find axes to best separate 
>  to best separate the classes such that all instances of a class are in the same quadrant but the decision boundary is quadratic


One-hot encoding can result in collinearity of features, 
instead use dummy encoding , 
drop one of the one-hot encoded columns

In [1]:
def summarize_classification(y_test,y_pred):
    
    acc = accuracy_score(y_test,y_pred,normalize=True)
    num_acc =accuracy_score(y_test,y_pred,normalize=False) # accurate count
    
    prec = precision_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred)
    
    return {
        'accuracy':acc,
        'precision':prec,
        'recall':recall,
        'accuracy_count':num_acc
    }

In [2]:
def build_model(classifier_fn,
                name_of_y_col,
                name_of_x_cols,
                dataset,
                test_frac=0.2
               ):
    X = dataset[name_of_x_cols]
    Y = dataset[name_of_y_col]
    
    x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=test_frac)
    
    model = classifier_fn(x_train,y_train)
    y_pred = model.predict(x_test)
    
    y_pred_train = model.predict(x_train)
    
    
    train_summary = summarize_classification(y_train,y_pred_train)
    test_summary = summarize_classification(y_test,y_pred)
    
    pred_results = pd.DataFrame({
        'y_test':y_test,
        'y_pred':y_pred
    })
    
    model_crosstab = pd.crosstab(pred_results.y_pred,pred_results.y_test)
    
    return {
        'training':train_summary,
        'test':test_summary,
        'confusion_matrix':model_crosstab
    }

In [3]:
def compare_results():
    for key in result_dict:
        print('Classification: ',key)
        print()
        print('Training data')
        for score in result_dict[key]['training']:
            print(score,result_dict[key]['training'][score])
            
        print()
        print('Test data')
        for score in result_dict[key]['test']:
            print(score,result_dict[key]['test'][score])
            
        print()
        
        
        
        


Solver 
* svd , singular value decomposition
    * Find axes w/o calculating the covariance matrix of features
    * useful when we have many features or many rows in dataset 


In [5]:
result_dict = {}

In [4]:
def linear_discriminant_fn(x_train,y_train,solver='svd'):
    model = LinearDiscriminantAnalysis(solver=solver)
    model.fit(x_train,y_train)
    return model

In [21]:
# useful when x variables corresponding to different labels have different covariances 
# i.e covariances are different for x for all values of y 

def quadratic_discriminant_fn(x_train,y_train):
    model = QuadraticDiscriminantAnalysis()
    model.fit(x_train,y_train)
    return model

In [12]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score, recall_score

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [13]:
titanic_df = pd.read_csv('solutions/02/demos/code/datasets/titanic_processed.csv')

In [14]:
FEATURES = titanic_df.columns.tolist()[1:]

In [18]:
titanic_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,0,14.0,0,0,7.8542,0,0,1
1,1,1,1,28.0,0,0,26.55,0,0,1
2,1,1,0,36.0,1,2,120.0,0,0,1
3,0,3,1,17.0,1,0,7.0542,0,0,1
4,0,3,1,4.0,4,2,31.275,0,0,1


In [15]:
result_dict['survived - linear_discriminant_analysis'] = build_model(
    linear_discriminant_fn,
    'Survived',
    FEATURES,
    titanic_df
)



In [16]:
compare_results()

Classification:  survived - linear_discriminant_analysis

Training data
accuracy 0.7838312829525483
precision 0.75
recall 0.7012987012987013
accuracy_count 446

Test data
accuracy 0.8461538461538461
precision 0.8301886792452831
recall 0.7719298245614035
accuracy_count 121



In [19]:
result_dict['survived - linear_discriminant_analysis'] = build_model(
    linear_discriminant_fn,
    'Survived',
    FEATURES[0:-1],
    titanic_df
)

In [20]:
compare_results()

Classification:  survived - linear_discriminant_analysis

Training data
accuracy 0.8101933216168717
precision 0.7699530516431925
recall 0.7354260089686099
accuracy_count 461

Test data
accuracy 0.7202797202797203
precision 0.711864406779661
recall 0.6461538461538462
accuracy_count 103



In [22]:
result_dict['survived - quadratic_discriminant_analysis'] = build_model(
    quadratic_discriminant_fn,
    'Survived',
    FEATURES[0:-1],
    titanic_df
)

In [23]:
compare_results()

Classification:  survived - linear_discriminant_analysis

Training data
accuracy 0.8101933216168717
precision 0.7699530516431925
recall 0.7354260089686099
accuracy_count 461

Test data
accuracy 0.7202797202797203
precision 0.711864406779661
recall 0.6461538461538462
accuracy_count 103

Classification:  survived - quadratic_discriminant_analysis

Training data
accuracy 0.7943760984182777
precision 0.7581395348837209
recall 0.7149122807017544
accuracy_count 452

Test data
accuracy 0.7692307692307693
precision 0.7454545454545455
recall 0.6833333333333333
accuracy_count 110

