Name: Ethan Paek

Date: 4/15/2020

Topic: COEN 140 Lab 3

Description: Use Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) over the provided dataset.The dataset can be downloaded at http://www.cse.scu.edu/~yfang/coen140/iris.data

In [1]:
import numpy as np
import pandas as pd

## Step 1: Import data and breakup into training and testing subsets

In [2]:
# load and store dataset from website
data = pd.io.parsers.read_csv(
    filepath_or_buffer='http://www.cse.scu.edu/~yfang/coen140/iris.data',
    header=None,
    sep=',',
    )
print(data.head())
print(data[4].value_counts())

     0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa
Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: 4, dtype: int64


In [3]:
# 80% of each class goes into train; other 20% goes into test
train = data[0:40].append(data[50:90]).append(data[100:140])
test  = data[40:50].append(data[90:100]).append(data[140:150])
print(train.shape)
print(test.shape)

(120, 5)
(30, 5)


In [4]:
X_train = train.iloc[:,:-1]
X_test = test.iloc[:,:-1]
y_train = train.iloc[:,-1]
y_test = test.iloc[:,-1]
print("Number of elements in training subset:", len(X_train))
print("Number of elements in testing subset:", len(X_test))

Number of elements in training subset: 120
Number of elements in testing subset: 30


## Step 2: Build a LDA classifier based on the training data

In [5]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(solver="svd",store_covariance=True)
y_pred_lda_train = lda.fit(X_train, y_train).predict(X_train)
y_pred_lda_test = lda.fit(X_train, y_train).predict(X_test)
print(y_pred_lda_train)
print(y_pred_lda_test)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versico

In [12]:
# Calculate the accuracy by comparing LDA/QDA prediction vs actual value
def calculate_accuracy(analysis, test, total):
    n_correct = 0
    for i in range(total):
        if analysis[i] == test.iloc[i]:
            n_correct += 1
        else:
            print("\nMismatch")
            print(analysis[i])
            print(test.iloc[i])
            print("Index:",i)
    accuracy = (n_correct/total) * 100
    error = 100 - accuracy
    return str(error)
print("\nError rate for LDA on training subset: " + calculate_accuracy(y_pred_lda_train, y_train, len(y_train)) + "%")
print("Error rate for LDA on testing subset: " + calculate_accuracy(y_pred_lda_test, y_test, len(y_test)) + "%")


Mismatch
Iris-virginica
Iris-versicolor
Index: 60

Mismatch
Iris-virginica
Iris-versicolor
Index: 73

Mismatch
Iris-versicolor
Iris-virginica
Index: 113

Error rate for LDA on training subset: 2.5%
Error rate for LDA on testing subset: 0.0%


## Step 3: Build a QDA classifier based on the training data

In [7]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA

qda = QDA(store_covariance=True)
y_pred_qda_train = qda.fit(X_train, y_train).predict(X_train)
y_pred_qda_test = qda.fit(X_train, y_train).predict(X_test)
print(y_pred_qda_train)
print(y_pred_qda_test)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versico

In [11]:
print("\nError rate for QDA on training subset: " + calculate_accuracy(y_pred_qda_train, y_train, len(y_train)) + "%")
print("Error rate for QDA on testing subset: " + calculate_accuracy(y_pred_qda_test, y_test, len(y_test)) + "%")


Mismatch
Iris-virginica
Iris-versicolor
Index: 60

Mismatch
Iris-virginica
Iris-versicolor
Index: 73

Error rate for QDA on training subset: 1.6666666666666714%
Error rate for QDA on testing subset: 0.0%


## Step 4: Check for non important variables in classifying iris type 

In [9]:
# creating feature names list to attach names to attributes of dataset
feature_dict = {i:label for i,label in zip(range(4),
                  ('sepal length in cm',
                  'sepal width in cm',
                  'petal length in cm',
                  'petal width in cm', ))}
print(feature_dict)

{0: 'sepal length in cm', 1: 'sepal width in cm', 2: 'petal length in cm', 3: 'petal width in cm'}


In [10]:
LDA_accuracy = {}
QDA_accuracy = {}

for feature in range(0,4):
        
    # drop the desired feature    
    X_train_drop = X_train.drop(feature,axis=1)
    X_test_drop = X_test.drop(feature,axis=1)
    
    # train with LDA and QDA now that we have new tables with one dropped attribute
    y_pred_lda_test_drop = lda.fit(X_train_drop, y_train).predict(X_test_drop)
    y_pred_qda_test_drop = qda.fit(X_train_drop, y_train).predict(X_test_drop)
    
    # important to remember that dictionary values refer to error percentage
    LDA_accuracy[feature] = calculate_accuracy(y_pred_lda_test_drop, y_test, len(y_test))
    QDA_accuracy[feature] = calculate_accuracy(y_pred_qda_test_drop, y_test, len(y_test))
    
    # change the dictionary keys to be the flower attributes
    LDA_accuracy[feature_dict[feature]] = LDA_accuracy.pop(feature)
    QDA_accuracy[feature_dict[feature]] = QDA_accuracy.pop(feature)
    
print('\nLDA error rates (%) when respective attribute is dropped: ', LDA_accuracy)
print('QDA error rates (%) when respective attribute is dropped: ', QDA_accuracy)


Mismatch
Iris-versicolor
Iris-virginica
Index: 21

Mismatch
Iris-versicolor
Iris-virginica
Index: 21

LDA error rates (%) when respective attribute is dropped:  {'sepal length in cm': '0.0', 'sepal width in cm': '0.0', 'petal length in cm': '0.0', 'petal width in cm': '3.3333333333333286'}
QDA error rates (%) when respective attribute is dropped:  {'sepal length in cm': '0.0', 'sepal width in cm': '0.0', 'petal length in cm': '0.0', 'petal width in cm': '3.3333333333333286'}


#### From the above results, we can conclude that the 'petal width' should be treated with more importance. We can see that when we dropped this value, we received higher error rates on our testing accuracy.