# Question

In this question,  you will explore the concept of Mahalanobis distance and its application to classifying samples from the Iris dataset.
The Iris datase is a commonly used dataset in machine learning and consists of three classes of iris plants: Setosa, Versicolor, and Virginica.
You will compute the Mahalanobis distance for one sample from each class and classify the samples based on their Mahalanobis distance.
Tasks:
a. Load the Iris dataset (csv file present in the classroom).
b. Choose one random sample from each class (Setosa, Versicolor, and Virginica) which will act as the test data.
c. Compute the mean vector and covariance matrix for each class (without the sample picked in the previous part, now it will act as the test data).
d. Calculate the Mahalanobis distance for each of the selected samples with each of the class using the formula:
Mahalanobis distance = sqrt((x - μ)ᵀ * Σ⁻¹ * (x - μ))
Where:
x is the feature vector of the sample.
μ is the mean vector of the entire dataset.
Σ⁻¹ is the inverse of the covariance matrix.
Compare the Mahalanobis distances for the three samples and classify each sample to the class with the smallest Mahalanobis distance.
Print the original class and the predicted class for each sample, along with their Mahalanobis distances.


## Importing Libraries

In [570]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## New Functions

In [571]:
def mahalanobis_distance(x, mean, covariance_matrix):
    x_mean = x-mean
    x_mean_transpose = (x-mean).transpose()
    cov_inverse = np.linalg.inv(covariance_matrix)
    dist = np.dot(x_mean, cov_inverse)
    dist = np.dot(dist, x_mean_transpose)
    dist = np.sqrt(dist)
    return dist

In [572]:
def mahalanobis_classification(unique_classes, class_mean_vectors, class_covariances, x_test):
    y_pred = []
    for i in x_test:
        distances = {}
        for class_label in unique_classes:
            distances[class_label] = mahalanobis_distance(i, class_mean_vectors[class_label], class_covariances[class_label])
        y_pred.append([min(distances, key=distances.get), min(distances.values())])
    return y_pred

In [573]:
def check_accuracy(y_pred, y_test):
    return np.sum(y_pred == y_test)/len(y_test)

## Pre-processing

Importing the dataset

In [574]:
dataset = pd.read_csv('iris.csv')
dataset

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [575]:
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

In [576]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [577]:
print(y)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor

Encoding the dependent variable

In [578]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [579]:
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


## Test Train Splitting

In [580]:
uniques_classes = np.unique(y)
uniques_classes

array([0, 1, 2])

In [581]:
test_indices = []
train_indices = []

In [582]:
for class_label in uniques_classes:
    class_indices = np.where(y == class_label)[0]
    index = random.randint(0, len(class_indices))
    test_indices.append(class_indices[index])
    indices_copy = class_indices.copy()
    indices_copy = np.delete(indices_copy, index)
    train_indices.extend(indices_copy)
print(test_indices)
print(train_indices)

[38, 64, 128]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]


In [583]:
X_test = dataset.loc[test_indices].values[:, 1:-1]
X_train = dataset.loc[train_indices].values[:, 1:-1]
y_test = dataset.loc[test_indices].values[:, -1]
y_train = dataset.loc[train_indices].values[:, -1]
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

In [584]:
X_test

array([[4.4, 3.0, 1.3, 0.2],
       [5.6, 2.9, 3.6, 1.3],
       [6.4, 2.8, 5.6, 2.1]], dtype=object)

In [585]:
X_train

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3.0, 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5.0, 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5.0, 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3.0, 1.4, 0.1],
       [4.3, 3.0, 1.1, 0.1],
       [5.8, 4.0, 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1.0, 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5.0, 3.0, 1.6, 0.2],
       [5.0, 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [586]:
print(y_test)

[0 1 2]


In [587]:
print(y_train)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


## Training the model

In [588]:
class_mean_vectors = {}
class_covariances = {}

In [589]:
for class_label in uniques_classes:
    class_indices = np.where(y_train == class_label)[0]
    class_sample = X_train[class_indices]
    class_mean = np.mean(class_sample, axis = 0)
    class_covariance = np.cov(class_sample.astype(float), rowvar=False)
    class_mean_vectors[class_label] = class_mean
    class_covariances[class_label] = class_covariance

In [590]:
class_mean_vectors

{0: array([5.018367346938775, 3.426530612244899, 1.4673469387755103,
        0.2448979591836734], dtype=object),
 1: array([5.942857142857143, 2.7673469387755105, 4.273469387755101,
        1.3265306122448979], dtype=object),
 2: array([6.591836734693876, 2.9775510204081628, 5.551020408163264,
        2.0244897959183668], dtype=object)}

In [591]:
class_covariances

{0: array([[0.11903061, 0.09700255, 0.01436224, 0.01019983],
        [0.09700255, 0.1444898 , 0.01046769, 0.01128401],
        [0.01436224, 0.01046769, 0.03016156, 0.00566327],
        [0.01019983, 0.01128401, 0.00566327, 0.01169218]]),
 1: array([[0.26958333, 0.0878869 , 0.18199405, 0.05675595],
        [0.0878869 , 0.10016156, 0.08619898, 0.04213435],
        [0.18199405, 0.08619898, 0.21615646, 0.0742602 ],
        [0.05675595, 0.04213435, 0.0742602 , 0.03990646]]),
 2: array([[0.41201531, 0.09502126, 0.30980017, 0.05041241],
        [0.09502126, 0.10552721, 0.07304422, 0.04889456],
        [0.30980017, 0.07304422, 0.31088435, 0.04976616],
        [0.05041241, 0.04889456, 0.04976616, 0.07688776]])}

## Testing the model

In [595]:
y_pred = mahalanobis_classification(uniques_classes, class_mean_vectors, class_covariances, X_test)
y_pred = np.array(y_pred, dtype=int)
y_pred[:, 0]

array([0, 1, 2])

In [596]:
print(f"Actual Output : {le.inverse_transform(y_test)}\nPredicted Output : {le.inverse_transform(y_pred[:, 0])}")

Actual Output : ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Predicted Output : ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


## Performance

In [594]:
print(f"Accuracy : {check_accuracy(y_pred[:, 0], y_test)*100}%")

Accuracy : 100.0%
