## Supplement 4: Classification

In [12]:
%matplotlib inline
import numpy as np
import pandas as pd


### 4.1 Programming Task: Gaussian Naive-Bayes Classifier
The Iris dataset, containing measurements of the flower parts obtained from 3 different species of the Iris plant, is provided in the file __iris.csv__. The first four columns of the dataset contain the measurement values representing input features for the model and the last column contains class labels of the plant species: Iris-setosa, Iris-versicolor, and Iris-virginica.
The goal of this task is to implement a Gaussian Naive-Bayes classifier for the Iris dataset.

i\. What are the assumptions on the dataset required for the Gaussian Naive-Bayes model?

-features are independent 

-covariance matrix is diagonal

ii\. Split the dataset into train and test by the 80:20 ratio.


In [13]:
dataset_pd = pd.read_csv("iris.csv")

dataset_pd = dataset_pd.sample(frac=1).reset_index()

train_pd = dataset_pd.iloc[:int(len(dataset_pd)*0.8)]
test_pd = dataset_pd.iloc[int(len(dataset_pd)*0.8):]

train_pd


Unnamed: 0,index,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,0,5.1,3.5,1.4,0.2,Iris-setosa
1,51,6.4,3.2,4.5,1.5,Iris-versicolor
2,9,4.9,3.1,1.5,0.1,Iris-setosa
3,149,5.9,3.0,5.1,1.8,Iris-virginica
4,123,6.3,2.7,4.9,1.8,Iris-virginica
...,...,...,...,...,...,...
115,40,5.0,3.5,1.3,0.3,Iris-setosa
116,78,6.0,2.9,4.5,1.5,Iris-versicolor
117,53,5.5,2.3,4.0,1.3,Iris-versicolor
118,72,6.3,2.5,4.9,1.5,Iris-versicolor


iii\. Estimate the parameters of the Gaussian Naive-Bayes classifier using the train set.


In [14]:
train_X = train_pd[["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]].to_numpy()
train_Y = train_pd["Species"].to_numpy()

species = np.unique(train_Y)


priors = {}
means = {}
covs = {}

In [15]:


for s in species:
    X_c = train_X[train_Y == s]
    priors[s] = X_c.shape[0] / train_X.shape[0]
    means[s] = np.mean(X_c, axis=0)

    cov = np.zeros((len(means[s]), len(means[s])))
    for x in X_c:
        x_m = (x-means[s]).reshape((len(x), 1))
        cov += x_m @ x_m.T
    cov /= len(X_c)
    covs[s] = np.diag(np.diag(cov))



iv\. Using the learned parameters, predict the classes for the samples in the test set.


In [16]:
test_X = test_pd[["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]].to_numpy()
test_Y = test_pd["Species"].to_numpy()


preds = []
for i, x in enumerate(test_X):
    species_probs = []
    for j, s in enumerate(species):
        x_m = (x-means[s]).reshape((len(x), 1))
        a = np.log(priors[s]) 
        b = -1/2 * np.log(np.abs(np.linalg.det(covs[s]))) 
        c = -1/2 * x_m.T @ np.linalg.inv(covs[s]) @ x_m
        prob = a + b +c
        species_probs.append(prob)
    preds.append(species[np.argmax(species_probs)])



corect_predictions = preds == test_Y
corect_predictions = corect_predictions.astype(int)

What is the accuracy of the model on the test set?

In [17]:
print("Accuracy: {:.3f}%".format(np.sum(corect_predictions)/len(corect_predictions)*100))


Accuracy: 96.667%
