## Problem 4: Gaussian Naive Bayes

1. Given a data set with m continuous features, what is the log-likelihood of the Gaussian NB
model? Compute the MLE for each of the model parameters.


In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
sonar_train = pd.read_csv('sonar_train.data', header=None)
sonar_test = pd.read_csv('sonar_test.data', header=None)
sonar_valid = pd.read_csv('sonar_valid.data', header=None)

sonar_train.loc[sonar_train[60] == 2, 60] = -1
sonar_test.loc[sonar_test[60] == 2, 60] = -1
sonar_valid.loc[sonar_valid[60] == 2, 60] = -1

def split_data(data):
    return data.iloc[:, :60].to_numpy(), data.iloc[:, 60:].to_numpy()

X_train, y_train = split_data(sonar_train)
X_validation, y_validation = split_data(sonar_valid)
X_test, y_test = split_data(sonar_test)

In [3]:
labels = {1, -1}
m, n = X_train.shape
data_stat = {}
for label in labels:
    mask = (y_train == label).ravel()
    y_prob = np.sum(mask.astype(np.float32)) / m
    X_label = X_train[mask]
    mean = X_label.mean(axis=0)
    var = X_label.var(axis=0)
    data_stat[label] = (y_prob, mean, var)

In [40]:
mask_1 = (y_train == 1).ravel()
print("y = 1")
print("Mean MLE\n", X_train[mask_1].mean(axis=0))
print("Variance MLE\n", (np.sum(X_train[mask_1], axis=0) - X_train[mask_1].mean(axis=0)) / X_train.shape[1])
print("Unbiased Variance MLE\n", (np.sum(X_train[mask_1], axis=0) - X_train[mask_1].mean(axis=0)) / (X_train.shape[1] - 1))

y = 1
Mean MLE
 [0.0222902  0.03148824 0.03790196 0.04584314 0.0689     0.10116275
 0.11923529 0.12568235 0.13812941 0.16653529 0.17673137 0.18921569
 0.20863922 0.2563     0.30578039 0.3820451  0.42764706 0.44002157
 0.4440902  0.47582157 0.54373137 0.57872549 0.60416078 0.65514314
 0.68392745 0.70956667 0.69666275 0.67369216 0.63386471 0.57646078
 0.52397255 0.43650392 0.44531961 0.45542353 0.46760784 0.47771373
 0.44626667 0.36371765 0.31952745 0.33340392 0.29820196 0.26851569
 0.2234451  0.18886863 0.16330588 0.13898039 0.10754118 0.07797647
 0.04480588 0.01862353 0.01237647 0.01037255 0.00930784 0.00888431
 0.00874706 0.00722353 0.00836863 0.00688824 0.00730196 0.00610784]
Variance MLE
 [0.01857516 0.0262402  0.03158497 0.03820261 0.05741667 0.08430229
 0.09936275 0.10473529 0.11510784 0.13877941 0.14727614 0.15767974
 0.17386601 0.21358333 0.25481699 0.31837092 0.35637255 0.36668464
 0.37007516 0.39651797 0.45310948 0.48227124 0.50346732 0.54595261
 0.56993954 0.59130556 0.580552

In [41]:
mask_1 = (y_train == -1).ravel()
print("y = -1")
print("Mean MLE\n", X_train[mask_1].mean(axis=0))
print("Variance MLE\n", (np.sum(X_train[mask_1], axis=0) - X_train[mask_1].mean(axis=0)) / X_train.shape[1])
print("Unbiased Variance MLE\n", (np.sum(X_train[mask_1], axis=0) - X_train[mask_1].mean(axis=0)) / (X_train.shape[1] - 1))

y = -1
Mean MLE
 [0.03253962 0.04104528 0.04472075 0.05795472 0.07641132 0.10650755
 0.12542075 0.14785283 0.21642453 0.26446415 0.29930566 0.30189434
 0.31160755 0.3128     0.33173774 0.3757434  0.40329623 0.44553396
 0.52754528 0.59813396 0.65068113 0.66556415 0.67275472 0.68820755
 0.67840566 0.69489245 0.70525472 0.70304906 0.65078113 0.57865094
 0.49120377 0.43141132 0.40898113 0.37285472 0.33809434 0.31356038
 0.30740377 0.30675283 0.30745283 0.28710755 0.25690566 0.27194528
 0.27778302 0.25384717 0.2275283  0.17933208 0.14156981 0.10864717
 0.06152264 0.01993396 0.01978679 0.01554717 0.01090755 0.0108
 0.00902075 0.00840377 0.00738302 0.00874151 0.00805472 0.00666792]
Variance MLE
 [0.02820101 0.03557258 0.03875799 0.05022742 0.06622314 0.09230654
 0.10869799 0.12813912 0.18756792 0.22920226 0.25939824 0.26164176
 0.27005987 0.27109333 0.28750604 0.32564428 0.3495234  0.38612943
 0.45720591 0.51838277 0.56392365 0.57682226 0.58305409 0.59644654
 0.58795157 0.60224013 0.61122075 

In [4]:
def compute_gaussian(x, mean, var):
    return (1/np.sqrt(2*np.pi*var)) * np.exp(-np.power(x-mean,2)/(2*var))

2. Fit a Gaussian NB model to the training data. What is the accuracy of your trained model
on the test set?


In [5]:
def predict(X):
    m, n = X.shape
    preds = np.zeros((m, 1))
    for i in range(m):
        max_likely_y = 0
        best_prob = -1
        for y in {1, -1}:
            prob, mean, var = data_stat[y]
            for j in range(len(X[i])):
                gaus = compute_gaussian(X[i][j], mean[j], var[j])
                prob *= gaus
            if prob > best_prob:
                best_prob = prob
                max_likely_y = y
        preds[i] = max_likely_y
    return preds

In [6]:
print("Test Accuracy: ", np.mean(predict(X_test) == y_test))

Test Accuracy:  0.6923076923076923


3. What kind of prior might make sense for this model? Explain.

**Conjugate priors can be used as a prior**

**For fixed variance - Normal distribution**

**For fixed mean - Inverse Gamma distribution**

4. Do you think the NB assumption is reasonable here?

**Yes, since the attributes are represents the energy within a particular frequency band, integrated over
a certain period of time. They might not be dependent on each other. Attribute independence can be assumed.**