# Exploring different statistical/bayesian classifiers

Overview:
- Own implementation of an Nearest Centroid Classifier (NCC)
- A Naive Bayesian Classifier (NBC) based on discrete feature values.
- A Gaussian Naive Bayesian Classifier (GNBC) using gaussian distributions

In [8]:
import NCC
import NBC
import datasets
import helpers

## Datasets

In [9]:
train_features_1, test_features_1, train_labels_1, test_labels_1 = datasets.sklearn_digits()
train_features_2, test_features_2, train_labels_2, test_labels_2 = datasets.sklearn_digits_summarized()
train_features_3, test_features_3, train_labels_3, test_labels_3 = datasets.MNIST_light(normalized=True)

## Nearest Centroid Classifier - own implementation

### Sklearn digits

In [10]:
clf = NCC.NearestCentroidClassifier()
clf.fit(train_features_1, train_labels_1)

In [11]:
y_pred = clf.predict(test_features_1)

In [12]:
helpers.evaluate_and_print(test_labels_1, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        45
           1       0.86      0.81      0.83        52
           2       0.96      0.83      0.89        53
           3       0.92      0.81      0.86        54
           4       1.00      0.92      0.96        48
           5       0.94      0.82      0.88        57
           6       0.97      0.98      0.98        60
           7       0.84      0.98      0.90        53
           8       0.93      0.84      0.88        61
           9       0.68      0.95      0.79        57

    accuracy                           0.89       540
   macro avg       0.91      0.89      0.90       540
weighted avg       0.90      0.89      0.89       540


Confusion matrix SKLearn GNB:
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 42  1  0  0  1  1  0  1  6]
 [ 1  2 44  3  0  0  0  2  0  1]
 [ 0  0  1 44  0  0  0  2  2  5]
 [ 0  1  0  0 44  0  0  3  0  0]
 [ 0  0  0

### Sklearn digits summarized

In [13]:
clf = NCC.NearestCentroidClassifier()
clf.fit(train_features_2, train_labels_2)

In [14]:
y_pred = clf.predict(test_features_2)

In [15]:
helpers.evaluate_and_print(test_labels_2, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        45
           1       0.82      0.77      0.79        52
           2       0.89      0.79      0.84        53
           3       0.89      0.78      0.83        54
           4       0.96      0.92      0.94        48
           5       0.94      0.84      0.89        57
           6       0.94      0.98      0.96        60
           7       0.86      0.94      0.90        53
           8       0.91      0.82      0.86        61
           9       0.68      0.93      0.79        57

    accuracy                           0.88       540
   macro avg       0.89      0.88      0.88       540
weighted avg       0.88      0.88      0.88       540


Confusion matrix SKLearn GNB:
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 40  2  0  0  1  2  0  2  5]
 [ 1  4 42  4  0  0  0  1  1  0]
 [ 0  0  1 42  0  0  0  2  2  7]
 [ 0  1  0  0 44  0  0  3  0  0]
 [ 0  0  0

### MNIST_light

In [16]:
clf = NCC.NearestCentroidClassifier()
clf.fit(train_features_3, train_labels_3)

In [17]:
y_pred = clf.predict(test_features_3)

In [18]:
helpers.evaluate_and_print(test_labels_3, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       164
           1       0.71      0.97      0.82       152
           2       0.84      0.73      0.78       155
           3       0.74      0.76      0.75       154
           4       0.75      0.76      0.75       143
           5       0.72      0.69      0.70       141
           6       0.90      0.86      0.88       143
           7       0.95      0.80      0.87       158
           8       0.79      0.72      0.75       132
           9       0.76      0.80      0.78       158

    accuracy                           0.80      1500
   macro avg       0.81      0.80      0.80      1500
weighted avg       0.81      0.80      0.80      1500


Confusion matrix SKLearn GNB:
[[150   0   2   0   0   6   3   1   2   0]
 [  0 148   0   0   0   2   0   0   2   0]
 [  0  15 113   8   2   3   3   1   8   2]
 [  1   5   8 117   1   7   1   2   8   4]
 [ 

## NBC - Discrete features

Here we will use Naive bayes.
It estimates the contitional probabilities that a feature x belongs to class y.
These probabities are  basically relative frequency counting. So it uses the number of times a feature appears in a class y.

### Sklearn digits

In [19]:
clf = NBC.NaiveBayesianClassifier()
clf.fit(train_features_1, train_labels_1)

In [20]:
y_pred = clf.predict(test_features_1)

100%|████████████████████████████████████████████████████████████████████████████████| 540/540 [00:19<00:00, 27.75it/s]


In [21]:
helpers.evaluate_and_print(test_labels_1, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.19      1.00      0.32        45
           1       0.84      0.62      0.71        52
           2       0.90      0.51      0.65        53
           3       0.76      0.48      0.59        54
           4       0.94      0.69      0.80        48
           5       0.92      0.39      0.54        57
           6       1.00      0.68      0.81        60
           7       0.81      0.57      0.67        53
           8       0.86      0.39      0.54        61
           9       0.74      0.49      0.59        57

    accuracy                           0.57       540
   macro avg       0.80      0.58      0.62       540
weighted avg       0.81      0.57      0.63       540


Confusion matrix SKLearn GNB:
[[45  0  0  0  0  0  0  0  0  0]
 [17 32  1  0  0  0  0  0  2  0]
 [25  0 27  1  0  0  0  0  0  0]
 [18  0  1 26  0  0  0  0  1  8]
 [14  0  0  0 33  0  0  1  0  0]
 [29  0  0

### Sklearn digits summarized

In [22]:
clf = NBC.NaiveBayesianClassifier()
clf.fit(train_features_2, train_labels_2)

In [23]:
y_pred = clf.predict(test_features_2)

100%|████████████████████████████████████████████████████████████████████████████████| 540/540 [00:20<00:00, 26.29it/s]


In [24]:
helpers.evaluate_and_print(test_labels_2, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.91      0.93      0.92        45
           1       0.69      0.90      0.78        52
           2       0.96      0.83      0.89        53
           3       0.75      0.87      0.80        54
           4       0.79      0.92      0.85        48
           5       0.91      0.89      0.90        57
           6       0.98      0.93      0.96        60
           7       0.83      0.92      0.88        53
           8       1.00      0.41      0.58        61
           9       0.78      0.88      0.83        57

    accuracy                           0.84       540
   macro avg       0.86      0.85      0.84       540
weighted avg       0.86      0.84      0.84       540


Confusion matrix SKLearn GNB:
[[42  0  0  0  3  0  0  0  0  0]
 [ 0 47  0  0  1  1  0  1  0  2]
 [ 1  5 44  1  0  0  0  0  0  2]
 [ 1  0  0 47  0  0  0  2  0  4]
 [ 0  0  0  0 44  0  0  4  0  0]
 [ 0  0  0

### MNIST_light - Currently unused, but maybe some time out of interest

## GNBC -  Features as probability distributions

### Sklearn digits

In [28]:
clf = NBC.GaussianNaiveBayesianClassifier()
clf.fit(train_features_1, train_labels_1, epsilon=1e-2)

In [29]:
y_pred = clf.predict(test_features_1)

100%|████████████████████████████████████████████████████████████████████████████████| 540/540 [00:28<00:00, 18.77it/s]


In [30]:
helpers.evaluate_and_print(test_labels_1, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        45
           1       0.84      0.83      0.83        52
           2       0.95      0.72      0.82        53
           3       0.77      0.81      0.79        54
           4       0.94      0.96      0.95        48
           5       0.98      0.89      0.94        57
           6       0.98      0.98      0.98        60
           7       0.85      0.96      0.90        53
           8       0.70      0.87      0.77        61
           9       0.88      0.75      0.81        57

    accuracy                           0.88       540
   macro avg       0.89      0.88      0.88       540
weighted avg       0.88      0.88      0.88       540


Confusion matrix SKLearn GNB:
[[45  0  0  0  0  0  0  0  0  0]
 [ 1 43  0  0  0  0  0  0  6  2]
 [ 0  5 38  2  0  0  0  0  8  0]
 [ 0  0  1 44  0  0  0  1  5  3]
 [ 0  0  0  0 46  0  0  2  0  0]
 [ 0  0  0

### Sklearn digits summarized

In [31]:
clf = NBC.GaussianNaiveBayesianClassifier()
clf.fit(train_features_2, train_labels_2, epsilon=1e-2)

In [32]:
y_pred = clf.predict(test_features_2)

100%|████████████████████████████████████████████████████████████████████████████████| 540/540 [00:29<00:00, 18.57it/s]


In [33]:
helpers.evaluate_and_print(test_labels_2, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.93      0.79      0.85        52
           2       0.96      0.85      0.90        53
           3       0.95      0.76      0.85        54
           4       0.94      0.94      0.94        48
           5       0.98      0.88      0.93        57
           6       0.97      0.98      0.98        60
           7       0.84      0.98      0.90        53
           8       0.78      0.97      0.86        61
           9       0.78      0.86      0.82        57

    accuracy                           0.90       540
   macro avg       0.91      0.90      0.90       540
weighted avg       0.91      0.90      0.90       540


Confusion matrix SKLearn GNB:
[[45  0  0  0  0  0  0  0  0  0]
 [ 0 41  0  0  1  1  1  1  4  3]
 [ 0  1 45  0  0  0  0  0  6  1]
 [ 0  0  0 41  0  0  0  1  3  9]
 [ 0  0  0  0 45  0  0  3  0  0]
 [ 0  0  0

### MNIST_light

IMPORTANT: epsilon cannot be too small!!
For this dataset we compute the likelihood for 44 different attributes (pixels) so if epsilon is too small we end up a negative overflow during the norm calculations.
Epsilon= 0.02 work best and produce the same results.

In [34]:
clf = NBC.GaussianNaiveBayesianClassifier()
clf.fit(train_features_3, train_labels_3, epsilon=0.02) #0.0004

In [35]:
y_pred = clf.predict(test_features_3)

  likelihood *= norm.pdf(feature[attr], loc=loc, scale=std)
100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [08:17<00:00,  3.01it/s]


In [36]:
helpers.evaluate_and_print(test_labels_3, y_pred)

Classification report SKLearn GNB:
              precision    recall  f1-score   support

           0       0.89      0.93      0.91       164
           1       0.76      0.97      0.85       152
           2       0.79      0.61      0.69       155
           3       0.76      0.77      0.76       154
           4       0.83      0.60      0.70       143
           5       0.93      0.53      0.68       141
           6       0.82      0.94      0.88       143
           7       0.95      0.78      0.86       158
           8       0.64      0.71      0.67       132
           9       0.60      0.90      0.72       158

    accuracy                           0.78      1500
   macro avg       0.80      0.77      0.77      1500
weighted avg       0.80      0.78      0.77      1500


Confusion matrix SKLearn GNB:
[[152   0   6   0   0   1   2   0   2   1]
 [  0 147   0   0   0   0   1   0   3   1]
 [  1   6  94  11   1   2  16   1  22   1]
 [  1   5  14 118   0   0   2   2   5   7]
 [ 