# Naive Bayes Classifiers

Principle : Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed in the previous section. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC.

In [7]:
%load_ext autoreload
%autoreload
from utils import feature_selection, PCA_reduction

import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler
from sklearn.decomposition import PCA

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
gt = pd.read_csv('../dumps/2020.01.13-14.25.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.34
Test set accuracy: 0.35
Test set accuracy: 0.34
Test set accuracy: 0.43
Test set accuracy: 0.48
Test set accuracy: 0.53


As we can see, no matter of big the test size is, the performances with the Gaussian classifier are really bad for this dataset. Let's try with more samples.

In [6]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.13
Test set accuracy: 0.17
Test set accuracy: 0.21
Test set accuracy: 0.24
Test set accuracy: 0.33
Test set accuracy: 0.46


This is even worse ! The reaseon why this algorithm is so fast but also so bad at generalization is because it learns parameters by looking at each feature individually and collect simple per-class statistics from each feature. Since we have a huge diversity in our dataset, the GaussianNB gives quite bad results.

Let's look at the Bernouilli distribution for different test sizes.

In [9]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    
    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.72
Test set accuracy: 0.70
Test set accuracy: 0.71
Test set accuracy: 0.70
Test set accuracy: 0.73
Test set accuracy: 0.76


The performances are not that bad but one has to know that BernouilliNB is assumes binary data (opposite to the GaussianNB which works for any kind of continuous data). We should therefore perform some tuning and only keep boolean values in our dataset. Let's scale the data to see if it improves the accuray :

In [12]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)
    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.91
Test set accuracy: 0.90
Test set accuracy: 0.89
Test set accuracy: 0.88
Test set accuracy: 0.88
Test set accuracy: 0.87


Impressive to see how great our results are now ! We can conclude that scaling is therefore quite game-changing in the case of this distribution.

And for the Multinomial distribution. Note that this distribution only accepts non-negative values, therefore we have either to parse our dataset and remove all rows where feature values are below 0...

In [13]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
print(data.shape)
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    
    gnb = MultinomialNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

(3771, 119)
Test set accuracy: 0.77
Test set accuracy: 0.80
Test set accuracy: 0.83
Test set accuracy: 0.87
Test set accuracy: 0.92
Test set accuracy: 0.95


... or use some preprocessing to scale all values between 0 and 1 :

In [19]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
print(data.shape)
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = MultinomialNB()
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

(7977, 119)
Test set accuracy: 0.92
Test set accuracy: 0.90
Test set accuracy: 0.89
Test set accuracy: 0.89
Test set accuracy: 0.88
Test set accuracy: 0.88


Conclusion : all the distribution we tested definitely need some scaling preprocessing. The Gaussian distribution didn't perform that well while Bernouilli's distribution provided the best results after normalization. The multinomial distribution quite good performances too but required removing some samples.

### Features relevance

Since Naive Bayes assumes independence and outputs class probabilities most feature importance criteria are not a direct fit. The feature importance should be no different from the skewness of the feature distribution in the set.

### Test with Thomas datasets

In [20]:
gt = pd.read_csv("../dumps/2019-08.Merged_thomas.csv")
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)
scaler = StandardScaler()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

tree = GaussianNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

tree = BernoulliNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = MultinomialNB()

tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.886
Accuracy on test set: 0.890
Accuracy on training set: 0.903
Accuracy on test set: 0.901
Accuracy on training set: 0.248
Accuracy on test set: 0.244


In [22]:
gt = pd.read_csv("../dumps/2019-09.Merged_thomas.csv")
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)
scaler = StandardScaler()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

tree = GaussianNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

tree = BernoulliNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
target = gt['label']
data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = MultinomialNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.989
Accuracy on test set: 0.986
Accuracy on training set: 0.933
Accuracy on test set: 0.930
Accuracy on training set: 0.223
Accuracy on test set: 0.218


### PCA

In [8]:
PCA_reduction('../dumps/2020.03.11-17.39.csv','gaussian')

  Variance    Training acc    Test acc    Components    Time (s)
----------  --------------  ----------  ------------  ----------
      1           0.698011    0.702142           119    0.136651
      0.99        0.823499    0.815369            98    0.13692
      0.95        0.95596     0.953077            77    0.131393
      0.9         0.958085    0.955117            60    0.127462
      0.85        0.952729    0.948657            47    0.125289


Gathering features here crazily improves the accuracies from 70 to 95% in nearly the same, even slightly faster !

In [9]:
PCA_reduction('../dumps/2020.03.11-17.39.csv','bernoulli')

  Variance    Training acc    Test acc    Components    Time (s)
----------  --------------  ----------  ------------  ----------
      1           0.920847    0.920095           119    0.142111
      0.99        0.919912    0.915675            98    0.15339
      0.95        0.922207    0.914655            77    0.14584
      0.9         0.916851    0.910575            60    0.135029
      0.85        0.918466    0.913635            47    0.136369


Here feature clustering improves the time but slightly decreases the accuracies.