# Naive Bayes Classifiers

Principle : Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed in the previous section. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC.

In [1]:
%load_ext autoreload
%autoreload

import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [2]:
gt = pd.read_csv('../../dumps/various_sizes/1K.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.61
Test set accuracy: 0.61
Test set accuracy: 0.68
Test set accuracy: 0.69
Test set accuracy: 0.78
Test set accuracy: 0.96


As we can see, in general, the performances with the Gaussian classifier are really bad for this dataset. Let's try with more samples.

In [3]:
gt = pd.read_csv('../../dumps/various_sizes/1K.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.61
Test set accuracy: 0.61
Test set accuracy: 0.68
Test set accuracy: 0.69
Test set accuracy: 0.78
Test set accuracy: 0.96


Same problem ! The reaseon why this algorithm is so fast but also so bad at generalization is because it learns parameters by looking at each feature individually and collect simple per-class statistics from each feature. Since we have a huge diversity in our dataset, the GaussianNB gives quite bad results.

Let's look at the Bernouilli distribution for different test sizes.

In [5]:
gt = pd.read_csv('../../dumps/various_sizes/1K.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    
    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.89
Test set accuracy: 0.87
Test set accuracy: 0.87
Test set accuracy: 0.87
Test set accuracy: 0.89
Test set accuracy: 0.88


The performances are not that bad but one has to know that BernouilliNB is assumes binary data (opposite to the GaussianNB which works for any kind of continuous data). We should therefore perform some tuning and only keep boolean values in our dataset. Let's scale the data to see if it improves the accuray :

In [6]:
gt = pd.read_csv('../../dumps/various_sizes/1K.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)
    scaler = StandardScaler()
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)
    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.86
Test set accuracy: 0.84
Test set accuracy: 0.85
Test set accuracy: 0.87
Test set accuracy: 0.90
Test set accuracy: 0.94


Impressive to see how great our results are now ! We can conclude that scaling is therefore quite game-changing in the case of this distribution.

And for the Multinomial distribution. Note that this distribution only accepts non-negative values, therefore we have either to use some preprocessing to scale all values between 0 and 1 :

In [9]:
gt = pd.read_csv('../../dumps/various_sizes/1K.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
print(data.shape)
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = MultinomialNB()
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaler.fit(data_train)
    data_train = scaler.transform(data_train)
    data_test = scaler.transform(data_test)

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

(1000, 119)
Test set accuracy: 0.93
Test set accuracy: 0.92
Test set accuracy: 0.91
Test set accuracy: 0.90
Test set accuracy: 0.92
Test set accuracy: 0.92


Conclusion : all the distributions we tested definitely need some scaling preprocessing. The Gaussian distribution didn't perform that well while Bernouilli's distribution provided the best results after normalization. The multinomial distribution quite good performances too but values are scaled then.