# Naive Bayes Classifiers

Principle : Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed in the previous section. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

In [4]:
gt = pd.read_csv('../dumps/2020.01.13-14.25.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.33
Test set accuracy: 0.34
Test set accuracy: 0.34
Test set accuracy: 0.40
Test set accuracy: 0.46
Test set accuracy: 0.51


As we can see, no matter of big the test size is, the performances with the Gaussian classifier are really bad for this dataset. Let's try with more samples.

In [5]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.13
Test set accuracy: 0.17
Test set accuracy: 0.21
Test set accuracy: 0.24
Test set accuracy: 0.32
Test set accuracy: 0.45


This is even worse ! The reaseon why this algorithm is so fast but also quite bad at generalization is because it learns parameters by looking at each feature individually and collect simple per-class statistics from each feature. Since we have a huge diversity in our dataset, the GaussianNB gives quite bad results.

Let's look at the Bernouilli distribution for different test sizes.

In [6]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.72
Test set accuracy: 0.70
Test set accuracy: 0.71
Test set accuracy: 0.70
Test set accuracy: 0.73
Test set accuracy: 0.76


And for the Multinomial distribution :

In [8]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = MultinomialNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

ValueError: Negative values in data passed to MultinomialNB (input X)

Since this algorithm doesn't support negative values, it doesn't work. Either we drop it, either we need to normalize the values.

Conclusion : not suited for our problem of classification between really sparse features.