## Naive Bayes Classifiers

*Naive Bayes* classifiers are a family of classifiers that are quite similar to linear models.  However, they tend to be even faster in training.  The price paid for this efficiency is that Naive Bayes models often provide generalization performance that is slightly worse than *LogisticRegression* and *LinearSVC*.

The reason that Naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature.  

There are 3 kinds of Naive Bayes classifiers implemented in scikit-learn:
- GaussianNB (can be applied to any continuous data)
- BernoulliNB (assumes binary data)
- MultinomialNB (assumes integer count data, like how many times a word appears in a sentence)

### BernoulliNB

The BernoulliNB Classifier counts how often every feature of each class is not zero.

Here, we have 4 data points, with 4 binary features each.  There are 2 classes, 0 and 1.  

For class 0 (the first and third data points):
- the first feature is zero 2 times and nonzero 0 times
- the second feature is zero 1 time and nonzero 1 time
- the third feature is zero 2 times and nonzero 0 times
- the fourth feature is zero 0 times and nonzero 2 times

For class 1 (the second and fourth data points):
- the first feature is zero 0 times and nonzero 2 times
- the second feature is zero 2 times and nonzero 0 times
- the third feature is zero 0 times and nonzero 2 times
- the fourth feature is zero 1 time and nonzero 1 time

In [2]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import sklearn
from IPython.display import display
import mglearn

# Don't display deprecation warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
X = np.array([[0, 1, 0, 1],
              [1, 0, 1, 1],
              [0, 0, 0, 1],
              [1, 0, 1, 0]])
y = np.array([0, 1, 0, 1])

Counting the nonzero entries per class in essence looks like this:

In [4]:
counts = {}
for label in np.unique(y):
    # Iterate over each class
    # Count (sum) entries of 1 per feature
    counts[label] = X[y == label].sum(axis=0)
    
print("Feature counts:\n", counts)

Feature counts:
 {0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])}


The other two Naive Bayes models, *MultinomialNB* and *GaussianNB*, are slightly different in what kind of statistics they compute.  *MultinomialNB* takes into account the average value of each feature for each class, while *GaussianNB* stores the average value as well as the standard deviation of each feature for each class.

To make a prediction, a data point is compared to the statistics for each of the classes, and the best matching class is predicted.  Interestingly, for both *MultinomialNB* and *BernoulliNB*, this leads to a prediction formula that is the same as the linear models.

### Strengths, Weaknesses, and Parameters

*MultinomialNB* and *BernoulliNB* have a single parameter, *alpha*, which controls model complexity.  The way *alpha* works is that the algorithm adds to the data *alpha* many virtual data points that have positive values for all the features.  The results in a "smoothing" of the statistics.

A large *alpha* means more smoothing, resulting in less complex models.  Setting *alpha* isn't critical for good performance, but tuning it usually improves accuracy somewhat.

*GaussianNB* is mostly used on very high-dimensional data, while the other 2 variants of Naive Bayes are widely used for sparse data such as text.  *MultinomialNB* usually performs better than *BernoulliNB*, particularly on datasets with a relatively large number of nonzero features.

The Naive Bayes models share many of the strengths and weaknesses of the linear models.  They are very fast to train and predict, and the training procedure is easy to understand.  The models work very well with high-dimensional sparse data and are relatively robust to the parameters.  Naive Bayes models are great baseline models and are often used on very large datasets, where training even a linear model might take too long.