# Naive Bayes Classifiers

Naive bayes classifier tend to be faster than algorithms from linear models. But the price for their efficiency is their performance is slightly worse than linear models such as LinearSVC and LogisticRegression.

The reason that naive Bayes models are so __efficient__ is that they learn parameters by
__looking at each feature individually and collect simple per-class statistics from each
feature__.

Three kinds of naive bayes classifiers implemented in scikit learn: GaussianNB, MultinomialNB, and BenoulliNB. 

__GaussianNB__ can be applied to
any __continuous data__, while __BernoulliNB assumes binary data__ and __MultinomialNB
assumes count data__ (that is, that each feature represents an integer count of something,
like how often a word appears in a sentence). BernoulliNB and MultinomialNB
are mostly used in text data classification.

__Import libraries__

In [2]:
import numpy as np

The BernoulliNB classifier counts how often every feature of each class is not zero.
This is most easily understood with an example:

In [3]:
X = np.array([[0, 1, 0, 1],
              [1, 0, 1, 1],
              [0, 0, 0, 1],
              [1, 0, 1, 0]])
y = np.array([0, 1, 0, 1])

Here, we have four data points, with four binary features each. There are two classes,
0 and 1. For class 0 (the first and third data points), the first feature is zero two times
and nonzero zero times, the second feature is zero one time and nonzero one time,
and so on. These same counts are then calculated for the data points in the second
class. Counting the nonzero entries per class in essence looks like this:

In [4]:
counts = {}
for label in np.unique(y):
    # iterate over each class
    # count (sum) entries of 1 per feature
    counts[label] = X[y == label].sum(axis=0)
print("Feature counts:\n{}".format(counts))

Feature counts:
{0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])}


The other two naive Bayes models, MultinomialNB and GaussianNB, are slightly different
in what kinds of statistics they compute. __MultinomialNB takes into account the
average value of each feature for each class__, while __GaussianNB stores the average value
as well as the standard deviation of each feature for each class__.

To make a prediction, a data point is compared to the statistics for each of the classes,
and the best matching class is predicted. Interestingly, for both MultinomialNB and
BernoulliNB, this leads to a prediction formula that is of the same form as in the linear
models.

## Strengths, Weaknesses, and Parameters

MultinomialNB and BernoulliNB have a single parameter, __alpha__, which controls
model complexity. The way alpha works is that __the algorithm adds to the data alpha
many virtual data points that have positive values for all the features__. This results in a
“smoothing” of the statistics. __A large alpha means more smoothing, resulting in less
complex models__. The algorithm’s performance is relatively robust to the setting of
alpha, meaning that __setting alpha is not critical for good performance__. However,
tuning it usually improves accuracy somewhat.

__GaussianNB is mostly used on very high-dimensional data__, while the __other two variants
of naive Bayes are widely used for sparse count data such as text__. MultinomialNB
usually performs better than BernoulliNB, particularly on datasets with a relatively
large number of nonzero features (i.e., large documents).

The naive Bayes models share many of the strengths and weaknesses of the linear
models. __They are very fast to train and to predict, and the training procedure is easy
to understand__. The models work very well with high-dimensional sparse data and are
relatively robust to the parameters. Naive Bayes models are great baseline models and
are often used on very large datasets, where training even a linear model might take
too long.

---

# Important Points

- Naive bayes classifiers algorithms tend to be faster than linear models.
- Efficient, looking for individual data and calculate the statistics and collect it.
- Gaussian: continuous data.
- Multinomial: count data.
- Binomial: binary data.
- Parameter alpha controls the model complexity. High value of alpha means the data less complex (not so different from linear models for regression).
- Strengths: fast train and predict, and no need for parameter tuning.
- Gaussian is mostly used for high dimensional data
- Multinomial and Binomial are mostly used for text document problem. Multinomial usually faster than Binomial. Good for sparse dataset.