# Naive Bayes Classifier for Natural Language Processing

This article will perform sentiment analysis on a series of tweets.
Some of which pertain to natural disaster / terrorist incidents
(class 1), the others are innocent drivel (class 0).

In this post we will learn

1): What is a Naive Bayes classifier.

2): Implementation in Python

3): Review Performance

##  Naive Bayes Probabilities

Training a model to label a sample of text as positive / negative
is known as Natural Language Processing (NLP) binary classification.

A common choice of model for such a task is the Naive Bayes classifier.
Using [Bayes Theorem](https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c
),
we can formulate our outcome variable as the probability of a certain class 1 or 2 given some input text X. In maths speak this is P(C_j | X).  We can calculate this from Bayes theorem as

P(C_j | X) = P(X | C_j) P(C_j) / P(X).

From the above, we have three quantities to calculate.

#### __P(X)__: 

Give that all probabilities have P(X) as the denominator, this constant can be disregarded.

#### __P(C_j):__ 

This is simple enough. It is just the relative
fraction of class _i_ in the data set (i.e.
for the positive class, this is the number of
positives divided by the total number of samples
in the training data).

#### __P(X | C_j):__

Representing the input as a set of features
x1, x2 ...xn, P(X) = P(x1, x2...xn).
We can rewrite P(X | C_j) as P(x1,x2...xn | C_j).
Now comes the trick. The word _Naive_ in Naive Bayes
classifiers means that we make the assumption that
probabilities of all words are independent of
each other. This chiefly means that we assume
the order of the words doesnt matter (an incorrect assumption
but one that often doesnt introduce too much error), but
 also allows us to rewrite our conditional
 probability as

P(x1,x2...xn | C_j) = P(x1|C_j) P(x2 | C_j) .... P(xn | C_j).

The probability of word 1 given class j, P(x1 | C_j), is then

P(xi | C_j) = count (xi, C_j) / sum_k (xk, C_j).

i.e. this counts all occurrences of word xi in
all inputs of class C_j and divides them by the
sum of counts of all words in the vocabulary.
Laplace smoothing can be used to mitigate the
effects of zero occurrences of words and
dividing by zero.

In summary, the Naive Bayes probability for
input X, P(X | C_j) P(C_j), is calculated by
multiplying together P(xi | C_j) for all words
in our vocabulary and multiplying the result by
P(C_J). Lets now set one up using sci-kit learn.

## Scikit-learn implementation:

Python's scitkit-learn module includes excellent Naive Bayes functionality for 