# Naive Bayes
Naive Bayes is a classification technique based on Bayes' Theorem with the assumption of independence among predictors. The latter part means: the presence of a particular feature in a class is unrelated to the presence of any other feature.
* Example: A fruit may be considered an pple if it is red, round, and about 3 inches in diameter.
* Even if these features depend on one another, all of these properties independently contribute to the probability that this fruit is an apple, and that's why it is known as Naive.

Naive Bayes is easy to build and particularly useful for large datsets.

#### Calculation
Bayes theorem provides a way of calculating posterior probability  $P(c|x)$ from $P(c)$, $P(x)$ and $P(x|c)$:

$$P(c|x) = \frac{P(x|c) P(c)}{P(x)}$$

Where:
* $P(c|x)$ is the posterior probability of *class* (c, target) given *predictor* (x, attributes)
* $P(c)$ is the prior probability of *class* 
* $P(x|c)$ is the likeligood of which the probability of *predictor* given *class* 
* $P(x)$ is the prior probability of *predictor* 

### How it works
We'll use an example of some data - does a person go outside and play depending on the weather? We'll try to classify whether someone will play based on weather condition.

**Steps**
1. Convert the data to a frequency table
2. Create a Likelihood table by finding the probabilities
    * Like, $P(Overcast)=0.29$ and $P(Playing)=0.64$
3. Now, use Naive Bayes equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction

![NB ex](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/Bayes_41.png)

**Hypothesis:** People will play if weather is sunny.

Solving via posterior probability:
* $P(Yes|Sunny) = P(Sunny|Yes) \frac{P(Yes}{P(Sunny)}$
    * $P(Sunny|Yes) = \frac{3}{9} = 0.33$ 
    * $P(Yes) = \frac{9}{14} = 0.64$
    * $P(Sunny) = \frac{5}{14} = 0.36$
* $P(Yes|Sunny) = 0.33 * \frac{0.64}{0.36} = 0.60$

### Pros
* Easy and fast to predict classes of data.
* Performs well in multi-class problems
* When assumption of independence holds, Naive Bayes performs better compared to something like logistic regression and you need less training data
* Performs well in the case of categorical input variables

### Cons
* If categorical variables in the test data have a category that was not observed in training, then the model will assign a 0 probability and will be unable to make a prediction
    * This is known as "Zero Frequency"
    * We can use a smoothing technique like Laplace estimation to solve this
* A bad estimator - the probability ouputs are not to be taken seriously
* The assumption of independent predictors - in real life this isn't usually the case

## Applications
* **Real time Prediction:** Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
* **Multi class Prediction:** This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
* **Text classification/ Spam Filtering/ Sentiment Analysis:** Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
* **Recommendation System:** Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

## Implementation using Scikit
Scikit has 3 types of Naive Bayes
1. **Gaussian:** Used in classification and it assumes that features follow a normal distribution
2. **Multinomial:** It is used for discrete counts. For example, we have a text classification problem. Here, we can consider Bernoulli trials which is one step further and instead of "word occuring in the document", we have "count how often word occurs in the document". You can think of it as "number of times outcome number $x_i$ is observed over the $n$ trials.
3. **Bernoulli:** The binomial model is useful if your feature vectors are binary. One application would be text classification with a "bag of words" model where the 1s and 0s are "words occurs in the document" and "word does not occur in the document".

##### Import packages

In [2]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

##### Create dummy data

In [7]:
np.random.seed(42)

X = np.array([[-3,7],[1,5], [1,2], 
              [-2,0], [2,3], [-4,0], 
              [-1,1], [1,1], [-2,2], 
              [2,7], [-4,1], [-2,7]])

y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])

##### Create model

In [9]:
# Create classifier
model = GaussianNB()

# Train the model
model.fit(X, y)

# Predict Output
predicted = model.predict([[1,2], [3,4]]) # should output 3 4
print(predicted)

[3 4]
