# Gaussian Naive Bayes

## Naive Bayes

Assume we have to classify messages as spam or not spam using the text in the messages. An Approcah we could follow is described below:

1. find the probabilities of any message being spam P(S) = total spam messages/ total messages
2. find the probabilities of any message being a normal message P(N) = total normal messages/ total messages
3. find out the frequecies of different words in spam and non spam messages and create histograms
    Eg: P(Dear | N) = total occurances of the word Dear in normal messages/ total no of words in normal messages
        P(Friend | N) = 0.29
        P(Lunch | N) = 0.17
        P(Money | N) = 0.06
   similarly,
        P(Dear | S) = total occurances of the word Dear in normal messages/ total no of words in normal messages = 0.29
        P(Friend | S) = 0.14
        P(Lunch | S) = 0.00
        P(Money | S) = 0.57
4. Now once we get a new message we can look at the words in the message and calculate probabilities that the message could be a spam or not
    Eg:  If we get the text "Dear Friend"
        P(Dear | N)*P(Friend | N)*P(N) = 0.09  proportional to P(N | Dear Friend)
        P(Dear | S)*P(Friend | S)*P(S) = 0.01  proportional to P(S | Dear Friend)
5. As we can see the probability that the message is N given "Dear Friend" as text is more we can classify it as normal

### Problems with the above approach

- Assume we are tasked to classify the message "Lunch Money Money Money"
- in the above example:
        P(N | Lunch Money Money Money)  proportional to P(Lunch | N)*P(Money | N)^3*P(N)= 0.0002
        P(S | Lunch Money Money Money)  proportional to P(Lunch | S)*P(Money | S)^3*P(S)= 0
- the second value is 0 because the value of P(Lunch | S) = 0 ie. the word "Lunch" never appeared in the messages we know are spam
- In such a scenario no matter how many times "Money" appears in our message the final result will always be more for normal case and we would classify it as normal message despite it being obviously spam


### A Solution to the zero probability

- Whenever we have a case where we are getting zero probability we can add a value alpha (usualy 1) to all the individuals words while calculating the probailities of each word given spam or normal
        Eg: P(Lunch | S) = (no of time "Lunch" appeared in spam messages + 1) / (total words in spam messages + 1)
- By adding this value our initial probabilities wont be affected as they are dependant on the number of spam and normal messages but the internal word probabilities change
- By doing this for above example we end up with:
        P(Lunch | N)*P(Money | N)^3*P(N)= 0.00002
        P(Lunch | S)*P(Money | S)^3*P(S)= 0.00213
- Thus we would be able to classify this as a Spam message

## Why is Naive Bayes Naive?

- Naive bayes method doesn't consider the relationship between the word in the above example
        ie. P(Dear Friend | N) == P(Friend Dear | N)
- Because the algorithm ignores obvious language related clues and just works with frequencies, it is termed as naive

- As Naive bayes ignores the relation between the words it tends to have a high bias 
- But, in general it fits and is able to classify well so it has in general a low variance

## Gaussian Naive Bayes

- Gaussian Naive Bayes is named so because of the use of gaussian curves to determine the outcome
- The core logic of Naive Bayes still applies in this case with a small addition

The steps are as follows:
- Assume we havenumerical columns which are used to determine if a person likes a movie
- We first find the mean and standard deviation for each of the column in 2 datasets. Those who love the specific movie and those who don't
- These measurements will be helpful in plotting the normal distributions fitting the mean and sd for each column
- Then we start with an initial prediction that a person loves the movie
        Eg: If we have 8 people who love the movie and 8 people who don't then our initial prediction that a new person will love the movie is 0.5
- NOTE: The initial guesses that a person will or will not love the movie are called Prior Probabilities


- NOTE: Likelihood is defined at the y axis value for a corresponding x axis value in the normal distribution for a column
        Eg: If we are looking at pop corn consumption as a parameter and want to predict the likelihood that the person who eats 20 g of popcorn daily loves the movie. We simply plot the normal disribution for pop corn consumption for the people who love the movie and check the y axis value when x axis value is 20 g. this can be represented as L(popcorn=20 | loves the movie)
        
        
- To predict if a new person would love the movie we can calculate the value of following:
        prior probability of people loving the movie * L(popcorn = 20 | love)* L(candy=25 | love) ...
- Similarly we can predict the value that a person wouldn't love the movie:
        prior probability of people not loving the movie * L(popcorn = 20 | no love)* L(candy=25 | no love) ...
- Sometimes likelihood values can get very small, so we can take a natural log() of above values to get the final results
- Which ever result is greater there is more chances the person falls into that category
        Eg: loves = -129  does no love = -58  Result: Person wouldn't love the movie

# Source:
- https://www.youtube.com/watch?v=O2L2Uv9pdDA

# Example

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [2]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()

In [3]:
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"% (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 4
