# Naive Bayes

## Bayes' Theorem

Let's first remember the Bayes' Theorem.

![](https://www.saedsayad.com/images/Bayes_rule.png)

## Constructing a Model

Naive Bayes is a quite simple and effective classifier. It relies on a probabilistic approach to make classifications.  
It is based on Bayes’ Theorem with the addition of independence assumption between features.

The classification step is based on the theorem. $ P(c \, | \, x) $ is the posterior probability of target $ c $ given attribute set $ x $. Therefore, for a new instance, we'll assign the class label that maximizes the posterior probability given its attributes.

![](https://www.saedsayad.com/images/naive_bayes_data.png)

![](https://www.saedsayad.com/images/Bayes_3.png)

![](https://www.saedsayad.com/images/naive_bayes_likelihood.png)

$ P(x_0, \, …, \, x_n \, | \, c_i) $ is difficult to compute. In order to simplify the computation, we make the assumption that $ x_0 $ through $ x_n $ are conditionally independent given $ c_i $, which allows us to put the above statement to $ P(x_0, \, …, \, x_n \, | \, c_i) \, = \, P(x_0 \, | \, c_i) * P(x_1 \, | \, c_i) * \, … \, * P(x_n \, | \, c_i) $. In most cases, this assumption is not true, hence we name it as naive Bayes classifier.

![](https://www.saedsayad.com/images/naive_bayes_example_2.png)

The final classification is made with the following expression. We can omit the $ P(x) $ value, since it would be the same for each class label.

$$ argmax_y \, P(y) \prod_{i=1}^n \, P(x_i \, | \, y)$$

## Implementation

Now, let's apply naive bayes to the infamous titanic dataset to predict the survivals of passengers.

In [3]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

path_prefix = "/content/drive/My Drive"

Mounted at /content/drive


In [0]:
import pandas as pd
import numpy as np
from os.path import join

In [0]:
filename = "titanic.csv"
df = pd.read_csv(join(path_prefix, filename))

In [0]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,


Depending on the dataset we have, there are multiple distribution options we can choose to compute the probabilities.  

**Gaussian:** It assumes that continuous features follow a normal distribution.  
**Multimodial:**  It is used for discrete counts.  
**Bernoulli:** The binomial model is useful if your features are binary.

Here, we choose Gaussian naive bayes to model the probabilities.

First of all, we can easily remove the `PassengerId`, `Name` and `Ticket` columns from the dataframe, since they have no effect on the target.  
In addition, `cabin` column can be removed as well, since it has some sort of relation with `Pclass` column

In [0]:
df.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)

### Encoding Categorical Variables

In sklearn, almost all of the algorithms expect numeric inputs. As a result, we need to encode/convert the categorical variables into their numeric representations. So far, we have been performing this conversion manually. By assigning each unique value to a number. However, with sklearn we have a bunch of different encoding mechanisms.

#### Ordinal Encoding

Given an ordinal features/variable, our goal is to represent each unique value with a numeric value from range 0 to n-1, where n is the number of categories, thus the name ordinal.

To this end, you may utilize `OrdinalEncoder` function.

``` py
>>> from sklearn.preprocessing import OrdinalEncoder
>>> encoder = OrdinalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], 
['female', 'from Europe', 'uses Firefox']]
>>> encoder.fit(X)
>>> encoder.transform([['female', 'from US', 'uses Safari']])
array([[0., 1., 1.]])
```

In [12]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

# the shape of the input must be 2D
encoded = encoder.fit_transform(df[["Sex"]])
encoded[:2]

array([[1.],
       [0.]])

You can use the `inverse_transform` method to observe the mapping

In [0]:
encoder.inverse_transform(encoded[:2])

array([['male'],
       ['female']], dtype=object)

#### One-Hot Encoding

The problem with ordinal, or label, encoding is that it creates an ordered relationship between the categories. The model would interpret that ordered relationship into misleading results. 

Another option is to deploy `One-Hot Encoding`, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

![](https://i.imgur.com/mtimFxh.png)

``` py
>> from sklear.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder()
>>> enc.transform([['female', 'from US', 'uses Safari'],
...                ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])
```

In [13]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

# the shape of the input must be 2D
encoded = encoder.fit_transform(df[["Sex"]])
# we need the toarray method
# to observe the raw values, since by default
# it returns a sparse array
encoded[:2].toarray()

array([[0., 1.],
       [1., 0.]])

In [0]:
encoder.inverse_transform(encoded[:2])

array([['male'],
       ['female']], dtype=object)

Pandas has its own one-hot encoder function as well, named `get_dummies`.

In [15]:
df = pd.get_dummies(df)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male
0,0,3,22.0,1,0,7.25,0,1
1,1,1,38.0,1,0,71.2833,1,0
2,1,3,26.0,0,0,7.925,1,0
3,1,1,35.0,1,0,53.1,1,0
4,0,3,35.0,0,0,8.05,0,1


### Back to Implementation

In [0]:
df.fillna(df.mean(), inplace=True)
features = df.drop(["Survived"], axis=1).values
target = df["Survived"].values

In [0]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [0]:
# creating the model
clf = GaussianNB()

In [0]:
# train-test split for the dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42)

In [20]:
# fit the training data
clf.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [0]:
# predict test data
y_pred = clf.predict(X_test)

In [23]:
# getting the accuracy score
accuracy_score(y_test, y_pred)

0.8101694915254237