***What is naive Bayes algorithm?***

 It is a generative, supervised learning method.

A classification technique, based on Bayes' theorem i.e $P(y|x)= \frac{P(x|y)P(y)}{P(x)}.$



It assumes that features are independent i.e the presence of a particular feature in a class is unrelated to the presence of any other feature in that class. Thus, the name 'Naive'.

Using the Bayes theorem and the assumption of indendepence of features we shall have the following

$P(y|x_1,...,x_n) = \frac{P(y)\prod_{i=1}^nP(x_i|y)}{P(x_1,...,x_n)}$


where;



$P(y|x_1,...,x_n)$ --- is the posterior probability of class (c,target) given feature(x,attributes).

$P(x_1,...,x_n/y)$ --- likelihood i.e probability of feature give class

$P(y)$ --- prior probability of a class

$P(x_1,...,x_n)$ --- prior probability of feature


Since $P(x_1,...,x_n)$ is a constant given the input, we can use the following classification rule:

$P(y|x_1,...,x_n) \propto P(y)\prod_{i=1}^nP(x_i|y)$

$\implies \hat{y} = \arg\max_{y }P(y)\prod_{i=1}^nP(x_i|y)$

To estimate $P(y)$ and $P(x_i|y)$ we use the Maximum A Posteriori (MAP). 

Naive bayes is commonly used in text classiffication and with cases where there are multiple classes.


***Pros and cons of Naive Bayes***

***Pros***

-Faster and performs well in multiclass prediction

-Performs better compared to other models such as logistic regression when the independence asssumption holds.

-Requires less training training data.

-Performs well in case of categorical variables compared to numerical variables(s).
For numerical variables, normal distribution is assumed.

***Cons***

-The model assigns a probability of 0 to a category of variable which is in a test data and not in a training data. (This is known as zero frequency). Smoothing techique is used to solve this,for example using the laplace estimation technique.

-In life it is difficult to find that feutures are independent as assumed by Naive bayes.

***Application of Naive Bayes Algorithims***

**Real time prediction**

Used for prediction in real time due to the fact that it is faster.

**Multiclass prediction**

Can be used for prediction of probability of multiclasses of target variable.

**Text classification/ Spam Filtering/ Sentiment Analysis**

-It is widely used for text classification due to better result in multi class problems and indepence rule.

-It is also widely used for spam filtering (identifu spam email) and sentiment analysis to identify positive and negative customer sentiments.

**Recommendation System**

Together with collaborative filtering, it builds a recommendation system to filter unseen information and predict whether a user would like a given resource or not.


***Three kinds of Naive Bayes model***

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i|y)$.

**Gaussian**

It assumes that features follow a normal distribution (that is are numerical)

**Multinomial**

It is used for discrete counts features.

**Bernoulli**

It is used when the feature vectors are binary(that is, zeros and ones).


Now we shall implement the three kinds of Naive Bayes using scikitlearn.

***Gaussian Naive Bayes implementation***

For this case, the probability of feautures is assumed to take a normal (gaussian)distribution. Features are numerical (continous).

$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma^2_y}}\exp
\left(-\frac{(x_i-\mu_y)}{2\sigma^2_y}\right)$

We shall use the iris data which has continous features.

In [13]:
#Importing libraries
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
#Import iris dataset
dataset = pd.read_csv("/home/aims/Documents/Kaggle Data/IRIS.csv")

In [38]:
#Getting data information
dataset.info()

From the information above we can see that our data does not have missing values, the dataset has 150 training examples/samples and 5 columns(features and target).

In [39]:
#Getting the first few rows of our data
dataset.head()

In [40]:
#Checking for normality of features
from scipy.stats import normaltest 
print('normality result for petal_width :', normaltest(dataset['petal_width']))
print('normality result for petal_length :', normaltest(dataset['petal_length']))
print('normality result for sepal_length :', normaltest(dataset['sepal_length']))


In [41]:
# Import the libraries
import matplotlib.pyplot as plt
import seaborn as sns

# matplotlib histogram
#plt.hist(dataset['petal_width'], color = 'blue', edgecolor = 'black',
         #bins = int(180/5))

# seaborn histogram
sns.distplot(dataset['petal_width'], hist=True, kde=False, 
             bins=int(180/5), color = 'blue',
             hist_kws={'edgecolor':'black'})
# Add labels
plt.title('Histogram of petal_width')
plt.xlabel('petal width')
plt.ylabel('petal width')

In [5]:
# We now split dataset into features and target
features = dataset.drop(['species'],1).values
target = dataset['species']

In [7]:
#We now encode the target labes with values 0 and n-classes-1
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
target= labelencoder.fit_transform(target)


In [8]:
#We now split our data into train and test set

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 0)

In [9]:
#Calling the model
model = GaussianNB()

In [42]:
#Fitting the model with train dataset

model.fit(x_train,y_train)

In [43]:
#Predicting the target on the training dataset
predict_train = model.predict(x_train)
print('prediction classes of target based on train data: ', predict_train)

In [44]:
#Accuracy score on train dataset
train_accuracy = accuracy_score(y_train,predict_train)*100


In [45]:
#Predicting the target on the testing dataset
predict_test = model.predict(x_test)
print('prediction classes of target based on test data: ', predict_test)

In [46]:
#Accuracy score on test dataset
test_accuracy = accuracy_score(y_test,predict_test)*100

We can see that the model performs better both in train and test set.

**Tips to improve the power of Naive Bayes Model**

-If continous feature do not have distribution, we need to consider use of transformation or different methods to convert it in normal distribution.

-If dataset has zero frequence issue , we apply smoothing technidue, laplace correction to predict class of test data set.

-Remove corelated features.