# Naive Bayes Classifier

## 1.  Algorithm
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.  The class with the highest probability is considered as the most likely class. 

__Bayes Theorem__

Bayes theorem named after Rev. Thomas Bayes. It works on conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge.

Below is the formula for calculating the conditional probability.

$P(H|E)=\frac{P(E|H) * P(H)}{P(E)}$

where
* P(H) is the probability of hypothesis H being true. This is known as the prior probability.
* P(E) is the probability of the evidence(regardless of the hypothesis).
* P(E|H) is the probability of the evidence given that hypothesis is true.
* P(H|E) is the probability of the hypothesis given that the evidence is there.

__Maximum A Posteriori (MAP)__

The MAP for a hypothesis is:

$ MAP(H)
= max(P(H|E))
= max(\frac{(P(E|H)*P(H))}{P(E)})
= max(P(E|H)*P(H))$

P(E) is evidence probability, and it is used to normalize the result. It remains same so, removing it won’t affect.

## 2. Asumption
Naive Bayes classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. This assumption is called class conditional independence.

## 3. Pros and Cons 
__Pros:__

* It is easy and fast to predict class of test data set. It also perform well in multi class prediction
* When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

__Cons:__

* If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
* On the other side Naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
* Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.



## 4. Applications
* __Real time Prediction__: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
* __Multi class Prediction__: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
* __Text classification/ Spam Filtering/ Sentiment Analysis__: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
* __Recommendation System__: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

## 5. Improvement
* __Transformation__: If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.
* __Smoothing__: If test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
* __Correlation__: Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.
* __Pre-Processing__: Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options (look at detail here). I would recommend to focus on your  pre-processing of data and the feature selection.
* __Ensemble__: You might think to apply some classifier combination technique like ensembling, bagging and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize. NB ignores correlation among the features, which induces bias and hence reduces variance.

## 6. Naive Bayes Model under Scikit Learn Library
* __Gaussian:__ A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.

* __Multinomial:__ It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number $x_i$ is observed over the n trials”.

* __Bernoulli:__ The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

In [2]:
# Sample Code

# =============================================================================
# # Set Path
# =============================================================================
import os
path = "/Users/xuefei.yang/Documents/GitHub/Machine-Learning-Notes/Naive Bayes"
os.chdir(path)

# =============================================================================
# # Import Packages
# =============================================================================
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# =============================================================================
# # Load Data
# =============================================================================
adult_df = pd.read_csv('adult.data.csv',header = None, delimiter=' *, *', engine='python')

adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

# =============================================================================
# # Handling Missing Data
# =============================================================================
adult_df.isnull().sum()

## For string columns
for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    print (value,":", sum(adult_df[value] == '?'))
    

# =============================================================================
# # Data Preprocessing
# =============================================================================
adult_df_rev = adult_df
adult_df_rev.describe(include= 'all')

## Data Imputation Step
## We are going to replace the “?” with the above describe methods top row’s value. 
for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],
                                inplace=True)
   
## Label Encoder
le = preprocessing.LabelEncoder()
workclass_cat = le.fit_transform(adult_df.workclass)
education_cat = le.fit_transform(adult_df.education)
marital_cat   = le.fit_transform(adult_df.marital_status)
occupation_cat = le.fit_transform(adult_df.occupation)
relationship_cat = le.fit_transform(adult_df.relationship)
race_cat = le.fit_transform(adult_df.race)
sex_cat = le.fit_transform(adult_df.sex)
native_country_cat = le.fit_transform(adult_df.native_country)


## Initialize the encoded categorical columns
adult_df_rev['workclass_cat'] = workclass_cat
adult_df_rev['education_cat'] = education_cat
adult_df_rev['marital_cat'] = marital_cat
adult_df_rev['occupation_cat'] = occupation_cat
adult_df_rev['relationship_cat'] = relationship_cat
adult_df_rev['race_cat'] = race_cat
adult_df_rev['sex_cat'] = sex_cat
adult_df_rev['native_country_cat'] = native_country_cat


## Drop the old categorical columns from dataframe
dummy_fields = ['workclass', 'education', 'marital_status', 
                  'occupation', 'relationship', 'race',
                  'sex', 'native_country']
adult_df_rev = adult_df_rev.drop(dummy_fields, axis = 1)


## Change the order of columns
adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
                                    'education_num', 'marital_cat', 'occupation_cat',
                                    'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
                                    'capital_loss', 'hours_per_week', 'native_country_cat', 
                                    'income'], axis= 1)
 
adult_df_rev.head(1)


## Standardization of Data
## Standardization will make different features have the same scale. Then different features will have 
## same degree of influence on parameters when using gradient descent.
num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
                'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
                'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
                'native_country_cat']
 
scaled_features = {}
for each in num_features:
    mean, std = adult_df_rev[each].mean(), adult_df_rev[each].std()
    scaled_features[each] = [mean, std]
    adult_df_rev.loc[:, each] = (adult_df_rev[each] - mean)/std


# =============================================================================
# # Data Split
# =============================================================================
features = adult_df_rev.values[:,:14]
target = adult_df_rev.values[:,14]
features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            target, test_size = 0.33, random_state = 10)

# =============================================================================
# # Gaussian Naive Bayes Implementation
# =============================================================================
clf = GaussianNB()
clf.fit(features_train, target_train)
target_pred = clf.predict(features_test)

## Accuracy
accuracy_score(target_test, target_pred, normalize = True)
# (target_test == target_pred).sum()/target_test.shape[0]

workclass : 1836
education : 0
marital_status : 0
occupation : 1843
relationship : 0
race : 0
sex : 0
native_country : 583
income : 0


0.80141447980643965

## Reference
Code Source: http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/

Data Source: https://archive.ics.uci.edu/ml/datasets/Adult

Naive Bayes Documentation: http://scikit-learn.org/stable/modules/naive_bayes.html

MultinomialNB Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Tutorial: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/