# Naive Bayes

Its Naive(innocent) because it assumes that all the features are independent of each other. Which is almost never possible.  

1. Easy to understand.  
2. All features are independent.  
3. All impact results equally.  
4. Need small amount of data to train the model.  
5. Fast – up to 100X faster.  
6. It is highly scalable.  
7. It can make probabilistic predictions.  
8. It's simple & out-performs many sophisticated methods.  
9. Stable to data changes.  


## Bayes’s Theorem
It describes the probability of an event, based on prior knowledge of conditions that might be related to the event.  

$ P(A|B) = \frac {P(B|A)*P(A)}{P(B)}$  

Suppose:  
Fact_1 = 200 cars/day   
Fact_2 = 300 cars/day  
Out of all Cars produced: 2% are faulty/having issue
Out of these faulty cars 50% came from each Factory.

Question:
What is the probability that a car manufactured by Fact_1 is faulty? P(Faulty | Fact_1)? 

Solution:  
Car Manufctured by Factory 1: P(Fact_1) = $\frac {200}{200 + 300} = 0.4$  
Car Manufctured by Factory 2: P(Fact_2) = $\frac {300}{200 + 300} = 0.6$  

2% of cars are Faulty = P(Faulty) = 0.02

Probability of a Faulty Car coming out of Factory 1: P(Fact_1 | Faulty) = 0.5  
Probability of a Faulty Car coming out of Factory 2: P(Fact_2 | Faulty) = 0.5  


P(Faulty | Fact_1) = $\frac {P(Fact_1|Faulty)*P(Faulty)}{P(Fact_1)}  = \frac{0.5*0.02}{0.4}$ = 2.5%

Example:  
Total 500 Cars  
Fact_1 = 200  
Fact_2 = 300  
Faulty = 10  
50% came froom Fact_1 = 5  
% of Faulty Cars came from Fact_1 = $\frac{5}{200}$ = 2.5%  

In [1]:
#Import Libraries
import pandas as pd                 # pandas is a dataframe library
import matplotlib.pyplot as plt      # matplotlib.pyplot plots data
%matplotlib inline

#Read the data
df = pd.read_csv("Data\Classification\pima-data.csv")

#Check the Correlation
df.corr()
#Delete the correlated feature
del df['skin']

#Data Molding
diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)

#Splitting the data
from sklearn.model_selection import train_test_split

feature_col_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'thickness', 'insulin', 'bmi', 'diab_pred', 'age']
predicted_class_names = ['diabetes']

X = df[feature_col_names].values     # predictor feature columns (8 X m)
y = df[predicted_class_names].values # predicted class (1=true, 0=false) column (1 X m)
split_test_size = 0.30

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split_test_size, random_state=42) 

#Imputing
from sklearn.impute import SimpleImputer 

#Impute with mean all 0 readings
fill_0 = SimpleImputer(missing_values=0, strategy="mean")

X_train = fill_0.fit_transform(X_train)
X_test = fill_0.fit_transform(X_test)

In [2]:
#Training with Naive Bayes
from sklearn.naive_bayes import GaussianNB

# create Gaussian Naive Bayes model object and train it with the data
nb_model = GaussianNB()

nb_model.fit(X_train, y_train.ravel())

GaussianNB(priors=None, var_smoothing=1e-09)

In [3]:
# Calculate Accuracy on training data
# predict values using the training data
nb_predict_train = nb_model.predict(X_train)

# import the performance metrics library
from sklearn import metrics

# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, nb_predict_train)))
print()

Accuracy: 0.7542



In [4]:
# Calculate Accuracy on test data
# predict values using the testing data
nb_predict_test = nb_model.predict(X_test)

from sklearn import metrics

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, nb_predict_test)))

Accuracy: 0.7359


In [8]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test, nb_predict_test) )
print("\n")
print("Classification Report")
print(metrics.classification_report(y_test, nb_predict_test))

[[118  33]
 [ 28  52]]


Classification Report
              precision    recall  f1-score   support

           0       0.81      0.78      0.79       151
           1       0.61      0.65      0.63        80

   micro avg       0.74      0.74      0.74       231
   macro avg       0.71      0.72      0.71       231
weighted avg       0.74      0.74      0.74       231

