Naive Bayes Classification With Sklearn

https://blog.sicara.com/naive-bayes-classifier-sklearn-python-example-tips-42d100429e44
    
    There are three available models in the Sklearn python library:

    1) Gaussian: It assumes that continuous features follow a normal distribution.

    2) Multinomial: It is useful if your features are discrete.
    
    3) Bernoulli: The binomial model is useful if your features are binary.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Importing dataset
data = pd.read_csv("Titanic_train.csv")

# Convert categorical variable to numeric
data["Sex_cleaned"]=np.where(data["Sex"]=="male",0,1)
data["Embarked_cleaned"]=np.where(data["Embarked"]=="S",0,
                                  np.where(data["Embarked"]=="C",1,
                                           np.where(data["Embarked"]=="Q",2,3)
                                          )
                                 )
# Cleaning dataset of NaN
data=data[[
    "Survived",
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]].dropna(axis=0, how='any')


In [39]:
# Split dataset in training and test datasets
X_train, X_test = train_test_split(data, test_size=0.5, random_state=int(time.time()))

In [40]:
# Instantiate the classifier
gnb = GaussianNB()
used_features =[
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]

# Train classifier
gnb.fit(
    X_train[used_features].values,
    X_train["Survived"]
)
y_pred = gnb.predict(X_test[used_features])

# Print results
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          X_test.shape[0],
          (X_test["Survived"] != y_pred).sum(),
          100*(1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0])
))

Number of mislabeled points out of a total 357 points : 78, performance 78.15%


In [22]:
mean_survival=np.mean(X_train["Survived"])
mean_not_survival=1-mean_survival
print("Survival prob = {:03.2f}%, Not survival prob = {:03.2f}%"
.format(100*mean_survival,100*mean_not_survival))

Survival prob = 40.06%, Not survival prob = 59.94%


In [35]:

mean_fare_survived = np.mean(X_train[X_train["Survived"]==1]["Fare"])
std_fare_survived = np.std(X_train[X_train["Survived"]==1]["Fare"])
mean_fare_not_survived = np.mean(X_train[X_train["Survived"]==0]["Fare"])
std_fare_not_survived = np.std(X_train[X_train["Survived"]==0]["Fare"])

print("mean_fare_survived = {:03.3f}".format(mean_fare_survived))
print("std_fare_survived = {:03.2f}".format(std_fare_survived))
print("mean_fare_not_survived = {:03.2f}".format(mean_fare_not_survived))
print("std_fare_not_survived = {:03.2f}".format(std_fare_not_survived))



mean_fare_survived = 58.430
std_fare_survived = 84.95
mean_fare_not_survived = 20.95
std_fare_not_survived = 25.58


In [44]:
# we can also use metrics library for presentation
import sklearn.metrics as skm
print('Confusion Matrix:\n',(skm.confusion_matrix(X_test['Survived'], y_pred)) )
print('Accuracy using sklearn.metrics : ', skm.accuracy_score(X_test['Survived'],y_pred))
print('Precision , Recall , F1-Score and Support : \n', skm.classification_report(X_test['Survived'], y_pred))

Confusion Matrix:
 [[180  42]
 [ 36  99]]
Accuracy using sklearn.metrics :  0.7815126050420168
Precision , Recall , F1-Score and Support : 
              precision    recall  f1-score   support

          0       0.83      0.81      0.82       222
          1       0.70      0.73      0.72       135

avg / total       0.78      0.78      0.78       357

