# <b> Naive Bayes </b>
___

<b> Table of Content: </b>
<br> [Pipeline_1](#100)
<br> [Pipeline 2](#200)
<br> [Pipeline 3](#300)

Loading Modules and Datasets

In [27]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

# import necessary packages  
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss 

In [28]:
# read csv file to a pandas dataframe
df_pipeline1 = pd.read_csv("pipeline_1.csv")

<a id = "100"> <h2> Pipeline 1 </h2> </a>
___

> Declare Features and Target

In [29]:
# show all columns in dataset
list(df_pipeline1.columns)[:]

['PageValues_iqr_yj_zscore',
 'Month_Nov',
 'VisitorType_New_Visitor',
 'Browser_12',
 'TrafficType_2',
 'TrafficType_3',
 'Browser_13',
 'Month_May',
 'TrafficType_16',
 'TrafficType_13',
 'OperatingSystems_6',
 'OperatingSystems_3',
 'SpecialDay_0.8',
 'TrafficType_1',
 'Month_Mar',
 'TrafficType_15',
 'TrafficType_8',
 'Month_Feb',
 'Administrative_Duration_iqr_yj_zscore',
 'OperatingSystems_7',
 'avg_exit_bounce_rates_iqr_yj_zscore',
 'Revenue']

In [30]:
# Define Features and Target variables
X = df_pipeline1.iloc[:, :-1] # Features is all columns in the dataframe except the last column
Y = df_pipeline1.iloc[:, -1] # Target is the last column in the dataframe: 'Revenue'

In [31]:
# Split dataset into training set and test set 
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,random_state=2019) 

# Create a Gaussian Classifier
gnb = GaussianNB()

# Train the model using the training sets
gnb.fit(X_train, y_train)

print('Before OverSampling, the shape of train_X: {}'.format(X_train.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train.shape)) 
  
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {}".format(sum(y_train == 0)))

# Predict the response for test dataset
y_pred = gnb.predict(X_test)

# print classification report 
print(classification_report(y_test, y_pred))

Before OverSampling, the shape of train_X: (8631, 21)
After OverSampling, the shape of train_y: (8631,) 

Before OverSampling, counts of label '1': 1340
Before OverSampling, counts of label '0': 7291
              precision    recall  f1-score   support

           0       0.98      0.25      0.40      3131
           1       0.19      0.97      0.32       568

    accuracy                           0.36      3699
   macro avg       0.58      0.61      0.36      3699
weighted avg       0.86      0.36      0.39      3699



In [32]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy Naive Bayes, pipeline 1:", metrics.accuracy_score(y_test, y_pred).round(4))

Accuracy Naive Bayes, pipeline 1: 0.3601


<b> 1.1 Synthetic Minority Oversampling Technique (SMOTE)

We oversample the dataset, because the classes within our target variable 'Revenue' are imbalanced:
* class 0: 84.53%
* class 1: 15.47%

Fore more information, click on detailed information from Prof. Jie Tao [link](https://github.com/DrJieTao/ba545-docs/blob/master/competition2/handling_imbalanced_data_part2.ipynb)

In [33]:
sm = SMOTE(random_state = 2019) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train) 

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

After OverSampling, the shape of train_X: (14582, 21)
After OverSampling, the shape of train_y: (14582,) 

After OverSampling, counts of label '1': 7291
After OverSampling, counts of label '0': 7291




The SMOTE Algorithm has oversampled the instances in the minority class and made it equal to majority class:
* Both classes (0 & 1) now have 7291 instances, the dataset is balanced.
* Class 1 increased from 1340 instances to 7291 instances, an increase of 5951 instances of class 1.

In [34]:
gnb1 = GaussianNB()
gnb1.fit(X_train_res, y_train_res)
y_pred = gnb1.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.15      0.25      3131
           1       0.17      0.97      0.29       568

    accuracy                           0.27      3699
   macro avg       0.57      0.56      0.27      3699
weighted avg       0.84      0.27      0.26      3699



In [35]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy Naive Bayes, pipeline 1:", metrics.accuracy_score(y_test, y_pred).round(4))

Accuracy Naive Bayes, pipeline 1: 0.2733


### Take-aways:
- The accuracy for predicting class 1 of our target (predicting if a web user will make a purchase is) is 27.33% after oversampling.
- Before oversampling it was 36.01%, which is a decrease of 8.68%. However after oversampling class 1, Our model is more unbiased.
- This is an increase in accuracy over random guessing class 1 correctly, from 15.47% to 27.33%, which is a 11.86% increase.

<b> 1.2 NearMiss Under-Sampling Technique

In [36]:
# apply near miss 
nr = NearMiss(random_state=123) 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

After Undersampling, the shape of train_X: (2680, 21)
After Undersampling, the shape of train_y: (2680,) 

After Undersampling, counts of label '1': 1340
After Undersampling, counts of label '0': 1340




The NearMiss Algorithm has undersampled the instances in the majority class and made it equal to minority class:
* Both classes (0 & 1) now have 1340 instances, the dataset is balanced.
* Class 0 decreased from 7291 instances to 1340 instances, a decrease of 5951 instances of class 0.

In [37]:
# train the model on train set 
gnb2 = GaussianNB()
gnb2.fit(X_train_miss, y_train_miss) 
y_pred = gnb2.predict(X_test) 

# print classification report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.93      0.88      3131
           1       0.14      0.07      0.09       568

    accuracy                           0.79      3699
   macro avg       0.49      0.50      0.49      3699
weighted avg       0.74      0.79      0.76      3699



### Take-aways:
- The accuracy for predicting class 1 of our target (predicting if a web user will make a purchase is) is 9% after undersampling.
- Before undersampling it was 36.01%, which is a decrease of 27.01%. However after undersampling class 1, Our model is more unbiased.
- NearMiss performs worse with 9% than random guessing with 15.47%, which is a decrease of 6.47%.
- Oversampling with SMOTE however gives us a better F1 score and better accuracy which implies that SMOTE is probably a better technique to use for this dataset.
<br> (27.33% accuracy for SMOTE) > (9.00% accuracy for NearMiss) in predicting class 1.

<a id = "200"> <h2> Pipeline 2 </h2> </a>
___

<a id = "300"> <h2> Pipeline 3 </h2> </a>
___