# Data understanding

#### a) Define the question

Perform classification of the testing set samples using the Naive Bayes Classifier on the data and compute the accuracy.

#### b) Metrics for success

The project will be considered successful if we are able to get a high accuracy score after comparing the different test sizes, with the Optimal Naive Bayes Classifier.

#### c) Recording the experimental design 

The following steps will be followed during this exercise:
- Data Understanding
- Data Preparation
- Data Cleaning
- Multivariate Analysis
- Modelling with Naive Bayes Classifier
- Evaluation
- Challenging the solution

#### d) Data relevance. 

This will be described in our data exploration

# Data preparation

## Importing the libraries. 

In [62]:
# Import all necessary libraries. 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold
import xgboost as xgb
from xgboost import XGBClassifier




## Reading the Data 

In [3]:
# Loading our datset
#
df = pd.read_csv('spambase.data')


In [4]:
# Previewing the first 10 records of our dataframe
#
df.head(10)

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1
5,0.0,0.0,0.0,0.0,1.92,0.0,0.0,0.0,0.0,0.64,0.96,1.28,0.0,0.0,0.0,0.96,0.0,0.32,3.85,0.0,0.64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054,0.0,0.164,0.054,0.0,1.671,4,112,1
6,0.0,0.0,0.0,0.0,1.88,0.0,0.0,1.88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.206,0.0,0.0,0.0,0.0,2.45,11,49,1
7,0.15,0.0,0.46,0.0,0.61,0.0,0.3,0.0,0.92,0.76,0.76,0.92,0.0,0.0,0.0,0.0,0.0,0.15,1.23,3.53,2.0,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.271,0.0,0.181,0.203,0.022,9.744,445,1257,1
8,0.06,0.12,0.77,0.0,0.19,0.32,0.38,0.0,0.06,0.0,0.0,0.64,0.25,0.0,0.12,0.0,0.0,0.12,1.67,0.06,0.71,0.0,0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.04,0.03,0.0,0.244,0.081,0.0,1.729,43,749,1
9,0.0,0.0,0.0,0.0,0.0,0.0,0.96,0.0,0.0,1.92,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.96,3.84,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.462,0.0,0.0,1.312,6,21,1


In [5]:
# Checking the data types of our different columns
#
spamdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

> All the columns in the dataset seem to have the appropriate datatype. 

In [6]:
df.describe()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,0.104576,0.212922,0.280578,0.065439,0.312222,0.095922,0.114233,0.105317,0.090087,0.239465,0.059837,0.54168,0.09395,0.058639,0.049215,0.248833,0.142617,0.184504,1.662041,0.085596,0.809728,0.121228,0.101667,0.094289,0.549624,0.265441,0.767472,0.124872,0.098937,0.102874,0.064767,0.047059,0.09725,0.047846,0.105435,0.097498,0.136983,0.013204,0.078646,0.064848,0.043676,0.132367,0.046109,0.079213,0.301289,0.179863,0.005446,0.031876,0.038583,0.139061,0.01698,0.26896,0.075827,0.044248,5.191827,52.17087,283.290435,0.393913
std,0.305387,1.2907,0.50417,1.395303,0.672586,0.27385,0.39148,0.401112,0.278643,0.644816,0.201565,0.861791,0.301065,0.335219,0.258871,0.825881,0.444099,0.53093,1.775669,0.509821,1.200938,1.025866,0.350321,0.442681,1.671511,0.887043,3.367639,0.538631,0.593389,0.456729,0.403435,0.328594,0.555966,0.32948,0.532315,0.402664,0.423493,0.220675,0.434718,0.349953,0.361243,0.7669,0.223835,0.622042,1.011787,0.911214,0.076283,0.285765,0.243497,0.270377,0.109406,0.815726,0.245906,0.429388,31.732891,194.912453,606.413764,0.488669
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,1.31,0.0,0.22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.2755,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.3825,0.0,0.0,0.0,0.0,0.16,0.0,0.8,0.0,0.0,0.0,0.1,0.0,0.0,2.64,0.0,1.27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.188,0.0,0.31425,0.052,0.0,3.70525,43.0,265.25,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,2.61,9.67,5.55,10.0,4.41,20.0,7.14,9.09,18.75,18.18,11.11,17.1,5.45,12.5,20.83,16.66,33.33,9.09,14.28,5.88,12.5,4.76,18.18,4.76,20.0,7.69,6.89,8.33,11.11,4.76,7.14,14.28,3.57,20.0,21.42,22.05,2.17,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


> These are the summary statistics of our dataset

## Exploring the Data

In [7]:
# Checking the number of rows and columns
#
print(df.shape)

(4600, 58)


> Our Spam feature dataset has 58 columns and 4600 entries. 

In [8]:
# Renaming our dataset as per the description provided
# 
column_names = ['word_freq_make'  ,'word_freq_address' ,'word_freq_all' ,'word_freq_3d' ,'word_freq_our'        
,'word_freq_over' ,'word_freq_remove' ,'word_freq_internet' ,'word_freq_order' ,'word_freq_mail','word_freq_receive' ,'word_freq,_will' ,'word_freq_people' ,'word_freq_report' ,'word_freq_addresses'  
,'word_freq_free' ,'word_freq_business' ,'word_freq_email' ,'word_freq_you' ,'word_freq_credit'     
,'word_freq_your' ,'word_freq_font' ,'word_freq_000' ,'word_freq_money' ,'word_freq_hp' ,'word_freq_hpl' ,'word_freq_george' ,'word_freq_650' ,'word_freq_lab' ,'word_freq_labs'       
,'word_freq_telnet' ,'word_freq_857' ,'word_freq_data' ,'word_freq_415' ,'word_freq_85' ,'word_freq_technology' 
,'word_freq_1999' ,'word_freq_parts' ,'word_freq_pm' ,'word_freq_direct' ,'word_freq_cs' ,'word_freq_meeting' ,'word_freq_original' ,'word_freq_project' ,'word_freq_re' ,'word_freq_edu' ,'word_freq_table' ,'word_freq_conference' 
,'char_freq_;' ,'char_freq_(' ,'char_freq_[' ,'char_freq_!' ,'char_freq_$' ,'char_freq_#','capital_run_length_average'
,'capital_run_length_longest' ,'capital_run_length_total', 'class']

df.columns = column_names
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,"word_freq,_will",word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


# Data Cleaning


In [9]:
# Cheking our column names
#
df.columns

Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq,_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',

> Our column titles have no issues and are all consistent.

In [10]:
# Checking for the missing values
#
df.isna().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq,_will               0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

> We have no missing data. 

# Multivariate Analysis

In [15]:
# spliting the data into X and y
X = df.drop('class', 1)
y = df['class'].values


In [16]:
from sklearn.model_selection import train_test_split
# Split into training and testing data 

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [17]:
# performing LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 13)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)




In [18]:
# Getting the LDA coeficients

factors = pd.DataFrame (index = X.columns.values, data = lda.coef_[0].T)
factors.sort_values(0, ascending = False).head(10)

Unnamed: 0,0
word_freq_remove,2.122862
char_freq_$,2.022548
word_freq_000,1.858357
word_freq_over,1.096539
word_freq_internet,0.983911
word_freq_money,0.882474
word_freq_order,0.822565
word_freq_our,0.805913
word_freq_free,0.760607
word_freq_credit,0.68185


> From this we can see that our most important features are word_freq_remove then char_freq_$, word_freq_000, word_freq_over, word_freq_internet, word_freq_money, word_freq_order, word_freq_our, word_freq_free then word_freq_credit respectively in that order.

# Modelling 

## Baseline Modelling

In [45]:
# Defining our predictor and target variables
#
X = df.drop(['class'], axis = 1)
y = df['class']

# Spliting our dataset
#
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Scaling predictor variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Fitting the data
regressor = LogisticRegression()
regressor.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
# Making the prediction. 
#
y_pred = regressor.predict(X_test)

In [26]:
# Getting the score of the baseline model. 
#
cm = confusion_matrix(y_test, y_pred)
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('The confusuion matrix is: ', "\n",cm)
print ('The accuracy score: ', accuracy_score(y_test, y_pred))

Root Mean Squared Error: 0.2553769592276246
Mean Squared Error: 0.06521739130434782
The confusuion matrix is:  
 [[513  25]
 [ 35 347]]
The accuracy score:  0.9347826086956522


> Our model has an RMSE of 0.2553 which is sufficiently low, showing that our model is a good fit. The rest of the metrice are also displayed above, and they support that the logistic regression model is a good fit.  

## Naive Bayes Model
We shall use the Gaussian Naive Bayes model since our data set is continuous. We shall assume that our features follow a normal distribution.

In [29]:
# Fitting our model
#
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

In [36]:
# Predicting our test predictors
predicted = model.predict(X_test)
print ('The accuracy score: ', accuracy_score(y_test, predicted))

The accuracy score:  0.8402173913043478


> Using Gaussian Naive Bayes, we have an accuracy score of 84% which generally is a good fit, but not high enough for us.

### Hyperparameter tuning

In [39]:
cv_method = RepeatedStratifiedKFold(n_splits=5,  n_repeats=3, random_state=999)

In [41]:
from sklearn.model_selection import GridSearchCV
params_NB = {'var_smoothing': np.logspace(0,-7, num=100)}
gs_NB = GridSearchCV(estimator=model, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')
gs_NB.fit(X_test, y_test);

Fitting 15 folds for each of 100 candidates, totalling 1500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    3.4s finished


In [42]:
# Predict
predict_test = gs_NB.predict(X_test)
# Check accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.8369565217391305


> Our tuned model has a lower accuracy score than our untuned model, thus we shall go with the untuned model 

In [46]:
# Defining the list of test sizes we will use for the assessment
#
test_size = [0.1, 0.2, 0.3, 0.4, 0.5]

# Using a for loop to split the dataset, fit it with the optimal parameters then get the accuracy score.
#
for test in test_size:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test, random_state = 0)
    
    # Fitting to the classifier
    clf = GaussianNB()  
    model = clf.fit(X_train, y_train)
    # Predicting our test predictors
    predicted = model.predict(X_test)

    print("Test size {} has accuracy score:".format(test), (metrics.accuracy_score(y_test, predicted)*100))

Test size 0.1 has accuracy score: 82.3913043478261
Test size 0.2 has accuracy score: 83.04347826086956
Test size 0.3 has accuracy score: 82.82608695652173
Test size 0.4 has accuracy score: 82.01086956521739
Test size 0.5 has accuracy score: 81.91304347826087


> The model with a test size of 0.2 has the highest accuracy score, thus is the most optimal test size.

# Challenging the Solution
We will challenge the solution with XGBoost algorithm.

> 

In [55]:
# Defining the regression of our classifer to linear
#
xgb_model = xgb.XGBClassifier(objective='reg:linear',random_state=42)

In [70]:
# Defining the predictor and target variables
#
import re
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in df.columns.values]


X = df.drop(['class'], axis = 1)
y = df['class']

# Splitting the data
#
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 42)
# Fitting the data into our model
#

xgb_model.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='reg:linear', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [71]:
# Predict the data using the test set
#
y_predict = xgb_model.predict(X_test)

In [73]:
# Printing out the various metrics
#
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,y_predict))
print(classification_report(y_test,y_predict))

[[507  23]
 [ 53 337]]
              precision    recall  f1-score   support

           0       0.91      0.96      0.93       530
           1       0.94      0.86      0.90       390

    accuracy                           0.92       920
   macro avg       0.92      0.91      0.91       920
weighted avg       0.92      0.92      0.92       920



> Our XGboost model has an accuracy score of 92% and an f1-score of 93%, which is  better than the metrics we got for Gaussian Naive Bayes model, thus it's a better model for our dataset