In [None]:
"""
Project 6 README
=====================================

Name:
Classifier Performance on Real-World Data

Purpose:

In this project we will run into a new set of challenges when working 
with a “real-world” dataset, and see how an imbalanced 
dataset can influence classifier performance.

Authors:

Brandon Ryan, Yi Wei Lee, Mina Moslehpour

Last modified by Brandon Ryan, October 14 2020

Experiments:

1.  Download bank-additional.zip and extract its contents. Since the dataset is 
    large and some of the algorithms we will use can be time-consuming, we will 
    train with bank-additional.csv, which is a subset of the original dataset.

    Once our models are trained, we will test against the full dataset, which is in 
    bank-additional-full.csv.

    The archive also contains a text file, bank-additional-names.txt, which 
    describes the dataset and what each column represents.

2.  Use read_csv() to load and examine the training and test sets. Unlike most 
    CSV files, the separator is actually ';' rather than ','.

3.  The training and test DataFrames will need some significant preprocessing 
    before they can be used:
    a. Several of the features are categorical variables and will need to be 
       turned into numbers before they can be used by ML algorithms. 
       The simplest way to accomplish this is to use dummy coding using get_dummies().

       Some algorithms (e.g. logistic regression) have problems with collinear 
       features. If you use one-hot encoding, one dummy variable will be a linear 
       combination of the other dummy variables, so be sure to pass drop_first=True.

    b. Per bank-additional-names.txt, the feature duration “should be discarded 
       if the intention is to have a realistic predictive model,” so removed.

    c. The feature y (or y_yes after dummy coding) is the target, so should be removed.

    d. Some algorithms (e.g. KNN and SVM) require non-categorical features to be standardized. 

4.  Fit Naive Bayes, KNN, and SVM classifiers to the training set, then score 
    each classifier on the test set. Which classifier has the highest accuracy?

5.  These numbers look pretty good, but let’s take another look at the data. 
    How many values in the training set have y_yes = 0, and how many have 
    y_yes = 1? What would be the accuracy if we simply assumed 
    that no customer ever subscribed to the product?

6.  Use np.zeros_like() to create a target vector representing the output of 
    the “dumb” classifier of experiment (5), then create a confusion matrix 
    and find its AUC.

7.  Create a confusion matrix and find the AUC for each of the classifiers 
    of experiment (4). Is the best classifier the one with the highest accuracy?

8.  One of the easiest ways to deal with an unbalanced dataset is random oversampling. 
    This can be done with an imblearn.over_sampling.RandomOverSampler object. 
    Use fit_resample() to generate a balanced training set.

9.  Repeat experiments (4) and (7) on the balanced training set of experiment (8).
    Which classifier performs the best, and how much better is its performance?

Platforms:

Jupyter Notebook

Libraries:

scikit-learn 
pandas
Matplotlib’s
numpy

"""



### 1. Download bank-additional.zip and extract its contents. Since the dataset is large and some of the algorithms we will use can be time-consuming, we will train with bank-additional.csv, which is a subset of the original dataset.
###Once our models are trained, we will test against the full dataset, which is in bank-additional-full.csv.
###The archive also contains a text file, bank-additional-names.txt, which describes the dataset and what each column represents.

In [None]:
import numpy as np
import matplotlib.pyplot as plotter
from matplotlib import colors
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn import datasets
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing  import StandardScaler
from sklearn import metrics
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

%matplotlib inline


import io
#upload data to the google colab
from google.colab import files
uploaded = files.upload()
#uploaded = files.upload()
!ls

Saving bank-additional.csv to bank-additional (2).csv
'bank-additional (1).csv'  'bank-additional-full (1).csv'
'bank-additional (2).csv'   bank-additional-full.csv
 bank-additional.csv	    sample_data


### 2. Use read_csv() to load and examine the training and test sets. Unlike most CSV files, the separator is actually ';' rather than ','.


In [None]:
pd.set_option('display.max_columns', 30)

# Data pulled from: 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/'
bank_data = pd.read_csv("bank-additional.csv", sep = ';')
#print("Bank Data: ")
print(bank_data.shape)

#bank_data_full = pd.read_csv("bank-additional-full.csv", sep = ';')
#print("Bank Data Full: ")
#print(bank_data_full)

(4119, 21)


###3. The training and test DataFrames will need some significant preprocessing before they can be used:

###Several of the features are categorical variables and will need to be turned into numbers before they can be used by ML algorithms. The simplest way to accomplish this is to use dummy coding using get_dummies().

###Some algorithms (e.g. logistic regression) have problems with collinear features. If you use one-hot encoding, one dummy variable will be a linear combination of the other dummy variables, so be sure to pass drop_first=True.

###Per bank-additional-names.txt, the feature duration “should be discarded if the intention is to have a realistic predictive model,” so removed.

###The feature y (or y_yes after dummy coding) is the target, so should be removed.

In [None]:
job = pd.get_dummies(bank_data['job'], drop_first=True)
marital = pd.get_dummies(bank_data['marital'], drop_first=True)
education = pd.get_dummies(bank_data['education'], drop_first=True)
default = pd.get_dummies(bank_data['default'], drop_first=True)
housing = pd.get_dummies(bank_data['housing'], drop_first=True)
loan = pd.get_dummies(bank_data['loan'], drop_first=True)
contact = pd.get_dummies(bank_data['contact'], drop_first=True)
month = pd.get_dummies(bank_data['month'], drop_first=True)
day_of_week = pd.get_dummies(bank_data['day_of_week'], drop_first=True)
poutcome = pd.get_dummies(bank_data['poutcome'], drop_first=True)

# Removing 'duration' column from bank_data
bank_data = bank_data.drop(columns='duration')

# Removing y (target) from bank_data
y = pd.get_dummies(bank_data['y'])
bank_data = bank_data.drop(columns='y')

# Preprocessing functions using label encoder and one-hot encoder
bank_data = bank_data.select_dtypes(include=[object])
le = preprocessing.LabelEncoder()
bank_data_le = bank_data.apply(le.fit_transform)
print(bank_data_le)

enc = preprocessing.OneHotEncoder()
enc.fit(bank_data_le)
onehotlabels = enc.transform(bank_data_le).toarray()
print(onehotlabels.shape)

#bank_data = bank_data.join(job)
#bank_data = bank_data.join(marital)
#bank_data = bank_data.join(education)
#bank_data = bank_data.join(default)
#ank_data = bank_data.join(housing)
#bank_data = bank_data.join(loan)
#bank_data = bank_data.join(contact)
#bank_data = bank_data.join(month)
#bank_data = bank_data.join(day_of_week)
#bank_data = bank_data.join(poutcome)

      job  marital  education  default  housing  loan  contact  month  \
0       1        1          2        0        2     0        0      6   
1       7        2          3        0        0     0        1      6   
2       7        1          3        0        2     0        1      4   
3       7        1          2        0        1     1        1      4   
4       0        1          6        0        2     0        0      7   
...   ...      ...        ...      ...      ...   ...      ...    ...   
4114    0        1          1        0        2     2        0      3   
4115    0        1          3        0        2     0        1      3   
4116    8        2          3        0        0     0        0      6   
4117    0        1          3        0        0     0        0      1   
4118    4        2          3        0        2     0        0      7   

      day_of_week  poutcome  
0               0         1  
1               0         1  
2               4         1  
3  

### 4. Fit Naive Bayes, KNN, and SVM classifiers to the training set, then score each classifier on the test set. Which classifier has the highest accuracy?

In [None]:
# Gaussian Naive Bayes classifiers for bank_data using y_yes as target data
Y = y['yes']
X = onehotlabels
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)

model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_train_nor,y_train)

0.8857837181044957

In [None]:
# K-nearest neighbor classifiers for bank_data using y_yes as target data
Y = y['yes']
X = onehotlabels
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
model.score(X_train_nor,y_train)

0.905224787363305

In [None]:
# SVM classifiers for bank_data using y_yes as target data
Y = y['yes']
X = onehotlabels
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)
model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model.fit(X_train, y_train)
model.score(X_train,y_train)

0.9100850546780073

The SVM Classifier has the highest accuracy of all 3 classifiers

### 5. These numbers look pretty good, but let’s take another look at the data. How many values in the training set have y_yes = 0, and how many have y_yes = 1? What would be the accuracy if we simply assumed that no customer ever subscribed to the product?

In [None]:
y_yes = y['yes'].tolist()
print("y_yes = 0 occurs: {}".format(y_yes.count(0)))
print("y_yes = 1 occurs: {}".format(y_yes.count(1)))

# Gaussian Naive Bayes classifiers for bank_data using y_yes as target data with 0 for all values

Y = np.zeros_like(y['yes'])
Y[0] = 1
X = onehotlabels
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)

print('\nIf we assume no customer ever subscribed to the product then:\n')

model = GaussianNB()
model.fit(X_train, y_train)
print('GNB = {}'.format(model.score(X_train_nor,y_train)))

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print('KNN = {}'.format(model.score(X_train_nor,y_train)))

model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model.fit(X_train, y_train)
print('SVM = {}'.format(model.score(X_train,y_train)))

y_yes = 0 occurs: 3668
y_yes = 1 occurs: 451

If we assume no customer ever subscribed to the product then:

GNB = 0.9987849331713244
KNN = 0.9987849331713244
SVM = 0.9987849331713244


If we simply assumed that no customer ever subscribed to the product then the accuracy would be 1.0 for all classifiers

### 6. Use np.zeros_like() to create a target vector representing the output of the “dumb” classifier of experiment (5), then create a confusion matrix and find its AUC.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

Y_true = y['yes']
Y_pred = np.zeros_like(y['yes'])

print('Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}'.format(roc_auc_score(Y_true, Y_pred)))

Confusion Matrix:
[[3668    0]
 [ 451    0]]

AUC of confusion matrix: 0.5


### 7. Create a confusion matrix and find the AUC for each of the classifiers of experiment (4). Is the best classifier the one with the highest accuracy?


In [None]:
# Gaussian Naive Bayes classifiers for bank_data using y_yes as target data
Y = y['yes']
X = onehotlabels
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)

model = GaussianNB()
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('GNB Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}'.format(roc_auc_score(Y_true, Y_pred)))

# K-nearest neighbor classifiers for bank_data using y_yes as target data
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('GNB Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}'.format(roc_auc_score(Y_true, Y_pred)))

# SVC classifiers for bank_data using y_yes as target data
model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('GNB Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}'.format(roc_auc_score(Y_true, Y_pred)))

GNB Confusion Matrix:
[[ 725 2943]
 [  36  415]]

AUC of confusion matrix: 0.5589163908145476
GNB Confusion Matrix:
[[3570   98]
 [ 392   59]]

AUC of confusion matrix: 0.5520514209305868
GNB Confusion Matrix:
[[3664    4]
 [ 422   29]]

AUC of confusion matrix: 0.5316055197827679


No, in fact the results from the AUC of each confusion matrix seems to be inversely proportional to the classifiers that had higher accuracies. AUC of GNB is the best and scored the worst in terms of accuracy.

### 8. One of the easiest ways to deal with an unbalanced dataset is random oversampling. This can be done with an imblearn.over_sampling.RandomOverSampler object. Use fit_resample() to generate a balanced training set.


In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler

# Using the fit_resample on the training data
Y = y['yes']
X = onehotlabels
ros = RandomOverSampler(random_state=42)
X_res, Y_res = ros.fit_resample(X, Y)

X_train, X_test, y_train, y_test = train_test_split(X_res, Y_res, test_size=0.8, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_nor = sc.transform(X_train)



### 9. Repeat experiments (4) and (7) on the balanced training set of experiment (8). Which classifier performs the best, and how much better is its performance?

In [None]:
# Gaussian Naive Bayes classifier
model = GaussianNB()
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('GNB Score = {}'.format(model.score(X_train,y_train)))
print('GNB Confusion Matrix:\n{}'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}\n'.format(roc_auc_score(Y_true, Y_pred)))

# K-nearest neighbor classifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('KNN Score = {}'.format(model.score(X_train,y_train)))
print('KNN Confusion Matrix:\n{}'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}\n'.format(roc_auc_score(Y_true, Y_pred)))

# SVC classifier
model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model.fit(X_train, y_train)
Y_pred = model.predict(X)

print('SVC Score = {}'.format(model.score(X_train,y_train)))
print('SVC Confusion Matrix:\n{}'.format(confusion_matrix(Y_true, Y_pred)))
print('AUC of confusion matrix: {}\n'.format(roc_auc_score(Y_true, Y_pred)))

GNB Score = 0.6203135650988412
GNB Confusion Matrix:
[[1507 2161]
 [  84  367]]
AUC of confusion matrix: 0.6122989140816362

KNN Score = 0.8841172460804363
KNN Confusion Matrix:
[[2580 1088]
 [  94  357]]
AUC of confusion matrix: 0.7474774341279647

SVC Score = 0.841854124062713
SVC Confusion Matrix:
[[3160  508]
 [ 152  299]]
AUC of confusion matrix: 0.7622380412363656



The SVC classifier performs best with the newly balanced data set with a slight margin over the KNN classifier.