# CPSC 483 Project 3 - Real-World Data, Preprocessing, and Classifier Performance ##

### Project 3, Fall 2021(Section 01)

**In this project you will use scikit-learn, which is a higher-level machine learning library that works with NumPy data, and Pandas, a library that makes it easier to manipulate data. You will explore a variety of classification algorithms, and compare their performance on a “real-world” dataset, which will introduce its own set of challenges.**

May god have mercy on our grade.

* Janelle Estabillo estabillojanelle@csu.fullerton.edu
* Benjamin Ahn benahn333@csu.fullerton.edu


### Examining the Dataset

* Importing our Dataset using Pandas
* Note that unlike most CSV files, the separator is actually ';' rather than ','. 


In [1]:
import pandas as pd

#Reading our dataset
df = pd.read_csv ('bank-additional-full.csv', sep=';')

#Calling head method and examining it's contents
print(df.shape)
df.head()


(41188, 21)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Encoding our Data

* 90% of the data will be the Training Set
* 10% of the data will be the Testing Set
* Per the description in bank-additional-names.txt, the duration “should be discarded if the intention is to have a realistic predictive model."
* The feature y is the target response; set this aside for use in training and testing, then drop it from your features.
* Let’s take as input features the variables described as “bank client data” in bank-additional-names.txt.

**Encoding the y variable as 1s and 0s**

In [2]:
# Encoding the predictor variable to 1 for yes and 0 for no

predictor = df["y"] == "yes"
df.loc[predictor, "y"] = 1

predictor = df["y"] == "no"
df.loc[predictor, "y"] = 0

y = df['y'].astype(float, errors = 'raise')
y.head(5)

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: y, dtype: float64

**Grabbing Numerical Variables**

In [3]:
num = ["age", "previous"]
df[num].head(5)

Unnamed: 0,age,previous
0,56,0
1,57,0
2,37,0
3,40,0
4,56,0


**Grabbing our Categorical Variables**

In [4]:
categorical = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']

for i in range(len(categorical)):
    df[categorical[i]] = df[categorical[i]].astype('category')
    
df[categorical].head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan
0,56,housemaid,married,basic.4y,no,no,no
1,57,services,married,high.school,unknown,no,no
2,37,services,married,high.school,no,yes,no
3,40,admin.,married,basic.6y,no,no,no
4,56,services,married,high.school,no,no,yes


**Getting our Dummy Variables**

In [5]:
dummies = pd.get_dummies(df[categorical])
#dummies.head(5)

# Combining dummy variables and numeric variables
df1 = df[num]

df2 = pd.concat([df1, dummies, y], axis=1)
df = df2

df.shape
df.head(5)

Unnamed: 0,age,previous,age_17,age_18,age_19,age_20,age_21,age_22,age_23,age_24,...,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,y
0,56,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0.0
1,57,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0.0
2,37,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,0,0.0
3,40,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0.0
4,56,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0.0


In [6]:
X = df.drop("y", axis = "columns") 
y = df["y"]

from sklearn.model_selection import train_test_split

#Getting our Training and Testing Set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=(2021-10-25))

#Testing
X_test.head(3)


Unnamed: 0,age,previous,age_17,age_18,age_19,age_20,age_21,age_22,age_23,age_24,...,education_unknown,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes
37215,38,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0
576,41,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
14186,35,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,1


### Fitting a Categorical Naive Bayes Classifier

In [7]:
from sklearn.naive_bayes import CategoricalNB

classifier = CategoricalNB()
classifier.fit(X_train,y_train)

#Predicting Test Set Results
pred = classifier.predict(X_test)
score = classifier.score(X_test, y_test)
print("Test Score:", score)

#Predicting Train Set Results
pred1 = classifier.predict(X_train)
score1 = classifier.score(X_train, y_train)
print("Train Score:", score1)

Test Score: 0.8701141053653799
Train Score: 0.8766084868758262


**Categories of the Age Variable**
* There are 78 categories for age


In [8]:
age_categories = df['age'].unique()
print("There are", len(age_categories), "categories for age")


There are 78 categories for age


**Splitting the Ages into Bins**
* One per decade (10, 20, ...,100)

In [25]:
#One bin per decade
bins = [10,20,30,40,50,60,70,80,90,100]
labels = [1,2,3,4,5,6,7,8,9]
df['age_group'] = pd.cut(df['age'],bins = bins, labels=labels, precision = 0)
age_group = pd.get_dummies(df['age_group'], drop_first=True)

#new version
new_age_data = pd.cut(df['age'],bins = bins, labels=labels, precision = 0)
new_age_data = pd.get_dummies(new_age_data, drop_first=True)

#Print how many is in an age group
#df['age_group'].value_counts(sort = False)

**Retraining our Classifier**
* Does the performance change? **Yes it did**


### Fitting a KNN Classifier
THIS SHOULDN'T BE .99

In [10]:
#df['age_group'].head()
X = df.drop('age_group', axis = "columns")
y = df['age_group']

#print(X_train)
#print(y_train)
#total_train = np.concatenate((X_train, y_train))
#classifier.fit(new_age_data, y)

from sklearn.naive_bayes import CategoricalNB

classifier = CategoricalNB()
classifier.fit(X ,y)

#Predicting Test Set Results
pred_1 = classifier.predict(X)
score_1 = classifier.score(X, y)
print("Test Score:", score_1)


Test Score: 0.9990774011848111


In [11]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)

#Predicting Test Set Results
proba_knn = classifier.score(X_test, y_test)
print("KNN Score:", proba_knn)

KNN Score: 0.8640446710366594


**How many values in the test set have response 0, and how many have response 1?**
* Test Results are suspiciously similar
* What would be the score if we simply assumed that no customer ever subscribed to the product?

In [12]:
#How many values in the test set have response 0, and how many have response 1?
len([i for i in y_test if i == 0]), len([i for i in y_test if i == 1])


(3646, 473)

### Creating a Confusion Matrix and Finding AUC

**Using numpy.zeros_like() to create a target vector representing the output of the “dumb” classifier**

In [13]:
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
import numpy as np

pred_0 = np.arange(len(y_test), dtype = int)
pred_0 = np.zeros_like(pred_0)

conf_matrix = confusion_matrix(y_test, pred_0)
print("Confusion Matrix:\n", conf_matrix)

predictProb_y = classifier.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, predictProb_y)
print("AUC Score:", auc)

Confusion Matrix:
 [[3646    0]
 [ 473    0]]
AUC Score: 0.6286303505014038


**Using the data where age is split into bins**

THIS SHOULD BE A 2x2 MATRIX

In [14]:
conf_matrix = confusion_matrix(y, pred_1)
print("Confusion Matrix:\n", conf_matrix)

predictProb_y = classifier.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, predictProb_y)
print("AUC Score:", auc)

Confusion Matrix:
 [[  139     0     0     0     0     1     0     0     0]
 [   10  7233     0     0     0     0     0     0     0]
 [   10     0 16371     0     0     0     0     4     0]
 [    4     0     0 10236     0     0     0     0     0]
 [    0     0     0     0  6270     0     0     0     0]
 [    0     0     0     0     0   488     0     0     0]
 [    0     0     0     0     0     0   303     0     0]
 [    0     0     0     0     0     0     0   109     0]
 [    0     0     0     0     0     0     6     3     1]]
AUC Score: 0.6286303505014038


**Using the original values**

In [15]:
conf_matrix = confusion_matrix(y_test, pred)
print("Confusion Matrix:\n", conf_matrix)

predictProb_y = classifier.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, predictProb_y)
print("AUC Score:", auc)

Confusion Matrix:
 [[3482  164]
 [ 371  102]]
AUC Score: 0.6286303505014038


### Balancing our Data
* We're dealing with an imbalanced data
* Use pandas.DataFrame.where() and pandas.DataFrame.sample() to generate  balanced training sets by weighting the values with response 1 more heavily.

In [16]:
#Imbalanced Dataset Fix using pandas.DataFrame.sample()
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import MinMaxScaler

#defining the dataset
X_train_new, y_train_new = make_classification(n_samples=41888, weights=[0.99], flip_y=0)
print(Counter(y))

over_sampler = RandomOverSampler(sampling_strategy='minority', random_state=(2021-10-25))
X_new, y_new = over_sampler.fit_resample(X_train_new, y_train_new)
print(Counter(y_new))

scaler = MinMaxScaler()
X_new = scaler.fit_transform(X_new)
#swapped from 
#y_new = scaler.fit_transform(y_new)
y_new1 = scaler.fit_transform(y_new.reshape(-1,1))

print(f"Statistics: {Counter(y_new)}")

Counter({3: 16385, 4: 10240, 2: 7243, 5: 6270, 6: 488, 7: 303, 1: 140, 8: 109, 9: 10})
Counter({0: 41469, 1: 41469})
Statistics: Counter({0: 41469, 1: 41469})


**Retraining the balanced dataset and fitting a Categorical Naive Bayes Classifier**

In [17]:
classifier = CategoricalNB()
classifier.fit(X_new, y_new)

#Predicting Test Set Results
pred_y_new = classifier.predict(X_new)
score = classifier.score(X_new, y_new)
print("Test Score:", score)

conf_matrix = confusion_matrix(y_new, pred_y_new)
print("Confusion Matrix:\n", conf_matrix)

predictProb_y = classifier.predict_proba(X_new)[:,1]
auc = roc_auc_score(y_new, predictProb_y)
print("AUC Score:", auc)

Test Score: 0.500180857990306
Confusion Matrix:
 [[   15 41454]
 [    0 41469]]
AUC Score: 0.500180857990306


**Retraining the balanced dataset and fitting a KNN Classifier**

THE AUC AND KNN SCORE SHOULD BE LOWER

In [20]:
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_new, y_new)

#Predicting Test Set Results
pred_knn = classifier.predict(X_new)
proba_knn = classifier.score(X_new, y_new)
print("KNN Score:", proba_knn)

conf_matrix = confusion_matrix(y_new, pred_knn)
print("Confusion Matrix:\n", conf_matrix)

predictProb_y = classifier.predict_proba(X_new)[:,1]
auc = roc_auc_score(y_new, predictProb_y)
print("AUC Score:", auc)

KNN Score: 0.9987942800646266
Confusion Matrix:
 [[41369   100]
 [    0 41469]]
AUC Score: 1.0


### Gaussian Naive Bayes 
* Using the input variables described as “social and economic context attributes” in bank-additional-names.txt.

**Grabbing our Variables**

In [5]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score

#removes all but "social and economic context attributes"
GNB_df = pd.read_csv ('bank-additional-full.csv', sep=';')
GNB_df = GNB_df.drop(["age","job","marital","education","default","housing","loan","contact","month","day_of_week","duration","campaign","pdays","previous","poutcome"], axis = 1)

GNB_X = GNB_df.drop("y", axis = "columns")
GNB_y = GNB_df["y"]

GNB_X_train, GNB_X_test, GNB_y_train, GNB_y_test = train_test_split(GNB_X, GNB_y, test_size = 0.1, random_state=(2021-10-25))

GNB_classifier = GaussianNB()
GNB_classifier.fit(GNB_X_train, GNB_y_train)

GNB_train_pred = GNB_classifier.predict(GNB_X_train)
GNB_train_score = GNB_classifier.score(GNB_X_train, GNB_y_train)
print("GNB Train Score:", GNB_train_score)

GNB_test_pred = GNB_classifier.predict(GNB_X_test)
GNB_test_score = GNB_classifier.score(GNB_X_test, GNB_y_test)
print("GNB Test Score:", GNB_test_score)
#print(X_train)

GNB_conf_matrix = confusion_matrix(GNB_y_test, GNB_test_pred)
print("GNB Confusion Matrix:\n", GNB_conf_matrix)

GNB_predictProb_y = GNB_classifier.predict_proba(GNB_X_test)[:,1]
GNB_auc = roc_auc_score(GNB_y_test, GNB_predictProb_y)
print("GNB AUC Score:", GNB_auc)

GNB Train Score: 0.7199816558310178
GNB Test Score: 0.7193493566399611
GNB Confusion Matrix:
 [[2634 1012]
 [ 144  329]]
GNB AUC Score: 0.7348436526924581


**Gaussian Naive Bayes**
* Computing the score, confusion matrix and AUC
* Do the results of the last experiment change if the training set is balanced? **Yes it does**