**<h1>Python Machine Learning Project</h1>**

<h2>Created by Brian Chairez<h2>

This project will be utilizing <b><a href="https://pandas.pydata.org/">pandas</a></b>, a software library used for data manipulation and analysis, and <b><a href="https://scikit-learn.org/stable/">scikit-learn</a></b>, a software library that contains various classification, regression, and clustering algorithms for predictive data analysis.

**<h3><u>Part 1 - Reading Data</u></h3>**

Using the <b>pandas.read_csv()</b> method, the provided csv files will be read in and loaded into dataframes. 
Dataframes are a type of data structure similar to a 2-Dimensional table of rows and columns that can be filled with data of various types. 
It is similar in concept to a spreadsheet as it is flexible and can be utilized to store and work with data.

The <i>"bank.csv"</i> file will be the training dataframe while <i>"bank-full.csv"</i> will be the testing dataframe.

In [51]:
import pandas as pd

trainDF = pd.read_csv('bank.csv', sep=';')
testDF = pd.read_csv('bank-full.csv', sep=';')

**<h3><u>Part 2 - Data Preprocessing</u></h3>**

Some of the features are categorical variables and will need to be turned into numbers using the <b>pandas.get_dummies()</b> method passing in <i>drop_first=True</i>.

In [52]:
trainDF = pd.get_dummies(trainDF, drop_first=True)
testDF = pd.get_dummies(testDF, drop_first=True)

The ['duration'] and ['y_yes'] feature will need to be droppped from both training and testing dataframes however the ['y_yes'] will become the target.

In [53]:
trainTarget = trainDF['y_yes']
trainDF = trainDF.drop(columns=['duration', 'y_yes'])

testTarget = testDF['y_yes']
testDF = testDF.drop(columns=['duration', 'y_yes'])

Non-categorical features must be standardized in order to utilize K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) classifiers. 

The specific features to be standardized are:
    <ul>
        <li>age</li> 
        <li>campaign</li> 
        <li>pdays</li>
        <li>previous</li>
        <li>emp.var.rate</li>
        <li>cons.price.idx</li>
        <li>cons.conf.idx</li>
        <li>euribor3m</li>
        <li>nr.employed</li>
    </ul>


This is done by subtracting the initial value with the mean and then dividing that result by the standard deviation of the respective feature:

x' = ( x<sub>n</sub> - x̅ ) / σ
    

In [54]:
# Training data standardization
for index, row in trainDF.iterrows():
  row['age'] = (row['age'] - trainDF['age'].mean())/trainDF['age'].std()
  row['campaign'] = (row['campaign'] - trainDF['campaign'].mean())/trainDF['campaign'].std()
  row['pdays'] = (row['pdays'] - trainDF['pdays'].mean())/trainDF['pdays'].std()
  row['previous'] = (row['previous'] - trainDF['previous'].mean())/trainDF['previous'].std()
  row['emp.var.rate'] = (row['emp.var.rate'] - trainDF['emp.var.rate'].mean())/trainDF['emp.var.rate'].std()
  row['cons.price.idx'] = (row['cons.price.idx'] - trainDF['cons.price.idx'].mean())/trainDF['cons.price.idx'].std()
  row['cons.conf.idx'] = (row['cons.conf.idx'] - trainDF['cons.conf.idx'].mean())/trainDF['cons.conf.idx'].std()
  row['euribor3m'] = (row['euribor3m'] - trainDF['euribor3m'].mean())/trainDF['euribor3m'].std()
  row['nr.employed'] = (row['nr.employed'] - trainDF['nr.employed'].mean())/trainDF['nr.employed'].std()

# Test data standardization
for index, row in testDF.iterrows():
  row['age'] = (row['age'] - testDF['age'].mean())/testDF['age'].std()
  row['campaign'] = (row['campaign'] - testDF['campaign'].mean())/testDF['campaign'].std()
  row['pdays'] = (row['pdays'] - testDF['pdays'].mean())/testDF['pdays'].std()
  row['previous'] = (row['previous'] - testDF['previous'].mean())/testDF['previous'].std()
  row['emp.var.rate'] = (row['emp.var.rate'] - testDF['emp.var.rate'].mean())/testDF['emp.var.rate'].std()
  row['cons.price.idx'] = (row['cons.price.idx'] - testDF['cons.price.idx'].mean())/testDF['cons.price.idx'].std()
  row['cons.conf.idx'] = (row['cons.conf.idx'] - testDF['cons.conf.idx'].mean())/testDF['cons.conf.idx'].std()
  row['euribor3m'] = (row['euribor3m'] - testDF['euribor3m'].mean())/testDF['euribor3m'].std()
  row['nr.employed'] = (row['nr.employed'] - testDF['nr.employed'].mean())/testDF['nr.employed'].std()

**<h3><u>Part 3 - Model Fitting</u></h3>**

The <i>Guassian Naive Bayes</i>, <i>K-Nearest Neighbor (KNN)</i>, and <i>Support Vector Machine (SVM)</i> are the machine learning models this project will be utilizing from the scikit-learn library.

In [55]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

The training data will be fit through instances of Naive Bayes, KNN, and SVM models by passing in the train dataframe and the target variable dataframe so each respective model can learn. 
This will be useful as once the models are trained, they could each make predictions given the test dataframe which can be used to score the model.

In [56]:
# Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(trainDF, trainTarget)
print('Gaussian Naive Bayes Score:', end=' ')
print(gnb.score(testDF, testTarget))

# K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(trainDF, trainTarget)
print('K-Nearest Neighbors Score:', end=' ')
print(knn.score(testDF, testTarget))

# Support Vector Machine
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(trainDF, trainTarget)
print('Support Vector Machine Score:', end=' ')
print(svc.score(testDF, testTarget))

Gaussian Naive Bayes Score: 0.8545450131106147
K-Nearest Neighbors Score: 0.8871030397203069
Support Vector Machine Score: 0.8982470622511411


**<h3><u>Part 4 - Model Analysis</u></h3>**

The trained models can now be used to predict how it would map new inputs to their labels. This can be done using the <b>predict()</b> method from each respective model instance and passing in the test samples as the parameter.

In [57]:
gnbPrediction = gnb.predict(testDF)

knnPrediction = knn.predict(testDF)

svmPrediction = svc.predict(testDF)

The <b>sklearn.metrics.confusion_matrix()</b> method can create a confusion matrix which is used to evaluate the accuracy of a classification by passing in the target values and the estimated targets as parameters.

What the confusion matrix does is report the number of <i>true positives</i>, <i>false negatives</i>, <i>false positives</i>, and <i>true negatives</i> as such: 
<table>
  <tr>
    <td>True Positive</td>
    <td>False Negative</td>
  </tr>
  <tr>
    <td>False Positive</td>
    <td>True Negative</td>
  </tr>
</table>

In [58]:
from sklearn.metrics import confusion_matrix

print('Gaussian Naive Bayes Confusion Matrix: ')
print(confusion_matrix(testTarget, gnbPrediction))
print('\n')

print('K-Nearest Neighbors Confusion Matrix: ')
print(confusion_matrix(testTarget, knnPrediction))
print('\n')

print('Support Vector Machine Confusion Matrix: ')
print(confusion_matrix(testTarget, svmPrediction))
print('\n')

Gaussian Naive Bayes Confusion Matrix: 
[[33020  3528]
 [ 2463  2177]]


K-Nearest Neighbors Confusion Matrix: 
[[35274  1274]
 [ 3376  1264]]


Support Vector Machine Confusion Matrix: 
[[36197   351]
 [ 3840   800]]


