## In-Class Practice (SVM

In this exercise, you have the bank data. There are 6 independent variables ('interest_rate', 'credit', 'march', 'may', 'previous' and 'duration'), and the dependent variable ('y'). You are supposed to predict the 'y'. 
Please remove the unnecessary column 'Unnamed: 0' from the dataset and encode the 'y' values ('no' to 0 and 'yes' to 1).

**Tasks**

1. Import dataset (Bank_data.csv)
2. Preprocess the dataset
3. Define the Dep and Indep variables
4. Scall the data
5. Train/Test Split
6. Build the Model using SVM (Linear kernel)
7. Predict the test data
        a. Accuracy, Recall, F1Score and Confission Matrix
8. Import the new dataset (Bank_data_testing.csv) as an testing data.
9. Apply necessary data preprocessing
10. Predict and interprest the results with new dataset
    a. Accuracy, Recall, F1Score and Confission Matrix
    
The results of new dataset should be as follow:
Accuracy: 0.86
Recall: 0.91
f1Score: 0.87

## Importing the libraries

In [1]:
# your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import sklearn
import gradio as gr

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

Double-click __here__ for the solution.

<!-- Your answer is below:

import numpy as np 
import pandas as pd
from sklearn.svm import SVC
-->

## Importing the dataset

In [2]:
# your code here
data=pd.read_csv('Bank_data.csv')

data.head()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,3,4.12,0.0,0.0,0.0,0.0,686.0,yes
4,4,4.856,0.0,1.0,0.0,0.0,157.0,no


In [19]:
# change y value
le = LabelEncoder()
df['y'] = le.fit_transform(df['y'])
data.head()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,3,4.12,0.0,0.0,0.0,0.0,686.0,1
4,4,4.856,0.0,1.0,0.0,0.0,157.0,0


Double-click __here__ for the solution.

<!-- Your answer is below:

df = pd.read_csv("Bank_data.csv")
df.head()

-->

<div id="practice"> 
    <h3>Practice</h3> 
    What is the size of data? 
</div>

In [7]:
# write your code here
data.describe()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,258.5,2.835776,0.034749,0.266409,0.388031,0.127413,382.177606
std,149.677988,1.876903,0.183321,0.442508,0.814527,0.333758,344.29599
min,0.0,0.635,0.0,0.0,0.0,0.0,9.0
25%,129.25,1.04275,0.0,0.0,0.0,0.0,155.0
50%,258.5,1.466,0.0,0.0,0.0,0.0,266.5
75%,387.75,4.9565,0.0,1.0,0.0,0.0,482.75
max,517.0,4.97,1.0,1.0,5.0,1.0,2653.0


In [9]:
data.shape

(518, 8)

In [10]:
data.columns

Index(['Unnamed: 0', 'interest_rate', 'credit', 'march', 'may', 'previous',
       'duration', 'y'],
      dtype='object')

In [12]:
data.notnull()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...
513,True,True,True,True,True,True,True,True
514,True,True,True,True,True,True,True,True
515,True,True,True,True,True,True,True,True
516,True,True,True,True,True,True,True,True


In [13]:
data.index

RangeIndex(start=0, stop=518, step=1)

<h2>Define Dep and Indep Variables</h2>

In [20]:
# write your code here
x = df[['interest_rate','credit','march','may','previous','duration']]					

y = df['y']
x

Unnamed: 0,interest_rate,credit,march,may,previous,duration
0,1.334,0.0,1.0,0.0,0.0,117.0
1,0.767,0.0,0.0,2.0,1.0,274.0
2,4.858,0.0,1.0,0.0,0.0,167.0
3,4.120,0.0,0.0,0.0,0.0,686.0
4,4.856,0.0,1.0,0.0,0.0,157.0
...,...,...,...,...,...,...
513,1.334,0.0,1.0,0.0,0.0,204.0
514,0.861,0.0,0.0,2.0,1.0,806.0
515,0.879,0.0,0.0,0.0,0.0,290.0
516,0.877,0.0,0.0,5.0,1.0,473.0


<h2>Pre-processing</h2>

<p> Check the data and make sure the data is clean before making the model!

In [8]:
df=data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518 entries, 0 to 517
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     518 non-null    int64  
 1   interest_rate  518 non-null    float64
 2   credit         518 non-null    float64
 3   march          518 non-null    float64
 4   may            518 non-null    float64
 5   previous       518 non-null    float64
 6   duration       518 non-null    float64
 7   y              518 non-null    object 
dtypes: float64(6), int64(1), object(1)
memory usage: 32.5+ KB


<hr>

<div id="setting_up_SVM">
    <h2>Setting up the SVM</h2>
    Use <b>train/test split</b> to split your dataset (80,20).
</div>

## Splitting the dataset into the Training set and Test set

In [47]:
# your code
x_train, x_test, y_train,y_test = train_test_split(x,y, test_size=0.2, random_state=1)

In [48]:
# scalling 
import statsmodels.api as sm
ss = StandardScaler()

In [59]:
x_train[['interest_rate','credit','march','may','previous','duration']] = \
ss.fit_transform(x_train[['interest_rate','credit','march','may','previous','duration']])

x_test[['interest_rate','credit','march','may','previous','duration']] = \
ss.transform(x_test[['interest_rate','credit','march','may','previous','duration']])


In [70]:
y_train[['y']] = \
ss.fit_transform(y_train[['y']])

# y_test[['y']] = \
# ss.transform(y_test[['y']])

Double-click __here__ for the solution.

<!-- Your answer is below:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

-->

<h3>Practice</h3>
Print the shape of x_train and y_train. Ensure that the dimensions match

In [60]:
# your code
x_train[['interest_rate','credit','march','may','previous','duration']] = \
ss.fit_transform(x_train[['interest_rate','credit','march','may','previous','duration']])

x_test[['interest_rate','credit','march','may','previous','duration']] = \
ss.transform(x_test[['interest_rate','credit','march','may','previous','duration']])
y_pred = results.predict(x_test)


In [71]:
print(x_train)
print(y_train)
# print(y_pred[:10])

     const  interest_rate    credit     march       may  previous  duration
498    1.0      -0.810966 -0.200502  1.632090 -0.492854  -0.39958 -0.160923
92     1.0      -1.030896 -0.200502 -0.612711 -0.492854  -0.39958 -0.432308
233    1.0       1.162000 -0.200502 -0.612711 -0.492854  -0.39958  1.418305
119    1.0       1.161466 -0.200502 -0.612711 -0.492854  -0.39958 -0.195568
284    1.0       0.689043 -0.200502 -0.612711 -0.492854  -0.39958 -0.374566
..     ...            ...       ...       ...       ...       ...       ...
129    1.0      -0.736233 -0.200502 -0.612711 -0.492854  -0.39958 -0.334147
144    1.0       1.105416 -0.200502  1.632090 -0.492854  -0.39958  0.774488
72     1.0       1.162000 -0.200502 -0.612711 -0.492854  -0.39958 -0.204229
235    1.0      -1.022355 -0.200502 -0.612711 -0.492854  -0.39958  2.212250
37     1.0       0.689043 -0.200502 -0.612711 -0.492854  -0.39958 -0.868255

[414 rows x 7 columns]
     const    y
498    1.0  1.0
92     1.0  1.0
233    1.0  1.0


Print the shape of x_test and y_test. Ensure that the dimensions match

In [72]:
# your code
print(x_test)
print(y_test)

     const  interest_rate    credit     march       may  previous  duration
485    1.0      -1.142463 -0.200502 -0.612711 -0.492854  -0.39958  0.884197
273    1.0       1.161466 -0.200502 -0.612711 -0.492854  -0.39958 -0.793191
420    1.0      -1.017551 -0.200502 -0.612711  0.714492   2.50263  0.428040
315    1.0      -0.814169 -0.200502 -0.612711 -0.492854  -0.39958 -0.579548
256    1.0      -1.110968 -0.200502 -0.612711  1.921838   2.50263 -0.414985
..     ...            ...       ...       ...       ...       ...       ...
502    1.0      -0.926269 -0.200502 -0.612711 -0.492854  -0.39958 -0.695031
339    1.0      -1.102427 -0.200502 -0.612711  4.336530   2.50263  0.220171
6      1.0       1.162000 -0.200502 -0.612711 -0.492854  -0.39958 -0.859594
442    1.0       1.110220 -0.200502 -0.612711 -0.492854  -0.39958 -0.438082
11     1.0      -1.074135 -0.200502 -0.612711 -0.492854  -0.39958 -0.839385

[104 rows x 7 columns]
485    1
273    0
420    1
315    1
256    1
      ..
502    0
3

<hr>

<div id="modeling">
    <h2>Modeling</h2>
  
</div>

## Training the SVM model on the Training set

In [47]:
# your code


Double-click __here__ for the solution.

<!-- Your answer is below:

results = SVC(kernel="linear", random_state=0)
results.fit(x_train,y_train)

-->

<hr>

<div id="prediction">
    <h2>Prediction</h2>
    Let's make some <b>predictions</b> on the testing dataset and store it into a variable called <b>y_pred</b>.
</div>

In [73]:
# your code
y_pred = results.predict(x_test)
y_pred

Unnamed: 0,0,1
485,1.0,0.955977
273,1.0,0.104708
420,1.0,1.003790
315,1.0,0.575200
256,1.0,0.878302
...,...,...
502,1.0,0.574862
339,1.0,1.077056
6,1.0,0.090525
442,1.0,0.190973


Double-click __here__ for the solution.

<!-- Your answer is below:

y_pred = results.predict(x_test)
y_pred

-->

You can print out <b>y_pred</b> and <b>y_test</b> if you want to visually compare the prediction to the actual values.

In [74]:
print (y_pred [0:5])
print (y_test [0:5])


       0         1
485  1.0  0.955977
273  1.0  0.104708
420  1.0  1.003790
315  1.0  0.575200
256  1.0  0.878302
485    1
273    0
420    1
315    1
256    1
Name: y, dtype: int64


<hr>

## Predicting the Test set results


<div id="evaluation">
    <h2>Evaluation</h2>
    Next, let's import <b>metrics</b> from sklearn and check the accuracy of our model.
</div>

In [80]:
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
# rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(x_train, y_train)

# Save the Random Forest model to a file
joblib.dump(rf_model, "random_forest_model.pkl")
rf_y_pred = rf_model.predict(x_test)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print(f"Random Forest Accuracy: {rf_accuracy}")
print(classification_report(y_test, rf_y_pred))

ValueError: Classification metrics can't handle a mix of binary and continuous-multioutput targets

In [75]:
# your code here



NameError: name 'metrics' is not defined

Double-click __here__ for the solution.

<!-- Your answer is below:

from sklearn import metrics
import matplotlib.pyplot as plt
print("SVM's Accuracy: ", metrics.accuracy_score(y_test, y_pred))

-->

Apply the rest of Metrics for the Predict the test data <b>a. Accuracy, <b>Recall, <b>F1Score and <b>Confission Matrix

## Import the Testing Data and follow the stpes

8. Import the new dataset (Bank_data_testing.csv) as an testing data.
9. Apply necessary data preprocessing
10. Predict and interprest the results with new dataset a. Accuracy, Recall, F1Score and Confission Matrix
The results of new dataset should be as follow: Accuracy: 0.86 Recall: 0.91 f1Score: 0.87

## Practice 
Can you calculate the accuracy score without sklearn ?

In [None]:
# your code here


Source: IBM, MIT