# Classification

The data set for this exercise is made available by the UCI Machine Learning Repository. It includes information about the clients of a bank and the bank’s marketing efforts for a "deposit" subscription. The data are from a Portuguese bank. Each row in the data set pertains to one client. There is a total of 9,280 clients in the data set. Your goal is to predict whether a client opened a deposit account, i.e., the `DEPOSIT` column in the data set, (coded as 1) or not (coded as 0). This is important, because it helps allocate resources for the marketing campaign.
<br><br>
Check the descriptions of the variables in the Data Dictionary file. 

## Goal

Use the **bank_deposit.csv** data set and build a model to predict **DEPOSIT**. 

# Setup

In [3]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

# Get the data

In [4]:
# Import the data set:
bank = pd.read_csv("bank_deposit.csv")
bank.head()

Unnamed: 0,AGE,MARRIED,EDUCATION,DEFAULT,HOUSING,LOAN,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NR_EMPLOYED,DEPOSIT
0,35,0,7,0,1,0,-1.8,92.893,-46.2,1.266,5099.1,0
1,42,1,6,0,0,0,1.1,93.994,-36.4,4.857,5191.0,1
2,36,1,7,0,0,0,1.4,93.444,-36.1,4.965,5228.1,1
3,37,1,5,0,1,1,1.4,93.918,-42.7,4.963,5228.1,1
4,31,0,7,0,1,0,-1.8,93.075,-47.1,1.365,5099.1,0


# Split the data into train and test

In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(bank, test_size=0.3)

## Check the missing values

In [6]:
train.isna().sum()
test.isna().sum()

AGE               0
MARRIED           0
EDUCATION         0
DEFAULT           0
HOUSING           0
LOAN              0
EMP_VAR_RATE      0
CONS_PRICE_IDX    0
CONS_CONF_IDX     0
EURIBOR3M         0
NR_EMPLOYED       0
DEPOSIT           0
dtype: int64

# Data Prep

In [7]:
# Imports:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

## Separate the target variable (don't transform the target)

In [8]:
# Note: I recommend assigning the target variable using single brackets (such as ['etc']) rather than
# double brackets (such as [['etc.']]). This determines whether it is stored as a "Series" or "DataFrame"
# Otherwise, you might have to make adjustments down below.
train_target = train['DEPOSIT']
test_target = test['DEPOSIT']

train_inputs = train.drop(['DEPOSIT'], axis=1)
test_inputs = test.drop(['DEPOSIT'], axis=1)

##  Identify the numerical and categorical columns

In [9]:
train_inputs.dtypes

AGE                 int64
MARRIED             int64
EDUCATION           int64
DEFAULT             int64
HOUSING             int64
LOAN                int64
EMP_VAR_RATE      float64
CONS_PRICE_IDX    float64
CONS_CONF_IDX     float64
EURIBOR3M         float64
NR_EMPLOYED       float64
dtype: object

In [10]:
# Identify the numerical columns
numeric_columns = train_inputs[['AGE','CONS_CONF_IDX','CONS_PRICE_IDX','EMP_VAR_RATE','EURIBOR3M','NR_EMPLOYED']].columns
numeric_columns

Index(['AGE', 'CONS_CONF_IDX', 'CONS_PRICE_IDX', 'EMP_VAR_RATE', 'EURIBOR3M',
       'NR_EMPLOYED'],
      dtype='object')

In [11]:
# Identify the categorical columns
categorical_columns = train_inputs[['DEFAULT','EDUCATION','HOUSING','LOAN','MARRIED']].columns
categorical_columns

Index(['DEFAULT', 'EDUCATION', 'HOUSING', 'LOAN', 'MARRIED'], dtype='object')

# Pipeline (recommended)

If you don't want to use pipelines, feel free to use your own data prep steps.

In [12]:
# Numeric transformer:
numeric_transformer = Pipeline(steps=[
                #('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])
#do not need to impute as there are no missing values

In [13]:
# Categorical transformer:
categorical_transformer = Pipeline(steps=[
    #('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
#do not need to impute as there are no missing values

In [14]:
# Column transformer:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)])

# Transform: fit_transform() for TRAIN

In [96]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)
train_x

array([[ 0.63002931, -1.9998282 ,  2.02471988, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.71341366, -1.13621117, -0.93823142, ...,  0.        ,
         0.        ,  1.        ],
       [-0.28719848,  0.70366857,  0.802542  , ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.78750455, -1.30517972, -0.65047414, ...,  0.        ,
         0.        ,  1.        ],
       [-1.78811669,  1.88644841, -1.3240159 , ...,  0.        ,
         1.        ,  0.        ],
       [ 0.9635667 , -1.30517972, -0.65047414, ...,  0.        ,
         0.        ,  1.        ]])

In [97]:
train_x.shape

(6496, 22)

# Tranform: transform() for TEST

In [98]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.88018235, -0.31014272,  1.54723253, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.46326062,  0.70366857,  0.802542  , ...,  0.        ,
         0.        ,  1.        ],
       [-0.53735152, -0.31014272,  1.54723253, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.37987628,  1.2293485 , -1.60861101, ...,  0.        ,
         0.        ,  1.        ],
       [-1.12104193, -0.47911127,  0.68237962, ...,  0.        ,
         1.        ,  0.        ],
       [-1.12104193, -1.13621117, -0.93823142, ...,  0.        ,
         1.        ,  0.        ]])

In [99]:
test_x.shape

(2784, 22)

# Prepare Target

In [100]:
#train_target = train_target.values.reshape(-1, 1)
#test_target = test_target.values.reshape(-1, 1)

In [101]:
#from sklearn.preprocessing import OrdinalEncoder

#ord_enc = OrdinalEncoder()

#train_y = ord_enc.fit_transform(train_target)

#train_y

In [102]:
#train_y.ravel()

In [103]:
#test_y = ord_enc.transform(test_target)

#test_y

In [105]:
#target does not need to be transformed because it is binary
train_y = train_target
test_y = test_target

# Random Forest Model

In [106]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

#Use .ravel() for train_y_ord to convert it to a 1-D array from a 2-D array.
forest_clf.fit(train_x, train_y.ravel())

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

## Accuracy

In [107]:
from sklearn.metrics import accuracy_score

In [108]:
#Train accuracy
train_y_pred = forest_clf.predict(train_x)
train_acc = accuracy_score(train_y, train_y_pred)
print('Train acc: {}' .format(train_acc))

Train acc: 0.9789100985221675


In [109]:
#Test accuracy
test_y_pred = forest_clf.predict(test_x)
test_acc = accuracy_score(test_y, test_y_pred)
print('Test acc: {}' .format(test_acc))

Test acc: 0.7061781609195402


The large difference between training and test accuracy indicates overfitting.

## Confusion Matrix

In [115]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y, test_y_pred)

array([[1016,  366],
       [ 452,  950]], dtype=int64)

# Baseline Accuracy

In [112]:
# Find the majority class:
train_y.value_counts()

0    3258
1    3238
Name: DEPOSIT, dtype: int64

In [113]:
#Find the percentage of the majority class:
train_y.value_counts()/len(train_y)

0    0.501539
1    0.498461
Name: DEPOSIT, dtype: float64

Predicting 0 would result in a model with only 50.1% accuracy

# Stochastic Gradient Descent Classifier

In [114]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100, tol=1e-3, random_state=42)

sgd_clf.fit(train_x, train_y)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=100, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

## Calculate the Accuracy


In [116]:
#Train accuracy
train_y_pred = sgd_clf.predict(train_x)
train_acc = accuracy_score(train_y, train_y_pred)
print('Train acc: {}' .format(train_acc))

Train acc: 0.7164408866995073


In [117]:
#Test accuracy
test_y_pred = sgd_clf.predict(test_x)
test_acc = accuracy_score(test_y, test_y_pred)
print('Test acc: {}' .format(test_acc))

Test acc: 0.711566091954023


Accuracies are much closer than random forest, indicating negligible overfitting.

## Confusion Matrix

In [118]:
confusion_matrix(test_y, test_y_pred)

array([[ 975,  407],
       [ 396, 1006]], dtype=int64)

# Classification Report

In [119]:
from sklearn.metrics import classification_report
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.71      0.71      0.71      1382
           1       0.71      0.72      0.71      1402

    accuracy                           0.71      2784
   macro avg       0.71      0.71      0.71      2784
weighted avg       0.71      0.71      0.71      2784



## Precision

In [120]:
from sklearn.metrics import precision_score
precision_score(test_y, test_y_pred)

0.7119603680113235

## Recall

In [121]:
from sklearn.metrics import recall_score
recall_score(test_y, test_y_pred)

0.717546362339515

## F1 Score


In [122]:
from sklearn.metrics import f1_score
f1_score(test_y, test_y_pred)

0.7147424511545294