# Banking Classification With Logistic Regression

## About Dataset

There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be subscribed (**yes**) or not (**no**) subscribed.

This dataset contains two main files:

1. **train.csv**: This file contains **32950** rows with **16** features, including the target features. This data spans from May 2008 to November 2010.
2. **test.csv**: This file includes **8238** rows with **13** features, excluding the target feature. The test data is already undergone preprocessing.

## Source

This dataset is available in Kaggle in the following Link:
> https://www.kaggle.com/datasets/rashmiranu/banking-dataset-classification/data

## Data Dictionary

* **age**: This is a numeric feature. This feature contains age of a person.
* **job**: This is a categorical feature. This feature contains type of job ('admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* **marital**: This is a categorical feature. This feature contains marital status of a person. ('divorced','married','single','unknown'; note: 'divorced' means divorced or widowed).
* **education**: This is a categorical feature. This feature contains education level of a person ('basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* **default**: This is a categorical feature. This features contains whether the parson has credit in default? ('no','yes','unknown')
* **housing**: This is a categorical feature. This feature contains whether the person has housing loan?
* **loan**: This is a categorical feature. This feature contains whether the person has personal loan?
* **contact**: This is a categorical feature. This feature contains contact communication type of a person ('cellular','telephone')
* **month**: This is a categorical(ordinal) feature. This feature contains last contact month of year with the person('jan', 'feb', 'mar', …, 'nov', 'dec')
* **day_of_week**: This is a categorical(ordinal) feature. This feature contains last contact day of the week with the person('mon','tue','wed','thu','fri')
* **duration**: This is a numeric feature. This feature contains last contact duration, in seconds.
* **campaign**: This is a numeric feature. This feature contains number of contacts performed during this campaign and for this client (includes last contact)
* **pdays**: This is a numeric feature. This feature contains number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
* **previous**: This is a numeric feature. This feature contains number of contacts performed before this campaign and for this client.
* **poutcome**: This is a categorical feature. This feature contains outcome of the previous marketing campaign with  a person('failure','nonexistent','success')
* **y**: This is the target feature(binary). This feature has the client subscribed a term deposit? ('yes','no').

## Problem Statements

1. **Model Building**: The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
2. **Model Evaluation**: Evaluate accuracy, precision, recall, F1 and Roc-AUV
3. **Hyperparameter Tuning**: Tune the hyperparameter of logistic regression model to find the optimal model

### Load Necessary Libraries

In [117]:
# General Libraries
import pandas as pd
import numpy as np
import warnings

# Preprocessing Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Model and Evaluation Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [118]:
warnings.filterwarnings("ignore")

In [80]:
csv_path = "train_encoded.csv"
# csv_path = "train_selected.csv"
df = pd.read_csv(csv_path)

In [81]:
# Show 1st 5
df.head()

Unnamed: 0,index,age,month,day_of_week,duration,campaign,y,job_blue-collar,job_entrepreneur,job_housemaid,...,education_unknown,default_unknown,default_yes,housing_unknown,housing_yes,loan_unknown,loan_yes,contact_telephone,poutcome_nonexistent,poutcome_success
0,0,49,11,3,227,4,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
1,1,37,11,3,202,2,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,2,78,7,1,1148,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,36,5,1,120,2,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,4,59,6,2,368,2,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


### Preprocessing before model training

In [82]:
# Separate Input and ouput feature
X = df.drop("y", axis=1)
y = df["y"]

In [83]:
# Split the train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [84]:
# Scale the input features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.values)
X_test = scaler.transform(X_test.values)

In [85]:
# Train the Logistic regression model with train data
model = LogisticRegression()
model.fit(X_train, y_train)

In [86]:
# Predict train data with trained model
y_train_pred = model.predict(X_train)

In [87]:
# Predict test data with trained model
y_test_pred = model.predict(X_test)

In [88]:
# Evaluate the model with printing metrics
def evaluate_model(true_val, pred_val):
    print(f"Accuracy: {accuracy_score(true_val, pred_val): 0.2f}")
    print(f"Precision: {precision_score(true_val, pred_val): 0.2f}")
    print(f"Recall: {recall_score(true_val, pred_val): 0.2f}")
    print(f"F1: {f1_score(true_val, pred_val): 0.2f}")

In [89]:
# Evaluate model for train data
evaluate_model(y_train, y_train_pred)

Accuracy:  0.92
Precision:  0.78
Recall:  0.45
F1:  0.57


In [90]:
# Evaluate model for test data
evaluate_model(y_test, y_test_pred)

Accuracy:  0.91
Precision:  0.78
Recall:  0.42
F1:  0.55


### Hyperparameter Tuning

In [119]:
# Define Hyperparameters
params = {
    "penalty": ["l1", "l2", "none"],
    "max_iter": [100, 200, 300],
    "solver": [ "lbfgs", "liblinear", "sag"]
}

In [120]:
# Optimize the model
clf = LogisticRegression()
gscv = GridSearchCV(clf, params, cv=5, verbose=1)
gscv.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


In [121]:
# Getting Best parameters and Score
best_params = gscv.best_params_
print(best_params)
print(f"Best Accuracy:{gscv.best_score_}")

{'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Accuracy:0.9192977477367542


In [123]:
# Train model with best parameters
model = LogisticRegression(**best_params)
model.fit(X_train, y_train)

In [124]:
# Predict train data with trained model
y_train_pred = model.predict(X_train)

In [125]:
# Predict test data with trained model
y_test_pred = model.predict(X_test)

In [126]:
# Evaluate model for train data
evaluate_model(y_train, y_train_pred)

Accuracy:  0.92
Precision:  0.78
Recall:  0.45
F1:  0.57


In [127]:
# Evaluate model for test data
evaluate_model(y_test, y_test_pred)

Accuracy:  0.91
Precision:  0.78
Recall:  0.42
F1:  0.55
