# Banking Classification With Logistic Regression

## About Dataset

There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be subscribed (**yes**) or not (**no**) subscribed.

This dataset contains two main files:

1. **train.csv**: This file contains **32950** rows with **16** features, including the target features. This data spans from May 2008 to November 2010.
2. **test.csv**: This file includes **8238** rows with **13** features, excluding the target feature. The test data is already undergone preprocessing.

## Source

This dataset is available in Kaggle in the following Link:
> https://www.kaggle.com/datasets/rashmiranu/banking-dataset-classification/data

## Data Dictionary

* **age**: This is a numeric feature. This feature contains age of a person.
* **job**: This is a categorical feature. This feature contains type of job ('admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* **marital**: This is a categorical feature. This feature contains marital status of a person. ('divorced','married','single','unknown'; note: 'divorced' means divorced or widowed).
* **education**: This is a categorical feature. This feature contains education level of a person ('basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* **default**: This is a categorical feature. This features contains whether the parson has credit in default? ('no','yes','unknown')
* **housing**: This is a categorical feature. This feature contains whether the person has housing loan?
* **loan**: This is a categorical feature. This feature contains whether the person has personal loan?
* **contact**: This is a categorical feature. This feature contains contact communication type of a person ('cellular','telephone')
* **month**: This is a categorical(ordinal) feature. This feature contains last contact month of year with the person('jan', 'feb', 'mar', …, 'nov', 'dec')
* **day_of_week**: This is a categorical(ordinal) feature. This feature contains last contact day of the week with the person('mon','tue','wed','thu','fri')
* **duration**: This is a numeric feature. This feature contains last contact duration, in seconds.
* **campaign**: This is a numeric feature. This feature contains number of contacts performed during this campaign and for this client (includes last contact)
* **pdays**: This is a numeric feature. This feature contains number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
* **previous**: This is a numeric feature. This feature contains number of contacts performed before this campaign and for this client.
* **poutcome**: This is a categorical feature. This feature contains outcome of the previous marketing campaign with  a person('failure','nonexistent','success')
* **y**: This is the target feature(binary). This feature has the client subscribed a term deposit? ('yes','no').

## Problem Statements

1. **Feature Selection**: The objective of feature selection is to select the most significant features for classification.
2. **Feature Engineering**: Encode the categorical features with one hot encoding or an appropriate encoding techenique.

### Load Necessary Libraries

In [23]:
# General Purpose Libraries
import pandas as pd
import numpy as np

# Feature selection Libraries
from sklearn.feature_selection import SelectKBest, chi2

## Load Dataset

In [82]:
# csv_path = "train_wo_dup.csv"
csv_path = "train_wo_no.csv"
# csv_path = "train_wo_out.csv"
df = pd.read_csv(csv_path)

In [83]:
# Print 1st 5 rows
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,poutcome,y
0,0,49,blue-collar,married,basic.9y,unknown,no,no,cellular,nov,wed,227,4,nonexistent,no
1,1,37,entrepreneur,married,university.degree,no,no,no,telephone,nov,wed,202,2,failure,no
2,2,78,retired,married,basic.4y,no,no,no,cellular,jul,mon,1148,1,nonexistent,yes
3,3,36,admin.,married,university.degree,no,yes,no,telephone,may,mon,120,2,nonexistent,no
4,4,59,retired,divorced,university.degree,no,no,no,cellular,jun,tue,368,2,nonexistent,no


### Feature Engineering

In [84]:
# Encode target feature
df["y"] = df["y"].map({"yes": 1, "no": 0})

In [85]:
# Sanity check
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,poutcome,y
0,0,49,blue-collar,married,basic.9y,unknown,no,no,cellular,nov,wed,227,4,nonexistent,0
1,1,37,entrepreneur,married,university.degree,no,no,no,telephone,nov,wed,202,2,failure,0
2,2,78,retired,married,basic.4y,no,no,no,cellular,jul,mon,1148,1,nonexistent,1
3,3,36,admin.,married,university.degree,no,yes,no,telephone,may,mon,120,2,nonexistent,0
4,4,59,retired,divorced,university.degree,no,no,no,cellular,jun,tue,368,2,nonexistent,0


In [86]:
# month and day_of_week are ordinal in nature. So encode them with their respective numeric values

# Define month dictionary
month_dict = {
    "jan": 1,
    "feb": 2,
    "mar": 3,
    "apr": 4,
    "may": 5,
    "jun": 6,
    "jul": 7,
    "aug": 8,
    "sep": 9,
    "oct": 10,
    "nov": 11,
    "dec": 12
}
# Encode moth with month dictionary
df["month"] = df["month"].map(month_dict)

# Define week dictionary
week_dict = {
    "sun": 0,
    "mon": 1,
    "tue": 2,
    "wed": 3,
    "thu": 4,
    "fri": 5,
    "sat": 6
}
# Encode days_of_week with week dictionary
df["day_of_week"] = df["day_of_week"].map(week_dict)

In [87]:
# Sanity check
df.head()

Unnamed: 0,index,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,poutcome,y
0,0,49,blue-collar,married,basic.9y,unknown,no,no,cellular,11,3,227,4,nonexistent,0
1,1,37,entrepreneur,married,university.degree,no,no,no,telephone,11,3,202,2,failure,0
2,2,78,retired,married,basic.4y,no,no,no,cellular,7,1,1148,1,nonexistent,1
3,3,36,admin.,married,university.degree,no,yes,no,telephone,5,1,120,2,nonexistent,0
4,4,59,retired,divorced,university.degree,no,no,no,cellular,6,2,368,2,nonexistent,0


In [88]:
# Use one hot encoding to encode all other categorical features
df = pd.get_dummies(df, dtype="int", drop_first= True)

In [89]:
# Sanity check
df.head()

Unnamed: 0,index,age,month,day_of_week,duration,campaign,y,job_blue-collar,job_entrepreneur,job_housemaid,...,education_unknown,default_unknown,default_yes,housing_unknown,housing_yes,loan_unknown,loan_yes,contact_telephone,poutcome_nonexistent,poutcome_success
0,0,49,11,3,227,4,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
1,1,37,11,3,202,2,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,2,78,7,1,1148,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,36,5,1,120,2,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,4,59,6,2,368,2,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [90]:
# Save encoded train data
df.to_csv("train_encoded.csv", index= False)

### Feature Selecttion

In [65]:
# Separate X and y
X = df.drop("y", axis= 1)
y= df["y"]

In [66]:
kbest = SelectKBest(score_func= chi2, k = 20)
kbest.fit(X, y)

In [67]:
# Get Selected Feature Indices
selected_feature_indices = kbest.get_support(indices= True)
selected_feature_indices

array([ 0,  1,  2,  4,  5,  6, 10, 12, 13, 15, 17, 18, 20, 21, 25, 26, 27,
       33, 34, 35])

In [68]:
# Get selected features
selected_features = df.columns[selected_feature_indices]
selected_features

Index(['index', 'age', 'month', 'duration', 'campaign', 'y', 'job_management',
       'job_self-employed', 'job_services', 'job_technician', 'job_unknown',
       'marital_married', 'marital_unknown', 'education_basic.6y',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'loan_yes', 'contact_telephone',
       'poutcome_nonexistent'],
      dtype='object')

In [69]:
df_selected = df[selected_features]
df_selected.head()

Unnamed: 0,index,age,month,duration,campaign,y,job_management,job_self-employed,job_services,job_technician,job_unknown,marital_married,marital_unknown,education_basic.6y,education_professional.course,education_university.degree,education_unknown,loan_yes,contact_telephone,poutcome_nonexistent
0,0,49,11,227,4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
1,1,37,11,202,2,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0
2,2,78,7,1148,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1
3,3,36,5,120,2,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1
4,4,59,6,368,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1


In [70]:
# Save train dataset with selected features
df_selected.to_csv("train_selected.csv", index= False)