## <center> Bank Marketing Analysis </center> 

<center>by Runtian Li, Rafe Chang, Sid Grover, Anu Banga </center>

**Repo Link:** https://github.com/UBC-MDS/dsci_522_group_8.git

In [1]:
## Import necessary Packages
import altair as alt
import altair_viewer
alt.data_transformers.enable("vegafusion")

import pandas as pd
import numpy as np
import statistics
import os
import sys

import warnings
warnings.filterwarnings("ignore")

sys.path.append("code/.")

# Data 
from ucimlrepo import fetch_ucirepo 

# Machine Learning
import IPython
import matplotlib.pyplot as plt
import mglearn
from IPython.display import HTML, display
# from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import make_scorer, f1_score, recall_score, precision_score, accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import PrecisionRecallDisplay

# %matplotlib inline
pd.set_option("display.max_colwidth", 200)

from IPython.display import Image

## <center> Summary </center>
Here we build a model of balanced SVC to try to predict if a new client will subscribe to a term deposit. We tested five different classification models, including dummy classifier, unbalanced/balanced logistic regression, and unbalanced/balanced SVC, and chose the optimal model of balanced SVC based on how the model scored on the test data; the model has the highest test recall score of 0.82, which indicates that the model makes the least false negative predictions among all five models. 

The balanced support vector machines model considers 13 different numerical/ categorical features of customers. After hyperparameter optimization, the model's test accuracy increased from 0.82 to 0.875. The results were somewhat expected, given SVC's known efficacy in classification tasks, particularly when there's a clear margin of separation. The high recall score of 0.875 indicates that the model is particularly adept at identifying clients likely to subscribe, which was the primary goal. It's noteworthy that such a high recall was achieved, as it suggests the model is highly sensitive to true positive cases.

## <center> Introduction </center>

### Background
The data set Bank Marketing was created by Sérgio Moro and Paulo Rita at the University Institute of Lisbon, and Paulo Cortez at the University of Minhom. It is sourced from the UCI Machine Learning Repository. Each row in this data set is an observation related to direct marketing campaigns (phone calls) of a Portuguese banking institution.

### Research Question

We are working on a binary classification model. The classification goal is to predict if the client will subscribe a term deposit: "yes" for will subscribe and "no" for won't subscribe.

### Data Description
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. It was sourced from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/dataset/222/bank+marketing). We will be using bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 

These are the detail of all inputs:

| Feature Name | Type        | Description                                                                                   | Classes |
|--------------|-------------|-----------------------------------------------------------------------------------------------|---------|
| age          | Numeric     |                                                                                               |         |
| job          | Categorical | Type of job                                                                                   | 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown' |
| marital      | Categorical | Marital status                                                                                | 'divorced','married','single','unknown' |
| education    | Categorical |                                                                                               | 'primary', 'secondary', 'tertiary', 'unknown' |
| default      | Categorical | Has credit in default?                                                                        | 'no', 'yes', 'unknown' |
| housing      | Categorical | Has housing loan?                                                                             | 'no', 'yes', 'unknown' |
| loan         | Categorical | Has personal loan?                                                                            | 'no', 'yes', 'unknown' |
| balance      | Numeric     | Balance of the individual                                                                     |         |
| contact      | Categorical | Contact communication type                                                                    | 'cellular', 'telephone' |
| month        | Categorical | Last contact month of year                                                                    | 'jan', 'feb', 'mar', ..., 'nov', 'dec' |
| day          | Categorical | Last contact day of the week                                                                  | 'mon', 'tue', 'wed', 'thu', 'fri' |
| duration     | Numeric     | Last contact duration, in seconds                                                             |         |
| campaign     | Numeric     | Number of contacts performed during this campaign and for this client                        |         |
| pdays        | Numeric     | Number of days that passed by after the client was last contacted from a previous campaign    |         |
| previous     | Numeric     | Number of contacts performed before this campaign and for this client                         |         |
| poutcome     | Categorical | Outcome of the previous marketing campaign                                                    | 'failure', 'nonexistent', 'success' |
| y            | Binary      | Has the client subscribed to a term deposit?                                                  | 'yes', 'no' |


The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

## <center>Results and Discussion </center>

### Exploratory Data Analysis 

In [2]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# Import the unique feature values functions function from the src folder
sys.path.append('..')
from src.unique import get_uniques

df = pd.read_csv("data/bank-full.csv", delimiter=";")
df.rename(columns={"y": "target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
    
get_uniques(df)

{'age': 0     58
 1     44
 2     33
 3     47
 4     35
       ..
 72    84
 73    87
 74    92
 75    93
 76    88
 Length: 77, dtype: int64,
 'job': 0        management
 1        technician
 2      entrepreneur
 3       blue-collar
 4           unknown
 5           retired
 6            admin.
 7          services
 8     self-employed
 9        unemployed
 10        housemaid
 11          student
 dtype: object,
 'marital': 0     married
 1      single
 2    divorced
 dtype: object,
 'education': 0     tertiary
 1    secondary
 2      unknown
 3      primary
 dtype: object,
 'default': 0     no
 1    yes
 dtype: object,
 'housing': 0    yes
 1     no
 dtype: object,
 'loan': 0     no
 1    yes
 dtype: object,
 'contact': 0      unknown
 1     cellular
 2    telephone
 dtype: object,
 'day': 0      5
 1      6
 2      7
 3      8
 4      9
 5     12
 6     13
 7     14
 8     15
 9     16
 10    19
 11    20
 12    21
 13    23
 14    26
 15    27
 16    28
 17    29
 18    30
 19   

In [3]:
# Import the eda_plotting functions function from the src folder
sys.path.append('..')
from src.eda_plotting import (
                                EDA_plot, 
                                spearman_correlation_matrix, 
                                text_EDA
                             )

In [None]:
numeric_cols = train_df.select_dtypes(include=['int64', 'float64']).columns.to_list()
categorical_cols = ["job", "marital", "education", "default", "housing", "loan", "poutcome"]
numerical_cols = numeric_cols

In [None]:
text_EDA(train_df)

In [None]:
display(spearman_correlation_matrix(df, numerical_cols))

In [None]:
display(EDA_plot(df, numeric_cols, categorical_cols))

### Preprocessing

<div class="alert alert-info">
    
- Since there is no missing values in our dataset, we don't need to do imputation or drop NAs.   
- We are going to drop "contact", "day" and "month" column here since they are not helping us in identifying useful underlying pattern in the model.    
- We take "age", "balance", "duration", "campaign", "pdays", "previous" as numerical features and we are doing StandardScaler transformation on them.
- We take "job", "marital", "education", "default", "housing", "loan", "poutcome" as categorical features and we are doing one hot encoding on them. We dropped columns only if the categorical is binary.
    
</div>

In [None]:
numeric_looking_columns = train_df.select_dtypes(include=np.number).columns.tolist()
print(numeric_looking_columns)

In [None]:
# Lists of feature names
numerical_features = ["age", "balance", "duration", "campaign", "pdays", "previous"]
categorical_features = ["job", "marital", "education", "default", "housing", "loan", "poutcome"]
drop_features = ["contact", "day", "month"]

# Create the column transformer
preprocessor = make_column_transformer(    
    (StandardScaler(), numerical_features),  # scaling on numeric features   
    (OneHotEncoder(drop="if_binary"), categorical_features),  # OHE on categorical features
    ("drop", drop_features),  # drop the drop features
)

# Show the preprocessor
preprocessor

In [None]:
# Seperate X and y
X_train = train_df.drop(columns=["target"])
X_test = test_df.drop(columns=["target"])
y_train = train_df["target"]
y_test = test_df["target"]

In [None]:
# This line nicely formats the feature names from `preprocessor.get_feature_names_out()` so that we can more easily use them below
preprocessor.verbose_feature_names_out = False

# Create a dataframe with the transformed features and column names
ct = preprocessor.fit(X_train)

# Columns names after one hot encoding
ohe_columns = list(
    preprocessor.named_transformers_["onehotencoder"]
    .get_feature_names_out(categorical_features)
)

# Columns after transformation
new_columns = (
    numerical_features + ohe_columns
)

# Now create the DataFrame with the dense data
X_train_enc = pd.DataFrame(preprocessor.transform(X_train), index=X_train.index, columns=new_columns)
X_train_enc.head()

### Model Selection

In [None]:
# 1. Base Model: Dummy Classifier
classification_metrics = ["accuracy", "precision", "recall", "f1"]
dc = DummyClassifier(strategy="most_frequent")
pipe_dc = make_pipeline(preprocessor, dc)
# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results = {}
scoring = {
    "accuracy": 'accuracy',
    'precision': make_scorer(precision_score, pos_label="yes", zero_division=0),
    'recall': make_scorer(recall_score, pos_label="yes"),
    'f1': make_scorer(f1_score, pos_label="yes")
}  # scoring can be a string, a list, or a dictionary

cross_val_results['dummy'] = pd.DataFrame(cross_validate(pipe_dc, X_train, y_train, return_train_score=True, scoring=scoring)).agg(['mean', 'std']).round(3).T

# Show the train and validation scores
cross_val_results['dummy']

In [None]:
# 2. Logistic regression

# The logreg model pipeline
logreg = make_pipeline(preprocessor, LogisticRegression(max_iter=1000, random_state=123))

# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results['logreg'] = pd.DataFrame(cross_validate(logreg, X_train, y_train, return_train_score=True, scoring=scoring)).agg(['mean', 'std']).round(3).T

# Show the train and validation scores
cross_val_results['logreg'] 

In [None]:
# 3. Support vector classifier

# The svc model pipeline
svc = make_pipeline(preprocessor, SVC(random_state=123))

# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results['svc'] = pd.DataFrame(cross_validate(svc, X_train, y_train, return_train_score=True, scoring=scoring)).agg(['mean', 'std']).round(3).T
# Show the train and validation scores
cross_val_results['svc'] 

In [None]:
# 4. Balanced logistic regression
logreg_bal = make_pipeline(preprocessor, 
                           LogisticRegression(max_iter=1000, 
                                              random_state=123, 
                                              class_weight="balanced"))

# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results['logreg_bal'] = pd.DataFrame(cross_validate(logreg_bal, X_train, y_train, return_train_score=True, scoring=scoring)).agg(['mean', 'std']).round(3).T

# Show the train and validation scores
cross_val_results['logreg_bal'] 

In [None]:
# 5. Balanced support vector classifier
svc_bal = make_pipeline(preprocessor, SVC(random_state=123, class_weight="balanced"))

# The mean and std of the cross validated scores for all metrics as a dataframe
cross_val_results['svc_bal'] = pd.DataFrame(cross_validate(svc_bal, X_train, y_train, return_train_score=True, scoring=scoring)).agg(['mean', 'std']).round(3).T

# Show the train and validation scores
cross_val_results['svc_bal'] 

In [None]:
# Compare the average scores of all the models
pd.concat(
    cross_val_results,
    axis='columns'
).xs(
    'mean',
    axis='columns',
    
    level=1
).style.format(
    precision=2
).background_gradient(
    axis=None
)

<div class="alert alert-info">

`Dummy Classifier` has low accuracy and zero precision, recall, and F1 scores, indicating it never predicts the positive class (in this case the client subscribed a term deposit). This is expected as it always predicts the most frequent class.

`logreg` shows improved accuracy over the dummy model. However, its recall is low, suggesting it misses a significant number of true positive cases. `svc` performed almost the same as logistic regression model among all metrics.

`logreg_bal` and `svc_bal` have lower accuracy compared to their unbalanced counterparts but significantly higher recall. This indicates they are better at identifying positive cases but at the cost of making more false positive errors.

Given the context of our bank marketing data set, we aim to detect the clients who will subscribe a term deposit given the features. Missing a potential "yes" could be more costly than false positives, as it represents a lost opportunity for the sales team to transform this potential customer. Therefore, we chose `svc_bal` as the model has the highest `test_recall` score. 
    
</div>

In [None]:
svc_bal.fit(X_train, y_train)
confmat_svc_bal = ConfusionMatrixDisplay.from_estimator(
    svc_bal,
    X_train,
    y_train,
    values_format="d",) 
confmat_svc_bal

### Hyperparameter Optimization

<div class="alert alert-info">

Optimizing hyperparameters in SVC with a smaller sample size of 10,000 instances is a strategy aimed at enhancing computational efficiency. This approach expedites the exploration of hyperparameter possibilities, aiding in the discovery of potential configurations. While the outcomes validate the concept, it's crucial to recognize and manage the constraints stemming from the smaller dataset size when interpreting the results.
    
</div>

In [None]:
# Creating a sample of 10000 observations
sample_data = df.sample(n=10000, random_state=123)
train_df_sampled, test_df_sampled = train_test_split(sample_data, test_size=0.2, random_state=123)

X_train_sampled = train_df_sampled.drop(columns=["target"])
X_test_sampled = test_df_sampled.drop(columns=["target"])
y_train_sampled = train_df_sampled["target"]
y_test_sampled = test_df_sampled["target"]

# Transformation on the sample training data
sample_preprocessor = make_column_transformer(
    (StandardScaler(), numerical_features),
    (OneHotEncoder(drop="if_binary"), categorical_features),
    ("drop", drop_features),
)

X_train_sampled_enc = pd.DataFrame(sample_preprocessor.fit_transform(X_train_sampled), index=X_train_sampled.index, columns=new_columns)

svc_bal_sample = make_pipeline(sample_preprocessor, SVC(random_state=123, class_weight="balanced"))

param_dist = {
    'svc__C': uniform(0.1, 10),
    'svc__gamma': uniform(0.001, 0.1),
    'svc__kernel': ['rbf', 'sigmoid', 'linear']
}

# Perform RandomizedSearchCV for hyperparameter optimization
random_search = RandomizedSearchCV(svc_bal_sample, param_distributions=param_dist, n_iter=25, cv=5, n_jobs=-1, random_state=123)
random_search.fit(X_train_sampled, y_train_sampled)

# Best hyperparameters
best_params_random = random_search.best_params_
print("Best Hyperparameters (Randomized Search):", best_params_random)

In [None]:
pd.DataFrame(random_search.cv_results_)[
    [
        "mean_test_score",
        "param_svc__gamma",
        "param_svc__C",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

### Test results after hyperparameter optimization

In [None]:
# Evaluate the best model on the test set
best_model_random = random_search.best_estimator_
accuracy_random = best_model_random.score(X_test, y_test)
print("Accuracy on Test Set:", accuracy_random)

In [None]:
predictions = best_model_random.predict(X_test)

recall = recall_score(y_test, predictions, pos_label='yes')
print("Recall on Test Set:", recall)

In [None]:
results = pd.DataFrame(random_search.cv_results_)

scatter = alt.Chart(results).mark_circle().encode(
    x='param_svc__C:Q',
    y='param_svc__gamma:Q',
    color=alt.Color('mean_test_score:Q', 
                    scale=alt.Scale(scheme='viridis', reverse=True)
                   )
).properties(
    width=400,
    height=300,
    title='C and gamma vs. Mean Test Score'
)

scatter

# <center> Discussions </center>

### Key Findings

In this bank marketing analysis project, we aimed to develop a binary classification model to predict client subscription to term deposits. We tested Logistic Regression and Support Vector Classifier (SVC) models, focusing on recall as a key performance metric. The SVC model outperformed Logistic Regression in recall, and after hyperparameter optimization, it achieved a recall score of 0.875 on the test dataset, which is quite promising!

### Reflection on Expectations

The results were somewhat expected, given SVC's known efficacy in classification tasks, particularly when there's a clear margin of separation. The high recall score of 0.875 indicates that the model is particularly adept at identifying clients likely to subscribe, which was the primary goal. It's noteworthy that such a high recall was achieved, as it suggests the model is highly sensitive to true positive cases.

### Impact of Finding

The high recall score of this model has significant implications for targeted marketing strategies. It suggests that the bank can confidently use the model's predictions to focus its marketing efforts on clients predicted to subscribe, potentially increasing the efficiency and effectiveness of its campaigns. This targeted approach could lead to higher conversion rates with lower marketing expenses. However, it's important to balance such a high recall with precision to ensure that the bank doesn't unnecessarily target unlikely prospects.

### Future Improvements

The success of this model leads to several potential areas for further exploration:

- Balancing Precision and Recall: Investigating methods to enhance precision without substantially reducing recall.
- Feature Analysis: Identifying which features most significantly influence subscription predictions.
Model Interpretability: Improving the model's interpretability to better understand the basis for its predictions.
- Temporal Adaptability: Assessing the model's adaptability to evolving trends and customer behaviors over time.
- Testing Alternative Models: Exploring whether ensemble methods or more advanced machine learning algorithms could yield better or comparable results.
- Customer Segmentation: Evaluating the model's performance across different customer segments to tailor more specific marketing strategies.

# <center> References </center>

Moro,S., Rita,P., and Cortez,P., 2012. Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306"

Timbers,T. , Ostblom,J., and Lee,M., 2023. Breast Cancer Predictor Report. GitHub repository, https://github.com/ttimbers/breast_cancer_predictor_py/blob/0.0.1/src/breast_cancer_predictor_report.ipynb",

Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62, 22-31.

Alsolami, F.J., Saleem, F., & Al-Ghamdi, A.S. (2020). Predicting the Accuracy for Telemarketing Process in Banks Using Data Mining.

Vajiramedhin, C., & Suebsing, A. (2014). Feature Selection with Data Balancing for Prediction of Bank Telemarketing. Applied mathematical sciences, 8, 5667-5672.

Moura, A.F., Pinho, C.M., Napolitano, D.M., Martins, F.S., & Fornari Junior, J.C. (2020). Optimization of operational costs of Call centers employing classification techniques. Research, Society and Development, 9.