## Identifying the Top Three Predictors of Term Deposit Subscriptions

by Jerry Yu, John Shiu, Sophia Zhao, & Zeily Garcia
2023/11/19

In [1]:
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay, classification_report,
    PrecisionRecallDisplay, average_precision_score, recall_score, precision_score
)
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer

# Summary

This report presents an analysis of the factors influencing client subscriptions to term deposits at a Portuguese banking institution. Utilizing a dataset comprising 45,211 client interactions with 17 input features, we apply logistic regression and decision tree classifiers to identify the top three predictors of term deposit subscriptions. The data preprocessing involves handling missing values, encoding categorical variables, and standardizing numerical variables. Our exploratory data analysis leverages visualizations to understand feature distributions and correlations, while model evaluation focuses on precision and recall due to the dataset's imbalance. Logistic regression is likely to prove slightly superior in precision to the decision tree classifier. The analysis identifies the outcome of previous campaigns, the month of contact, and the call duration as the most significant predictors. These findings offer valuable insights into the decision-making process of clients regarding term deposit subscriptions and suggest areas for future research.

# Introduction

In an age where banking and finance are rapidly transforming, the ability to analyze and predict customer behavior has become indispensable for crafting impactful marketing strategies. Term deposits, a staple financial product, offer customers a secure investment option, while providing banks with a stable funding source. Understanding what motivates a customer to commit to these term deposits is of particular interest to financial institutions seeking to bolster their customer base and secure funds.

Our investigation centers on identifying the key determinants that influence a customer's decision to subscribe to a term deposit. We pose the following question: What are the primary factors that predict whether a client will subscribe to a term deposit at a banking institution? To address this question, we will leverage a comprehensive dataset `bank-full.csv` provided by a Portuguese bank along with a few sub datasets inlcuding `bank.csv` a 10% subset of the `bank-full.csv`, `bank-additional-full.csv` is a newer version from May 2008 to November 2010, `bank-addtional.csv` is a 10% subset of `bank-addtional-full.csv`, `bank-name.txt`, and `bank-addtional-names.txt`, which detail the outcomes of their marketing campaigns related to term deposits. This dataset includes a range of variables from client demographic information to details of contact during the campaigns, encompassing over 45,000 individual client records.

To uncover the patterns within this data, we will employ two widely recognized machine learning models: logistic regression and decision tree classifiers. These models are chosen for their ability to handle both numerical and categorical data, providing a robust framework for our analysis. Through this methodical and data-driven investigation, we aim to offer reliable and actionable insights that could significantly enhance the efficiency of marketing strategies tailored towards increasing term deposit subscriptions. Furthermore, We will also be seeing a significant challenge which is the class imbalance within the dataset. This imbalance can skew predictive modeling, leading to overfitting and reduced model performance on unseen data. Addressing this, we will techniques tailored to balance the dataset, thus ensuring that our models provide an accurate reflection of the underlying patterns.

# Methods

## Data
The data set utilized in this project was created by Sérgio Moro, P. Cortez, and P. Rita, and is hosted on the UCI Machine Learning Repository (Moro et al., 2014). It can be accessed [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). The data was sourced from the marketing campaigns of a Portuguese banking institution, focusing on client subscriptions to term deposits. The primary dataset, `bank-full.csv`, includes 45,211 examples with 17 input features. Each row in the dataset represents information from a client contact during the campaign. This dataset comprehensively covers client demographic information (such as age, job type, marital status, and education level), financial information (including credit default history, average yearly balance, housing and personal loan status), and information about bank marketing (like the type of contact, date of the last contact, call duration, number of contacts in the current campaign, days elapsed since the last contact in a previous campaign, outcome of the previous campaigns, and finally, whether the client will subscribe to a term deposit [variable y], a binary outcome).

## Analysis
Logistic regression and decision tree classifiers were employed to develop a model for determining the top 3 factors that aid in predicting if a client will subscribe to a term deposit (variable y). All variables included in the original data set, with the exception of `contact` and `poutcome` , which contained too many unknown examples, were used to fit the model. The exploratory data analysis focused on understanding variable distributions and relationships between features. It employed visualizations like bar charts and histograms for categorical and numerical variables. Correlation heatmaps were generated to assess the relationships among numerical features and to detect potential multicollinearity.

In the preprocessing phase, categorical variables were transformed using one-hot encoding, ordered variables were ordinal encoded, and numerical variables were standardized. The data was then split, assigning 75% to the training set and 25% to the test set. The logistic regression model aimed to look into the relationship between the predictors and the likelihood of a client subscribing to a term deposit. Similarly, the decision tree classifier was employed to identify the decision making framework that leads to subscriptions based on the features. Both models were thoroughly evaluated using accuracy, precision, and recall.

Python Language (Van Rossum & Drake, 2009) and its libraries, including pandas (McKinney, 2010), NumPy (Harris et al., 2020), scikit-learn (Pedregosa et al., 2011), and Altair (VanderPlas, 2018), were essential for the analysis. The complete code and methodology are documented in [https://github.com/UBC-MDS/dsci522_group21/blob/add-models/term_deposit_analysis.ipynb].

The analysis revealed that Logistic Regression exhibited slightly better precision than the decision tree classifier, with a smaller gap between training and testing scores, suggesting better generalizability. Upon model fitting and evaluation, the top three features predicting a client's decision to subscribe to a term deposit were identified by sorting feature (coefficient) importance.


# Results & Discussion

**Data Preprocessing**

The dataset was divided into training and testing sets with a 75-25 split, ensuring a representative sample in both sets.

In [2]:
# download data set
df_bank = pd.read_csv("../data/bank-full.csv", delimiter=";")

In [3]:
# split into train & test
train_df, test_df = train_test_split(df_bank, test_size=0.25, random_state=123)

**Exploratory Data Analysis**

During our analysis, it was noted that the dataset exhibits an imbalance in class proportions within the target. Additionally, the features `job`, `education`, `contact`, and `poutcome` include "unknown" entries distinct from null values. Due to insufficient dataset information, proper imputation of these unknown values is not feasible. Specifically, the `contact` and `poutcome` columns, containing excessive unknown values, required removal. Eliminating only the unknown values from these columns was deemed impractical as it would significantly reduce the data. On the other hand, the `job` and `education` features, despite also containing unknown values, were manageable and were retained by simply removing the unknown entries.

In [4]:
alt.data_transformers.enable("vegafusion")
alt.Chart(train_df).mark_bar().encode(
    x=alt.X('job:N', title='Job Types'), 
    y='count()'
).properties(
    title='Figure 1 - Distribution of Job Types'
)

Like in *Figure 1*, bar charts were created for categorical variables, and histograms for numerical variables to show illustration. Correlation heatmaps based on two different metrics (Spearman and Pearson correlation coefficients) were generated to investigate the relationships between numerical variables. The Pearson heatmap is shown on this report since it measures linear correlation, which is important to identify potential multicollinearity.

In [5]:
categorical_cols = list(train_df.drop(columns=["y"]).select_dtypes(include=["object"]).columns)
numerical_cols = list(train_df.select_dtypes(include=["int64"]).columns)

In [6]:
alt.data_transformers.enable("vegafusion")

Fig_2 = alt.Chart(train_df).mark_bar().encode(
    x=alt.X('previous', type='quantitative', bin=alt.Bin(maxbins=40), title='Interactions'),
    y='count()'
).properties(
    title='Figure 2 - Number of Contacts Before this Campaign (previous)'
)

Fig_3 = alt.Chart(train_df).mark_bar().encode(
    x=alt.X('pdays', type='quantitative', bin=alt.Bin(maxbins=40), title='Days'),
    y='count()'
).properties(
    title='Figure 3 - Days Passed after Last Contact (pdays)'
)


person_corr_df = train_df[numerical_cols].corr("pearson").unstack().reset_index()
person_corr_df.columns = ["num_variable_0", "num_variable_1", "correlation"]

corr_heatmap = alt.Chart(person_corr_df, title="Figure 4 - Pearson Correlation").mark_rect().encode(
    x=alt.X("num_variable_0").title("Numerical Variable"),
    y=alt.Y("num_variable_1").title("Numerical Variable"),
    color="correlation:Q"
).properties(
    width=250,
    height=250
)

text = alt.Chart(person_corr_df).mark_text().encode(
    x=alt.X("num_variable_0").title("Numerical Variable"),
    y=alt.Y("num_variable_1").title("Numerical Variable"),
    text=alt.Text("correlation:Q", format=".2f")
)

Fig_4 = corr_heatmap + text

(Fig_2 | Fig_3 | Fig_4)

The analysis revealed that the distributions of the `pdays` and `previous` variables in the dataset are significantly skewed. Moreover, `pdays` and `previous` variables exhibit a moderate Pearson correlation score of 0.44. The Pearson correlation coefficient suggests potential multicollinearity between the recency of client engagement (pdays) and the intensity of past efforts to connect with clients (previous).

In [7]:
Fig_5 = alt.Chart(train_df, title="Figure 5 - Recency (pdays) vs Intensity (previous)").mark_point().encode(
    x=alt.X("pdays", title='Days (pdays)'),
    y=alt.Y("previous", title='Interactions (previous)').scale(domain=(0, 50), clamp=True)
)
Fig_5

However, a visual inspection of *Figure 5* suggests that the correlation between `pdays` and `previous` is not strong enough to warrant concern. Thus, we retained both variables as features in the dataset.

**Data Preprocessing Cont.**

In [8]:
X_train, y_train = train_df.drop(columns=["y"]), train_df["y"]
X_test, y_test = test_df.drop(columns=["y"]), test_df["y"]

We identified categorical, ordinal, and numerical features and applied appropriate transformations, including one-hot encoding for categorical features, ordinal encoding for education levels, and standardization for numerical features.

In [9]:
categorical_feats = ["job", "marital", "default", "housing", "loan", "contact", "day", "month", "poutcome"]
ordinal_feats = ["education"]
numeric_feats = ["age", "balance", "duration", "campaign", "previous", "pdays"]

education_levels = ["unknown", "primary", "secondary", "tertiary"]

preprocessor = make_column_transformer(
    (OneHotEncoder(sparse_output=False, drop="if_binary"), categorical_feats),
    (OrdinalEncoder(categories=[education_levels], dtype=int), ordinal_feats),
    (StandardScaler(), numeric_feats),
)

**Model Training and Evaluation**

Our analysis evaluated three distinct models: a baseline dummy classifier, a decision tree, and logistic regression. We assessed these models using five-fold cross-validation based on key metrics: accuracy, precision, and recall. These metrics allowed us to assess whether the unbalanced nature of our dataset would contribute to potential type 1 or type 2 errors and enabled a thorough comparison of each model's effectiveness. 

In [10]:
model_pipes = {
    "Baseline": DummyClassifier(strategy="most_frequent", random_state=522),
    "DecisionTree": make_pipeline(preprocessor, DecisionTreeClassifier(max_depth=5, random_state=522)),
    "LogisticRegression": make_pipeline(preprocessor, LogisticRegression(max_iter=2000, random_state=522)),
}

mod_precision_score = make_scorer(precision_score, zero_division=0)

classification_metrics = {
    "accuracy": "accuracy",
    "precision": mod_precision_score,
    "recall": "recall", 
}
cross_val_results = {}

for name, pipe in model_pipes.items():
    cross_val_results[name] = pd.DataFrame(
        cross_validate(
            pipe, 
            X_train, 
            y_train=="yes", 
            cv=5,
            return_train_score=True, 
            scoring=classification_metrics)
    ).agg(['mean', 'std']).round(3).T

In [11]:
# plotting 
Fig_6 = pd.concat(
    cross_val_results,
    axis='columns'
).xs(
    'mean',
    axis='columns',
    level=1
).style.format(
    precision=2
).background_gradient(
    axis=None
).set_caption("Figure 6 - Scores")

Fig_6

Unnamed: 0,Baseline,DecisionTree,LogisticRegression
fit_time,0.01,0.28,0.74
score_time,0.04,0.04,0.05
test_accuracy,0.88,0.9,0.9
train_accuracy,0.88,0.9,0.9
test_precision,0.0,0.64,0.65
train_precision,0.0,0.67,0.66
test_recall,0.0,0.35,0.35
train_recall,0.0,0.37,0.36


*Figure 6* reveals that the Logistic Regression slightly outperforms the Decision Tree Classifier in terms of test precision, whilst both models perform the same in regards to test recall. Additionally, the Logistic Regression model demonstrates a narrower disparity between training and testing scores, suggesting a higher potential of effective generalization. Despite these findings, there remains potential for precision improvement. In future efforts to predict subscription outcomes, increasing the threshold could improve precision. Overall, the Logistic Regression model's performance indicated better generalizability.

**Analysis of Model Performance**

Upon further analysis, we observed that the Logistic Regression model, when fitted with the training data, achieved a precision of 0.66 for predicting positive outcomes (clients subscribing to a term deposit).

In [12]:
lr_pipe = model_pipes['LogisticRegression']

lr_pipe.fit(X_train, y_train)

In [13]:
report = classification_report(y_test, lr_pipe.predict(X_test))
title = "Figure 7 - Classification Report for Logistic Regression Model"
print(title + "\n" + report)

Figure 7 - Classification Report for Logistic Regression Model
              precision    recall  f1-score   support

          no       0.92      0.98      0.95     10003
         yes       0.66      0.34      0.45      1300

    accuracy                           0.90     11303
   macro avg       0.79      0.66      0.70     11303
weighted avg       0.89      0.90      0.89     11303



The precision test score is similar to the validation score as well as the train score. Therefore, we would believe that the feature importance conclusion drawn from this model is generalizable.

**Feature Importance**

To obtain the top 3 driving factors behind the predictions, we looked into the feature importance derived from the Logistic Regression model.

In [14]:
categorical_cols = lr_pipe.named_steps['columntransformer'].named_transformers_['onehotencoder'].get_feature_names_out().tolist()
ordinal_cols = lr_pipe.named_steps['columntransformer'].named_transformers_['ordinalencoder'].get_feature_names_out().tolist()
numeric_cols = lr_pipe.named_steps['columntransformer'].named_transformers_['standardscaler'].get_feature_names_out().tolist()

feature_importance = pd.DataFrame({
    'feature': categorical_cols + ordinal_cols + numeric_cols, 
    'coef': lr_pipe.named_steps['logisticregression'].coef_[0].tolist(),
    'coef_abs': map(abs, lr_pipe.named_steps['logisticregression'].coef_[0].tolist())
})

The analysis reveals that `poutcome`, `month`, and `duration` are the three most significant features influencing a client's likelihood to subscribe to a term deposit. Clients with a previous successful `poutcome` tend to be more inclined to subscribe, likely due to their positive experiences with the bank's services. Interestingly, `month` contains a pattern where in March ("mar") there is a higher subscription rate and in January ("jan") the rate is lower. This finding could suggest a seasonal factor, possibly linked to financial periods, though its exact cause remains unclear and would require further external research. Longer `duration` of calls is linked with increased subscription rates, potentially implying greater client interest and giving salespersons more time for effective pitching might add to the likelihood of subscription.

In [15]:
feature_importance.sort_values('coef_abs', ascending=False).drop(columns=["coef_abs"]).head(8).style.format(
    precision=3
).background_gradient(
    cmap="PiYG",
    vmin=-2,
    vmax=2,
    axis=None
).set_caption("Figure 8 - Feature Importance")

Unnamed: 0,feature,coef
66,poutcome_success,1.607
59,month_mar,1.457
56,month_jan,-1.115
71,duration,1.084
20,contact_unknown,-0.963
57,month_jul,-0.918
62,month_oct,0.844
63,month_sep,0.778


**Final Insights**
- **Past Outcome:** A successful outcome in past campaigns increases the likelihood of subscription in future ones.
- **Month of Contact:** The subscription patterns in March and January require further exploration to understand its underlying causes.
- **Call Duration:** There is a positive relationship between longer call durations and higher subscription rates, indicating that extended interactions reflect and possibly enhance client interest.

Our Logistic Regression model showed promising results in predicting client subscriptions to term deposits. Future improvements could involve exploring more complex models like Random Forests and adjusting the threshold to enhance precision. Understanding the reasons behind the potential seasonality trend and incorporating additional data could also yield more insightful observations. Moreover, given that during model comparison the difference in scores where minimal, the top three features selected by the Decision Tree Classifier are worth looking into.

# References
- Harris, C.R. et al., 2020. Array programming with NumPy. Nature, 585, pp.357–362.
- Moro,S., Rita,P., and Cortez,P., 2012. Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
- Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.
- Timbers,T. , Ostblom,J., and Lee,M., 2023. Breast Cancer Predictor Report. GitHub repository, https://github.com/ttimbers/breast_cancer_predictor_py/blob/0.0.1/src/breast_cancer_predictor_report.ipynb
- Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.
- VanderPlas, J. et al., 2018. Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.