## Identifying the Top Three Predictors of Term Deposit Subscriptions

by Jerry Yu, John Shiu, Sophia Zhao, & Zeily Garcia
2023/11/19

In [None]:
from IPython import display
import pandas as pd
from myst_nb import glue

# Summary

This report presents an analysis of the factors influencing client subscriptions to term deposits at a Portuguese banking institution. Utilizing a dataset comprising 45,211 client interactions with a target variable and 16 input features, we apply logistic regression and decision tree classifiers to identify the top three predictors of term deposit subscriptions. The data preprocessing involves handling missing values, encoding categorical variables, and standardizing numerical variables. Our exploratory data analysis leverages visualizations to understand feature distributions and correlations, while model evaluation focuses on precision and recall due to the dataset's imbalance. Logistic regression is likely to prove slightly superior in precision to the decision tree classifier. The analysis identifies the outcome of previous campaigns, the month of contact, and the call duration as the most significant predictors. These findings offer valuable insights into the decision-making process of clients regarding term deposit subscriptions and suggest areas for future research.

# Introduction

In the evolving world of banking and finance, the ability to understand and predict customer behavior is key to developing successful marketing strategies. Term deposits, a core financial product, serve a dual purpose: they provide a safe investment option for customers and a reliable source of funding for banks. For financial institutions looking to grow their customer base and ensure financial stability, it's crucial to understand what drives customers to invest in these term deposits.

Our study focuses on uncovering the factors that influence a customer's decision to subscribe to a term deposit at a Portuguese banking institution. We pose a critical question: What are the top three factors that determine whether a client will invest in a term deposit at this bank? To explore this, we analyze a dataset that contains over 45,000 individual client records, including demographic data and details of interactions from the bank's term deposit marketing campaigns. With consideration of a logistic regression model and decision tree classifier, renowned for their efficacy in handling both numeric and categorical variables, we aim to detect significant patterns from this dataset. Our objective is to provide insights that could lead to an improvement to the effectiveness of marketing strategies that boost term deposit subscriptions at the institution.

The importance of this research lies in its potential to transform marketing strategies for the banking institution. By accurately understanding and predicting customer behavior, our findings can guide the Portuguese financial institution in crafting more effective and tailored marketing approaches. This not only aids in customer acquisition and retention but also ensures that customers are presented with options that align with their financial goals and needs. In an era where personalized services are becoming the norm, such insights are invaluable for banks seeking to remain competitive and foster long-term customer relationships.


# Methods

## Data
The data set utilized in this project was created by Sérgio Moro, P. Cortez, and P. Rita, and is hosted on the UCI Machine Learning Repository {cite}`Moro2012`. It can be accessed [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). The data was sourced from the marketing campaigns of a Portuguese banking institution, focusing on client subscriptions to term deposits. The primary dataset, `bank-full.csv`, includes 45,211 examples with 17 variables. Each row in the dataset represents information from a client contact during the campaign. 

This dataset comprehensively covers client demographic information (such as age, job type, marital status, and education level), financial information (including credit default history, average yearly balance, housing and personal loan status), and information about bank marketing (like the type of contact, date of the last contact, call duration, number of contacts in the current campaign, days elapsed since the last contact in a previous campaign, outcome of the previous campaigns, and finally, whether the client will subscribe to a term deposit).

### Variable Description
Below shows the data description of each variable in the primary dataset, `bank-full.csv`,  
| Variable    | feature/target?  | Description                                                                          | Type |
| :--------- | :--------------:  | :------------------------------------------------------------------------------------------- | :------------------- |
| `age`       | feature          | client age                                                                                   | numeric              |
| `job`       | feature          | type of job                                                                                  | categorical          |
| `marital`   | feature          | marital status                                                                               | categorical          |
| `education` | feature          | education                                                                                    | categorical (ordered)|
| `default`   | feature          | does this client have credit in default?                                                     | binary               |
| `balance`   | feature          | average yearly balance (in Euros)                                                            | numeric              |
| `housing`   | feature          | does this client have housing loan?                                                          | binary               |
| `loan`      | feature          | does this client have personal loan?                                                         | binary               |
| `contact`   | feature          | contact communication type                                                                   | categorical          |
| `day`       | feature          | last contact day of the month                                                                | categorical          |
| `month`     | feature          | last contact month of year                                                                   | categorical          |
| `duration`  | feature          | last contact duration (in seconds)                                                           | numeric              |
| `campaign`  | feature          | number of contacts performed during this campaign and for this client                        | numeric              |
| `pdays`     | feature          | number of days that passed by after this client was last contacted from a previous campaign  | numeric              |
| `previous`  | feature          | number of contacts performed before this campaign and for this client                        | numeric              |
| `poutcome`  | feature          | outcome of the previous marketing campaign                                                   | categorical          |
| `y`         | target           | has this client subscribed a term deposit?                                                   | binary               |
														

## Analysis
Logistic regression and decision tree classifier were considered to develop a model for determining the top 3 factors that aid in predicting if a client will subscribe to a term deposit. To be able to explore the significance of each attribute, all variables included in the original data set were used to fit the model. The exploratory section of our data analysis explored the distribution of categorical and numerical variables, and correlation between the numerical variables.

In the preprocessing phase, categorical or binary variables were transformed using one-hot encoding, ordered categorical variable was ordinal encoded, and numerical variables were standardized. The data was then split, assigning 75% to the training set and 25% to the test set. The evaluation of both logistic regression and decision tree classifier was conducted with a detailed focus on accuracy, precision, and recall metrics, prioritizing precision in particular. This emphasis on precision is crucial for our study, which aims to minimize Type 1 errors. 

Python Language {cite}`VanRossum2009` and its libraries, including pandas {cite}`Mckinney2010`, NumPy {cite}`Harris2020`, scikit-learn {cite}`Pedregosa2011`, and Altair {cite}`VanderPlas2018`, were essential for the analysis. The environment file used to run the analysis was adapted from the University of British Columbia DSCI 573 Feature and Model Selection course repository {cite}`Ostblom2023`.

The analysis revealed that logistic regression exhibited slightly better precision than the decision tree classifier, with a smaller gap between training and testing scores, suggesting better generalizability. Logistic regression was also highly regarded as the model for assessing feature importance, given its strong interpretability. Upon model fitting and evaluation, the top three features predicting a client's decision to subscribe to a term deposit were identified by sorting feature importance in terms of absolute coefficient. 

The full coding process and methods are thoroughly recorded in [term_deposit_full_analysis.ipynb](https://github.com/UBC-MDS/dsci522_group21/blob/main/src/term_deposit_full_analysis.ipynb).


# Results & Discussion

**Data Preprocessing**

The dataset was divided into training and testing sets with a 75-25 split, ensuring a representative sample in both sets.

In [None]:
train_df = pd.read_csv("../data/processed/train_df.csv") 
glue("train_df", train_df)

```{glue:figure} train_df

 Training data set 
```

**Exploratory Data Analysis**

In [None]:
y_train_distribution = pd.read_csv("../data/processed/y_train_distribution.csv") 
glue("y_train_distribution", y_train_distribution)

```{glue:figure} y_train_distribution
:figwidth: 300px
:name: "y_train_distribution"

 Target Distribution on Training Dataset
```

During our analysis, {numref}`Figure {number} <y_train_distribution>` noted that the dataset exhibits an imbalance in class proportions within the target (88.2358% 'no' subscription rate versus 11.7642% 'yes' rate). 

On the other hand, the features `job`, `education`, `contact`, and `poutcome` include "unknown" entries distinct from null values. Due to insufficient dataset information, proper imputation of these unknown values is not feasible. Eliminating only the unknown values from these columns was deemed impractical as it would significantly reduce the data. 

Additionally, the variable `poutcome` represents the result of the previous marketing campaign. The prevalence of "unknown" values in this field largely corresponds to cases where `previous` equals 0, indicating that the individual had not been previously contacted for marketing purposes. In such scenarios, labeling `poutcome` as "unknown" is logical, as there is no prior marketing engagement to reference. However, for individuals who have been contacted before, the `poutcome` data could be insightful.

The features `job` and `education`, and `contact` are also worth retaining in the analysis. Although they contain "unknown" values, their proportion is not significantly high. Furthermore, our objective is to unveil the relevance of each feature. Keeping these variables allows for a comprehensive analysis, even though we might anticipate the outcome to suggest that some of these features might not be of high importance.

Overall, we opted to maintain all columns and rows, considering the scope of our analysis. For a comprehensive evaluation of each feature's significance, it's crucial to consider the entire dataset.

```{figure} ../img/job_types.png
---
height: 300px
name: JobTypes
---
Distribution of Job Types
```

```{figure} ../img/previous_and_pdays.png
---
height: 400px
name: PreviousPdays
---
Left: Number of Contacts Before this Campaign (previous), Right: Days Passed after Last Contact (pdays)
```

Bar charts, similar to {numref}`Figure {number} <JobTypes>`,  were created for categorical variables, and histograms were created for numerical variables for illustration. {numref}`Figure {number} <PreviousPdays>` revealed that the distributions of the `pdays` and `previous` variables in the dataset are significantly skewed. 

A heatmap illustrating the Pearson correlation ({numref}`Figure {number} <HeatMap>`) was created to explore the relationships among numerical variables. The Pearson heatmap was included given that it evaluates linear correlation, an essential factor in determining potential multicollinearity.

```{figure} ../img/correlation_heatmap.png
---
height: 300px
name: HeatMap
---
Pearson Correlation Heatmap
```

```{figure} ../img/pdays_vs_previous_scatter.png
---
height: 300px
name: PdaysScatterPlot
---
Scatterplot for Recency (pdays) vs Intensity (previous)
```

Moreover, displayed in {numref}`Figure {number} <HeatMap>`, `pdays` and `previous` variables exhibit a moderate Pearson correlation score of 0.44. The Pearson correlation coefficient suggests potential multicollinearity between the recency of client engagement (`pdays`) and the intensity of past efforts to connect with clients (`previous`). However, a visual inspection of {numref}`Figure {number} <PdaysScatterPlot>` suggests that the correlation between `pdays` and `previous` is not strong enough to warrant concern. Thus, we retained both variables as features in the dataset.

**Data Preprocessing Cont.**

In our analysis, specific data transformations were applied to make the dataset suitable for modeling techniques. These transformations were based on the nature and characteristics of individual variables in the dataset.

Variables such as `job`, `marital`, `default`, `housing`, `loan`, `contact`, `day`, `month`, and `poutcome` were transformed using one-hot encoding. This method is effective for categorical variables without an inherent order. For instance, `marital` status has distinct categories that don't imply any ordinal relationship. This encoding creates new binary columns for each category of a variable, which is necessary for logistic regression and decision trees since they handle numerical inputs better. 

The `education` variable was treated with ordinal encoding. Education levels (`unknown`, `primary`, `secondary`, `tertiary`) have a natural order, and ordinal encoding maintains this hierarchy. This method assigns a unique integer to each category according to its rank, which is necessary for interpretations.

Numerical variables like `age`, `balance`, `duration`, `campaign`, `previous`, and `pdays` were standardized. Standardization involves rescaling these variables to have a mean of zero and a standard deviation of one. This process is very important as it ensures that all variables are on the same scale, preventing large scale variables from disproportionately influencing the model - especially for models like logistic regression.

**Model Training and Evaluation**

Our analysis evaluates three distinct models: a baseline dummy classifier, a decision tree, and logistic regression. We assessed these models using five-fold cross-validation based on key metrics: accuracy, precision, and recall. 

Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of examples. Precision, on the other hand, reflects the proportion of true positive outcomes among all positive predictions made by the model. Precision is significant to assess when the cost of false positives is high. Recall measures the proportion of actual positive cases that are correctly identified by the model. This metric is important when the cost of false negatives is high. 

Looking at all three metrics was particularly vital in examining the impact of our dataset's unbalanced nature on the likelihood of errors. Our analysis places a higher emphasis on precision. This is because our objective is to identify true positives with minimal false positives, in the context of examining client subscriptions to term deposits at the Portuguese banking institution. By centering our analysis on precision, we aim to ensure that the features under consideration are truly indicative of successful term deposit subscriptions, thus providing more reliable insights into client behavior.

In [None]:
cv_results = pd.read_csv("../data/processed/cv_results.csv") 
glue("cv_results", cv_results)

```{glue:figure} cv_results
:figwidth: 300px
:name: "cv_results"

 Model Comparison - Cross Validation Scores
```

The decision to use logistic regression and decision trees was influenced by several factors. Logistic regression was favored for its simplicity and strong interpretability, especially useful in understanding the impact of each feature on the probability of a client subscribing to a term deposit. It is suitable for binary classification problems and provides a framework based on probabilities. Decision Trees offer a visual representation of the decision making processes, making them easy to interpret. They are also non parametric and can handle non linear relationships. 

Both models are robust and prominent in the data science feild, making them reliable choices for our analysis.

{numref}`Figure {number} <cv_results>` reveals that the Logistic Regression slightly outperforms the Decision Tree Classifier in terms of test precision, whilst both models perform the same in regards to test recall. Additionally, the Logistic Regression model demonstrates a narrower disparity between training and testing scores, suggesting a higher potential of effective generalization. Despite these findings, there remains potential for precision improvement. In future efforts to predict subscription outcomes, increasing the threshold could improve precision. Overall, the Logistic Regression model's performance indicated better generalizability.

**Analysis of Model Performance**

Upon further analysis, we observed that the Logistic Regression model, when applied on the test data, achieved a precision of 0.66 for predicting positive outcomes (clients subscribing to a term deposit).

In [None]:
classification_report = pd.read_csv("../data/processed/classification_report.csv") 
glue("classification_report", classification_report)

```{glue:figure} classification_report
:figwidth: 300px
:name: "classification_report"

 Classification Report for Logistic Regression Model
```

The precision test score (0.66) is similar to the precision validation score (0.65) as well as the precision train score (0.66). Therefore, we would believe that the feature importance conclusion drawn from this model is generalizable.

**Feature Importance**

To obtain the top 3 driving factors behind the predictions, we looked into the feature importance derived from the Logistic Regression model.

The analysis reveals that `poutcome`, `month`, and `duration` are the three most significant features influencing a client's likelihood to subscribe to a term deposit. Clients with a previous successful `poutcome` tend to be more inclined to subscribe, likely due to their positive experiences with the bank's services. Interestingly, `month` contains a pattern where in March ("mar") there is a higher subscription rate and in January ("jan") the rate is lower. This finding could suggest a seasonal factor, possibly linked to financial periods, though its exact cause remains unclear and would require further external research. Longer `duration` of calls is linked with increased subscription rates, potentially implying greater client interest and giving salespersons more time for effective pitching might add to the likelihood of subscription.

In [None]:
feature_importance = pd.read_csv("../data/processed/feature_importance.csv") 
glue("feature_importance", feature_importance)

```{glue:figure} feature_importance
:figwidth: 300px
:name: "feature_importance"

 Feature Importance
```

**Final Insights**

- **Past Outcome:** A successful outcome in past campaigns increases the likelihood of subscription in future ones.
- **Month of Contact:** The subscription patterns in March and January require further exploration to understand their underlying causes.
- **Call Duration:** There is a positive relationship between longer call durations and higher subscription rates, indicating that extended interactions reflect and possibly enhance client interest.

Our Logistic Regression model displayed some predictive performance for client subscriptions to term deposits, as shown by consistent precision test, validation, and train scores. Nonetheless, it's important to consider these results while acknowledging our limitations and assumptions.

**Limitations & Assumptions**

Our first limitation refers to our dataset displaying a significant imbalance in the target distribution (88.2358% 'no' subscription rate versus 11.7642% 'yes' rate). Despite our efforts to compensate for the imbalance by including the use of precision and recall metrics, the imbalance could still impact the overall generalizability of our findings.

Another limitation stems from our interpretation of the potential seasonal trend, sourced from the significance of January and March months in subscription rates. We initially linked these insights to the potential fiscal calendar of the Portuguese banking institutions, thinking financial tax or bonus periods could influence client receptiveness. This idea came from our team's knowledge that not all countries follow the same financial quarters. However, this theory was no longer considered after we discovered that the financial year for Portugal ends in December. This points to the possibility of other unknown external factors influencing the `month` feature, which are beyond the scope of our dataset. Understanding the reasons behind the potential seasonality trend and incorporating additional data could generate more insightful observations.

Additionally, our decision to opt for the Logistic Regression model over the Decision Tree Classifier was justified by a very minimal difference in test precision (1%). This decision was guided by the ease in interpretability of feature importance that Logistic Regression models offer, which might have led us to overlook the potential insights Decision Tree models offer. Thus, the top three features the Decision Tree Classifier selected are worth looking into too.

Time constraints also impacted our analysis. This limitation might have restricted the depth of our analysis, potentially overlooking more sophisticated modeling approaches that could further improve our understanding of client subscriptions to term deposits at the Portuguese banking institution. Future improvements could involve exploring more complex models like Random Forests and adjusting the threshold to enhance precision.

# References

```{bibliography}
```