In [None]:
import pandas as pd
import numpy as np
import altair_ally as aly
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from scipy.stats import loguniform, randint, uniform
from sklearn.impute import SimpleImputer
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
aly.alt.data_transformers.enable('vegafusion')


# Summary:
---
We built a classification model using both the Logistic Regression and Support Vector Classifier (SVC) which can use the information related to the client and the marketing contact to predict whether a client will subscribe to a term deposit.

Our final classifier, the Logistic Regression model, performed well on an unseen test data set, achieving a Test Score (Accuracy) of **0.90375** (compared to **0.90125** for SVC). The Train Score was 0.905625, indicating a good fit without significant overfitting. Although the simple accuracy score does not detail the balance of True Positives versus False Negatives, an accuracy exceeding **90%** suggests the model is highly effective.

Given our goal is to increase the subscription rate, the model's primary goal is to minimize the False Negative rate, avoiding the error of predicting a client will not subscribe when they would have. The current performance suggests that using this model for initial client prioritization could significantly improve resource allocation, making the model valuable for immediate business implementation. However, further analysis of precision and recall would be necessary to optimize its practical utility.

# Introduction:
---
Direct marketing campaigns, particularly those relying on phone calls, are a significant investment for banking institutions. The success of these campaigns is measured by the client subscription rate to a product like a term deposit. Another dataset shows that the subscription rate of term deposit in a Portuguese banking is only around 11.70% [Ngu Hui En 2024](https://medium.com/@nguhe/predictive-analysis-of-client-subscription-rates-in-the-portuguese-banking-sector-using-sas-40fb04a9dcd3), optimizing the targeting strategy is crucial to maximize return on investment and minimize operational costs.

Here we ask if a machine learning algorithm can be used to predict whether a client will subscribe to a term deposit based on information related to the client, such as type of job, education level; and also the marketing contact, e.g. number of contacts during the campaign, number of days since last contact. Answering this question is important because term deposit campaigns often require multiple contacts to the same client, making the process labor-intensive and expensive. Thus, if a machine learning algorithm can accurately and effectively predict client subscription, this could allow the bank to prioritize clients who are most likely to convert, leading to more efficient resource allocation and a higher overall subscription rate, improving the campaign results.

# Methods:
---

## Data: 
The data we used was obtained the UCI Machine Learning Repository which can be found [here](https://doi.org/10.24432/C5K306), specifically the Bank Marketing dataset of a Portuguese bank institution . The dataset contains various features about bank customers and whether they subscribed to a term deposit, an investment product offered by the bank (variable y). Each row in the dataset contains details of customers which was used to predict if they would subscribe to the term deposit or not. The original dataset contains 45211 records with 16 features and one target (17 columns). For the purpose of this analysis, we sampled 4,000 records from the original dataset to speed up the EDA and model training process.


## Analysis: 
We started this analysis by perfoming an exploratory data analysis (EDA) to understand the nature of the variables and their relationships. We observed some missing values in the dataset. We also observed that the target variable (y) was imbalanced with a higher proportion of customers not subscribing to the term deposit. 

Furthermore, distribution plots for variables previous, pday, campaign, duration, balance were highly right-skewed. This implies that most customers had low values for these varaibles and a few customers had high values. In addition, the correlation plots in Figure 3 showed that "previous" and "pday" had the highest positive correlation.

We decided to use both Logistic Regression and Support Vector Classifier (SVC) models for this analysis.The [sklearn](https://scikit-learn.org/stable/) package was greatly used in these processes. We performed hyperparameter tuning using Python RandomizedSearchCV to find the best parameters for each model. We mapped the values of the target y, using 'yes': 1, 'no': 0. The data was split using 80% for the training set and 20% for the test set. In the preprocessing, we dropped 'day_of_week' and 'pdays' because we considered them not relevant for analysis. The 'poutcome' variable was also dropped since it had a high number of missing values. Categorical variables were one-hot encoded, ordinal data were handled using ordinal encoding, numerical variables were scaled using StandardScaler and missing values for the selected features were imputed using SimpleImputer with the "most frequent" strategy.

The models were evaluated based on their accuracy on the test set. 


### Loading The Data

In [None]:
## Download Data -- NEEDS ATTRIBUTION FOR DOWNLOAD CODE from UCI ML github repo

## Uncomment and Run to install neccessary packages
#!pip3 install -U ucimlrepo 

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 
  
# variable information 
print(bank_marketing.variables) 

bank_marketing_data =X; bank_marketing_data['y'] = y
#bank_marketing_data.to_csv('data/bank_marketing.csv')

bank_marketing_sample = bank_marketing_data.sample(4000, random_state=522)
#bank_marketing_sample.to_csv('data/bank_marketing_small.csv')

### EDA

In [None]:
bank_marketing.data.features.head(10)

In [None]:
bank_marketing.data.targets.head(10)

In [None]:
bank_marketing_data.info()

In [None]:
bank_marketing_sample.info()

In [None]:

numeric_plot = aly.dist(bank_marketing_sample, color='y')

numeric_plot.properties(
    title="Figure 1: Univariate distributions of numeric variables in Bank Marketing Dataset"
)

In [None]:
# univariate distrbutions (counts) for the categorical variables

categorical_plots = aly.dist(
    bank_marketing_sample, dtype='object', color='y')

categorical_plots.properties(
    title="Figure 2: Univariate distributions of category variables in Bank Marketing Dataset"
)


In [None]:
correlation_plots = aly.corr(bank_marketing_sample)
correlation_plots.properties(
    title="Figure 3: Correlation Plots For Numeric variables in Bank Marketing Dataset"
)


### Model Building and Evaluation

In [None]:
bank_marketing_sample.isnull().sum() / bank_marketing_sample.shape[0] * 100

In [None]:
# preprocessing
#bank_marketing_sample = bank_marketing_sample.copy()

# map the target variable to numeric
bank_marketing_sample['y'] = bank_marketing_sample['y'].map({'yes': 1, 'no': 0})


# feature engineering on 'pdays' column into categorical determining if client was contacted before or not
bank_marketing_sample['pdays_contacted'] = bank_marketing_sample['pdays'].apply(lambda x: 'never' if x == -1 else 'contacted')

# dropping columns
bank_marketing_sample= bank_marketing_sample.drop(columns=['day_of_week', 'pdays', 'poutcome'])


# split data
X_train, X_test, y_train, y_test = train_test_split(bank_marketing_sample.drop(columns='y'), bank_marketing_sample['y'], test_size=0.2, random_state=522)

In [None]:
#df.select_dtypes(include=['object']).nunique()

bank_marketing_sample.select_dtypes(include=['number']).columns

In [None]:
bank_marketing_sample.select_dtypes(include=['object']).columns

In [None]:
bank_marketing_sample.info()

In [None]:
# separating columns by type of transformation required

# One-hot encoding
categorical_cols = ['job', 'marital', 'default', 'housing', 'loan', 'contact','month', 'pdays_contacted']
# Ordinal encoding
ordinal_cols = ['education']
# Standard scaling
numerical_cols = ['age', 'balance', 'duration', 'campaign', 'previous']



We decided to train both a Logistic Regression and Support Vector Classifier (SVC) to determine which was more efficient in predicting if a customer would subscribed to the banks offering of term investments. We performed hyperparameter tuning using RandomizedSearchCV to find the best parameters for each model. The models were evaluated based on their accuracy on the test set.

In [None]:
# defining the preprocessor

data_preprocessor = make_column_transformer(
    (
        make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore')), categorical_cols
    ), (
        make_pipeline(SimpleImputer(strategy='most_frequent'), OrdinalEncoder(categories=[['unknown', 'primary', 'secondary', 'tertiary']], dtype=object)), ordinal_cols
    ), (StandardScaler(), numerical_cols))

In [None]:
# Logistic Regression cross-validation with RandomizedSearchCV 
lr_pipe = make_pipeline(data_preprocessor, LogisticRegression(random_state=42, max_iter=1000))
param_dist1 = {"logisticregression__C": loguniform(1e-4, 1e3)} 
random_lr = RandomizedSearchCV(lr_pipe, param_distributions=param_dist1,
                                n_iter=100, n_jobs=-1, return_train_score=True, random_state=522)

In [None]:
# Fit model
random_lr.fit(X_train, y_train)
print(f'Train Score: {random_lr.score(X_train, y_train)}')
print(f'Test Score: {random_lr.score(X_test, y_test)}')

In [None]:
# SVC cross-validation with RandomizedSearchCV 
svc_pipe = make_pipeline(data_preprocessor, SVC(random_state=42))
param_dist = { "svc__C": loguniform(1e-2, 1e3), "svc__gamma": loguniform(1e-2, 1e3)}
random_svc = RandomizedSearchCV(svc_pipe, param_distributions=param_dist,
                                n_iter=100, n_jobs=-1, return_train_score=True, random_state=522)


In [None]:
# Fit model
random_svc.fit(X_train, y_train)
print(f'Train Score: {random_svc.score(X_train, y_train)}')
print(f'Test Score: {random_svc.score(X_test, y_test)}')

In [None]:
#Confusion Matrix for Logistic Regression model
ConfusionMatrixDisplay.from_estimator(
    random_lr,
    X_test,
    y_test,
    values_format="d",
);
plt.title("Figure 4: Confusion Matrix for Logistic Regression model")
plt.show()

In [None]:
#Confusion Matrix for SVC model

y = ConfusionMatrixDisplay.from_estimator(
    random_svc,
    X_test,
    y_test,
    values_format="d",
);
plt.title("Figure 5: Confusion Matrix for SVC model")
plt.show()


# Results & Discussion
---

Both our Logistic Regression and SVC classification models performed well on the testing data, with final scores of 0.90375 and 0.90125, respectively. Further comparison shows that the testing scores are similar to the training scores of 0.905625 and 0.919375, indicating that our models are well-fitted.

Considering the imbalance in our target class, accuracy alone is not sufficient for determining the suitability of our model. Therefore, exploring metrics from the confusion matrix and classification report is recommended as a next step.

A second area for further analysis is determining which specific features are most important for predicting whether a client will subscribe to a term deposit. Identifying these key features will enable the bank to better tailor its actions to increase subscription rates. In light of this, the Logistic Regression model is a better choice, as it provides more interpretable results. However, due to interactions between features and preprocessing steps such as regularization, the coefficients of the Logistic Regression model can become difficult to interpret. Nevertheless, if properly examined, the coefficient estimates can help identify the important characteristics the bank should focus on to increase its rate of subscriptions to term deposits.

## References:
---
<br> Ngu, H. E. (2024). Predictive analysis of client subscription rates in the Portuguese banking sector using SAS. Medium. https://medium.com/@nguhe/predictive-analysis-of-client-subscription-rates-in-the-portuguese-banking-sector-using-sas-40fb04a9dcd3 <br>
<br>Timbers, T. (n.d.). breast_cancer_predictor (Version 0.0.1) [Computer software]. GitHub. https://github.com/ttimbers/breast_cancer_predictor/tree/0.0.1<br>
<br>Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306. <br>
<br>Scikit-learn. (n.d.). scikit-learn: Machine learning in Python. Retrieved November 21, 2025, from https://scikit-learn.org/stable/<br>
<br>uci-ml-repo (2025). ucimlrepo: Python package for dataset imports from the UCI Machine Learning Repository [Computer software]. GitHub. https://github.com/uci-ml-repo/ucimlrepo<br>