# Problem Description

**How do you turn a "maybe" into a "yes"?**

For banks, this is the key to running successful marketing campaigns. Phone calls are a powerful tool for offering financial products like term deposits, but they come with challenges. Contacting uninterested clients wastes time, resources, and trust. Missing the right clients means losing valuable opportunities. So, how can banks know who to call?

This project focuses on **predicting whether a client will subscribe to a term deposit ("yes" or "no")**, based on past marketing campaigns conducted by a Portuguese bank. The dataset includes rich information about clients, such as their demographics, previous interactions, and economic conditions. By analyzing this data, we can uncover the patterns behind client decisions.

However, the problem isn’t as simple as it seems. The dataset is **imbalanced**, meaning most clients say "no," and client profiles are highly **diverse**, making accurate predictions tricky. For example, a younger client might respond differently to a call than a retiree, and these differences can complicate modeling. Overcoming these hurdles requires careful feature engineering and robust machine learning techniques.

The rewards, however, are worth it. Solving this problem can help banks run **smarter, more personalized campaigns**, saving money, improving efficiency, and delivering a better experience for clients. Beyond banking, these insights can inspire other industries—like e-commerce or healthcare—to adopt data-driven approaches for better decision-making. 

**The question remains:** *Can we uncover the "yes" that makes the difference?*


# Dataset Description

The dataset provides detailed information about clients and their interactions with a Portuguese banking institution’s marketing campaigns. The goal is to predict whether a client will subscribe to a term deposit (**"yes" or "no"**). This data includes a variety of features, ranging from demographic details to historical campaign outcomes, as well as economic indicators.

---

### **Columns Description**

#### **Baseline Data**
1. **`age`** – Age of the client (numeric).  
2. **`job`** – Type of job (categorical):  
   - Values: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services".  
3. **`marital`** – Marital status (categorical):  
   - Values: "married", "divorced" (includes widowed), "single".  
4. **`education`** – Education level (categorical):  
   - Values: "unknown", "secondary", "primary", "tertiary".  
5. **`default`** – Has credit in default? (binary):  
   - Values: "yes", "no".  
6. **`balance`** – Average yearly balance, in euros (numeric).  
7. **`housing`** – Has a housing loan? (binary):  
   - Values: "yes", "no".  
8. **`loan`** – Has a personal loan? (binary):  
   - Values: "yes", "no".  

#### **Campaign Data**
9. **`contact`** – Contact communication type (categorical):  
   - Values: "unknown", "telephone", "cellular".  
10. **`day`** – Last contact day of the month (numeric).  
11. **`month`** – Last contact month of the year (categorical):  
   - Values: "jan", "feb", "mar", ..., "nov", "dec".  
12. **`duration`** – Last contact duration, in seconds (numeric).  
   - **Important Note:** Duration is a highly predictive feature for the target variable but may introduce bias if used without caution.

#### **Previous Campaign Data**
13. **`campaign`** – Number of contacts performed during this campaign (numeric, includes last contact).  
14. **`pdays`** – Number of days since the client was last contacted from a previous campaign (numeric; `-1` indicates no prior contact).  
15. **`previous`** – Number of contacts performed before this campaign for this client (numeric).  
16. **`poutcome`** – Outcome of the previous marketing campaign (categorical):  
   - Values: "unknown", "other", "failure", "success".  

#### **Target Variable**
17. **`y`** – Did the client subscribe to a term deposit? (binary):  
   - Values: "yes", "no".  

---

### **Dataset Characteristics**
- **Type**: Multivariate  
- **Subject Area**: Business, Marketing  
- **Associated Task**: Classification  
- **Feature Types**: Categorical, Numeric  
- **Instances**: 45,211  
- **Features**: 16  

---

### **Challenges**
1. **Class Imbalance**:  
   - Most clients do not subscribe to term deposits (`"no"`), making this a highly imbalanced dataset. Handling this imbalance is key to building accurate and fair models.
2. **Feature Interactions**:  
   - Numerical features (e.g., `balance`, `duration`) and categorical features (e.g., `job`, `contact`) interact in complex ways, requiring thoughtful feature engineering.  
3. **Bias in Predictive Features**:  
   - Features like `duration` are strong predictors but can introduce bias if interpreted improperly. Models need to consider their practical relevance.

---

### **Potential Applications**
1. **Smarter Marketing Strategies**:  
   - Improve campaign targeting by identifying clients most likely to subscribe, saving resources and time.
2. **Customer Insights**:  
   - Analyze demographic and behavioral factors influencing client decisions.
3. **Cross-Industry Impact**:  
   - Insights from this dataset can also apply to industries like e-commerce and telecom, where customer retention and product adoption are key.

# Dependencies loading

In [364]:
# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for Gradient Boosting and preprocessing
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

# Advanced Boosting Models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Handling imbalanced data (optional, install imbalanced-learn if needed)
from imblearn.over_sampling import SMOTE

# Utilities
import os
import warnings
warnings.filterwarnings('ignore')

# Project setup

In [367]:
bank_full_url = 'https://raw.githubusercontent.com/filipecorreia23/Bank-Term-Deposit-Prediction/main/input/bank-full.csv'

# Data preparation

## Data loading

In [371]:
data = pd.read_csv(bank_full_url, sep=';')
df = data.copy()

In [372]:
df.sample(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
27534,46,self-employed,divorced,tertiary,no,8017,yes,yes,cellular,21,nov,169,2,-1,0,unknown,no
40308,28,admin.,single,secondary,no,136,no,no,cellular,16,jun,355,1,-1,0,unknown,yes
37889,46,admin.,single,secondary,no,-976,yes,yes,cellular,14,may,111,1,-1,0,unknown,no
14595,43,management,married,tertiary,no,348,no,yes,cellular,15,jul,88,3,-1,0,unknown,no
7162,48,management,married,primary,no,1910,yes,no,unknown,29,may,413,2,-1,0,unknown,no
609,32,services,married,secondary,no,243,yes,yes,unknown,6,may,144,1,-1,0,unknown,no
41278,36,management,married,tertiary,no,255,no,no,cellular,25,aug,242,6,95,4,success,yes
13100,28,admin.,single,secondary,no,144,no,yes,cellular,8,jul,136,1,-1,0,unknown,no
38960,32,management,single,tertiary,no,7290,yes,no,cellular,18,may,357,1,-1,0,unknown,yes
9371,39,admin.,married,secondary,no,441,no,no,unknown,6,jun,101,2,-1,0,unknown,no


# Dataset adjustment

## Removing Redundant Variables

The `day` and `month` columns are combined into a single variable, `contact_date`, to simplify the dataset and capture temporal information more effectively. This helps reduce redundancy without losing any meaningful data.


In [375]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

In [377]:
# Combine 'day' and 'month' into a string
df['contact_date'] = df['day'].astype(str) + '-' + df['month']

# Drop the original 'day' and 'month' columns
df = df.drop(columns=['day', 'month'])

df.sample(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,contact_date
5921,23,management,married,tertiary,no,175,yes,no,unknown,796,1,-1,0,unknown,no,26-may
18028,53,self-employed,married,primary,no,4576,no,no,telephone,82,2,-1,0,unknown,no,30-jul
18641,53,services,married,secondary,no,514,yes,no,cellular,10,3,-1,0,unknown,no,31-jul
8112,55,blue-collar,single,primary,no,1187,yes,no,unknown,193,2,-1,0,unknown,no,2-jun
41633,34,admin.,married,tertiary,no,2552,yes,yes,cellular,74,1,-1,0,unknown,no,25-sep
11266,53,technician,married,secondary,no,1776,no,no,unknown,443,7,-1,0,unknown,no,18-jun
32581,28,technician,married,secondary,no,-92,yes,no,cellular,873,1,-1,0,unknown,no,17-apr
32749,38,technician,single,tertiary,no,2273,yes,no,cellular,222,1,-1,0,unknown,no,17-apr
1976,32,blue-collar,single,secondary,no,-83,yes,no,unknown,116,3,-1,0,unknown,no,9-may
6834,41,blue-collar,single,unknown,no,2835,yes,no,unknown,53,1,-1,0,unknown,no,28-may


## Handling 'Unknown' values

In dataset adjustment, it's important to find any missing values, including placeholders like `"unknown"`. This helps decide whether to fill them, remove them, or leave them as they are before moving on to further analysis.


In [382]:
unknown_counts = df.apply(lambda x: (x == 'unknown').sum())
total_counts = df.shape[0]
unknown_percentage = (unknown_counts / total_counts) * 100

print("Count of 'unknown' values per column:")
print(unknown_counts)

Count of 'unknown' values per column:
age                 0
job               288
marital             0
education        1857
default             0
balance             0
housing             0
loan                0
contact         13020
duration            0
campaign            0
pdays               0
previous            0
poutcome        36959
y                   0
contact_date        0
dtype: int64


In [384]:
print("Percentage of 'unknown' values per column:")
print(unknown_percentage)

Percentage of 'unknown' values per column:
age              0.000000
job              0.637013
marital          0.000000
education        4.107407
default          0.000000
balance          0.000000
housing          0.000000
loan             0.000000
contact         28.798301
duration         0.000000
campaign         0.000000
pdays            0.000000
previous         0.000000
poutcome        81.747805
y                0.000000
contact_date     0.000000
dtype: float64


In [386]:
df.shape

(45211, 16)

In [388]:
print("Value counts for 'contact':")
df.contact.value_counts()

Value counts for 'contact':


contact
cellular     29285
unknown      13020
telephone     2906
Name: count, dtype: int64

In [390]:
print("Value counts for 'poutcome':")
df.poutcome.value_counts()

Value counts for 'poutcome':


poutcome
unknown    36959
failure     4901
other       1840
success     1511
Name: count, dtype: int64

**Removing 'unknown' values**: We will drop rows where important columns like `education` and `job` have the value `'unknown'` to keep the data clean and reliable.


In [393]:
#drop the unknown values in education and job
df = df[df['education'] != 'unknown']
df = df[df['job'] != 'unknown']

**Handling 'unknown' values in `poutcome`**: The majority of values in the `poutcome` column are labeled as `'unknown'`, outnumbering all other categories. To simplify and retain this data, we are replacing `'unknown'` with `'other'`.

In [396]:
# Replace the 'unknown' values in the 'poutcome' column with 'other'
df['poutcome'] = df['poutcome'].replace('unknown', 'other')

# Check the updated value counts in 'poutcome'
print(df['poutcome'].value_counts())

poutcome
other      37060
failure     4709
success     1424
Name: count, dtype: int64


In [398]:
df.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,contact_date
0,58,management,married,tertiary,no,2143,yes,no,unknown,261,1,-1,0,other,no,5-may
1,44,technician,single,secondary,no,29,yes,no,unknown,151,1,-1,0,other,no,5-may
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,76,1,-1,0,other,no,5-may
5,35,management,married,tertiary,no,231,yes,no,unknown,139,1,-1,0,other,no,5-may
6,28,management,single,tertiary,no,447,yes,yes,unknown,217,1,-1,0,other,no,5-may
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,380,1,-1,0,other,no,5-may
8,58,retired,married,primary,no,121,yes,no,unknown,50,1,-1,0,other,no,5-may
9,43,technician,single,secondary,no,593,yes,no,unknown,55,1,-1,0,other,no,5-may
10,41,admin.,divorced,secondary,no,270,yes,no,unknown,222,1,-1,0,other,no,5-may
11,29,admin.,single,secondary,no,390,yes,no,unknown,137,1,-1,0,other,no,5-may


In [400]:
df.shape

(43193, 16)

**Keeping 'unknown' values in `contact`:** For now, we will keep the `'unknown'` values in the `contact` column as a separate category.

## Converting Target Variable (`y`)

We are converting the `y` column (target variable) to numerical values: `1` for "yes" and `0` for "no". This ensures compatibility with machine learning algorithms, simplifies evaluation metrics, and makes exploratory data analysis (EDA) more straightforward. By doing this early, we streamline the workflow for subsequent steps like visualization, splitting, and modeling.


In [404]:
# Convert the target variable 'y' to binary 0 and 1
df['y'] = df['y'].apply(lambda x: 1 if x == 'yes' else 0)

df['y'].value_counts()

y
0    38172
1     5021
Name: count, dtype: int64

## Handling Binary Variables (`yes`/`no`)

We are transforming binary features (`yes`/`no`) into numeric values (`1` for `yes` and `0` for `no`) to ensure compatibility across different machine learning models. Here's how the different boosting algorithms handle categorical variables:

- **LightGBM:** Supports categorical variables natively, but transforming binary features is still useful for simplicity and consistency.
- **CatBoost:** Handles categorical variables natively, but binary features are simple to convert and are often treated as numeric.
- **XGBoost:** Requires all features to be numeric, making transformation mandatory.
- **Gradient Boosting (Scikit-Learn):** Does not handle categorical variables, requiring all features to be numeric.

By converting these binary variables now, we prepare the dataset for compatibility with all algorithms, ensuring clarity and flexibility for future analysis and modeling steps.

In [407]:
binary_columns = ['default', 'housing', 'loan']

for col in binary_columns:
    df[col] = df[col].map({'yes': 1, 'no': 0})

df.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,contact_date
0,58,management,married,tertiary,0,2143,1,0,unknown,261,1,-1,0,other,0,5-may
1,44,technician,single,secondary,0,29,1,0,unknown,151,1,-1,0,other,0,5-may
2,33,entrepreneur,married,secondary,0,2,1,1,unknown,76,1,-1,0,other,0,5-may
5,35,management,married,tertiary,0,231,1,0,unknown,139,1,-1,0,other,0,5-may
6,28,management,single,tertiary,0,447,1,1,unknown,217,1,-1,0,other,0,5-may
7,42,entrepreneur,divorced,tertiary,1,2,1,0,unknown,380,1,-1,0,other,0,5-may
8,58,retired,married,primary,0,121,1,0,unknown,50,1,-1,0,other,0,5-may
9,43,technician,single,secondary,0,593,1,0,unknown,55,1,-1,0,other,0,5-may
10,41,admin.,divorced,secondary,0,270,1,0,unknown,222,1,-1,0,other,0,5-may
11,29,admin.,single,secondary,0,390,1,0,unknown,137,1,-1,0,other,0,5-may


For the other categorical features, we chose to retain them as they are for now, planning to handle them natively in models that support it and numerically for XGBoost and the baseline model.

## Sorting
We sort the dataset by age and balance to make the data easier to understand and ensure it's organized for analysis.

In [411]:
df.sort_values(by=['age', 'balance'], inplace=True)
df.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,contact_date
41252,18,student,single,secondary,0,5,0,0,cellular,143,2,-1,0,other,0,24-aug
42146,18,student,single,secondary,0,156,0,0,cellular,298,2,82,4,other,0,4-nov
40887,18,student,single,primary,0,608,0,0,cellular,267,1,-1,0,other,1,12-aug
42274,18,student,single,primary,0,608,0,0,cellular,210,1,93,1,success,1,13-nov
40736,18,student,single,primary,0,1944,0,0,telephone,122,3,-1,0,other,0,10-aug
34288,19,student,single,primary,0,0,0,0,cellular,72,4,-1,0,other,0,4-may
41402,19,student,single,secondary,0,4,0,0,cellular,114,1,-1,0,other,0,3-sep
41706,19,student,single,secondary,0,55,0,0,telephone,89,2,193,1,other,0,6-oct
40927,19,student,single,primary,0,56,0,0,cellular,246,1,-1,0,other,0,12-aug
41500,19,student,single,secondary,0,88,0,0,cellular,191,1,-1,0,other,0,8-sep


# Dataset Splitting

We split the dataset into:
- **Training (& Validation) dataset**: 80% of the data.
- **Test dataset**: 20% of the data for final evaluation.

Test dataset will be used only for the final predictions! We assume that during the entire study they do not have access to it and do not study its statistical properties.

In [414]:
X = df.drop('y', axis=1)  # Features
y = df['y']               # Target

**Description:** This separates the dataset into the input features X and the target variable y. The target variable y will be used for predictions, while X includes all other columns.

In [418]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

**Description:** The `train_test_split` function splits the dataset into training and test sets:

- **test_size=0.2**: 20% of the data is reserved for testing.
- **random_state=42**: Ensures reproducibility of the split.
- **stratify=y**: Maintains the proportion of target classes in both training and testing datasets, ensuring a balanced representation.

In [424]:
print(X_train.shape, X_test.shape)

(34554, 15) (8639, 15)


In [422]:
print(y_train.shape, y_test.shape)

(34554,) (8639,)


# EDA (Exploratory Data Analysis)