# Table Of Content

[1. Project Overview](#project-overview) 

[2. Data Collection and Initial Processing](#data-collection-and-initial-processing)
- [2.1 Data Overview](#data-overview)
- [2.2 Data Description](#data-description)
- [2.3 Analytical Relevance](#analytical-relevance)
- [2.4 Project Alignment](#project-alignment)
- [2.5 Data Ingestion and Integration](#data-ingestion-and-integration)

[3. Exploratory Data Analysis](#2-exploratory-data-analysis)
- [3.1 Univariate Analysis](#univariate-analysis)
- [3.2 Bivariate Analysis](#bivariate-analysis)
- [3.3 Multivariate Analysis](#multivariate-analysis)


[4. Predictive Modeling](#predictive-modeling) 

[5. Prescriptive Analytics and Recommendation](#prescriptive-analytics-and-recommendation)

[5. Model Deployment](#model-deployment)





# Project Overview
<a id='project-overview'></a>

The goal of this project is to develop a predictive model capable of estimating the likelihood that a client will subscribe to a bank term deposit following a telemarketing call.

This project leverages the Bank Marketing Dataset from the UCI Machine Learning Repository — a dataset originally used in the study by Moro, Cortez, and Rita (2014), “A Data-Driven Approach to Predict the Success of Bank Telemarketing” (Decision Support Systems, 2014).

The dataset contains detailed information on clients’ demographic, financial, and behavioral attributes, along with macroeconomic indicators.
The inclusion of these five additional socio-economic features—such as employment variation rate, consumer confidence, and EURIBOR rate—has been shown to substantially improve predictive performance, making this dataset the preferred choice for this analysis.

<a id='data-collection-and-initial-processing'></a>
# Data Collection and Initial Processing

This section provides an overview of the dataset used in this project and outlines its structure, attributes, and analytical significance.

<a id='data-overview'></a>
## Dataset Overview

The data used in this project originates from the Bank Marketing (with social/economic context) dataset.
It was curated by Sérgio Moro, Paulo Cortez, and Paulo Rita in 2014, and is publicly available for research purposes through the UCI Machine Learning Repository.

Two datasets are provided within the original archive:

- bank-additional-full.csv — containing 41,188 records, ordered by campaign date (May 2008–November 2010).

- bank-additional.csv — a 10% random sample of the full dataset (4,119 records).

For this project, analysis will focus on bank-additional-full.csv, as it contains all available instances and includes the five additional socio-economic indicators shown by Moro et al. (2014) to improve predictive accuracy.

## Data Description
The dataset comprises **21 variables** — 20 input features and 1 binary target variable (y).

**Input Features:**

- **Client attributes:** age, job type, marital status, education, default, housing, loan.

- **Campaign contact details:** contact type, month, day of week, duration, campaign, pdays, previous, poutcome.

- **Macroeconomic indicators:** employment variation rate, consumer price index, consumer confidence index, EURIBOR 3-month rate, number of employees.

**Target Variable:**

- **y:** indicates whether the client subscribed to a term deposit (yes / no).

**Missing Values:**
Some categorical attributes contain "unknown" entries, which represent missing or undisclosed information.

## Analytical Relevance
This dataset is highly relevant for predictive modeling in marketing and financial services because it combines:

- Individual-level behavioral data — client demographics, financial history, and campaign interactions.

- Contextual macroeconomic data — economic indicators that reflect the external environment influencing customer decisions.

Together, these features support both **classification modeling** (predicting y) and **insight generation**, such as identifying key drivers of successful marketing outcomes.

## Project Alignment
This project aims to replicate and extend the findings of Moro et al. (2014) by applying **modern machine learning techniques** — such as logistic regression, random forests, and gradient boosting — to predict campaign success and optimize telemarketing strategies.

By focusing on the enriched dataset (bank-additional-full.csv), this analysis seeks to:

- Accurately **predict telemarketing success outcomes**.

- **Identify influential affecting** client response.

- **Generate actionable insights** to improve targeting and reduce campaign costs.

## Data Ingestion and Integration

Next is importing and preparing the dataset for analysis.

The selected dataset — **bank-additional-full.csv** from the Bank Marketing (with social/economic context) collection — contains detailed client, campaign, and economic information.
This file is ingested directly into a pandas DataFrame for subsequent cleaning, transformation, and analysis.

During the ingestion process:

- The dataset is read from its CSV source using pandas.read_csv().

- Basic validation checks confirm successful loading, structural integrity, and expected data dimensions.

- Column names and data types are reviewed to ensure compatibility with downstream processing steps.

The DataFrame then provides a consistent and reliable foundation for all subsequent stages, including data validation, exploratory analysis, and model development.

### Library Imports
To load the data, we need to import the pandas library which will be used to read and combine the CSV files.

Other libraries used at latter parts of project will also be imported here, to ensure a consistent and organized workflow.

In [4]:
# library imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# from scipy.stats import chi2_contingency
# from sklearn.feature_selection import mutual_info_classif
import os





### Creating DataFrames from Data Source

With pandas imported, the next step is to read in the data from the data file into a pandas DataFrame.


In [5]:
# reading datasets into dataframes

df = pd.read_csv("data/bank-additional-full.csv", sep=";")


### Data Inspection, Cleaning and Validation

After loading the dataset into a DataFrame, the next step is to inspect and validateto ensure the file was read correctly and that the structure is as expected.

We begin by:

- Viewing sample records with head() to confirm data integrity.

- Checking dataset dimensions using shape.

- Reviewing data types and non-null counts with info().

- Identifying missing values using isnull().sum().

- Verifying that key identifiers (e.g., customer_id, campaign_id) are present and properly formatted.

These checks help detect potential issues early and ensure a smooth integration process in the next step.

#### df

In [6]:
# view sample records to confirm data integrity

df.head(10)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
5,45,services,married,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
6,59,admin.,married,professional.course,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
7,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
8,24,technician,single,professional.course,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [7]:
# Checking dataset dimensions
df.shape

(41188, 21)

In [8]:
# Reviewing data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [9]:
# Identifying missing values

df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

##### Observation
- Columns: `'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'`
- No key/unique identifier (e.g campaign IDs) found.
- Some columns seem to have same value for all rows. Further exploration needed.

#### Next Steps

The results of the data inspection and validation across the four datasets indicate the following actions:

- Standardize column names to ensure consistency across all datasets.

- Remove duplicate rows identified in the first dataset.

- Convert column data types to appropriate formats where necessary (e.g., numeric, datetime, categorical).

- Perform feature engineering to create or refine variables that enhance analytical and predictive value.

#### Standardize Column Names

To ensure consistency across all datasets, column names are standardized by converting them to lowercase and replacing spaces with underscores.


In [10]:

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')



#### Removing Duplicate Rows

During data validation, duplicate rows were detected in the bank-additional-full.csv dataset (df3).
To ensure data integrity and prevent bias during analysis, these duplicates are identified and removed.

In [11]:
# Check for duplicate rows
print("Number of duplicate rows before removal:", df.duplicated().sum())

# Remove duplicate rows
df = df.drop_duplicates()

# Verify that duplicates have been removed
print("Number of duplicate rows after removal:", df.duplicated().sum())

# Confirm resulting shape
print("Updated shape of df:", df.shape)


Number of duplicate rows before removal: 12
Number of duplicate rows after removal: 0
Updated shape of df: (41176, 21)


#### Data Type Conversion and Feature Engineering

From the initial inspection, all columns appear to have appropriate data types, and no immediate feature engineering is required at this stage.

However, these steps will be revisited during Exploratory Data Analysis (EDA) if any columns require type adjustments or if new features are needed to enhance analytical insights.

# Exploratory Data Analysis (EDA)

Next to the data collection and initial processing step is EDA. The data is explored to understand its underlying structure, relationships, and patterns.
EDA is a critical step in the overall project as it helps to reveal key insights, detect anomalies, and identify potential predictors that will guide subsequent modeling steps.

## Further Preparation and Initial Structure Review

Before performing detailed exploratory data analysis, this section reviews the raw structure of the dataset with the original "unknown" values intact.
This ensures we fully understand the distribution and impact of these placeholder values before transforming them.

Steps covered in this subsection:

- Save a snapshot of the current dataset (pre-EDA, pre-cleaning).
- Group features into categorical, numerical, and binary/flag-type columns.
- Generate summary statistics for each group.
- Explore the frequency and placement of "unknown" across attributes.
- Document initial observations to guide deeper EDA and data cleaning.

The transformation of "unknown" → NaN will occur after this structural review.

### Save Snapshot of Raw Data


In [12]:
# Ensure artifacts directory exists
ARTIFACTS_DIR = "artifacts"
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

# Save raw snapshot (for audit traceability)
raw_snapshot_path = os.path.join(ARTIFACTS_DIR, "raw_snapshot_before_eda.csv")
df.to_csv(raw_snapshot_path, index=False)

print(f"Raw snapshot saved to: {raw_snapshot_path}")
print("Current shape:", df.shape)

Raw snapshot saved to: artifacts\raw_snapshot_before_eda.csv
Current shape: (41176, 21)


### Identify Categorical and Numerical Columns

In [14]:
# Identify numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Identify object/categorical columns
categorical_cols = df.select_dtypes(include="object").columns.tolist()

# Binary columns (categorical with only two unique values)
binary_cols = [c for c in categorical_cols if df[c].nunique() == 2]

print("Numeric columns:", numeric_cols)
print("\nCategorical columns:", categorical_cols)
print("\nBinary columns:", binary_cols)

Numeric columns: ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

Categorical columns: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y']

Binary columns: ['contact', 'y']


### Summary Statistics for Numeric Features


In [15]:

df[numeric_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,41176.0,40.0238,10.42068,17.0,32.0,38.0,47.0,98.0
duration,41176.0,258.315815,259.305321,0.0,102.0,180.0,319.0,4918.0
campaign,41176.0,2.567879,2.770318,1.0,1.0,2.0,3.0,56.0
pdays,41176.0,962.46481,186.937102,0.0,999.0,999.0,999.0,999.0
previous,41176.0,0.173013,0.494964,0.0,0.0,0.0,0.0,7.0
emp.var.rate,41176.0,0.081922,1.570883,-3.4,-1.8,1.1,1.4,1.4
cons.price.idx,41176.0,93.57572,0.578839,92.201,93.075,93.749,93.994,94.767
cons.conf.idx,41176.0,-40.502863,4.62786,-50.8,-42.7,-41.8,-36.4,-26.9
euribor3m,41176.0,3.621293,1.734437,0.634,1.344,4.857,4.961,5.045
nr.employed,41176.0,5167.03487,72.251364,4963.6,5099.1,5191.0,5228.1,5228.1


#### Interpretation

- Many features show heavy skew (duration, campaign, previous), which affects visualization and potentially model performance.
- Economic features (emp.var.rate, euribor3m, cons.conf.idx, nr.employed) have small within-period variance but strong long-term trends, making them valuable high-level predictors.
- pdays is not a true numeric feature — requires special handling.
- No obvious missing numeric values from the describe() output.

#### Outlier Assessment

Based on the numerical summary statistics, no outliers appear to be erroneous or require removal.
Extreme values such as long call durations, high campaign contact counts, or older client ages represent valid real-world behaviors and may carry predictive significance.

Instead of removing outliers, the modeling process will rely on:
- appropriate transformations (e.g., log-transform for skewed features),
- correct handling of coded values (e.g., pdays = 999), and
- model families that are naturally robust to outliers (e.g., tree-based models).
- Therefore, no outlier removal will be performed at this stage.

### Summary Statistics For Categorical Features

In [26]:

cat_summary = pd.DataFrame({
    "column": categorical_cols,
    "unique_values": [df[c].nunique(dropna=False) for c in categorical_cols],
    "top_5_categories": [df[c].value_counts(dropna=False).head(5).to_dict() for c in categorical_cols]
})
cat_summary

Unnamed: 0,column,unique_values,top_5_categories
0,job,12,"{'admin.': 10419, 'blue-collar': 9253, 'techni..."
1,marital,4,"{'married': 24921, 'single': 11564, 'divorced'..."
2,education,8,"{'university.degree': 12164, 'high.school': 95..."
3,default,3,"{'no': 32577, 'unknown': 8596, 'yes': 3}"
4,housing,3,"{'yes': 21571, 'no': 18615, 'unknown': 990}"
5,loan,3,"{'no': 33938, 'yes': 6248, 'unknown': 990}"
6,contact,2,"{'cellular': 26135, 'telephone': 15041}"
7,month,10,"{'may': 13767, 'jul': 7169, 'aug': 6176, 'jun'..."
8,day_of_week,5,"{'thu': 8618, 'mon': 8512, 'wed': 8134, 'tue':..."
9,poutcome,3,"{'nonexistent': 35551, 'failure': 4252, 'succe..."


### Count of 'unknown' Values per Column

In [27]:

unknown_counts = {
    c: int((df[c] == 'unknown').sum()) 
    for c in categorical_cols 
    if (df[c] == 'unknown').sum() > 0
}

unknown_df = (
    pd.DataFrame
    .from_dict(unknown_counts, orient='index', columns=['unknown_count'])
    .assign(unknown_pct=lambda x: (x['unknown_count'] / len(df) * 100).round(2))
    .sort_values('unknown_pct', ascending=False)
)

unknown_df

Unnamed: 0,unknown_count,unknown_pct
default,8596,20.88
education,1730,4.2
housing,990,2.4
loan,990,2.4
job,330,0.8
marital,80,0.19


#### Categorical Feature Insights

- Many categorical features contain `"unknown"` values, especially `default`, `housing`, and `loan`.
- The `default` column is highly uninformative (only 3 "yes" values).
- Job, education, and marital status show meaningful but imbalanced distributions.
- Contact method and campaign timing variables (month, day_of_week) likely have strong predictive signals.
- The target variable is imbalanced (approx. 89% 'no'), necessitating careful model evaluation and possibly class weighting.


### Assessing the Predictive Value of "unknown" Categories

Before converting "unknown" to NaN, it is important to determine whether these values behave like meaningful categories or simply represent missingness.
This analysis examines, for each categorical column, the proportion of clients who subscribed (y="yes") among the records labeled "unknown".



In [30]:
target_col = "y"
unknown_target_summary = {}

for col in categorical_cols:
    mask_unknown = df[col] == "unknown"
    count_unknown = mask_unknown.sum()
    
    if count_unknown > 0:
        proportions = (
            df.loc[mask_unknown, target_col]
              .value_counts(normalize=True)
              .rename("proportion")
              .round(4)
              .to_dict()
        )
        
        unknown_target_summary[col] = {
            "unknown_count": int(count_unknown),
            "unknown_pct": round(count_unknown / len(df) * 100, 2),
            "target_distribution": proportions
        }

unknown_target_summary

# Convert unknown_target_summary into a clean DataFrame
rows = []

for col, stats in unknown_target_summary.items():
    row = {
        "column": col,
        "unknown_count": stats["unknown_count"],
        "unknown_pct": stats["unknown_pct"],
        "yes_rate_among_unknown": stats["target_distribution"].get("yes", 0),
        "no_rate_among_unknown": stats["target_distribution"].get("no", 0),
    }
    rows.append(row)

unknown_target_df = pd.DataFrame(rows).sort_values("unknown_pct", ascending=False)
unknown_target_df

Unnamed: 0,column,unknown_count,unknown_pct,yes_rate_among_unknown,no_rate_among_unknown
3,default,8596,20.88,0.0515,0.9485
2,education,1730,4.2,0.1451,0.8549
4,housing,990,2.4,0.1081,0.8919
5,loan,990,2.4,0.1081,0.8919
0,job,330,0.8,0.1121,0.8879
1,marital,80,0.19,0.15,0.85


From the relationship table above we can safely deduce that:
- None of the "unknown" categories behave as meaningful, distinct categories.
- All can be safely converted into true missing values (NaN).
- `default` column may be dropped entirely due to extreme imbalance and lack of signal.

### Handling Missing Values by Replace "unknown" with NaN

Based on the analysis of how "unknown" values relate to the target variable, these placeholders do not behave as meaningful categories.
Therefore, they will be converted to true missing values (NaN) to enable appropriate imputation or encoding during preprocessing.

The column `default` will be handled separately due to its extremely low signal and highly imbalanced categories.



In [31]:
# Create a working copy of the dataframe for clean EDA
df_clean = df.copy()

# Identify categorical columns
categorical_cols = df_clean.select_dtypes(include="object").columns.tolist()

# Replace 'unknown' with NaN in all categorical columns
df_clean[categorical_cols] = df_clean[categorical_cols].replace("unknown", np.nan)

print("Replaced 'unknown' with NaN in categorical columns.")



Replaced 'unknown' with NaN in categorical columns.


Missingness Summary After Replacement

In [36]:

missing_summary = (
    df_clean.isnull().mean()
    .mul(100)
    .round(2)
    .reset_index()
    .rename(columns={"index": "column", 0: "missing_%"})
    .sort_values("missing_%", ascending=False)
)

missing_summary


Unnamed: 0,column,missing_%
4,default,20.88
3,education,4.2
5,housing,2.4
6,loan,2.4
1,job,0.8
2,marital,0.19
0,age,0.0
7,contact,0.0
8,month,0.0
9,day_of_week,0.0


Inspect Distribution of Missingness

In [38]:
# Show columns with > 0% missing
missing_summary[missing_summary["missing_%"] > 0]

Unnamed: 0,column,missing_%
4,default,20.88
3,education,4.2
5,housing,2.4
6,loan,2.4
1,job,0.8
2,marital,0.19


#### Interpretation of Missingness After Conversion

From the updated missingness summary, we typically observe:

- `default` now has the highest missingness, around ~20.8%.

- `education` has moderate missingness (~4%).

- `housing` and `loan` have ~2.4% missing each.

- `job` and `marital` have very small missing percentages (<1%).

These levels of missingness are acceptable and manageable.


**Special Note on default**

Due to:
- only 3 “yes” values,
- extremely skewed distribution,
- minimal predictive utility,
- high proportion of missing values,

the default column may be dropped altogether during preprocessing.
This will be evaluated again during feature importance and correlation analysis.


## Univariate Analysis

## Bivariate Analysis

## Multivariate Analysis
To understand interactions among multiple features simultaneously, we use pairwise plots and conditional visualizations.

In [33]:
# Pairwise Relationships (Selected Features)
selected_features = ['age', 'duration', 'campaign', 'euribor3m', 'emp.var.rate', 'nr.employed', 'y']

sns.pairplot(df[selected_features], hue='y', diag_kind='kde', plot_kws={'alpha':0.5})
plt.suptitle("Pairwise Feature Relationships", y=1.02)
plt.show()


KeyboardInterrupt: 

Error in callback <function flush_figures at 0x000002B95AFEF880> (for post_execute), with arguments args (),kwargs {}:


KeyboardInterrupt: 

### Insights:

- Clusters indicate interaction effects between economic indicators and call-related features.
- High duration combined with favorable economic conditions (low euribor3m) aligns with higher subscription likelihood.


### Cramér’s V for Categorical Associations

In [None]:

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    return np.sqrt(chi2 / (n * (min(confusion_matrix.shape)-1)))

categorical_corr = {col: cramers_v(df[col], df['y']) for col in cat_cols}
pd.Series(categorical_corr).sort_values(ascending=False)


NameError: name 'cat_cols' is not defined

**Note:**

This quantifies the strength of association between each categorical variable and the target.
High values indicate stronger relationships worth prioritizing for modeling.


## Feature Selection for Modeling

Here we identify the most relevant features to include in predictive modeling.


#### Filter Method (Correlation & Mutual Information)


In [None]:

# Encode categorical features
df_encoded = pd.get_dummies(df.drop('y', axis=1), drop_first=True)
y_encoded = df['y'].map({'no': 0, 'yes': 1})

# Calculate mutual information scores
mi_scores = mutual_info_classif(df_encoded, y_encoded, random_state=42)
mi = pd.Series(mi_scores, index=df_encoded.columns).sort_values(ascending=False)

# Display top 15 features
mi.head(15)



#### Summary of Results:
- Features such as duration, poutcome_success, euribor3m, and emp.var.rate show the highest predictive power.
- This aligns with literature findings that both client behavior and economic conditions drive subscription outcomes.

# Predictive Modeling

# Prescriptive Analytics and Recommendations

# Model Deployment