# Table Of Content

# Project Overview


# Data Collection and Initial Processing



## Dataset Overview
The dataset used in this project originates from a Portuguese retail bank and contains detailed records of telemarketing campaigns conducted between 2008 and 2013. These campaigns were aimed at promoting long-term deposit subscriptions among existing and potential customers.
The data was collected and organized across two versions:
bank.zip – containing data from the initial campaigns between 2008 and 2010.
bank-additional.zip – an extended version collected between 2008 and 2013 with richer socio-economic indicators and additional campaign details.
Together, the datasets include up to 150 attributes, encompassing a broad range of customer demographics, banking product information, campaign interaction details, and external macroeconomic variables.
The dataset is widely recognized for benchmarking predictive modeling techniques in marketing analytics and serves as an excellent real-world example for classification problems in data science.
The target variable, y, indicates whether the telemarketing call resulted in a successful sale of a term deposit (yes) or not (no).

## Data Description
The dataset’s structure integrates multiple domains of information that collectively influence telemarketing outcomes. The features can be grouped into the following main categories:
1. Customer Demographics
These variables describe the socio-demographic profile of each client:
age – Client’s age (numeric).
job – Type of occupation (e.g., admin, technician, blue-collar, services, etc.).
marital – Marital status (married, single, divorced).
education – Education level (basic, secondary, tertiary, unknown).
default – Indicates if the client has credit in default (yes, no).
housing – Has a housing loan (yes, no).
loan – Has a personal loan (yes, no).
2. Campaign and Communication Attributes
These describe the telemarketing contact details and campaign context:
contact – Communication type (cellular or telephone).
month – Last contact month of the year.
day_of_week – Last contact day of the week.
duration – Duration of the last call in seconds.
campaign – Number of contacts performed during this campaign for the client.
pdays – Number of days since the client was last contacted in a previous campaign (-1 if never contacted).
previous – Number of contacts performed before this campaign.
poutcome – Outcome of the previous marketing campaign (e.g., success, failure, nonexistent).
3. Banking Product Details
Information about the client’s relationship with the bank and existing products:
balance – Average yearly balance in euros.
deposit subscription (y) – The target variable indicating campaign success (yes for successful subscription, no otherwise).
4. Socio-Economic Context
These external indicators reflect macroeconomic conditions at the time of each campaign:
emp.var.rate – Employment variation rate (quarterly indicator).
cons.price.idx – Consumer price index.
cons.conf.idx – Consumer confidence index.
euribor3m – Euribor 3-month rate.
nr.employed – Number of employees in the economy.

## Analytical Relevance
This dataset provides a rich foundation for:
Exploratory Data Analysis (EDA) to uncover patterns in client behavior.
Feature engineering to enhance predictive modeling.
Machine learning classification to predict y (success of the campaign).
Model interpretation using tools such as SHAP and LIME to derive actionable insights for marketing optimization.

## Project Alignment
By combining these data attributes with modern data science techniques—such as logistic regression, random forests, gradient boosting (XGBoost), and neural networks—the project aims to:
Predict telemarketing call success more accurately.
Identify key drivers of positive campaign outcomes.
Provide strategic recommendations to improve campaign efficiency and reduce operational costs.

## Data Ingestion and Integration

Following the project alignment phase, the next step focuses on assembling a clean and comprehensive dataset for analysis. Multiple CSV files containing campaign details, customer demographics, and call outcomes are ingested and merged into a single unified dataframe using Python’s pandas library. During this process, shared identifiers are used to align records across files, while NumPy and pandas utilities support validation checks to ensure consistency, resolve missing or mismatched entries, and confirm structural integrity. The resulting dataset provides a reliable foundation for the subsequent preprocessing, modeling, and analysis stages.

### Library Imports
To load the data, we need to import the pandas library which will be used to read and combine the CSV files.

Other libraries used at latter parts of project will also be imported here, to ensure a consistent and organized workflow.

In [None]:
# library imports

import pandas as pd

### Creating DataFrames from Data Sources

With pandas imported, the next step is to read in the data from the various data files.
Each source file is loaded into a separate pandas DataFrame.

This approach allows for easy inspection of each dataset before merging them into one big dataframe for a unified analysis.

In [None]:
# reading datasets into dataframes

df1 = pd.read_csv("data/bank-additional-full.csv", sep=";")
df2 = pd.read_csv("data/bank-additional.csv", sep=";")
df3 = pd.read_csv("data/bank-full.csv", sep=";")
df4 = pd.read_csv("data/bank.csv", sep=";")

### Data Inspection and Validation

After loading the datasets into separate DataFrames, the next step is to inspect and validate each one before merging.
This helps ensure that all files were read correctly and that their structures are consistent.

We begin by:

- Viewing sample records with head() to confirm data integrity.

- Checking dataset dimensions using shape.

- Reviewing data types and non-null counts with info().

- Identifying missing values using isnull().sum().

- Verifying that key identifiers (e.g., customer_id, campaign_id) are present and properly formatted.

These checks help detect potential issues early and ensure a smooth integration process in the next step.

#### df1

In [None]:
# view sample records to confirm data integrity

df1.head(10)

In [None]:
# Checking dataset dimensions

df1.shape

In [None]:
# Reviewing data types and non-null counts
df1.info()

In [None]:
# Identifying missing values

df1.isnull().sum()

##### Observation
- Columns: `'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'`
- No key/unique identifier (e.g campaign IDs) found.
- Some columns seem to have same value for all rows. Further exploration needed.

#### df2

In [None]:
# view sample records to confirm data integrity

df2.head(10)

In [None]:
# Checking dataset dimensions

df2.shape

In [None]:
# Reviewing data types and non-null counts
df2.info()

In [None]:
# Identifying missing values

df2.isnull().sum()

##### Observation
- Columns: `'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'`
- No key/unique identifier (e.g campaign IDs) found.
- Most of the columns and corresponding categorical classes are present in df1 as well

#### df3

In [None]:
# view sample records to confirm data integrity

df3.head(10)

In [None]:
# Checking dataset dimensions

df3.shape

In [None]:
# Reviewing data types and non-null counts
df3.info()

In [None]:
# Identifying missing values

df3.isnull().sum()

##### Observation
- Columns: `'age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y'`
- No key/unique identifier (e.g campaign IDs) found.
- less columns than df1 and df2
- Similar columns with df1 and df2
- Some columns seem to have same value for all rows. Further exploration needed.

#### df4

In [None]:
# view sample records to confirm data integrity

df4.head(10)

In [None]:
# Checking dataset dimensions

df4.shape

In [None]:
# Reviewing data types and non-null counts
df4.info()

In [None]:
# Identifying missing values

df4.isnull().sum()

##### Observation
- Columns: `'age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y'`
- No key/unique identifier (e.g campaign IDs) found.
- less columns than df1 and df2
- Similar columns with df1 and df2

#### Tabular Summary

The shape and structure of all the dataframes are summarized in tabular form below.

In [None]:

# dictionary of all dataframes
dataframes = {
    "df1": df1,
    "df2": df2,
    "df3": df3,
    "df4": df4
}

# Function to summarize key info
def summarize_df(df):
    return {
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Numeric Columns": df.select_dtypes(include='number').shape[1],
        "Categorical Columns": df.select_dtypes(exclude='number').shape[1]
    }

# Generate summary table
summary = pd.DataFrame({name: summarize_df(df) for name, df in dataframes.items()}).T
summary


#### Tabular Summary II: Understand Potential Relevance of Columns
To look for:

- Columns with many missing or constant values → likely low value.

- Columns with clear categorical or numerical variation → likely valuable for modeling.

In [None]:
for df_name, df in dataframes.items():
    print(f"--- {df_name} ---")
    print(df.describe(include='all').T[['unique', 'non-null', 'top', 'freq']].head())


# Exploratory Data Analysis (EDA)

# Predictive Modeling

# Prescriptive Analytics and Recommendations

# Model Deployment