# **01 — Data Loading & Initial Data Readiness Check**


# 1. Project Context and Objective

### 1.1 Business Context

This project addresses the inefficiencies of traditional telemarketing campaigns conducted by a Portuguese banking institution. Historically, client outreach was performed without predictive targeting, leading to high operational costs, low conversion rates, and unnecessary customer contact.

The goal of this end-to-end data science project is to support a transition toward a data-driven marketing strategy by building predictive models capable of identifying clients with a higher probability of subscribing to a term deposit.

This notebook represents the first stage of the analytical workflow:
ensuring that the dataset is correctly loaded, well understood at a structural level, and suitable for downstream analysis and modeling.

# 2. Data Source and Provenance

## 2.1 Dataset Origin
The analysis is based on the Bank Marketing Data Set from the UCI Machine Learning Repository.

* Authors: S. Moro, P. Rita, & P. Cortez (2014)
* Institution: University of Minho, Portugal
* Time Period: Marketing campaigns conducted between May 2008 and November 2010

Official Repository Reference:
[https://archive.ics.uci.edu/dataset/222/bank+marketing](https://archive.ics.uci.edu/dataset/222/bank+marketing)

The dataset reflects real-world phone-based marketing campaigns, making it well suited for applied predictive analytics and business-oriented modeling.

## 2.2 Local Data File Description
| Feature | Detail |
| :--- | :--- |
| **File name** | `bank-additional-full.csv` |
| **Location** | `data/directory` |
| **Format** | CSV (text file) |
| **Delimiter** | Semicolon (`;`) |
| **Observations** | 41,188 |
| **Variables** | 21 (20 features + 1 target) |
| **Target Variable** | `y` (term deposit subscription: yes / no) |

**Important Note**
This dataset does not use the default comma delimiter.
Explicitly specifying sep=';' is required to avoid incorrect parsing.

# 3. Data Loading

## 3.1 Library Imports

In [1]:
# Import necessary libraries for data manipulation and numerical operations
import pandas as pd
import numpy as np

## 3.2 Dataset Ingestion

In [2]:
# Load the dataset using pandas.
# Specify the semicolon (;) as the delimiter
# to ensure correct column parsing.
df = pd.read_csv(
    '/content/drive/MyDrive/ClassFiles/bank-additional-full.csv',
    sep=';'
  )

## 3.3 Structural Validation

In [9]:
# Confirm dataset dimensions
print("Dataset Dimensions")
print(df.shape)

Dataset Dimensions
(41188, 21)


## 3.4 Initial Data Inspection

In [11]:
# Preview the first five rows to validate:
# - Correct column separation
# - Column names
# - Overall data structure
display(df.head())

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# 4. Initial Data Readiness Assessment

4.1 Variable Types and Structure

In [13]:
# Inspect column names, data types, and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

## 4.2 Preliminary Observations

At this stage, several important characteristics are identified and documented for later stages:

* The target variable y is binary (yes / no).
* Several categorical variables include values such as "unknown" or "nonexistent". These are informative missing values, not nulls, and will require explicit handling.
* The dataset includes campaign-related features (e.g., duration) that are known from prior research to introduce data leakage if used incorrectly.

These observations are intentionally recorded here to justify preprocessing and modeling decisions made in subsequent notebooks.

## 5. Conclusion of the Data Loading Stage

This notebook confirms that:

* The dataset source and provenance are clearly documented.
* The data is correctly loaded using the appropriate delimiter.
* The dataset structure matches official documentation.
* No ingestion or formatting issues are present.

The dataset is therefore considered ready for exploratory data analysis.