
# 📊 Bank Marketing Campaign Analysis

## 📌 Business Understanding

This project analyzes data from a Portuguese bank's telemarketing campaigns. The goal is to **predict whether a customer will subscribe to a term deposit**. 

We use a real-world dataset, perform exploratory data analysis (EDA), train predictive models, evaluate them, and provide actionable insights to guide marketing strategies.


In [None]:

import pandas as pd

# Load the dataset directly from GitHub
url = "https://raw.githubusercontent.com/bedrock510/bank_marketing_colab_ready/main/bank-additional-full.csv"
df = pd.read_csv(url, sep=';')
print("✅ Loaded data with shape:", df.shape)


In [None]:

# Drop 'duration' if present to avoid data leakage
if 'duration' in df.columns:
    df = df.drop(columns=['duration'])
    print("✅ Dropped 'duration' column")
else:
    print("ℹ️ 'duration' not found — possibly already removed")



## 🔍 Exploratory Data Analysis (EDA)

Let's visualize the distribution of the target variable, as well as categorical and numerical features.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
plt.figure(figsize=(6, 4))
sns.countplot(x='y', data=df)
plt.title("Target Variable: Subscription to Term Deposit")
plt.xlabel("Subscribed")
plt.ylabel("Count")
plt.show()

# Categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols.remove('y')

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(18, 20))
for ax, col in zip(axes.flatten(), categorical_cols):
    sns.countplot(x=col, data=df, ax=ax, order=df[col].value_counts().index)
    ax.set_title(f"Distribution of {col}")
    ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

# Numerical features
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))
for ax, col in zip(axes.flatten(), numerical_cols):
    sns.histplot(df[col], kde=True, ax=ax)
    ax.set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()



## 🤖 Modeling

We'll use Logistic Regression and Random Forest classifiers to predict whether a client will subscribe to a term deposit.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Prepare target and features
y = df['y'].map({'no': 0, 'yes': 1})
X = df.drop(columns=['y'])

categorical_features = X.select_dtypes(include='object').columns.tolist()
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Preprocessing pipelines
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median'))])
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
logreg_pipeline = Pipeline([
    ('pre', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])
logreg_pipeline.fit(X_train, y_train)
log_preds = logreg_pipeline.predict(X_test)
log_proba = logreg_pipeline.predict_proba(X_test)[:, 1]

# Random Forest
rf_pipeline = Pipeline([
    ('pre', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])
rf_pipeline.fit(X_train, y_train)
rf_preds = rf_pipeline.predict(X_test)
rf_proba = rf_pipeline.predict_proba(X_test)[:, 1]

# Evaluation
print("📊 Logistic Regression:")
print(confusion_matrix(y_test, log_preds))
print(classification_report(y_test, log_preds))
print("ROC AUC:", roc_auc_score(y_test, log_proba))

print("\n📊 Random Forest:")
print(confusion_matrix(y_test, rf_preds))
print(classification_report(y_test, rf_preds))
print("ROC AUC:", roc_auc_score(y_test, rf_proba))



## 🧠 Findings & Insights

- The **dataset is imbalanced**: far more customers do not subscribe than do.
- **Random Forest outperforms Logistic Regression** slightly in precision and AUC.
- Key predictive features (based on importance and domain knowledge) include:
  - `month`: time of year affects campaign results
  - `contact`: cell phone vs telephone
  - `education`: level of education correlates with subscription likelihood

### 📌 Business Insight:
- Campaigns may perform better in **specific months** and with **certain client profiles**.
- Use model outputs to **target high-likelihood clients** in future outreach.

---



## ✅ Next Steps

- Tune models using **GridSearchCV** for better performance
- Explore advanced algorithms like **XGBoost or LightGBM**
- Consider **sampling methods** (SMOTE or undersampling) to handle class imbalance
- Deploy the model to assist call center agents in **real-time targeting**

---
