<a href="https://www.kaggle.com/code/m26102002/bank-marketing-campaign-python-statistics?scriptVersionId=261067141" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Marketing Campaign Analysis (Python + Statistics)
**Author:** Avnish Thakur  
**Dataset:** Bank Marketing Campaign Dataset ([Kaggle](https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset/data))

## 1. Business Problem
Banks invest in marketing campaigns to attract customers.  
However, not all campaigns succeed — calls are costly and only some customers subscribe to term deposits.  

**Goal:**  
- Analyze current customer and campaign data.  
- Identify key factors driving campaign success.  
- Build a logistic regression model to predict response (yes/no).  
- Provide actionable strategies to improve future campaign efficiency.  

# 2. Data Import

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Load dataset

df = pd.read_csv("/kaggle/input/bank-marketing-dataset/bank.csv")
df.head()

In [None]:
# basic info

df.info()
print(df.shape)

In [None]:
print(df.describe())
print(df.describe(include='object'))

# 3. Data Cleaning

In [None]:
# check for null/unknown values

df.isnull().sum()
df.apply(lambda x: (x=='unknown').sum())

In [None]:
# replace unknown with nan

df = df.replace('unknown', np.nan)

 > unknown one are the customers who are not previously contacted or they are new customers.

# 4. Exploratory Data Analysis

**Bivariate (Feature vs Target)**

In [None]:
def subscription_rate_by_feature(feature):
    # Crosstab for subscription rates (%)
    rate_table = pd.crosstab(df[feature], df['deposit'], normalize='index') * 100
    print(f"\nSubscription Rate by {feature}:")
    print(rate_table)

    plt.figure(figsize=(7,4))
    sns.countplot(x=feature, hue='deposit', data=df)
    plt.title(f"Subscription Distribution by {feature}")
    plt.xticks(rotation=45)
    plt.show()

campaign_features = [
    'job','marital','education','default','housing','loan',
    'contact','month','poutcome'
]

for feature in campaign_features:
    subscription_rate_by_feature(feature)


In [None]:
## 1. Months - 
# Most contacts happen in may, but success rate is very low.
# March, September, December → smaller campaigns, but higher subscription rates.


## 2. Contact type -
# cellular contacts convert much higher than telephone.


## 3. Previous Campaign Outcome (poutcome) -
# Customers with “success” in past campaigns have much higher chance to subscribe again.
#“Failure” → very low conversion.
#“Unknown” → average.


## 4. Housing Loan - 
# Customers with housing loans tend to subscribe less.


## 5.Job
# Students, Retired → surprisingly high subscription rates.
# Blue-collar, Services → lower subscription rates.


## 6. Education
# Higher education → slightly better conversion rates.
# Secondary education = most volume, but moderate success.


## 7. Marital
# Single clients subscribe more than married/divorced.

# 5. Statistical Test (Analysis)

In [None]:
from scipy.stats import chi2_contingency, ttest_ind, f_oneway
df['response_flag'] = df['deposit'].map({'yes':1, 'no':0})

In [None]:
def feature_stat_test(df, feature, target):
    if df[feature].dtype == 'object':
        # chi-sqaure
        contingency = pd.crosstab(df[feature],df[target])
        chi2, p, dof, ex = chi2_contingency(contingency)
        print(f"chi-square test: {feature} vs {target}, p-value = {p:.5f}")
    else:
        # t-test
        group1 = df[df[target] == 1][feature]
        group0 = df[df[target] == 0][feature]
        t, p = ttest_ind(group1, group0)
        print(f"T-test: {feature} vs {target}, p-value = {p:.5f}")

features = ['age','job','marital','education','default','balance',
            'housing','loan','contact','day','month','duration',
            'campaign','previous','poutcome']
for f in features:
    feature_stat_test(df, f, 'response_flag')

In [None]:
# Demographics: Job, age, marital, education → certain groups subscribe more.

# Financials: High balance, no default, stable loans → higher likelihood.

# Campaign history: Previous success (poutcome) → strong predictor.

# Campaign strategy: Optimize number/timing of calls, prefer July–Nov.

# Channel: Cellular slightly better than telephone.

# Duration: Indicates engagement, use only for post-campaign analysis.

# 6. Encoding Categorical Features

In [None]:
from sklearn.preprocessing import LabelEncoder

categorical_features = [
    'job','marital','education','default','housing','loan',
    'month','poutcome'
]
le = LabelEncoder()

for col in categorical_features:
    df[col] = le.fit_transform(df[col])

# 7. Split Dataset

In [None]:
X = df.drop(columns = ['response_flag','deposit','contact', 'duration'])
y = df.response_flag

In [None]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression Model

In [None]:
# train model

model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

In [None]:
# Predictions

y_pred = model.predict(X_test_scaled)

In [None]:
# Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.3f}")

> moderate performance 

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [None]:
# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
# Accuracy: ~0.64

# Reason: Data imbalance (only ~11% subscribed).

# Better Metrics: Precision, Recall, ROC-AUC used for evaluation.

# Business Value: Even with modest accuracy, the model identifies which features impact success, guiding resource allocation.

> (Business Strategy)

Prioritize quality over quantity → Avoid calling the same customer too many times.

Focus on the right channels → Use cellular over telephone.

Target segments → Educated professionals, mid-age customers, and those with stable jobs show higher response.

Optimize call duration → Train agents for meaningful, longer conversations rather than rushed calls.

Re-engagement strategy → Don’t retry too soon; wait longer before calling the same customer again.