Email Campaign Performance Analysis

 Section 1: Loading and Inspecting the Data

In [1]:
import pandas as pd

# Load CSV files
email_df = pd.read_csv("Dataset/email_table.csv")
opened_df = pd.read_csv("Dataset/email_opened_table.csv")
clicked_df = pd.read_csv("Dataset/link_clicked_table.csv")

# Display structure
email_df.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases
0,85120,short_email,personalized,2,Sunday,US,5
1,966622,long_email,personalized,12,Sunday,UK,2
2,777221,long_email,personalized,11,Wednesday,US,2
3,493711,short_email,generic,6,Monday,UK,1
4,106887,long_email,generic,14,Monday,US,6


Section 2: Calculating the percentage of users who opened the email and the percentage of users who clicked on the link within the email

In [11]:
# Add flags for whether an email was opened or clicked
email_df['opened'] = email_df['email_id'].isin(opened_df['email_id']).astype(int)
email_df['clicked'] = email_df['email_id'].isin(clicked_df['email_id']).astype(int)

# Calculate open and click-through rates
open_rate = email_df['opened'].mean() * 100
click_rate = email_df['clicked'].mean() * 100

print(f"Open Rate: {open_rate:.2f}%")
print(f"Click-Through Rate: {click_rate:.2f}%")

Open Rate: 10.35%
Click-Through Rate: 2.12%


Section 3: Building a model to optimize email targeting based on click probability

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define input features and target
features = ['email_text', 'email_version', 'hour', 'weekday', 'user_country', 'user_past_purchases']
target = 'clicked'
X = email_df[features]
y = email_df[target]

# Preprocessing
categorical_cols = ['email_text', 'email_version', 'weekday', 'user_country']
numeric_cols = ['hour', 'user_past_purchases']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('num', 'passthrough', numeric_cols)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
pipeline.fit(X_train, y_train)

# Evaluate model
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     24470
           1       0.09      0.02      0.03       530

    accuracy                           0.98     25000
   macro avg       0.53      0.51      0.51     25000
weighted avg       0.96      0.98      0.97     25000

ROC AUC Score: 0.5815941738439832


Section 4: Calculating the rate by which the model can improve the click-through rate

In [15]:
# Add predicted probabilities to test set
X_test_copy = X_test.copy()
X_test_copy['true_clicked'] = y_test.values
X_test_copy['predicted_proba'] = y_proba

# Simulate targeting only top 10% most likely to click
top_10_percent = int(0.10 * len(X_test_copy))
top_users = X_test_copy.sort_values('predicted_proba', ascending=False).head(top_10_percent)

# Model-based CTR among top 10% likely users
model_ctr = top_users['true_clicked'].mean() * 100

print(f"Model-Based CTR (Top 10% Users): {model_ctr:.2f}%")
print(f"Improvement Over Actual CTR ({click_rate:.2f}%): {model_ctr - click_rate:.2f}%")

Model-Based CTR (Top 10% Users): 3.96%
Improvement Over Actual CTR (2.12%): 1.84%


We can test this by simulating targeting only the users most likely to click (top predicted 10%), then calculating the actual click rate among them (from the ground truth labels). If this rate is significantly higher than the global CTR, the model is working!

Section 5: Analysing the segment performance: did different user groups behave differently?

In [16]:
print("Click Rate by Email Version:")
print(email_df.groupby('email_version')['clicked'].mean() * 100)

print("\nClick Rate by Email Text:")
print(email_df.groupby('email_text')['clicked'].mean() * 100)

print("\nClick Rate by Country:")
print(email_df.groupby('user_country')['clicked'].mean() * 100)

print("\nClick Rate by Weekday:")
print(email_df.groupby('weekday')['clicked'].mean() * 100)

Click Rate by Email Version:
email_version
generic         1.513673
personalized    2.729409
Name: clicked, dtype: float64

Click Rate by Email Text:
email_text
long_email     1.853767
short_email    2.387177
Name: clicked, dtype: float64

Click Rate by Country:
user_country
ES    0.832748
FR    0.800400
UK    2.467526
US    2.435981
Name: clicked, dtype: float64

Click Rate by Weekday:
weekday
Friday       1.403682
Monday       2.290608
Saturday     1.784611
Sunday       1.675123
Thursday     2.444491
Tuesday      2.488864
Wednesday    2.761999
Name: clicked, dtype: float64


Explaining the patterns on how the email campaign performed for different segments of users:

1. Email Version Matters
Personalized emails clearly outperformed generic ones. The click-through rate for personalized emails was substantially higher, indicating that tailoring content to individual users resonates better and encourages engagement. This supports the idea that a one-size-fits-all strategy is less effective, and incorporating user-specific details can significantly boost results.

2. Short Emails Perform Better
Shorter emails led to better engagement than longer ones. Users appeared to respond more positively when the message was concise and to the point. This suggests that simplifying the content and minimizing the time needed to process the information can improve click behavior.

3. Geographic Differences:
There was a stark difference in click performance between countries. Users from the United States and the United Kingdom showed much higher engagement compared to those from France and Spain. This may point to regional differences in interest, familiarity with the platform, or even content relevance. It highlights the importance of localization—future campaigns should consider crafting messages that better align with regional preferences and cultural context, especially for underperforming markets.

4. Day of the Week Impact:
Engagement was not uniform throughout the week. Mid-week days—particularly Wednesday, followed by Tuesday and Thursday—showed the highest click-through rates. In contrast, weekends and Fridays experienced noticeably lower engagement. This pattern suggests that users are more responsive and likely to interact with marketing content earlier in the week, likely due to routine work habits and higher online presence during those days. 

Summary of Recommendations:
- Use personalized and short-format emails.
- Target mid-week (especially Wednesday) for best engagement.
- Consider regional strategies—optimize or test different content for low-performing countries like France and Spain.
- Use this information to train smarter models and personalize delivery across user segments.