# Email Marketing Campaign


## Goal


Optimizing marketing campaigns is one of the most common data science tasks. Among the many possible marketing tools, one of the most efficient is emails.

Emails are great because they are free, scalable, and can be easily personalized. Email optimization involves personalizing the text and/or the subject, who should receive it, when should be sent, etc. Machine Learning excels at this.


## Challenge Description


The marketing team of an e-commerce site has launched an email campaign. This site has email addresses from all the users who created an account in the past.

They have chosen a random sample of users and emailed them. The email lets the user know about a new feature implemented on the site. From the marketing team perspective, success is if the user clicks on the link inside of the email. This link takes the user to the company site.

You are in charge of figuring out how the email campaign performed and were asked the following questions:

- What percentage of users opened the email and what percentage clicked on the link within the email?

- The VP of marketing thinks that it is stupid to send emails in a random way. Based on all the information you have about the emails that were sent, can you build a model to optimize in future how to send emails to maximize the probability of users clicking on the link inside the email?

- By how much do you think your model would improve click through rate (defined as # of users who click on the link/total users who receive the email). How would you test that?

- Did you find any interesting pattern on how the email campaign performed for different segments of users? Explain.

In [1]:
import pandas as pd

# Load the provided CSV files
email_opened = pd.read_csv("./email/email_opened_table.csv")
email_table = pd.read_csv("./email/email_table.csv")
link_clicked = pd.read_csv("./email/link_clicked_table.csv")

# Display the first few rows of each table to understand the structure
email_opened.head(), email_table.head(), link_clicked.head()

(   email_id
 0    284534
 1    609056
 2    220820
 3    905936
 4    164034,
    email_id   email_text email_version  hour    weekday user_country  \
 0     85120  short_email  personalized     2     Sunday           US   
 1    966622   long_email  personalized    12     Sunday           UK   
 2    777221   long_email  personalized    11  Wednesday           US   
 3    493711  short_email       generic     6     Monday           UK   
 4    106887   long_email       generic    14     Monday           US   
 
    user_past_purchases  
 0                    5  
 1                    2  
 2                    2  
 3                    1  
 4                    6  ,
    email_id
 0    609056
 1    870980
 2    935124
 3    158501
 4    177561)

### 1. What percentage of users opened the email and what percentage clicked on the link within the email?
To find these percentages, we'll calculate:

- The percentage of emails opened: $\frac{\text{Number of emails opened}}{\text{Total number of emails sent}} \times 100 $

- The percentage of links clicked: $ \frac{\text{Number of links clicked}}{\text{Total number of emails sent}} \times 100 $

In [2]:
# Calculate the total number of emails sent
total_emails_sent = len(email_table)

# Calculate the number of emails opened
num_emails_opened = len(email_opened)

# Calculate the number of links clicked
num_links_clicked = len(link_clicked)

# Calculate the percentage of emails opened
percent_emails_opened = (num_emails_opened / total_emails_sent) * 100

# Calculate the percentage of links clicked
percent_links_clicked = (num_links_clicked / total_emails_sent) * 100

percent_emails_opened, percent_links_clicked

(10.345, 2.119)

1. Approximately 10.35% of users opened the email.
2. Approximately 2.12% of users clicked on the link within the email.

### 2. Can we build a model to optimize in the future how to send emails to maximize the probability of users clicking on the link inside the email?
To do this, we'll first need to preprocess our data to create a unified dataset with features that might impact the click-through rate (CTR). Then, we can build and evaluate a predictive model based on this dataset.

In [3]:
# Merge the dataframes to create a unified dataset
merged_data = email_table.merge(email_opened, on="email_id", how="left", indicator="opened")
merged_data = merged_data.merge(link_clicked, on="email_id", how="left", indicator="clicked")

# Create binary columns for 'opened' and 'clicked'
merged_data["opened"] = (merged_data["opened"] == "both").astype(int)
merged_data["clicked"] = (merged_data["clicked"] == "both").astype(int)

# Display the first few rows of the merged data
merged_data.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,opened,clicked
0,85120,short_email,personalized,2,Sunday,US,5,0,0
1,966622,long_email,personalized,12,Sunday,UK,2,1,1
2,777221,long_email,personalized,11,Wednesday,US,2,0,0
3,493711,short_email,generic,6,Monday,UK,1,0,0
4,106887,long_email,generic,14,Monday,US,6,0,0


Now that we have merged the datasets, we need to preprocess the data further to make it suitable for modeling:

1. Convert categorical variables (like email_text, email_version, weekday, and user_country) into a numerical format using one-hot encoding.
2. Split the dataset into training and test sets.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the categorical variables
encoder = OneHotEncoder(drop='first')
encoded_features = encoder.fit_transform(merged_data[['email_text', 'email_version', 'weekday', 'user_country']])
encoded_df = pd.DataFrame(encoded_features.toarray(), columns=encoder.get_feature_names_out(['email_text', 'email_version', 'weekday', 'user_country']))

# Concatenate the encoded features with the original dataframe
final_data = pd.concat([merged_data, encoded_df], axis=1)

# Drop the original categorical columns and email_id column
final_data = final_data.drop(columns=['email_id', 'email_text', 'email_version', 'weekday', 'user_country'])

# Split the data into training and test sets
X = final_data.drop(columns='clicked')
y = final_data['clicked']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.head()

Unnamed: 0,hour,user_past_purchases,opened,email_text_short_email,email_version_personalized,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday,user_country_FR,user_country_UK,user_country_US
75220,6,0,0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48955,9,9,0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
44966,11,1,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
13568,11,3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
92727,5,5,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Initialize the logistic regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]

# Calculate accuracy and ROC AUC score
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

accuracy, roc_auc

(0.97735, 0.9508795859778058)

The logistic regression model yields the following results:

- Accuracy: Approximately 97.74\%
- ROC AUC Score: Approximately 0.951

The ROC AUC score is a measure of how well the model distinguishes between the two classes, with a score of 1 indicating perfect separation. An ROC AUC score of 0.951 suggests that the model performs quite well.

### 3. By how much do you think your model would improve click-through rate (CTR)? How would you test that?
To determine by how much our model could improve CTR, we can compare the model's predicted probabilities of clicking with the actual outcomes. If we use the model to prioritize sending emails to users with the highest predicted probabilities, the CTR should increase.

To test this:

1. We can sort users by their predicted probabilities of clicking from the model.
2. We can then simulate sending emails to the top N users and calculate the resulting CTR.
3. Compare this simulated CTR to the baseline CTR (without the model).

In [8]:
# Simulate sending emails based on the model's predicted probabilities

# Create a dataframe with actual labels and predicted probabilities
simulation_df = pd.DataFrame({
    'actual': y_test,
    'predicted_prob': y_pred_proba
})

# Sort the dataframe by predicted probabilities
simulation_df = simulation_df.sort_values(by='predicted_prob', ascending=False)

# Simulate sending emails to top N users (for demonstration, let's say N = 5000)
N = 5000
simulated_clicks = simulation_df.head(N)['actual'].sum()

# Calculate simulated CTR
simulated_ctr = simulated_clicks / N

# Calculate baseline CTR from the test set
baseline_ctr = y_test.sum() / len(y_test)

simulated_ctr, baseline_ctr

(0.088, 0.02265)

Based on the simulation:

- The simulated CTR (using model predictions to prioritize sending emails to the top 5,000 users with the highest predicted probabilities) is approximately 8.8\%.
- The baseline CTR (without using the model) from the test set is approximately 2.27\%.

Thus, using the model's predictions to prioritize sending emails could potentially increase the CTR from 2.27\% to 8.8\%, a significant improvement.

To validate this in a real-world scenario, one could conduct an A/B test:

- Group A (Control Group): Send emails randomly without using the model's predictions.
- Group B (Treatment Group): Use the model's predictions to prioritize sending emails.

After a predefined period, one can compare the CTRs of both groups to assess the effectiveness of the model in improving CTR.

### 4. Did you find any interesting pattern on how the email campaign performed for different segments of users?
To answer this, we can explore the feature importance from the logistic regression model and also visualize the CTR across different segments (e.g., based on user_country, email_version, etc.). Let's start by checking the feature importances.

In [9]:
# Extract feature importances from the logistic regression model (coefficients in this case)
feature_importances = logreg.coef_[0]
features = X_train.columns

# Create a dataframe for feature importances
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
})

# Sort the dataframe by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

importance_df

Unnamed: 0,Feature,Importance
2,opened,5.907902
10,weekday_Wednesday,0.2529
4,email_version_personalized,0.214044
1,user_past_purchases,0.167203
6,weekday_Saturday,0.159822
8,weekday_Thursday,0.151944
5,weekday_Monday,0.117337
9,weekday_Tuesday,0.095828
12,user_country_UK,0.069209
13,user_country_US,0.060273


Based on the logistic regression model's coefficients (which represent the feature importances):

1. The most influential feature is whether the email was opened or not, which makes intuitive sense. If the email is opened, there's a higher chance that the user might click on the link inside.
2. Emails sent on Wednesdays seem to have a higher probability of getting clicked.
3. Personalized email versions also have a positive impact on the click-through rate.
4. The number of past purchases by the user has a positive correlation, suggesting that users who've made more purchases in the past are more likely to click on the links in the emails.
5. Emails to users from France (FR) have a negative coefficient, implying a lower probability of getting clicked compared to the reference category.


To better understand these patterns and visualize the CTR across different segments, we can explore the CTR based on user_country, email_version, and other categorical variables. Let's start with the user_country variable.

In [11]:
# Calculate CTR for each user country
country_ctr = merged_data.groupby('user_country')['clicked'].mean().sort_values(ascending=False)

country_ctr

user_country
UK    0.024675
US    0.024360
ES    0.008327
FR    0.008004
Name: clicked, dtype: float64

Based on the click-through rate (CTR) for different user countries:

- UK has the highest CTR at approximately 2.47\%.
- US follows closely with a CTR of approximately 2.44\%.
- ES (Spain) and FR (France) have significantly lower CTRs, around 0.83\% and 0.80\%, respectively.

This aligns with the model's feature importance, where users from France had a negative coefficient, indicating a lower probability of clicks.

In [12]:
# Calculate CTR for each email version
email_version_ctr = merged_data.groupby('email_version')['clicked'].mean().sort_values(ascending=False)

email_version_ctr

email_version
personalized    0.027294
generic         0.015137
Name: clicked, dtype: float64

Based on the click-through rate (CTR) for different email versions:

- Personalized emails have a CTR of approximately 2.73\%.
- Generic emails have a lower CTR of approximately 1.51\%.

It's clear that personalized emails have a significantly higher CTR than generic emails, which is also reflected in the model's feature importance.

This suggests that personalizing emails can lead to better user engagement and a higher likelihood of users clicking on the link inside the email.

In summary, several patterns emerge from the data and the model:

- Email open rate plays a critical role in determining click-through rate.
- The day of the week, particularly Wednesdays, seems to influence the likelihood of users clicking on the link.
- Personalized emails perform better than generic ones.

There are differences in CTR among users from different countries, with the UK and the US having higher CTRs compared to Spain and France.