# Email Marketing Campaign


## Goal


Optimizing marketing campaigns is one of the most common data science tasks. Among the many possible marketing tools, one of the most efficient is emails.

Emails are great because they are free, scalable, and can be easily personalized. Email optimization involves personalizing the text and/or the subject, who should receive it, when should be sent, etc. Machine Learning excels at this.


## Challenge Description


The marketing team of an e-commerce site has launched an email campaign. This site has email addresses from all the users who created an account in the past.

They have chosen a random sample of users and emailed them. The email lets the user know about a new feature implemented on the site. From the marketing team perspective, success is if the user clicks on the link inside of the email. This link takes the user to the company site.

You are in charge of figuring out how the email campaign performed and were asked the following questions:

- What percentage of users opened the email and what percentage clicked on the link within the email?

- The VP of marketing thinks that it is stupid to send emails in a random way. Based on all the information you have about the emails that were sent, can you build a model to optimize in future how to send emails to maximize the probability of users clicking on the link inside the email?

- By how much do you think your model would improve click through rate (defined as # of users who click on the link/total users who receive the email). How would you test that?

- Did you find any interesting pattern on how the email campaign performed for different segments of users? Explain.

In [17]:
import pandas as pd

# Load the provided CSV files
email_opened = pd.read_csv("./email/email_opened_table.csv")
email_table = pd.read_csv("./email/email_table.csv")
link_clicked = pd.read_csv("./email/link_clicked_table.csv")

# Display the first few rows of each table to understand the structure
email_opened.head(), email_table.head(), link_clicked.head()

(   email_id
 0    284534
 1    609056
 2    220820
 3    905936
 4    164034,
    email_id   email_text email_version  hour    weekday user_country  \
 0     85120  short_email  personalized     2     Sunday           US   
 1    966622   long_email  personalized    12     Sunday           UK   
 2    777221   long_email  personalized    11  Wednesday           US   
 3    493711  short_email       generic     6     Monday           UK   
 4    106887   long_email       generic    14     Monday           US   
 
    user_past_purchases  
 0                    5  
 1                    2  
 2                    2  
 3                    1  
 4                    6  ,
    email_id
 0    609056
 1    870980
 2    935124
 3    158501
 4    177561)

### 1. What percentage of users opened the email and what percentage clicked on the link within the email?
To find these percentages, we'll calculate:

- The percentage of emails opened: $\frac{\text{Number of emails opened}}{\text{Total number of emails sent}} \times 100 $

- The percentage of links clicked: $ \frac{\text{Number of links clicked}}{\text{Total number of emails sent}} \times 100 $

In [18]:
# Calculate the total number of emails sent
total_emails_sent = len(email_table)

# Calculate the number of emails opened
num_emails_opened = len(email_opened)

# Calculate the number of links clicked
num_links_clicked = len(link_clicked)

# Calculate the percentage of emails opened
percent_emails_opened = (num_emails_opened / total_emails_sent) * 100

# Calculate the percentage of links clicked
percent_links_clicked = (num_links_clicked / total_emails_sent) * 100

percent_emails_opened, percent_links_clicked

(10.345, 2.119)

1. Approximately 10.35% of users opened the email.
2. Approximately 2.12% of users clicked on the link within the email.

### 2. Can we build a model to optimize in the future how to send emails to maximize the probability of users clicking on the link inside the email?
To do this, we'll first need to preprocess our data to create a unified dataset with features that might impact the click-through rate (CTR). Then, we can build and evaluate a predictive model based on this dataset.

In [19]:
# Merge the dataframes to create a unified dataset
merged_data = email_table.merge(email_opened, on="email_id", how="left", indicator="opened")
merged_data = merged_data.merge(link_clicked, on="email_id", how="left", indicator="clicked")

# Create binary columns for 'opened' and 'clicked'
merged_data["opened"] = (merged_data["opened"] == "both").astype(int)
merged_data["clicked"] = (merged_data["clicked"] == "both").astype(int)

# Display the first few rows of the merged data
merged_data.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,opened,clicked
0,85120,short_email,personalized,2,Sunday,US,5,0,0
1,966622,long_email,personalized,12,Sunday,UK,2,1,1
2,777221,long_email,personalized,11,Wednesday,US,2,0,0
3,493711,short_email,generic,6,Monday,UK,1,0,0
4,106887,long_email,generic,14,Monday,US,6,0,0


Now that we have merged the datasets, we need to preprocess the data further to make it suitable for modeling:

1. Convert categorical variables (like email_text, email_version, weekday, and user_country) into a numerical format using one-hot encoding.
2. Split the dataset into training and test sets.

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

ModuleNotFoundError: No module named 'sklearn'

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the categorical variables
encoder = OneHotEncoder(drop='first')
encoded_features = encoder.fit_transform(merged_data[['email_text', 'email_version', 'weekday', 'user_country']])
encoded_df = pd.DataFrame(encoded_features.toarray(), columns=encoder.get_feature_names_out(['email_text', 'email_version', 'weekday', 'user_country']))

# Concatenate the encoded features with the original dataframe
final_data = pd.concat([merged_data, encoded_df], axis=1)

# Drop the original categorical columns and email_id column
final_data = final_data.drop(columns=['email_id', 'email_text', 'email_version', 'weekday', 'user_country'])

# Split the data into training and test sets
X = final_data.drop(columns='clicked')
y = final_data['clicked']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.head()

ModuleNotFoundError: No module named 'sklearn'

In [16]:
!pip install sklearn

