# Email marketting campaign

Assume we have ran a marketting campaign and we are interested in exploring how the campaign did. We want to explore what segment have done better and also check what strategy is better performing the marketting campaign, rather than randomly send it to users.

There are three data sets. 

1- ``emails``: that include the characteristics of email being sent, e.g., the size of the email, whether it was personalized, time that email was sent, user country, user number of past purchases.

2- ``opened``: id of the users that openned the email.

3- ``clicked``: id of users that clicked through the emai.

In [57]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

In [2]:
opened = pd.read_csv('data/email/email_opened_table.csv')
# we need to develope a variable that show whether the email was open
opened['open'] = 1

In [3]:
opened.head()

Unnamed: 0,email_id,open
0,284534,1
1,609056,1
2,220820,1
3,905936,1
4,164034,1


In [4]:
clicked = pd.read_csv('data/email/link_clicked_table.csv')
clicked['click'] = 1
clicked.head()

Unnamed: 0,email_id,click
0,609056,1
1,870980,1
2,935124,1
3,158501,1
4,177561,1


In [5]:
email = pd.read_csv('data/email/email_table.csv')
email.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases
0,85120,short_email,personalized,2,Sunday,US,5
1,966622,long_email,personalized,12,Sunday,UK,2
2,777221,long_email,personalized,11,Wednesday,US,2
3,493711,short_email,generic,6,Monday,UK,1
4,106887,long_email,generic,14,Monday,US,6


In [6]:
email.info() # there is no null value in the email data set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
email_id               100000 non-null int64
email_text             100000 non-null object
email_version          100000 non-null object
hour                   100000 non-null int64
weekday                100000 non-null object
user_country           100000 non-null object
user_past_purchases    100000 non-null int64
dtypes: int64(3), object(4)
memory usage: 5.3+ MB


### Joining the tables

In [7]:
table = email.merge(opened, how = 'left', on = 'email_id')
table = table.merge(clicked, how = 'left', on = 'email_id')
table.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,open,click
0,85120,short_email,personalized,2,Sunday,US,5,,
1,966622,long_email,personalized,12,Sunday,UK,2,1.0,1.0
2,777221,long_email,personalized,11,Wednesday,US,2,,
3,493711,short_email,generic,6,Monday,UK,1,,
4,106887,long_email,generic,14,Monday,US,6,,


Since there was no null in the original email data set, we can replace all NaNs here with 0.

In [8]:
table.fillna(0, inplace=True) # replacing all nans with zeros
table.head(2)

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,open,click
0,85120,short_email,personalized,2,Sunday,US,5,0.0,0.0
1,966622,long_email,personalized,12,Sunday,UK,2,1.0,1.0


In [9]:
print("Percentage of people who opened the emails:\t%", np.mean(table.open) * 100)
print("Percentage of people who also clicked the emails:\t%", np.mean(table.click) * 100)

Percentage of people who opened the emails:	% 10.345
Percentage of people who also clicked the emails:	% 2.119


The conversion rate is about 2 % which is below the industry standard.

### EDA
First lets take a look at the distribution of the data to make sure there are no outliers.

In [10]:
table.describe() # numerical variables seem consistent and clean

Unnamed: 0,email_id,hour,user_past_purchases,open,click
count,100000.0,100000.0,100000.0,100000.0,100000.0
mean,498690.19616,9.0593,3.87845,0.10345,0.02119
std,289230.727534,4.439637,3.19611,0.304547,0.144018
min,8.0,1.0,0.0,0.0,0.0
25%,246708.25,6.0,1.0,0.0,0.0
50%,498447.0,9.0,3.0,0.0,0.0
75%,749942.75,12.0,6.0,0.0,0.0
max,999998.0,24.0,22.0,1.0,1.0


In [11]:
table.describe(include = ['O']) # similarly categorical variables are also clean

Unnamed: 0,email_text,email_version,weekday,user_country
count,100000,100000,100000,100000
unique,2,2,7,4
top,long_email,generic,Saturday,US
freq,50276,50209,14569,60099


### Segment analysis
I am performing segment analysis based on customer behaviour and email characteristics. Our end metric is conversion rate, therefore, we focus on that metrics and drop the open metric which is an intermediate step (I will drop it for the final model).

#### Categorical variables

In [12]:
columns_to_drop = ['open']

In [13]:
def categorical_seg_analysis(category):
    ''' small help function to automate groupbys'''
    return table.groupby(category).click.agg(['mean','count']).sort_values('mean', ascending = False)

In [14]:
categorical_seg_analysis('email_text')

Unnamed: 0_level_0,mean,count
email_text,Unnamed: 1_level_1,Unnamed: 2_level_1
short_email,0.023872,49724
long_email,0.018538,50276


Shorter emails show significant imcrease in the conversion rate.

In [15]:
categorical_seg_analysis('email_version')

Unnamed: 0_level_0,mean,count
email_version,Unnamed: 1_level_1,Unnamed: 2_level_1
personalized,0.027294,49791
generic,0.015137,50209


Personalised emails show around two times higher conversion rate.

In [16]:
categorical_seg_analysis('weekday')

Unnamed: 0_level_0,mean,count
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Wednesday,0.02762,14084
Tuesday,0.024889,14143
Thursday,0.024445,14277
Monday,0.022906,14363
Saturday,0.017846,14569
Sunday,0.016751,14387
Friday,0.014037,14177


Day of the week also shows high impact in the conversion rate. 

In [17]:
categorical_seg_analysis('user_country')

Unnamed: 0_level_0,mean,count
user_country,Unnamed: 1_level_1,Unnamed: 2_level_1
UK,0.024675,19939
US,0.02436,60099
ES,0.008327,9967
FR,0.008004,9995


This results does not directly affect our analysis since the user country is not something we control. However, we see that different countries who much different conversion rate. We need to check whether these results are consistent across other marketting platform. Otherwise, we might have some issue especialy for non-english countries. The email might be in English and we might need to translate the email accurately to other languages. More research is needed on this topic.
We will drop this columns from the final model.

In [18]:
columns_to_drop.append('user_country')

In [19]:
categorical_seg_analysis('hour')

Unnamed: 0_level_0,mean,count
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
23,0.041379,145
24,0.028986,69
10,0.02824,8180
11,0.027128,7483
9,0.025794,8529
12,0.025661,6508
15,0.024907,3493
16,0.023197,2759
14,0.020742,4580
13,0.019889,5581


Again, we see much different conversion rate for different hours of the day that the email is sent, for example, the conversion rate for the emails sent at 23:00 is 4% while at 21:00 is 0.8% (almost one fifth).
This seems a very valuable result as it is a very actionable feature.

#### Numerical variables

We will group by by target so that we can capture the difference in the results of numerical variables.

In [20]:
table.groupby('click')['user_past_purchases'].agg(['mean','count']).sort_values('mean', ascending=True)

Unnamed: 0_level_0,mean,count
click,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,3.828864,97881
1.0,6.168948,2119


It shows that the loyal customers show higher click rate into the emails. However, this feature is also not actionable. Though it provides some insight on the fact that we might need to develop a loyal user base which will also become more responsive to marketting campaings.

In [21]:
columns_to_drop.append('user_past_purchases')
columns_to_drop.append('email_id')

### Developing model

I suggest using random forest classifier as our model. There are two reasons, first it provides us with feature importance graph which allows us to choose our marketting strategy. In other words, it allows us to focus on the features that are more important.
Moreover, the data consists of mostly categorical variables, which random forest shows very good performance on. Moreover, the standard scaling step is also not required.

First step is to clean the data set.

In [22]:
table = table.drop(labels=columns_to_drop, axis=1)
table.head()

Unnamed: 0,email_text,email_version,hour,weekday,click
0,short_email,personalized,2,Sunday,0.0
1,long_email,personalized,12,Sunday,1.0
2,long_email,personalized,11,Wednesday,0.0
3,short_email,generic,6,Monday,0.0
4,long_email,generic,14,Monday,0.0


I am developing dictionaries to convert categorical variables to numerical counterpart. We can use Categorical method of dataframes too. However, since we want to use this model later to predict the best possible conversion rate, we need to have the dictionaries for future mapping.

In [27]:
email_text_categories = table.email_text.unique()
email_text_dict = {email_text_categories[i]:i for i in range(len(email_text_categories))}
email_text_dict

{'long_email': 1, 'short_email': 0}

In [29]:
email_version_categories = table.email_version.unique()
email_version_dict = {email_version_categories[i]:i for i in range(len(email_version_categories))}
email_version_dict

{'generic': 1, 'personalized': 0}

In [30]:
weekday_categories = table.weekday.unique()
weekday_dict = {weekday_categories[i]:i for i in range(len(weekday_categories))}
weekday_dict

{'Friday': 4,
 'Monday': 2,
 'Saturday': 3,
 'Sunday': 0,
 'Thursday': 6,
 'Tuesday': 5,
 'Wednesday': 1}

In [31]:
def data_frame_normalizer(df):
    df['weekday'] = df['weekday'].apply(lambda x: weekday_dict[x])
    df['email_version'] = df['email_version'].apply(lambda x: email_version_dict[x])
    df['email_text'] = df['email_text'].apply(lambda x: email_text_dict[x])
    return df

In [32]:
table = data_frame_normalizer(table)
table.head()

Unnamed: 0,email_text,email_version,hour,weekday,click
0,0,0,2,0,0.0
1,1,0,12,0,1.0
2,1,0,11,1,0.0
3,0,1,6,2,0.0
4,1,1,14,2,0.0


In [33]:
x_col = table.columns.values[:-1] 
y_col = table.columns.values[-1]

In [39]:
X = table[x_col]
y = table[y_col]

In [40]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y) # we stratify based on y since classes are skewd

Continuing on our EDA, let's perform a feature importance analysis using random forest.

In [None]:
forest = ExtraTreesClassifier(n_estimators=1000, random_state=0, min_samples_leaf=100, max_features=3)
forest.fit(X, y)

In [55]:
importance = forest.feature_importances_ # feature importance based on features
indicies = np.argsort(importance)[::-1] # index of descending sorted features
print("Feature ranking:")
for f in range(len(x_col)):
    print("feature {} - {}:\t{} ".format(f+1, x_col[indicies[f]], round(100*importance[indicies[f]],2)))

Feature ranking:
feature 1 - hour:	42.7 
feature 2 - weekday:	25.9 
feature 3 - email_version:	25.39 
feature 4 - email_text:	6.01 


The results also coindices with our intuition from segment analysis. We can see that hour of the day results in the most variance in the conversion rate. Using hour as a feature results in highest purity of samples.
Second is day of the week. Finaly the email version and email text also show some variance.
Therefore, if we have limited resources/time to perform the next email marketting, it is best to focus on time of the day and then day of the week.

Next, let's peform some modelling and implement a random forest classifier. This allows us to predict what is the highes converson rate we can expect.

In [60]:
scoring = {'AUC': 'roc_auc'}
rf=RandomForestClassifier(n_jobs=-1)
param_grid = {"max_depth": [3, 10, None],
              "max_features": ["log2"],
              "min_samples_split": [50, 100,200],
              "min_samples_leaf": [50, 100],
              "bootstrap": [True],
              "criterion": ["gini"]}
clf_rf = GridSearchCV(rf, param_grid=param_grid,scoring=scoring, cv=3, refit='AUC')
clf_rf.fit(x_train, y_train)
print("Best parameters are {}".format(clf_rf.best_params_))
print("Best score is {}".format(clf_rf.best_score_))

Best parameters are {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 100, 'min_samples_split': 100}
Best score is 0.6110770858248712


We can see that the model is onyl 61% confident. Therefore, whatever results we get we cant be more than this much confident.

Let's develope an observation that consists of all the segments of email that we found lean toward higher conversion rate and see what the model will predict as its outcome.

In [68]:
best_conversion_rate_df = pd.DataFrame({'email_text':['short_email'], 'email_version':['personalized'], 'weekday':['Wednesday'], 'hour':[23]})
best_conversion_rate_df.head()

Unnamed: 0,email_text,email_version,hour,weekday
0,short_email,personalized,23,Wednesday


In [64]:
best_conversion_rate_df = data_frame_normalizer(best_conversion_rate_df)

In [65]:
best_conversion_rate_df.head()

Unnamed: 0,email_text,email_version,hour,weekday
0,0,0,23,1


In [67]:
clf_rf.predict_proba(best_conversion_rate_df)

array([[0.96887866, 0.03112134]])

This shows that the model predicts, with 60% confidence, that if we send emails on Wednesdays at 11:00 pm and keep the emails short and personalized, we can increase the overal conversion rate to 3.1%. Recall that our base conversion rate was 2.1%. This is about 50% increase in revenue from this marketing camapign. 

Considering 60% confidence in this results, we can show that our expected conversion rate is 40% * old conversion rate + 60 * new conversion rate which is equal to 2.7%. Therefore, given the results of the model, we could be expecting around 0.7 % increase in the conversion rate.

# Conclusion

We have performed an analysis on email marketting campaign. 10% of users in this campaign open the email, while 2% of them finally land on the website. 
We noticed that the cusotmer tend to react better to this campaign, when the email is short and personalized and is sent to customers on Wednesdays at 11:00pm. We also showed that hour of the day is the most important factor in predicting the conversion rate. 

Finally, we built a model that showed we can expect 0.7% increase in the conversion rate if we use this campaing strategy.