I work in the marketing department and my product manager has asked me to get back to her some project ideas on how to improve __email click-through-rate__. That is, the company has been sending marketing emails and they want to increase __the percentage of people who click on the company link inside the email__.

__table 1__
email_id, email_text, email_version, clicked
8       , long_email, generic,       0
9       , short_email, personalized, 1

- email_id : the Id of the email that was sent. It is unique by email
- email_text : two different versions of the email have been sent: one has “long text” (i.e. has 4 paragraphs) and one has “short text” (just two paragraphs)
- email_version : some emails were “personalized” (i.e. they had the name of the user receiving the email in the incipit, such as “Hi John,”), while some emails were “generic” (the incipit was just “Hi,”)
- clicked - Whether the user has clicked on the link inside the email. This is our label and, most importantly, the goal of the project is to increase this

__table 2__
email_id, hour, weekday
9       , 14  , Thursday

- hour : the local time on which the email was sent
- weekday : the weekday on which the email was sent

__table 3__
user_id, user_country

- user_country : the country where the user receiving the email is based. It comes from the user ip address when they created the account

__table 4__
user_id, user_past_purchases

- user_past_purchases : how many items in the past were bought by the user receiving the email

__table 5__
user_id, email_id

> I will use logistic regression given that the label I am trying to predict ("clicked") is binary.

In [1]:
import pandas
import statsmodels.api as sm
pandas.set_option('display.max_columns', 20)
pandas.set_option('display.width', 350)

In [2]:
from pandas.api.types import CategoricalDtype

In [3]:
data_path = '/Users/Bien/Documents/Data_Science/Project_data/'

In [4]:
data = pandas.read_csv(data_path + "emails.csv")

In [5]:
data.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,clicked
0,8,short_email,generic,9,Thursday,US,3,0
1,33,long_email,personalized,6,Monday,US,0,0
2,46,short_email,generic,14,Tuesday,US,3,0
3,49,long_email,personalized,11,Thursday,US,10,0
4,65,short_email,generic,8,Wednesday,UK,3,0


In [6]:
data.shape

(99950, 8)

In [7]:
data.describe()

Unnamed: 0,email_id,hour,user_past_purchases,clicked
count,99950.0,99950.0,99950.0,99950.0
mean,498695.729065,9.0591,3.878559,0.0207
std,289226.115244,4.439618,3.196324,0.14238
min,8.0,1.0,0.0,0.0
25%,246721.5,6.0,1.0,0.0
50%,498441.5,9.0,3.0,0.0
75%,749936.75,12.0,6.0,0.0
max,999998.0,24.0,22.0,1.0


Before building the regression, we need to know which ones are the reference levels for the categorical variables

In [8]:
data.dtypes

email_id                int64
email_text             object
email_version          object
hour                    int64
weekday                object
user_country           object
user_past_purchases     int64
clicked                 int64
dtype: object

In [9]:
#Set the dtypes of the "object" columns to "category", so we can check and change the refence level
data[data.select_dtypes(['object']).columns] = data[data.select_dtypes(['object']).columns].astype("category")

In [10]:
# Select the categorical variables
data_categorical = data.select_dtypes(['category'])
#find reference level, i.e. the first level
print(data_categorical.apply(lambda x: x.cat.categories[0])) 
#dataframe.apply(lambda x: x) targetting on each column (x) in the dataframe 

email_text       long_email
email_version       generic
weekday              Friday
user_country             ES
dtype: object


> Change the __reference level__ of a categorical column

In [13]:
data['email_text'].dtype

CategoricalDtype(categories=['long_email', 'short_email'], ordered=False)

In [14]:
cat_type = CategoricalDtype(categories=['short_email','long_email'], ordered=True)

In [15]:
data['email_text'] = data['email_text'].astype(cat_type)

In [16]:
data_categorical = data.select_dtypes(['category'])
print(data_categorical.apply(lambda x: x.cat.categories[0])) 

email_text       short_email
email_version        generic
weekday               Friday
user_country              ES
dtype: object


In [17]:
#make dummy variables from categorical ones. Using one-hot encoding and drop_first=True 
data = pandas.get_dummies(data, drop_first=True) # it will only get dummies for categorical columns

In [18]:
#add intercept
data['intercept'] = 1
#drop the label
train_cols = data.drop('clicked', axis=1)

In [19]:
#Build Logistic Regression
logit = sm.Logit(data['clicked'], train_cols)
output = logit.fit()

Optimization terminated successfully.
         Current function value: 0.092770
         Iterations 9


In [20]:
output_table = pandas.DataFrame(dict(coefficients = output.params, SE = output.bse, z = output.tvalues, p_values = output.pvalues))
#get coefficients and pvalues
print(output_table)

                            coefficients            SE          z       p_values
email_id                   -3.848609e-08  7.780379e-08  -0.494656   6.208432e-01
hour                        1.670684e-02  5.005879e-03   3.337445   8.455247e-04
user_past_purchases         1.878107e-01  5.725787e-03  32.800855  5.725039e-236
email_text_long_email      -2.793085e-01  4.530477e-02  -6.165101   7.043829e-10
email_version_personalized  6.387251e-01  4.691461e-02  13.614631   3.277989e-42
weekday_Monday              5.410326e-01  9.341014e-02   5.792011   6.954864e-09
weekday_Saturday            2.828638e-01  9.777629e-02   2.892969   3.816190e-03
weekday_Sunday              1.836278e-01  1.001194e-01   1.834088   6.664099e-02
weekday_Thursday            6.254040e-01  9.233999e-02   6.772839   1.262790e-11
weekday_Tuesday             6.162222e-01  9.237223e-02   6.671077   2.539336e-11
weekday_Wednesday           7.554637e-01  9.084515e-02   8.315950   9.102053e-17
user_country_FR            -

In [21]:
#only keep significant variables and order results by coefficient value
print(output_table.loc[output_table['p_values'] < 0.05].sort_values("coefficients", ascending=False))
  

                            coefficients        SE          z       p_values
user_country_UK                 1.155255  0.122060   9.464618   2.946372e-21
user_country_US                 1.141360  0.115963   9.842487   7.386228e-23
weekday_Wednesday               0.755464  0.090845   8.315950   9.102053e-17
email_version_personalized      0.638725  0.046915  13.614631   3.277989e-42
weekday_Thursday                0.625404  0.092340   6.772839   1.262790e-11
weekday_Tuesday                 0.616222  0.092372   6.671077   2.539336e-11
weekday_Monday                  0.541033  0.093410   5.792011   6.954864e-09
weekday_Saturday                0.282864  0.097776   2.892969   3.816190e-03
user_past_purchases             0.187811  0.005726  32.800855  5.725039e-236
hour                            0.016707  0.005006   3.337445   8.455247e-04
email_text_long_email          -0.279308  0.045305  -6.165101   7.043829e-10
intercept                      -6.601613  0.154950 -42.604781   0.000000e+00

# Understanding the output

1. User country seems very important. Especially interesting is that English speaking countries (US, UK) are doing significantly better than non-English speaking countries (ES, FR). That could mean a bad translation or in general a non-localized version of the email. The first thing you want to do here is probably getting in touch with the international team and asking them to review French and Spanish email templates
2. Not surprisingly, all weekday coefficients are positive. Sunday is (barely) non-significant, all others are significant. This is a consequence of having Friday as reference level. It is a well-known fact that sending marketing emails on Friday is not a great idea. Wednesday seems to be the best day, but in general all week days (Monday-Thursday) perform similarly. Friday - Sunday are much worse. The company should probably start sending emails only Monday-Thursday, with a particular focus on the middle of the week
3. Personalized emails are doing better. So the company should stop sending generic emails. But most importantly, this can be a huge insight from a product standpoint. If just adding the name at the top is increasing clicks significantly, imagine what would happen with even more personalization. Definitely worth investing in this
4. Sending short emails appears to be better, but personalizing emails should be the priority vs finding a general optimal email template that on an average works best for everyone (see much lower coefficient compared to the personalization one)
5. Hour perfectly emphasizes the problems of logistic regressions with numerical variables. The best time is likely during the day and early mornings and late nights are probably bad. But the model is trying to find a linear relationship between hour and the output. In most cases, this means that will not find a significant relationship. If it does find significance, the results would be highly misleading. Like in this case, it is telling us that the larger the value of hour, the better it is. So the best time would be 24 (midnight)! To solve this, you should manually create segments (i.e. indicator variables) before building the model. One segment could be night time, one morning to noon, etc.
6. Email_id is not significant, but the p-value is not that high either, so it is something to keep in mind. Email_id could be interesting because it can be seen as a proxy for time, i.e. the first email sent gets id 1, second id 2, etc. So a significant and negative coefficient would mean that as time goes by, less and less people are clicking on the email. This could be a big red flag, like for instance Google started labeling us as spam. It doesn’t look like the case here, but still, it is something to keep in mind
7. More importantly, note the super low coefficient for email_id compared to the other ones. That doesn’t mean that the variable is irrelevant. The super low coefficient simply depends on the fact that email_id scale is way larger than the other variables. The max value of all other variables is 24 for hour. The max value of email_id is 100K! So the low coefficient is meant to balance the different scale, otherwise email_id would entirely drive the regression output.
8. The intercept highly negative and significant is the regression outcome if all variables are set to zero. So, basically, categorical variables are all set to their reference levels and numerical variables are set to 0. Intercepts are almost always negative and significant given that in the majority of cases you are dealing with imbalanced classes, where 1s are <5% of the events. And in a logistic regression a negative outcome means higher probability of predicting class zero. Don’t read too much into it. After all, the all-values-are-0 scenario is unrealistic at best, and often impossible. Like here “hour” is coded as from 1 to 24, so it cannot even have the value 0! __Only thing, looking at the scale of the intercept vs the scale of the other coefficients * the possible values of those variables can be useful to get a sense of by how much you can affect the output__
    1. If I send emails on Wednesday, that variable value becomes 0.7 (i.e. 0.7 coefficient times the value of the variable that would be 1) which is pretty high relative to the -6.8 intercept. So opportunities of meaningful improvements are there. Imagine my intercept were -1000 and Wednesday coefficient were the same. Then optimizing the day would be almost irrelevant from a practical standpoint.