# Goal
Optimizing marketing campaigns is one of the most common data science tasks. Among the many possible marketing tools, one of the most eﬃcient is using emails. Emails are great cause they are free and can be easily personalized. Email optimization involves personalizing the text and/or the subject, who should receive it, when should be sent, etc. Machine Learning excels at this. Challenge Description

The marketing team of an e-commerce site has launched an email campaign. This site has email addresses from all the users who created an account in the past. They have chosen a random sample of users and emailed them. The email let the user know about a new feature implemented on the site. From the marketing team perspective, a success is if the user clicks on the link inside of the email. This link takes the user to the company site. 

You are in charge of ﬁguring out how the email campaign performed and were asked the following questions: What percentage of users opened the email and what percentage clicked on the link within the email? 
The VP of marketing thinks that it is stupid to send emails to a random subset and in a random way. Based on all the information you have about the emails that were sent, can you build a model to optimize in future email campaigns to maximize the probability of users clicking on the link inside the email? 
By how much do you think your model would improve click through rate ( deﬁned as # of users who click on the link / total users who received the email). How would you test that? Did you ﬁnd any interesting pattern on how the email campaign performed for diﬀerent segments of users? Explain.


In [28]:
#data read from input file 
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt


emailOpen_file='data/email/email_opened_table.csv'
email_file='data/email/email_table.csv'
linkClicked_file='data/email/link_clicked_table.csv'

emailOpen=pd.read_csv(emailOpen_file)
email=pd.read_csv(email_file,index_col="email_id")
linkClicked=pd.read_csv(linkClicked_file)


In [29]:
emailOpen.head()
linkClicked.head()
email.head()
#emailOpen[emailOpen["email_id"]==966622] 

Unnamed: 0_level_0,email_text,email_version,hour,weekday,user_country,user_past_purchases
email_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
85120,short_email,personalized,2,Sunday,US,5
966622,long_email,personalized,12,Sunday,UK,2
777221,long_email,personalized,11,Wednesday,US,2
493711,short_email,generic,6,Monday,UK,1
106887,long_email,generic,14,Monday,US,6


In [30]:
data=email
data["email_status"]=0
email_open_id=emailOpen["email_id"]

data.loc[email_open_id,"email_status"]=1

In [31]:
data["link_status"]=0

data.loc[linkClicked["email_id"],"link_status"]=1

In [35]:
data.head()
data.groupby("email_text")["email_status"].mean() #check based on long/short text email, user open email ratio


email_text
long_email     0.091177
short_email    0.115860
Name: email_status, dtype: float64

for short emails, open ratio is 11%, wile for long ones open ratio is 9% 

In [36]:
data.groupby("email_text")["link_status"].mean() #check based on long/short text email, user open email ratio

email_text
long_email     0.018538
short_email    0.023872
Name: link_status, dtype: float64

for short email, link clicked ratio is 2%, compared to long ones, which is 1.8%

In [38]:
data.groupby(["email_text","email_version"])["email_status"].mean() #check based on long/short text email, user open email ratio

email_text   email_version
long_email   generic          0.070812
             personalized     0.111701
short_email  generic          0.087975
             personalized     0.143994
Name: email_status, dtype: float64

In [39]:
data.groupby(["email_text","email_version"])["link_status"].mean() #check based on long/short text email, user open email ratio


email_text   email_version
long_email   generic          0.013711
             personalized     0.023403
short_email  generic          0.016578
             personalized     0.031231
Name: link_status, dtype: float64

In [42]:
data.groupby("email_status")["link_status"].mean()
#among emails which are not opened, there are some error message showing link were clicked
#among emails which are opened, 20% of them with links clicked 

email_status
0    0.000558
1    0.200000
Name: link_status, dtype: float64

In [41]:
data["email_status"].mean() #10%
data["link_status"].mean() #2.1%

0.021190000000000001

## Question 1:
    Overall 10% email were opened, 2.1% of link get clicked. 
    For different email settings, open ratio and click ratio are different. Short message with personalized content with highest open ratio and click ratio, which are 14% and 3.1% respectively. 
    
    