# Design experiment impact in improving retention 

In [14]:
import math as mt
import numpy as np
import pandas as pd
from scipy.stats import norm

In [2]:
import pandas as pd
control=pd.read_csv("control_data.csv")
experiment=pd.read_csv("experiment_data.csv")

In [5]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [4]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


# A/B TEST EXPERIMENT DESIGN 

# Step 1: Sanity checks on Invariant Metrics 

### 3 Invariant metrics 

1) Number of cookies that view the course overview page 

2) Number of cookies to click ‘Start Free Trial’ button which happens before the ‘screener’ pops up

3) Free trial button Click through Probability: Number of cookies that click on ‘Start Free Trial’/ Number of cookies that view the course overview page

In [54]:
# Sanity Check on Invariant metric 1 : Cookies that view the course overview page 
pageviews_cont=control['Pageviews'].sum()
pageviews_exp=experiment['Pageviews'].sum()
pageviews_total=pageviews_cont+pageviews_exp
print ("#pageviews in control:", pageviews_cont)
print ("#pageviews in experiment:" ,pageviews_exp)
print ("#pageviews in total:" ,pageviews_total)

('#pageviews in control:', 345543)
('#pageviews in experiment:', 344660)
('#pageviews in total:', 690203)


Let's make sure this difference in amounts is random and not significant. 
We expect the amount of pageviews in the control group to be about a half (50%) of the total pageviews in both groups. This can be represented by a binomial random distribution where a cookie is assigned to one of the groups wit 0.5 proability. Using Central Limit Theorem, we can approximate this binomial distribution to normal distribution because the sample size is greater than 30 with a mean p (0.05) and a standard deviation of sqrt(p(1-p)/n) 

Therefore p=0.5 

### Null Hypothesis: Any difference between the metric value of the two groups is due to chance. 
### Alternate Hypothesis: The difference between the metric value of the two groups is meaningful, and significant. It cannot be due to random change.


If the observed p(pageviews in one of the groups/ total pageviews of two groups) lies within 95% confidence interval of p=0.5, then there's no significant difference between the groups which is necessary for the A/B Testing to happen the right way. 

In [55]:
p=0.5
alpha=0.05
def get_z_score(alpha):
    return norm.ppf(alpha)
def standard_error(p,N):
    return round(mt.sqrt(p*(1-p)/(N)),4)

In [57]:
## Page views 
observed_probability=round(pageviews_cont*1.0/(pageviews_total),4)
Margin_error=round(get_z_score(1-(alpha/2))*mt.sqrt(p*(1-p)/(pageviews_total)),4)
print ("The confidence interval is between",p-Margin_error,"and",p+Margin_error)
print ("observed_probability is ", observed_probability)

('The confidence interval is between', 0.4988, 'and', 0.5012)
('observed_probability is ', 0.5006)


The observed probablity lies within confidence interval which means we cannot reject null hypothesis. Therefore, there's no significant difference between the pageview of the two groups and hence the pageview metrics is indeed an invariable metric

In [60]:
# Sanity Check on Invariant metric 2 : Number of cookies to click ‘Start Free Trial’ button
clicks_cont=control['Clicks'].sum()
clicks_exp=experiment['Clicks'].sum()
clicks_total=clicks_cont+clicks_exp
print ("#clicks in control:", clicks_cont)
print ("#clicks in experiment:" ,clicks_exp)
print ("#clicks in total:" ,clicks_total)

('#clicks in control:', 28378)
('#clicks in experiment:', 28325)
('#clicks in total:', 56703)


In [61]:
## Clicks
observed_probability=round(clicks_cont*1.0/(clicks_total),4)
Margin_error=round(get_z_score(1-(alpha/2))*mt.sqrt(p*(1-p)/(clicks_total)),4)
print ("The confidence interval is between",p-Margin_error,"and",p+Margin_error)
print ("observed_probability is ", observed_probability)

('The confidence interval is between', 0.4959, 'and', 0.5041)
('observed_probability is ', 0.5005)


The observed probablity lies within confidence interval which means we cannot reject null hypothesis. Therefore, there's no significant difference between the pageview of the two groups and hence the clicks metrics is indeed an invariable metric

The first two metrics are invariant metrics and are aggregate counts, hence used the above approach (approximating binomial to normal).  In the case of Click through proability, we want to make sure the proportion of clicks given a pageview (observed CTP) is about the same in both groups. In order to check this out, we will calculate the Click through proability for each of the group and calculate a confidence interval for the expected difference between them. 

Therefore,

Null Hypothesis: CTP of control - CTP of experiment = 0. No significant difference

Alternate Hypothesis: CTP of control - CTP of experiment != 0. There is a sgnificant difference

Note: Here while calculating Confidence Interval, we consider the standard pooled error. We should understand that CTP is a proportion in a population (amount of events x in a population n) like the amount of clicks out of the amount of pageviews.

SDpool=sqrt(ppool^(1−ppool^(1Ncont+1Nexp))
 
ppool=(xcont+xexp)/ (Ncont+Nexp)


In [95]:
# Sanity Check on Invariant metric 3 : Click through proability 
p=0.0
ctp_cont=clicks_cont*1.0/pageviews_cont
ctp_exp=clicks_exp*1.0/pageviews_exp
d_observed=round(ctp_exp-ctp_cont,5)
p_pooled=clicks_total*1.0/pageviews_total
sd_pooled=mt.sqrt(p_pooled*(1-p_pooled)*(1.0/pageviews_cont+1.0/pageviews_exp*1.0))
margin_error=round(get_z_score(1-(alpha/2))*sd_pooled,4)
print ("The confidence interval is between",p-margin_error,"and",p+margin_error)
print ('Observed difference is ', d_observed)

('The confidence interval is between', -0.0013, 'and', 0.0013)
('Observed difference is ', 3e-05)


The observed CTP lies within confidence interval which means we cannot reject null hypothesis. Therefore, there's no significant difference between the pageview of the two groups and hence the CTP is indeed an invariable metric

### All the sanity checks are satisfied and proves that the control and experiment groups have similar distribution and also the pageviews have been randomly assigned into the experiment and control groups. Hence our A/B Testing can be proceeded further

# Step 2: Change in Evaluation Metrics 

If the difference in evaluation metrics between control and experiment group is significant both statistically and practically, it means that the suggested change is beneficical to the company 

Evaluation Metrics: 

These are the performance indicators (sums, counts, means, percentiles, probability, rates or ratios) designed to measure the objective and the experiment impact. You want to choose a metric that is has high sensitivity, that means the metric can pick up the change you care about. You also want the metric to be robust against changes you don’t care about. 

1) Retention: # user-ids remain enrolled past 14 day trial period and made payment divided by # user-ids who enrolled in the course through free trial where they enter card information

2) Net conversion: # user-ids remain enrolled past 14 day trial period and made payment divided by # unique cookies to click the ‘Start Free Trial’ button 

* Cookies and user-ids are different because user-ids are tracked only when the person enrolls in the course. 

We expect the Net Conversion rate to drop if the experiment works because the users now would be more informed and hence would access free course material than enroll and make payment. The numerator lowered for the experiment group while denominator is almost the same. 

Retention, on the other hand is expected to be more in the experiment group because the denominator decreases because of the very same reason that users are more informed and would enroll through free trial only when they can contribute more than 5 hours per week which acts as a proxy for improved course completion probability. 


From the sample size calculations, we have concluded that with the data available we can only arrive at solid conclusions on net conversion evaluation metric

In [99]:
clicks_cont=control["Clicks"].loc[control["Enrollments"].notnull()].sum()
clicks_exp=experiment["Clicks"].loc[experiment["Enrollments"].notnull()].sum()
enrollments_cont=control["Enrollments"].sum()
enrollments_exp=experiment["Enrollments"].sum()
payments_cont=control["Payments"].sum()
payments_exp=experiment["Payments"].sum()
# Expected p 
p=0.0 # as no difference is expected according to null hypothesis 
# calculating observed net conversion rate
NC_cont_obs=payments_cont/clicks_cont
NC_exp_obs=payments_exp/clicks_exp
# calculating observed difference in rates 
NC_diff_obs=round(NC_exp_obs-NC_cont_obs,4)
# calculating standard pooled error and its confidence interval 
NC_pooled=(payments_cont+payments_exp)/(clicks_cont+clicks_exp)
NC_sd_pooled=mt.sqrt(NC_pooled*(1-NC_pooled)*(1.0/clicks_cont+1.0/clicks_exp))
NC_margin_error=round(get_z_score(1-alpha/2)*NC_sd_pooled,4)
print ("The confidence interval is between",NC_diff_obs-NC_margin_error,"and",NC_diff_obs+NC_margin_error)
print ('Observed difference is ', NC_diff_obs)
print ('Expected difference is ', p)

('The confidence interval is between', -0.0116, 'and', 0.0018000000000000004)
('Observed difference is ', -0.0049)
('Expected difference is ', 0.0)


Observe that the net conversion rate decreased by 0.49% units which is expected as stated above. Net conversion rate in the experiment group is lower than the control group. This change is significant if confidence interval doesn't contain zero because only then we can reject null hypothesis.Thus proving that change in the metric is because of the experiment and not by random . Also, it is practically significant if dmin doesn't fall in the Confidence interval  

But in this case, the CI contains zero therefore, the change is not significant. Also, you can notice that the change can range between -1.6% and 1.8% which is less than the practical significance of 2%. 

## Conclusion:  So this change is neither statistically nor practically significant  

# Step 3: Sign tests to find patterns 

Sign test helps in checking whether the direction of differences (either positive or negative) in the evaluation metrics between two groups are meaningfully distributed over the days of the experiment or not. If the change is negative for all the days, then our experiment is indeed driving the rate lower. Basically, this test is helpful to see the behaviour of signs between weekdays/weekends, different regions and spot common trends. If any, then probably a further breakdown of data and traffic would be required before running another set of A/B tests

Intuition: Computing the evaluation metric for every day and then check for how many days the metric was lower in the experiment group. This similar to assign a binomial variable being assigned 1 when Netconversion of experiment group is less than control group.  and this will be the number of succssesses for our binomial variable. Once this is defined we can look at the proportion of days of success out of all the available days.

In [118]:
# Merging two groups 
whole_exp=control.join(other=experiment,how="inner",lsuffix="_cont",rsuffix="_exp")
whole_exp=whole_exp.loc[whole_exp["Enrollments_cont"].notnull()]
whole_exp.head()

Unnamed: 0,Date_cont,Pageviews_cont,Clicks_cont,Enrollments_cont,Payments_cont,Date_exp,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp
0,"Sat, Oct 11",7723,687,134.0,70.0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9102,779,147.0,70.0,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10511,909,167.0,95.0,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9871,836,156.0,105.0,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",10014,837,163.0,64.0,"Wed, Oct 15",9793,832,140.0,94.0


In [119]:
whole_exp.shape
# 23 days data 

(23, 10)

In [120]:
whole_exp['NC_cont']=whole_exp['Payments_cont']/whole_exp['Clicks_cont']
whole_exp['NC_exp']=whole_exp['Payments_cont']/whole_exp['Clicks_exp']
# wherever net_conversion of experiment group is lower than control group assigning 1 to the new column as a success - 'Binomial' 
whole_exp['NC_success'] = np.where(whole_exp['NC_cont']>=whole_exp['NC_exp'],1,0)
whole_exp.drop(columns=['NC_exp','NC_cont'],inplace=True)
whole_exp.head()

Unnamed: 0,Date_cont,Pageviews_cont,Clicks_cont,Enrollments_cont,Payments_cont,Date_exp,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp,NC_success
0,"Sat, Oct 11",7723,687,134.0,70.0,"Sat, Oct 11",7716,686,105.0,34.0,0
1,"Sun, Oct 12",9102,779,147.0,70.0,"Sun, Oct 12",9288,785,116.0,91.0,1
2,"Mon, Oct 13",10511,909,167.0,95.0,"Mon, Oct 13",10480,884,145.0,79.0,0
3,"Tue, Oct 14",9871,836,156.0,105.0,"Tue, Oct 14",9867,827,138.0,92.0,0
4,"Wed, Oct 15",10014,837,163.0,64.0,"Wed, Oct 15",9793,832,140.0,94.0,0


In [124]:
print '# days with net_conversion of experiment group is lower than control group out of 23 days is',whole_exp.NC_success[whole_exp["NC_success"]==1].count(), 'days'

# days with net_conversion of experiment group is lower than control group out of 23 days is 10 days


According to the binomial distribution with  p=0.5  and  n= total number of days;we'll now calucate the probability of  x  days being a success. Because we are doing a two-tailed test we want to double this probability and once we have we can call it the  p−value  and compare it to our  α . If the  p−value  is greater than the  α  the result is not significant

In [144]:
#first a function for calculating probability of x=number of successes out of n 
# in this case x succesful days out of 23(n) days 
def get_success_prob(x,p,n):
    p_observed=round(mt.factorial(n)/(mt.factorial(x)*mt.factorial(n-x))*p**x*(1-p)**(n-x),4)
    return p_observed
#next a function to compute the p-value from probabilities of maximum x
def get_cumulative_2side_pvalue(x,p,n):
    p_observed=0
    for i in range(0,x+1):
        p_observed=p_observed+get_success_prob(i,p,n)
    return 2*p_observed*1.0

In [145]:
get_cumulative_2side_pvalue(10,0.5,23) 

0.6774

p-value is greater then 0.05 , thus the change is not significant is again validated 

## Conclusion is that it is better to not continue with the new screener message as it didn't really result in a significant change in our net conversion rate

### Having said that because it is not a harmful/risky change, we can still tune the parameters and try our A/B Testing again to observe significant changes 
