## Enhancing Student Experience Without Compromising Revenue: A/B Testing the Inplementation of Free Trial Screener.
##### The final project of the course Udacity A/B Testing by Google 

## Table of Contents

## TBD: Need to change later, not matching right section
- [1. Project Overview](#1-project-overview)
  - [1.1 Situation](#11-situation)
  - [1.2 Goal](#12-goal)
  - [1.3 Treatment](#13-treatment)
  - [1.4 Expected Result](#14-expected-result)
- [2. Experiment Setup](#2-experiment-setup)
  - [2.1 Metric Choice](#21-metric-choice)
    - [2.1.1 Choosing Invariant Metrics](#211-choosing-invariant-metrics)
    - [2.1.2 Choosing Evaluation Metrics](#212-choosing-evaluation-metrics)
  - [2.2 Measuring Variability](#22-measuring-variability)
    - [2.2.1 Scaling Given Sample Size of 5000](#221-scaling-given-sample-size-of-5000)
    - [2.2.2 SE for Evaluation Metrics](#222-se-for-evaluation-metrics)
- [3. Sizing](#3-sizing)
  - [3.1 Alpha and Beta Set Up](#31-alpha-and-beta-set-up)
  - [3.2 Sample Size in Pageview](#32-sample-size-in-pageview)
- [4. Exposure and Duration](#4-exposure-and-duration)
  - [4.1 Final Number of Pageviews](#41-final-number-of-pageviews)
  - [4.2 Duration in Days](#42-duration-in-days)
  - [4.3 Fraction of Traffic](#43-fraction-of-traffic)
- [5. Multiple Metrics Alpha Correction?](#5-multiple-metrics-alpha-correction)

- [6. Data Analysis](#6-data-analysis)
  - [6.1 EDA](#61-eda)
  - [6.2 Sanity Check](#62-sanity-check)
  - [6.3 Effect Size Testing](#63-effect-size-testing)
  - [6.4 Sign Test](#64-sign-test)
- [7. Interpretation & Recommendation](#7-interpretation-recommendation)

- [8. Follow-Up Experiment](#8-follow-up-experiment)



In [2]:
import os       
import numpy as np 
import pandas as pd
from scipy import stats 
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import binom_test

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### 1. Project Overview

This is the [final project](https://learn.udacity.com/courses/ud257/lessons/811aec6b-bd88-4da7-adb9-0fbd96e74238/concepts/5345b415-6c2f-431a-8dd5-c212bb3b7c20)
 of the course [Udacity A/B Testing by Google](https://www.udacity.com/course/ab-testing--ud257)


1.1 Situation

Udacity's course overview page currently presents two options to users: 

1. "start free trial," which requires entering credit card information and enrolled in free trial of the paid version. Students will automatically be transitions into a paid subscription after 14 days unless they cancel first. 

2. "Access course materials," which allows users to view course content and quizzes for free but without coaching support or certification. They cannot submit their final project for review either. 

A significant challenge has been the frustration and dropout of users who start the free trial without understanding the time commitment required, leading to dissatisfaction and cancellations.

1.2 Goal  

The goal of the experiment is to ensure that users have a clear understanding of the time commitment required for the course before enrolling in the free trial. By setting clearer expectations upfront, Udacity aims to reduce the number of students who leave the free trial out of frustration, without significantly impacting the number of students who continue past the free trial and complete the course.

1.3 Treatment

In the experiment, Udacity introduces an additional step in the enrollment process. When a user clicks "start free trial," they are prompted to indicate the amount of time they can dedicate to the course each week. Please see the [screenshot](https://github.com/emmaliberkeley/AB-Testing-Project/blob/main/Basic_Info/Final%20Project_%20Experiment%20Screenshot.png)  on what the experiment should look like.

Users who indicate 5 or more hours per week proceed to the normal checkout process, while those who indicate fewer than 5 hours receive a message highlighting the typical time requirement and suggesting they might prefer to access course materials for free. At this point, students have the option to continue enrolling in the free trial, or access the course materials for free instead.

1.4 Expected Result

The expected result is a decrease in the number of users who sign up for the free trial without sufficient time to dedicate to the course, thereby reducing early dropouts and increasing the overall satisfaction of Udacity's user base. The experiment seeks to validate the hypothesis that clearer communication of course requirements will lead to an improved student experience and more efficient use of coaching resources.

# TBD: a customer funnel viz

#### 2. Experiment Setup

2.1 Unit of Diversion

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

2.2 Hypotheses 

- Null Hypothesis (H0): The treatment does not influence the proportion of individuals enrolling in the free trial. 

- Alternative Hypothesis (H1): The treatment decreases the proportion of individuals enrolling in the free trial.

- Null Hypothesis (H0): The treatment has no impact on the percentage of individuals who exit the free trial.

- Alternative Hypothesis (H1): The treatment enhances the overall student experience, thereby reducing the percentage of individuals leaving the free trial.


- Null Hypothesis (H0): The treatment does not affect the count of individuals continuing beyond the free trial.

- Alternative Hypothesis (H1): The treatment influences the count of individuals who continue beyond the free trial.

2.2 Metric Choice and Practical Significance (Provided by Udacity)

To test these hypotheses, we need to define appropriate evaluation metrics.
Evaluation metrics are used to test the performance of what we care about. It could be high level business metrics or detail metrics on user experience research. Evaluation metrics should have a good balance between sensitivity and robustness, meaning being sensitive enough that they pick up the changes we care about and being robust against the changes we do not care about.

We also need to define a set of invariant metrics/control variables. Invariant metrics are metrics that we expect not to change between test and control group. They will be used in the "sanity check" section. 

How many metrics do we need? For sanity checking, we need multiple metrics. For evaluation, some companies want just one metric so that different teams are working toward one goal collectively. If we have mutiple evaluation metrics, we can test each one of them, though might need adjust the alpha.  More on this in section [5. Multiple Metrics Alpha Correction?](#5-multiple-metrics-alpha-correction)). We can also create a composite metric: Objective Function or OEC(overall evaluation criterion) which is a weighed metrics of all individual metrics. OEC is not advised because need to define all of the individual metrics and may lead to confusion. 

What's more, we cannot implement every change that's statistically significant because of the cost of implementation or may want to wait for more substantial change. In addition to assessing statistical significance, we also consider the practical significance, denoted as 'dmin.' 

Udacity provides seven metrics to choose from. Additionally, the practical relevance level of each metric is given. Below is the visualization of the metrics provided, their definitions in one sentence, and the practical significance level. 



| Metric Name               | Definition                                                                                      | dmin |
| ------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------- |
| Number of cookies         | Number of unique cookies to view the course overview page.                      | 3000                            |
| Number of user-ids        | Number of users who enroll in the free trial.                                       | 50                              |
| Number of clicks          | Number of unique cookies to click the "Start free trial" button (before the screener is triggered). | 240 |
| Click-through-probability | Number of unique cookies to click the "Start free trial" button divided by the number of unique cookies to view the course overview page.  | 0.01 |
| Gross conversion          | Number of user-ids to complete checkout and enroll in the free trial divided by the number of unique cookies to click the "Start free trial" button.  | 0.01 |
| Retention                 | Number of user-ids to remain enrolled past the 14-day boundary (and make at least one payment) divided by the number of user-ids who complete checkout.  | 0.01 |
| Net conversion            | Number of user-ids to remain enrolled past the 14-day boundary (and make at least one payment) divided by the number of unique cookies to click the "Start free trial" button.  | 0.0075 |


2.2.1 Choosing Invariant Metrics

We expect the invariant metrics to NOT be significantly different between the experiment and control group. 

Even though the click-through-probability has already include number of cookies and number of clicks as numerator and denominator, we still want to have them included as invariant metrics. This is because when one group has a small number of cookies and clicks, while the other group has a large number of cookies and clicks, but their click-through probabilities are still the same. This is why we need multiple invariant metrics helps account for variations, ensuring a valid comparison between treatment and control groups.

| Metric                   | Formula                                                             | dmin  |
| ------------------------ | ------------------------------------------------------------------- | ----- |
| Number of cookies        | C = # of unique daily cookies to view course overview page          | 3000  |
| Number of clicks         | CL = # of unique daily cookies to click "start free trial" button   | 240   |
| Click-through-probability| CTP = CL / C                                                        | 0.01  |



2.2.2 Choosing Evaluation Metrics

We expect the evalution metrics to be influenced by the treatment and differ between the experiment and control group.

| Metric            | Formula                                   | dmin   |
| ----------------- | -----------------------------------------| ------ |
| Gross Conversion  | CG = (# of user-ids that enrolled) / CL  | -0.01  |
| Retention         | R = (# of user-ids that paid) / (# of user-ids that enrolled) | 0.01   |
| Net Conversion    | CN = (# of user-ids that paid) / CL      | 0.0075 |



2.3 Hypotheses Restated with Evaluation Metrics

H<sub>0</sub>: C<sub>G<sub>treatment</sub></sub> = C<sub>G<sub>control</sub></sub>
H<sub>1</sub>: C<sub>G<sub>treatment</sub></sub> ≠ C<sub>G<sub>control</sub></sub>

H<sub>0</sub>: R<sub>treatment</sub> = R<sub>control</sub>
H<sub>1</sub>: R<sub>treatment</sub> ≠ R<sub>control</sub>

H<sub>0</sub>: C<sub>N<sub>treatment</sub></sub> = C<sub>N<sub>control</sub></sub>
H<sub>1</sub>: C<sub>N<sub>treatment</sub></sub> ≠ C<sub>N<sub>control</sub></sub>

2.3 Measuring Variability 


If the unit of diversion is the same as the unit of analysis, then the analytical estimate of the variability is accurate. 

Recall "if the unit of diversion is a cookie, although if the students enrolls in the free trial, they are tracked by user-id from that point forward." 

- Before Enrollment (Unit of Diversion: Cookie)
- After Enrollment (Unit of Diversion: User-ID)

Since the unit of diversion is the same as the unit of analysis (denominator of the metric formula) for each evaluation metric: cookie in the case of Gross Conversion and Net Conversion, user-id in the case of Retention. Besides, we can assume the binomial distribution of the metrics, we can get the variability analytically instead of empirically.

We are given a sample size of 5000 cookies visiting the course overview page each day. Thus, we are going to make analytical estimate of each evaluation metric's standard error, or standard deviation of sampling distribution of the sample mean. Since all of our evaluation metrics are ratios, it's more accuarte to say standard deviation of the sampling proportion. The standard error is an estimate of how far the sample proportion is likely to be from the population proportion.

In [3]:
#The baseline values and the practical significant levels for all the metrics are given by Udacity, changed from Udacity's true numbers.
baseline = {"Metric": ["Cookies", "Clicks", "User-ids", "Click-through-probability", "Gross Conversion", "Retention", "Net Conversion"], 
     "Baseline": [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
     "dmin": [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]}
bs = pd.DataFrame(data=baseline, index=["C", "CL", "ID", "CTP", "GC", "R", "NC"])
bs.insert(2, "scaled_baseline", np.nan)
factor = 5000/ bs.loc['C']['Baseline']
for i in ['C', 'CL', 'ID']:
    bs.at[i, 'scaled_baseline'] = bs.loc[i]['Baseline'] * factor
pd.options.display.float_format = '{:.5g}'.format
bs
# Print the DataFrame as a table with integer formatting

Unnamed: 0,Metric,Baseline,scaled_baseline,dmin
C,Cookies,40000.0,5000.0,3000.0
CL,Clicks,3200.0,400.0,240.0
ID,User-ids,660.0,82.5,-50.0
CTP,Click-through-probability,0.08,,0.01
GC,Gross Conversion,0.20625,,-0.01
R,Retention,0.53,,0.01
NC,Net Conversion,0.10931,,0.0075


To calculate the SE, we are going to apply central limit theorm, which requires: 
- The sample size is large enough: A common guideline is that both n*p and n*(1-p) should be greater than or equal to 10.

- Each event is independent and identially distributed

All of the evaluation metrics have large enough sample size, so we can apply CLT


To calculate the SE:

The standard error (SE) is given by:

SE = $\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$

Where $\hat{p}(1 - \hat{p})$ is the estimated population standard deviation.

In [4]:
bs["SE"] = np.NaN
def se (n, p):
    '''Return Standard Error'''
    return (p*(1-p)/n)**0.5
for i in ['GC', 'NC']:
    bs.at[i, "SE"] = se(bs.loc["CL"]["scaled_baseline"], bs.loc[i]["Baseline"]) 
bs.at["R", "SE"] = se(bs.loc["ID"]["scaled_baseline"], bs.loc["R"]["Baseline"])
bs 

Unnamed: 0,Metric,Baseline,scaled_baseline,dmin,SE
C,Cookies,40000.0,5000.0,3000.0,
CL,Clicks,3200.0,400.0,240.0,
ID,User-ids,660.0,82.5,-50.0,
CTP,Click-through-probability,0.08,,0.01,
GC,Gross Conversion,0.20625,,-0.01,0.020231
R,Retention,0.53,,0.01,0.054949
NC,Net Conversion,0.10931,,0.0075,0.015602


#### 3. Sizing

3.1 Alpha and Beta Set Up

In [5]:
# Define a dictionary to store alpha and beta values
alpha_beta_dict = {
    'alpha': 0.05,
    'beta': 0.20
}
alpha_value = alpha_beta_dict['alpha']
beta_value = alpha_beta_dict['beta']
print(f'Alpha: {alpha_value}')
print(f'Beta: {beta_value}')

Alpha: 0.05
Beta: 0.2


3.2 Sample Size in Pageview

We will use the [Evan's sample size calculator](https://www.evanmiller.org/ab-testing/sample-size.html). When we calculate the experiment sample size we have to keep in mind that n gives us the sample size per group. In A/B testing, we have two groups, thus the final number should be doubled. 

We want to calculate the experiment sample size in terms of cookies that visit the page. Thus, we also need to account for the circumstance that our evaluation metrics' units of analysis are clicks and user-ids, respectively.


$\text{Sample Size}_{\text{GC or NC}} = \left( \frac{n_{\text{Evan's}}}{\text{CTP}} \right) \times 2$

$\text{Sample Size}_{\text{R}} = \left( \frac{\frac{n_{\text{Evan's}}}{\text{CTP}}}{\text{GC}} \right) \times 2$

Where ${n_{\text{Evan's}}}$ is the output from Evan's sample size calculator.

There are four key factors in determining the sample size:
1. Baseline Conversion Rate
2. Minimum Detectable Effect
3. Statistical Power: 1-$\beta$
4. Significance level: $\alpha$

In [6]:
bs["sample_size"] = np.nan
total_sample_size = ["645,875", "4,741,212", "685,325"]
for i in range(len(["GC", "R", "NC"])):
    bs.at[["GC", "R", "NC"][i], "sample_size"] = total_sample_size[i]
bs

  bs.at[["GC", "R", "NC"][i], "sample_size"] = total_sample_size[i]


Unnamed: 0,Metric,Baseline,scaled_baseline,dmin,SE,sample_size
C,Cookies,40000.0,5000.0,3000.0,,
CL,Clicks,3200.0,400.0,240.0,,
ID,User-ids,660.0,82.5,-50.0,,
CTP,Click-through-probability,0.08,,0.01,,
GC,Gross Conversion,0.20625,,-0.01,0.020231,645875.0
R,Retention,0.53,,0.01,0.054949,4741212.0
NC,Net Conversion,0.10931,,0.0075,0.015602,685325.0


If we want to test the three hypotheses from section 2.3, 
We need 645,875 pageviews with Gross Conversion; 4,741,212 pageviews with Retention; 685,325 pageview with Net Conversion. 

Given our estimation that there are 40,000 unique pageviews per day, 4,741,212 is too large. If we want to test the second hypothesis, we need to run on all traffic for 119 days. 

Thus, we are going to test hypothesis 1 and 3 with metrics GG and NC. We conclude that we need 685,325 pageviews to run the experiment.  

#### 4. Exposure and Duration
Our goal is to ensure that the experiment is executed efficiently and doesn't extend beyond a few weeks in duration. Additionally, given that the experiment doesn't involve any high-risk features or aspects that could lead to significant media attention, we are comfortable directing a substantial portion of our traffic toward it. However, because there's always a possibility of unexpected issues during the implementation phase, we may choose not to allocate our entire traffic volume to the experiment. We want around 50% of the traffic exposure. 

Another important factor to take into account is the variation in website visitor behavior between weekdays and weekends. Thus, we should aim for an experiment duration that is a multiple of 7 days. With 685,325 pagesviews as sample size and 40,000 as the total number of pageviews per day, we choose 35 days as the final duration, or 5 weeks.

- Final number of pageviews: at least 685,325
- Duration in days: 35
- Fraction of traffic: 49%

#### 5. Multiple Metrics Alpha Correction?

With multiple hypotheses at play, the risk of encountering false positives increases. However, since our metrics are not entirely independent, the actual probability of false positives remains below the 9.75% threshold for independent metrics. While we could use methods like Bonferroni or false discovery rate to address multiple hypotheses, these approaches have their limitations, including the potential for more false negatives. Given the slightly elevated risk of false positives in this case, we've chosen not to implement strict control measures for multiple hypotheses. 

If the metrics are independent, $\
\text{Actual Probability of False Positive} = 1 - (1 - \alpha)^n$, where n is the number of evaluation metrics. 

In our case, $1 - (1 - 0.05)^2 = 9.75\%$






#### 6. Data Analysis

6.1 EDA

In [7]:
directory_path = '/Users/emmababe/Documents/AB-Testing-Project/Results'

for dirname, _, filenames in os.walk(directory_path):
    for filename in filenames:
        print(filename)

experiment_results.xlsx
control_results.xlsx


In [8]:
control = pd.read_excel('/Users/emmababe/Documents/AB-Testing-Project/Results/control_results.xlsx')
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134,70
1,"Sun, Oct 12",9102,779,147,70
2,"Mon, Oct 13",10511,909,167,95
3,"Tue, Oct 14",9871,836,156,105
4,"Wed, Oct 15",10014,837,163,64


In [9]:
experiment = pd.read_excel('/Users/emmababe/Documents/AB-Testing-Project/Results/experiment_results.xlsx')
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105,34
1,"Sun, Oct 12",9288,785,116,91
2,"Mon, Oct 13",10480,884,145,79
3,"Tue, Oct 14",9867,827,138,92
4,"Wed, Oct 15",9793,832,140,94


In [10]:
#check the sample_size it make sure it is above 
sample_size_control = control["Pageviews"].sum()
sample_size_experiment = experiment["Pageviews"].sum()
sample_size = sample_size_control+sample_size_experiment
sample_min = 685325
sample_size
print(f'The sample size is {sample_size}')
print(f'The minimum sample size is {sample_min}')
print(f'Sample size greater than minimum sample requirement: {sample_size > sample_min}')

The sample size is 690203
The minimum sample size is 685325
Sample size greater than minimum sample requirement: True


6.2 Sanity Check

As stated in section 2.2.1, three invarient metrics are number of cookies, number of clicks, and click through probability. 
we would expect that these metrics do not differ significantly between control and treatment group. 

6.2.1 Sanity Check Part 1: number of cookies and number of clicks


If we treat a cookie being assigned to the control as a success, then the number of successes should follow a binomial distribution with 50% probabiliyt of success because a cookie should have 50-50 chance of being assigned to either treatment or control group. The same applies to the number of clicks. 
Since the n here is large enough, and each event is independent and identically distributed, we can assume that the sampling distribution of the sample proportion follows a normal distribution according to CLT. 
There are two ways to do the sanity check: 1. use CLT to get the confidence interval (used in this project), one-proportion z-test; 2. Use exact binomial test to calculate the probability of observing the exact number of successes or more extreme outcomes on both sides of the distribution.

Side note with the CLT approach
Usually we should calculate the confidence interval with p_hat +/- margin of error because we do not the true probability. But here, We know that if the experiment is set up properly, the true probability is 0.5. Thus, we will use 0.5 +/- margin of error. Then if the p_hat is not in the CL, we reject the null. In most empirical experiment, the p (true population proportion) is unknown. 

In [17]:
from scipy import stats
#1. Apply CLT to get the confidence interval and see if the observed value falls into the confidence interval
sanity_check = pd.DataFrame(columns=["CI_left", "CI_right", "p_hat","passed?"], index=["Cookies", "Clicks", "CTP"])

#set alpha and p_hat
p = 0.5
alpha = 0.05

def standardError (n, p):
    '''Return the standard deviation for a given probability p and sample size n'''
    return (p*(1-p)/n)**0.5

for i,j in zip(["Cookies", "Clicks"], ["Pageviews", "Clicks"]):
    #calculate the number of successes (n_control) and number of observations (n)
    n = control[j].sum()+experiment[j].sum()
    n_control = control[j].sum()
    
    #compute confidence interval
    sanity_check.at[i, "CI_left"] = p-(stats.norm.ppf(1-alpha/2)*standardError(n,p))
    sanity_check.at[i, "CI_right"] = p+(stats.norm.ppf(1-alpha/2)*standardError(n,p))
    
    #compute observed fraction of successes
    sanity_check.at[i, "p_hat"] = round(n_control/(n),4)
    
    #check if the observed fraction of successes lies within the 95% confidence interval
    if sanity_check.at[i, "CI_left"] <= sanity_check.at[i, "p_hat"] <= sanity_check.at[i, "CI_right"]:
        sanity_check.at[i, "passed?"] = "yes"
    else:
        sanity_check.at[i, "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,p_hat,passed?
Cookies,0.49882,0.50118,0.5006,yes
Clicks,0.49588,0.50412,0.5005,yes
CTP,,,,


6.2.2 Sanity Check Part 2: click-through probability

We use two-proportion z-test to check if the CTP pass the sanity check. The  difference of the two metrics is 0.

$CI = \left[0 - Z_{1-\alpha/2} \cdot SE; 0 + Z_{1-\alpha/2} \cdot SE\right]$

In this project we use the second way of calculating SE_pooled shown down below because the SDs of the experiment and control group are different. 

$\
SE_{\text{pooled}} = \sqrt{\frac{SD_1^2}{n_1} + \frac{SD_2^2}{n_2}}
$

$SD = \sqrt{p \cdot (1-p)}$

$p_{\text{Clicks}} = \frac{n_{\text{Clicks}}}{n_{\text{Cookies}}}$

$d_\text{hat} = CTP_{\text{experiment}} - CTP_{\text{control}}$





What type of SE_pooled to use?
1. When comparing proportions between two groups, such as in two-sample z-test for proportions or a confidence interval for difference in proportions. Assume two samples have similar SD.

    $\
SE_{\text{pooled}} = \sqrt{p_{\text{pooled}} \cdot (1 - p_{\text{pooled}}) \cdot \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$

2. When dealing with continuous data and comparing means between two groups, assuming different SD between two groups.

    $\
SE_{\text{pooled}} = \sqrt{\frac{SD_1^2}{n_1} + \frac{SD_2^2}{n_2}}
$

    $\
z = \frac{(p_{\text{hat}1} - p_{\text{hat}2})}{\sqrt{\frac{p_{\text{hat}1} \times (1 - p_{\text{hat}1})}{n_1} + \frac{p_{\text{hat}2} \times (1 - p_{\text{hat}2})}{n_2}}}$




In [20]:
#compute CTP for both groups
CTP_control = control["Clicks"].sum()/control["Pageviews"].sum()
CTP_experiment = experiment["Clicks"].sum()/experiment["Pageviews"].sum()

#compute sample standard deviations for both groups
S_control = (CTP_control*(1-CTP_control))**0.5
S_experiment = (CTP_experiment*(1-CTP_experiment))**0.5

#compute SE_pooled
SE_pooled = (S_control**2/control["Pageviews"].sum()+S_experiment**2/experiment["Pageviews"].sum())**0.5

#compute 95% confidence interval and store it in sanity check
alpha = 0.05

sanity_check.at["CTP", "CI_left"] = 0-(stats.norm.ppf(1-alpha/2)*SE_pooled)
sanity_check.at["CTP", "CI_right"] = 0+(stats.norm.ppf(1-alpha/2)*SE_pooled)

#compute observed difference d and store it in sanity check
sanity_check.at["CTP", "p_hat"] = round(CTP_experiment - CTP_control,4)

#check if sanity check is passed
if sanity_check.at["CTP", "CI_left"] <= sanity_check.at["CTP", "p_hat"] <= sanity_check.at["CTP", "CI_right"]:
    sanity_check.at["CTP", "passed?"] = "yes"
else:
    sanity_check.at["CTP", "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,p_hat,passed?
Cookies,0.49882,0.50118,0.5006,yes
Clicks,0.49588,0.50412,0.5005,yes
CTP,-0.0012957,0.0012957,0.0001,yes


6.3 Effect Size Testing


Based on the hypothses from section 2.3, we will test the following two sets of hypotheses:
1. - $H_\text{0}: GC_{\text{control}} = GC_{\text{experiment}}$
   - $H_\text{1}: GC_{\text{control}} \neq GC_{\text{experiment}}$

2. - $H_\text{0}: NC_{\text{control}} = NC_{\text{experiment}}$
   - $H_\text{1}: NC_{\text{control}} \neq NC_{\text{experiment}}$




#### 7. Interpretation & Recommendation
The original question is about "How to reduce early cancellations?" In anther word, the goal it to increase the number of students sticking to the end of the course, which translate to better student (user) experience without sacraficing the revenue (total numebr of payments). In fact, the user experience and revenue can increase at the same time through one strategy: increasing the coaching capacity. By investing in more coaching resources, students will have sufficent coaching support to help them stay longer at each course, which increases the number of payments. At the same time, with more support and better user experience, Udacity can build up a positive brand image and the word-of-mouth effect will attract more students to enroll and convert to payment users. That will be a win, win, win situation. 

One drawback of the "Free Trial Screener" treatment is that it will discourage low commitment students from becoming a paied user. Those students may not have immediate commitment but they still hope that after they invest in the course, they will be more motivated to take the course. This is similar to the gym memebership situation where people like to buy gym membership especially during New Year to motivate them to keep working out. Essentially, there are two types of students. One type is they need to get the course certificate, so they pay and finish the course. The other type is they need a motivation to study, so they first made the payment and then figure out a way to continue study, though they might not be as likely to finish the course as the first type. Thus, we do not want to miss out revenue from the second cohort. 

#### 8. Follow-Up Experiment
Below are the three evalution metrics that I want to test if we decide to increase the coaching resource. 

1. **Number of Payments:**
   - Null Hypothesis (H0): There is no significant difference in the number of payments between the control and experimental groups.
   - Alternative Hypothesis (H1): The experimental group will have a significantly higher number of payments compared to the control group.

2. **Number of Enrollments:**
   - Null Hypothesis (H0): There is no significant difference in the number of enrollments between the control and experimental groups.
   - Alternative Hypothesis (H1): The experimental group will have a significantly higher number of enrollments compared to the control group.

3. **Number of Questions Answered per Payment User:**
   - Null Hypothesis (H0): There is no significant difference in the number of questions answered per payment user between the control and experimental groups.
   - Alternative Hypothesis (H1): The experimental group will have a significantly higher number of questions answered per payment user compared to the control group.

These hypotheses will help in assessing the impact of the increased coaching support on payments, enrollments, and user engagement with questions.