# AB Testing Final Project Udacity
project description: https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub

## Import module 

In [4]:
import pandas as pd
import math

# 1. Metric Choice
> See readme.md - https://github.com/Wei-Kuang/Udacity_AB_Testing_Final_Project#readme

# 2. Measuring Variability
This **spreadsheet** (see the DF-dataframe that I created) contains rough estimates of the baseline values for these metrics (again, these numbers have been changed from Udacity's true numbers).

In [5]:
x_dict = {
    "metirc":['Unique cookies to view course overview page per day', 
              'Unique cookies to click "Start free trial" per day', 
              'Enrollments per day',  
              'Click-through-probability on "Start free trial"', 
              'Probability of enrolling, given click',  
              'Probability of payment, given enroll', 
              'Probability of payment, given click']
    
    ,"alias":['n_pageview', 'n_click', 'n_enroll', 'p_click', 'p_enroll_click' , 'p_payment_enroll','p_payment_click']
    
    ,"value": [40000,3200,660,0.08,0.20625, 0.53, 0.1093125]
}

pd.set_option('display.max_colwidth', None) # display all content
DF = pd.DataFrame(x_dict)
DF

Unnamed: 0,metirc,alias,value
0,Unique cookies to view course overview page per day,n_pageview,40000.0
1,"Unique cookies to click ""Start free trial"" per day",n_click,3200.0
2,Enrollments per day,n_enroll,660.0
3,"Click-through-probability on ""Start free trial""",p_click,0.08
4,"Probability of enrolling, given click",p_enroll_click,0.20625
5,"Probability of payment, given enroll",p_payment_enroll,0.53
6,"Probability of payment, given click",p_payment_click,0.109313


For each metric **you selected as an evaluation metric**, estimate its **standard deviation** analytically, given that the sample size is 5000 cookies visiting course overview page. 

### Evaluation metrics
I select three evaluation metrics and their **baseline value**:
1. Gross conversion = enrollment / click = probability of enrolling, given click = **0.20625**
2. Retention = payment / enrollment = probability of payment, given enroll = **0.53**
3. Net conversion = payment / click = probability of payment, given click  = **0.109313**

### To calculate the analytic standard deviation for the evaluation metrics, given that page views = 5000

My thinking process:  
1. Basically, this is a scale-down process.


2. The baseline info is based on sample size = 40000 (page views). Now, we need to re-estimate the statistics given the sample size = 5000.


3. Probability's standard error (SE):

$ SE = \sqrt{P(1-P) \over N}  $

4. The key is to find "N (denominator)" for each evaluation metric, given that page view = 5000. That is, we need to figure out **n_click** and **n_enroll**.


5. In other words, let's solve these two equations to figure the click and enrollment, given page view = 5000.

$$  {3200 \over 40000} = { click \over 5000}  $$

$$  {660 \over 40000} = { enroll \over 5000}  $$

In [6]:
# Data given pageview = 5000
n_Click   = (3200 * 5000 ) /40000
n_Entroll = (660 * 5000 ) /40000

# computation result given page view = 5000
print('N of Clikc:', n_Click)
print('N of Enrollment: ', n_Entroll)
print('Given that course paveview = 5000')

N of Clikc: 400.0
N of Enrollment:  82.5
Given that course paveview = 5000


### Calculate the analytic standard error,  given that page view = 500

Let's use analytic form to compute probability's standard deviation (SD):

$ Se = \sqrt{P(1-P) \over N}  $

In [7]:
def get_prob_sd(p,n):
    import math
    sd = math.sqrt(  ( p*(1-p) /n ) )
    return round(sd, 4)

#### SE - Gross Conversion


In [8]:
# Probability of enrolling, given click = 0.206250
# n_click= 400, given the page view = 5000
get_prob_sd(p=0.20625, n = 400)

0.0202

#### SE - Retention

In [9]:
# Probability of payment, given enroll = 0.530000
# N of enrolling = 82.5 , given pageview = 5000 
get_prob_sd(p=0.53, n = 82.5)

0.0549

#### SE - Net Conversion

In [10]:
# NC =  (n of payment) / n_click = Probability of payment, given click = 0.109313
# n_click= 400, given the page view = 5000
get_prob_sd(p=0.109313, n = 400)

0.0156

**Summary of Variability**

The estimated variability is based on the page view = 5000. This will give us an ideal which metric might be problematic.

|Metric| Standard Error|
|---   |                ---|
|Gross Conversion|0.0202|
|Retention       |0.0549|
|Net Conversion  |0.0156|

> **Note:** Retention's SE is relatively large. We will take a look if we can collect enough data to overcome this large variability.

**Question:** Do you expect the analytic estimates to be accurate? That is, for which metrics, if any, would you want to collect an empirical estimate of the variability if you had time?

> **My answer:** 
I would like to collect empirical estimates of variability for (1) **Gross Conversion** and (2) **Net Conversion**, because for these metrics, their unit of diversion is not the same as the unit of analysis. In such case, the analytic form of variability is usually underestimated (smaller than the truth). Thus, it's better to use bootstrap method to obtain the empirical estimate of the variability. 


|Evaluation metric | Numerator (unit of diversion) | Denominator (unit of analysis)  |
|------------------|-------------------------------|---------------------------------|
|Gross Conversion  | enrollment (user-id)          | click (cookie)                  |
|Retention         | payment (user-id)             | enroll (user-id)                |
|Net Conversion    | payment (user-id)             | click (cookie)                  |

# 3.  Sizing vs. Duration vs. Exposure

### Choosing Number of Samples given Power
Using the analytic estimates of variance, how many page views total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

#### My thinking process:  
1. I will compute the required sample size for each evaluation metric, based on the (1) baseline data, (2) minimally important difference (dmin) , (3) alpha = 0.05, and (4) beta=0.2.

2. On-line calculator will be a great tool for this task: https://www.evanmiller.org/ab-testing/sample-size.html

3. Convert the required "sample size" into required "page views" for both control and experiment groups.

4. Then, check if our system can reach the sample size in a short time (usually ~ 30 days). If a metric need more than 30 days to collect, we might need to drop it or re-design the experiment.

5. Based on the required page views, I will select the maximum page views among evaluation metrics.



#### Size  -  Gross Conversion 
* dmin = 0.01

* GC = n_enroll / n_click  = Probability of enrolling, given click  = 0.206250 (baseline)

* Required Sample Size in each group, N=25835 (using Online-calculator).

* Required n-click (two groups) = 2 * 25835 = 51670

* Then, we want to know how many page view can reach the number of required clicks.

* Based on the ratio, there will be 3,200 clicks among 40000 page views.

* Now, we need to have [2*25835 = 51,670] clicks, so  we need  (40000 * 51670) / (3200)  = 645,875 page views.

In [11]:
N_Click_both_group = 2*25835
N_Pageview = (N_Click_both_group * 40000) /3200
N_Pageview

645875.0

#### Size - Retention
* dmin = 0.01
* RT = (n of payment) / n_enroll = Probability of payment, given enroll = 0.530000
* Required Sample Size in each group, N=39115 (using Online-calculator).
* Two groups of required enrollments = 2 * 39115 = 78230
* Based on the ratio, there will be 660 enrollments given 40000 page views.
* Thus, to reach 78,230 enrollments, then we need 4,741,212 page view

In [12]:
N_Enrol_two_groups = 2*39115
N_Pageview = (N_Enrol_two_groups * 40000)/660
round(N_Pageview,0)

4741212.0

#### Size - Net Conversion
* dmin = 0.0075
* NC =  (n of payment) / n_click = Probability of payment, given click = 0.109313
* Required Sample Size in each group, N=27,413 (using Online-calculator).
* Two groups of required n-click = 2 * 27413 = 54,826
* Based on the ratio, there will be 3,200 clicks given 40000 page views.
* Now, we need to have 2 * 27413 = 54,826 clicks, we need 685,325 page views.

In [13]:
N_Click_two_groups = 2*27413
N_Pageview = (N_Click_two_groups * 40000)/3200
N_Pageview

685325.0

#### How many days are enough to reach the required the page views for each evaluation metric?

In [14]:
print('Gross Conversion: '  ,  math.ceil(645875/(40000*0.5)  ),  ' days, using 50% traffic' )
print('Retention:        '  ,  math.ceil(4741212/(40000*0.5) ),  ' days, using 50% traffic' )
print('Net Conversion:   '  ,  math.ceil(685325/(40000*0.5)  ),  ' days, using 50% traffic' )
print('\n')
print('Gross Conversion: '  ,  math.ceil(645875/(40000*1.00)  ),  ' days, using 100% traffic' )
print('Retention:        '  ,  math.ceil(4741212/(40000*1.00) ),  ' days, using 100% traffic' )
print('Net Conversion:   '  ,  math.ceil(685325/(40000*1.00)  ),  ' days, using 100% traffic' )

Gross Conversion:  33  days, using 50% traffic
Retention:         238  days, using 50% traffic
Net Conversion:    35  days, using 50% traffic


Gross Conversion:  17  days, using 100% traffic
Retention:         119  days, using 100% traffic
Net Conversion:    18  days, using 100% traffic


**Summary of Sizing**

* This is the sample size to reach these wanted statistics parameters: 0.05 alpha (Type I error) and 0.2 Beta (Type II error)
* With 100% traffic, the site estimated that there will be 40000 page views per day.

|Metrics| Required page view (cookies)| Required days with 50% traffic | Required days with 100% traffic |
|-|-|-|-|
|Gross Conversion| 645,875|33 days| 17 days|
|Retention|4,741,212|238 days|119 days|
|Net Conversion|685,325|35 days|18 days|

> Note: Retention need 119 days (17 weeks =~ 4 months) to obtain wanted sample size, even if we use 100% of the traffic. This is too long for a on-line experiment. I will drop "Retention" from the list of evaluation metrics.

**Question:** Which evaluation metrics did you choose?

> **ANS:** (1) gross conversion and (2) net conversion


**Question:** Will you use Bonferroni Correction? 

> **ANS:** These evaluation metrics are highly correlated, so Bonferroni would be too conservative.


**Question:** Is the change risky enough that you wouldn't want to run on all traffic?

> **ANS:** The feature is to add additional question to ask users about the commitment time and then it diverts users into different processes. I think this new feature is not very risky, because this feature just adds an additional branching of the process, instead of changing an entire new user funnel process. 



**Question:** What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)?
 
> **ANS:** Just to be safe, we can use the 90% of traffic. 

**Question:** How many page view that we need? 

> **ANS** 685,325 which is the maximum required page views among (1) gross conversion and (2) net conversion

## 4. Data
The data for you to analyze is here. This data contains the raw information needed to compute the above metrics, broken down day by day. Note that there are two sheets within the spreadsheet - one for the experiment group, and one for the control group.


#### Column definition:

* **Page views:** Number of unique cookies to view the course overview page that day.

* **Clicks:** Number of unique cookies to click the course overview page that day.

* **Enrollments:** Number of user-ids to enroll in the free trial that day.

* **Payments:** Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

#### Control Group Data

In [15]:
df_Ctrl = pd.read_excel("C:/Users/15177/Desktop/Learn_Analysis/AB_Test_Udacity/Final_Project_Results.xlsx", sheet_name="Control")
df_Ctrl

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0
3,"Tue, Oct 14",9871.0,836.0,156.0,105.0
4,"Wed, Oct 15",10014.0,837.0,163.0,64.0
5,"Thu, Oct 16",9670.0,823.0,138.0,82.0
6,"Fri, Oct 17",9008.0,748.0,146.0,76.0
7,"Sat, Oct 18",7434.0,632.0,110.0,70.0
8,"Sun, Oct 19",8459.0,691.0,131.0,60.0
9,"Mon, Oct 20",10667.0,861.0,165.0,97.0


> Note: There are missing values starting at row-23 for Enrollment and Payments

#### Experiment Group Data

In [16]:
df_Exp = pd.read_excel("C:/Users/15177/Desktop/Learn_Analysis/AB_Test_Udacity/Final_Project_Results.xlsx", sheet_name="Experiment")
df_Exp

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716.0,686.0,105.0,34.0
1,"Sun, Oct 12",9288.0,785.0,116.0,91.0
2,"Mon, Oct 13",10480.0,884.0,145.0,79.0
3,"Tue, Oct 14",9867.0,827.0,138.0,92.0
4,"Wed, Oct 15",9793.0,832.0,140.0,94.0
5,"Thu, Oct 16",9500.0,788.0,129.0,61.0
6,"Fri, Oct 17",9088.0,780.0,127.0,44.0
7,"Sat, Oct 18",7664.0,652.0,94.0,62.0
8,"Sun, Oct 19",8434.0,697.0,120.0,77.0
9,"Mon, Oct 20",10496.0,860.0,153.0,98.0


> Note: There are missing values starting at row-23 for Enrollment and Payments

## 5. Sanity Check
Checking whether your invariant metrics are equivalent between the Control and Experiment groups. 

Here are my invariant metrics:
* Number of cookies (simple count)
* Number of clicks (simple count)
* Click-through-probability (probability)


Hints:
> If the invariant metric is a **simple count** that should be randomly split between the 2 groups, you can use a **binomial test** as demonstrated in Lesson 5. Otherwise, you will need to construct a **confidence interval** for a **difference in proportions** using a similar strategy as in Lesson 1, then check whether the difference between group values falls within that confidence level.


## Sanity Check: [1] Number of cookies
**Note:** Page view is tracked by cookies.  Thus, the invariant metric (the number of cookies) = the number of page view.

**My thinking process for Sign Test (Binomial test) :** 
1. Page view is a simple count, so I can use Sign Test (Binomial test) to perform the sanity check. 


2. The key is to realize each day is an independent trial and the success can be defined as "Control > Experiment". 


3. The assumption is that **page views** should be equally and randomly divided into two groups. Therefore, some days Control group may have more page view than Experiment group, and this chance is expected to be 0.5. 


4. Then, the "binomial test" will test if the observed data match our expectation. In this case, if p-value > 0.05, then this is the statistical evidence to support no difference between two groups.

**My thinking process for using Confidence interval to perform sanity check :** 
1. The assumption is that the page views should be equally and randomly divided into two groups, so the the ratio of Control to Experiment should be close to 1:1 or we can say p = (N of Control)/(total) is about 0.5.


2. The strategy is to obtain the 95% confidence interval (95%CI) on the expected p, given the expected p=0.5 and observed N.


3. The decision-making process is that if the observed $ \hat{p} $ is in this 95%CI, then sanity check pass! , because it's acceptable to observe such $ \hat{p} $ at the 95% confidence level.

In [17]:
##################################
#### Sign Test for Page views ####
##################################
from scipy import stats
N_trials = len(df_Ctrl['Pageviews']) 
N_success = sum( df_Ctrl['Pageviews'] > df_Exp['Pageviews'])
p_value_SingTest = stats.binom_test(x=N_success, n=N_trials, p=0.5, alternative='two-sided')
p_value_SingTest = round(p_value_SingTest,6)

# Sign Test Result
print( "N_trials: ", N_trials)
print( "N_success: ", N_success)
print( "Sing Test (two-sided) P-value: ", p_value_SingTest)
print('')
print('Conclusion: Sing Test (two-sided) P-value:  0.324009 > 0.05, so we fail to reject Null hypothesis. We support that there is no difference. Sanity Check passes!')

N_trials:  37
N_success:  22
Sing Test (two-sided) P-value:  0.324009

Conclusion: Sing Test (two-sided) P-value:  0.324009 > 0.05, so we fail to reject Null hypothesis. We support that there is no difference. Sanity Check passes!


In [18]:
##########################################
#### Using CI to perform Sanity Check ####
##########################################
N_cookies_Ctrl = df_Ctrl['Pageviews'].sum()
N_cookies_Exp = df_Exp['Pageviews'].sum()

# Compute the 95% CI for the value that we expect to observe. Thus, the SE is based on p = 0.5 
N = N_cookies_Ctrl+N_cookies_Exp
SE_expected = math.sqrt( 0.5 * (1-0.5) * (1/N))
SE_expected = round(SE_expected, 4)

# 95% CI for the value that we expect to observe 
margin_error = 1.96* SE_expected
CI95_expected = [0.5-margin_error, 0.5 + margin_error]
CI95_expected = [round(i,4) for i in CI95_expected]

# observed probability
P_Obs = N_cookies_Ctrl / (N_cookies_Ctrl+N_cookies_Exp)
P_Obs = round(P_Obs,4)

# Result
print('SE                                              ', SE_expected)
print('95% Confidence interval on expected probability ', CI95_expected)
print('Observed probability                            ', P_Obs)

SE                                               0.0006
95% Confidence interval on expected probability  [0.4988, 0.5012]
Observed probability                             0.5006


**Result of Sanity Check: Number of cookies**
* Sanity Check passes!
* Since the observed probability (0.5006) is in the expected 95%CI [0.4988, 0.5012], this matches our expectation. 
* Sign test (binomial test)also agrees with this conclusion.

## Sanity Check: [2] Number of clicks

**My thinking process for using Sign Test:**
1. The number of clicks is a simple count, so I can use Sign Test (Binomial test) to perform the sanity check. 


2. The key is to realize each day is an independent trial and the success can be defined as "Control > Experiment". 


3. The assumption is that **clicks** should be equally and randomly divided into two groups. Therefore, some days Control group may have more page view than Experiment group, and this chance is expected to be 0.5. 


4. Then, the "binomial test" will test if the observed data match our expectation. In this case, if p-value > 0.05, then this is the statistical evidence to support no difference between two groups.

**My thinking process for using Confidence interval to perform sanity check :** 
1. The assumption is that the clicks should be equally and randomly divided into two groups, so the the ratio of Control to Experiment should be close to 1:1 or we can say p = (N of Control)/(total) is about 0.5.


2. The strategy is to obtain the 95% confidence interval (95%CI) on the expected p, given the expected p=0.5 and observed N.


3. The decision-making process is that if the observed $ \hat{p} $ is in this 95%CI, then sanity check pass! , because it's acceptable to observe such $ \hat{p} $ at the 95% confidence level.

In [19]:
##############################
#### Sign Test for clicks ####
##############################
from scipy import stats

N_trials = len(df_Ctrl['Clicks']) 
N_success = sum( df_Ctrl['Clicks'] > df_Exp['Clicks'])

p_value_SingTest = stats.binom_test(x=N_success, n=N_trials, p=0.5, alternative='two-sided')
p_value_SingTest = round(p_value_SingTest,6)

print( "N_trials: ", N_trials)
print( "N_success: ", N_success)
print( "Sing Test (two-sided) P-value: ", p_value_SingTest)
print('')
print( 'Conclusion: Sing Test (two-sided) P-value:  1.0 > 0.05, so we fail to reject Null hypothesis. We support that there is no difference. Sanity Check passes!')

N_trials:  37
N_success:  18
Sing Test (two-sided) P-value:  1.0

Conclusion: Sing Test (two-sided) P-value:  1.0 > 0.05, so we fail to reject Null hypothesis. We support that there is no difference. Sanity Check passes!


In [20]:
##########################################
#### Using CI to perform Sanity Check ####
##########################################
N_clicks_Ctrl = df_Ctrl['Clicks'].sum()
N_clicks_Exp = df_Exp['Clicks'].sum()

# Compute the 95% CI for the value that we expect to observe 
# Thus, the SE is based on p = 0.5 
N = N_clicks_Ctrl+N_clicks_Exp
SE_expected = math.sqrt( 0.5 * (1-0.5) * (1/N))
SE_expected = round(SE_expected, 4)

# 95% CI for the value that we expect to observe 
margin_error = 1.96* SE_expected
CI95_expected = [0.5-margin_error, 0.5 + margin_error]
CI95_expected = [round(i,4) for i in CI95_expected]

# observed probability

P_Obs = N_clicks_Ctrl / (N_clicks_Ctrl+N_clicks_Exp)
P_Obs = round(P_Obs,4)

# Result
print('SE                                              ', SE_expected)
print('95% Confidence interval on expected probability ', CI95_expected)
print('Observed probability                            ', P_Obs)

SE                                               0.0021
95% Confidence interval on expected probability  [0.4959, 0.5041]
Observed probability                             0.5005


**Sanity Check Result - Number of Clicks**

* Sanity Check passes!!!

* Since the observed probability (0.5005) is in the expected 95%CI [0.4959, 0.5041], this matches our expectation. 

* Sign test (binomial test)also agrees with this conclusion.



## Sanity Check: Click-through-probability (CTP)

**My thinking process for using Confidence interval to perform sanity check :** 
    
1. The assumption is that the Click-through-probability should be the same in two groups.


2. Thus, we can perform the test to see if the difference between two groups is close to zero.


3. The strategy is to obtain the 95% confidence interval on the difference. By using this method, we need to use pooled probability to compute the SE for difference.


4. The decision-making process is that if this 95%CI of difference includes zero, then sanity check pass!

In [21]:
# control
N_pageviews_Ctrl = df_Ctrl['Pageviews'].sum()
N_clicks_Ctrl    = df_Ctrl['Clicks'].sum()
CTP_Ctrl         = N_clicks_Ctrl /N_pageviews_Ctrl 
CTP_Ctrl         = round(CTP_Ctrl, 6)

# experiment 
N_pageviews_Exp  = df_Exp['Pageviews'].sum()
N_clicks_Exp     = df_Exp['Clicks'].sum()
CTP_Exp          = N_clicks_Exp / N_pageviews_Exp
CTP_Exp          = round(CTP_Exp, 6)

# difference
diff = round(CTP_Exp - CTP_Ctrl, 6)

print( "Observed data:")
print( "Click-through-probability ", "Control :", CTP_Ctrl)
print( "Click-through-probability ", "Exp :", CTP_Exp)
print( "Difference:", "Control-Exp = ", diff)

Observed data:
Click-through-probability  Control : 0.082126
Click-through-probability  Exp : 0.082182
Difference: Control-Exp =  5.6e-05


In [22]:
# Compute the 95% CI for the difference 
# Thus, the Two.Group.Differnece's SE  = sqrt( P.pooled * (1-P.pooled)* (1/N.control + 1/N.exp) ) 
p_pooled = (N_clicks_Ctrl+ N_clicks_Exp ) / (N_pageviews_Ctrl+N_pageviews_Exp)
p_pooled

0.08215409089789526

In [23]:
Two_Group_Diff_SE = math.sqrt( p_pooled * (1-p_pooled) * (1/N_pageviews_Ctrl  + 1/N_pageviews_Exp))
Two_Group_Diff_SE

0.0006610608156387222

In [24]:
# 95% CI for the difference of Click-through-probability between Control and Exp.
margin_error = 1.96* Two_Group_Diff_SE
CI95_expected = [diff-margin_error, diff+ margin_error]
CI95_expected = [round(i,4) for i in CI95_expected]
CI95_expected

[-0.0012, 0.0014]

**Sanity Check result for Click-through-probability:**
* The sanity check passes! 


* Since the 95%CI on the "difference" of Click-through-probability between Control and Experiment, [-0.0012, 0.0014] , includes zero, we fail to reject the Null. That is, we support that "Click-through-probability" has no difference between Control and Exp groups. This is consistent with our expectation.   

### Summary of Sanity Check

* Count Metric

> **My thinking process for Sign Test (Binomial test) :** 
>> 1. If the metric is a simple count, then I will use Sign Test (Binomial test) to perform the sanity check. 
>> 2. The key is to realize each day is an independent trial and the success can be defined as "Control > Experiment". 
>> 3. The assumption is that **page views** should be equally and randomly divided into two groups. Therefore, some days Control group may have more page view than Experiment group, and this chance is expected to be 0.5.
>> 4. Then, the "binomial test" will test if the observed data match our expectation. In this case, if p-value > 0.05, then this is the statistical evidence to support no difference between two groups.

> **My thinking process for using Confidence interval (CI) to perform sanity check :** 
>> 1. The assumption is that the invariant metric should the same between two groups, so the the ratio of Control to Experiment should be close to 1:1 or we can say p = (N of Control)/(total) is about 0.5.
>> 2. The strategy is to obtain the 95% confidence interval (95%CI) on the expected p, given the expected p=0.5 and observed N.
>> 3. The decision-making process is that if the observed $\hat{p}$ is in this 95%CI, then sanity check pass!, because it's acceptable to observe such $\hat{p}$ at the 95% confidence level.

|Metric |Sign-Test p-value |Observed values |95% confidence interval on expected value| Sanity Check Results|
|--|--|--|--|--|
|Number of cookies (simple count)     |p-value 0.324|0.5006 |[0.4988, 0.5012]| Sign Test and 95%CI check are all good|
|Number of clicks (simple count)      |p-value 1.0  |0.5005 |[0.4959, 0.5041]| Sign Test and 95%CI check are all good|


* Probability Metric

> **My thinking process for using Confidence interval to perform sanity check :**     
>> 1. The assumption is that the probability should be the same in two groups.
>> 2. Thus, we can test if the difference between two groups is close to zero.
>> 3. The strategy is to obtain the 95% confidence interval on the difference. By using this method, we need to use pooled probability to compute the SE for difference.
>> 4. The decision-making process is that if this 95%CI of difference includes zero, then sanity check pass!


    
|Metric |Observed difference |95% confidence interval on difference| Sanity Check Results|
|--|--|--|--|
|Click-through-probability (probability) |0.000056 |[-0.0012, 0.0014]| Good, because the 95%CI of difference includes zero |


# 6. Effect Size Test and Sign Test
Now, it's time use the data to make decision if we should launch this new feature. For each of your evaluation metric, compute a Confidence Interval (CI) around the difference. Explain whether to use Bonferroni correction.


**My evaluation metrics:**
1. Gross Conversion = enrollment / click 
2. Net Conversion   = payment / click

**Question: Would I use Bonferroni correction?**
* **ANS:** No, because these two evaluation metrics are highly corrected, so bonferroni correction will make the test to be too conservative.

### Check if sample size is enough?

Based on previous analysis, we initially aimed for (n = 685,325).  However, in our actual A/B test dataset, the sample size is lower (n_true = 423,525). That is, we may not have enough power to detect true difference!

In [25]:
df_Ctrl.loc[range(0,23), 'Pageviews'].sum() + df_Exp.loc[range(0,23), 'Pageviews'].sum()

423525.0

### Two tailed test procedure

In [26]:
def two_tailed_testing( metric_name, alpha, dmin, expected_direction_of_change ,Ctrl_numerator, Ctrl_denominator, Exp_numerator, Exp_denominator):
    import math
    from scipy import stats
    
    Ctrl_p = Ctrl_numerator / Ctrl_denominator
    Exp_p = Exp_numerator / Exp_denominator
    Diff = Exp_p - Ctrl_p
    
    P_pooled = (Ctrl_numerator + Exp_numerator ) / (Ctrl_denominator + Exp_denominator)
    Diff_SE = math.sqrt(P_pooled*(1-P_pooled)*(1/Ctrl_denominator + 1/Exp_denominator)) 
    Z_score = stats.norm.ppf(1-alpha/2)
    margin_error = Diff_SE * Z_score
    Confidence_interval = [Diff - margin_error , Diff + margin_error]
    Confidence_interval = [round(i,5) for i in  Confidence_interval]
    
    #### Result 
    print(metric_name)
    print('Observed Difference: ', round(Diff,5))
    print('95% confidence interval on difference: ', Confidence_interval)

    #### Check statistical significance
    print('\n')
    print('Check Statistical Significance')
    if (0 >= Confidence_interval[0]) and (0 <=Confidence_interval[1]) :
        print('This interval includes zero, so the difference is "NOT" statistically significant.')
    else:
        print('This confidence interval does not includes zero, so the difference is statistically significant.')  
         
    #### Check practical significance - two steps 
    print('\n')
    print('Check Practical Significance')
    
    #### 1st Step: Check if the direction of difference is expected?
    if Diff > 0:
        direction = "positive"
    else:
        direction = "negative"
        
    if direction != expected_direction_of_change:
        print('The direction of the observed difference is "Not" expected, so it is not pratically significant.')
        
    else:
        #### Second Setps: Check if the Confidence interval statisfy the minimal practical difference? 
        if expected_direction_of_change == 'positive':
            if Confidence_interval[0] > dmin:
                print ("The confidence interval is beyond the minimal practical difference, so it is practically significant.")
            else:
                print ('it is "NOT" practically significant.')
        if expected_direction_of_change == 'negative':
            if  Confidence_interval[1] < dmin:
                print ("The confidence interval is beyond the minimal practical difference, so it is practically significant.")
            else:
                 print ('it is "NOT" practically significant.')

#### Two tailed test - [1] Gross Conversion

In [27]:
two_tailed_testing(metric_name = 'Gross Conversion',
                   alpha=0.05, 
                   dmin=0.01,
                   expected_direction_of_change= 'negative',
                   Ctrl_numerator  =df_Ctrl.loc[range(0,23), 'Enrollments'].sum(),
                   Ctrl_denominator=df_Ctrl.loc[range(0,23), 'Clicks'     ].sum(), 
                   Exp_numerator    =df_Exp.loc[range(0,23), 'Enrollments'].sum(), 
                   Exp_denominator  =df_Exp.loc[range(0,23), 'Clicks'     ].sum()
                  )

Gross Conversion
Observed Difference:  -0.02055
95% confidence interval on difference:  [-0.02912, -0.01199]


Check Statistical Significance
This confidence interval does not includes zero, so the difference is statistically significant.


Check Practical Significance
The confidence interval is beyond the minimal practical difference, so it is practically significant.


####  Two tailed test - [2] Net Conversion

In [28]:
two_tailed_testing(metric_name = 'Net Conversion',
                   alpha=0.05, 
                   dmin=0.0075,
                   expected_direction_of_change= 'positive',
                   Ctrl_numerator  =df_Ctrl.loc[range(0,23), 'Payments'].sum(),
                   Ctrl_denominator=df_Ctrl.loc[range(0,23), 'Clicks'  ].sum(), 
                   Exp_numerator    =df_Exp.loc[range(0,23), 'Payments'].sum(), 
                   Exp_denominator  =df_Exp.loc[range(0,23), 'Clicks'  ].sum()
                  )

Net Conversion
Observed Difference:  -0.00487
95% confidence interval on difference:  [-0.0116, 0.00186]


Check Statistical Significance
This interval includes zero, so the difference is "NOT" statistically significant.


Check Practical Significance
The direction of the observed difference is "Not" expected, so it is not pratically significant.


### Sing test procedure

In [29]:
############################
#### Sign Test function ####
############################
def SingTest(Trials, Successes, Expected_Prob):
    from scipy import stats
    p_value_SingTest = stats.binom_test(x=Successes, n=Trials, p=Expected_Prob, alternative='two-sided')
    
    print( "N_trials: ", N_trials)
    print( "N_success: ", N_success)
    print( "Sign Test (two-sided) P-value: ", round(p_value_SingTest,5)  )

    if p_value_SingTest < 0.05:
        print('Test Result: Statistically significant.')
    else:
        print('Test Result: Not statistically significant.')  

#### Sign Test - [1] Gross Conversion

In [30]:
N_trials     = len( df_Ctrl.loc[range(0,23), 'Enrollments'] )
Ctrl_P_array = df_Ctrl.loc[range(0,23), 'Enrollments'] / df_Ctrl.loc[range(0,23), 'Clicks']
Exp_P_array  = df_Exp.loc[range(0,23), 'Enrollments'] / df_Exp.loc[range(0,23), 'Clicks']
N_success    = sum( Ctrl_P_array> Exp_P_array)

SingTest(Trials= N_trials,
         Successes= N_success,
         Expected_Prob = 0.5)

N_trials:  23
N_success:  19
Sign Test (two-sided) P-value:  0.0026
Test Result: Statistically significant.


#### Sign Test - [2] Net Conversion 

In [31]:
N_trials = len( df_Ctrl.loc[range(0,23), 'Payments'] )
Ctrl_P_array = df_Ctrl.loc[range(0,23), 'Payments'] / df_Ctrl.loc[range(0,23), 'Clicks']
Exp_P_array =  df_Exp.loc[range(0,23), 'Payments'] / df_Exp.loc[range(0,23), 'Clicks']
N_success = sum( Ctrl_P_array> Exp_P_array)

SingTest(Trials= N_trials,
         Successes= N_success,
         Expected_Prob = 0.5)

N_trials:  23
N_success:  13
Sign Test (two-sided) P-value:  0.67764
Test Result: Not statistically significant.


# 7. Results Summary

|Evaluation Metric| Observed Difference| Minimal Practical Significance (dmin)| 95% Confidence Interval on Diff.|Statistical Significance?| Practical Significance?| Sign Test - Statistical Significance?| Final Comment|
|---|---|---|---|---|---|---|---|
|Gross Conversion (enrollment/click)|-0.02055|-0.01|[-0.0291, -0.0120]|Yes |Yes |Yes| All good|
|Net Conversion (payment/click)|-0.00487|+0.0075|[-0.0116, 0.0019]|No |No, because the difference is negative which is not our expectation |No| Not good|

* Gross conversion satisfied all of the tests, suggesting that the new feature can reduce the amount of people in the enrollment stage.  

* However, net conversion did not pass the test, meaning that this new feature won't increase the probability of making the payment.

# 8. Recommendation

* Since our evaluation metrics did not pass all tests, I will not launch this feature.

# 9. Follow-up Experiment

#### This is my idea-1:
> **Hypothesis:** 
> * Users don't have enough support to go through assignments or lectures

> **Experiment:**
> * Create a forum which has clear and specific section for each lecture and assignment, so users can easily have resources easily to go through the course.

> **Expectation and Evaluation Metric**
> * If the hypothesis is true, then we should see a **higher Net Conversion** in Experiment Group, compared to Control Group.

#### This is my idea-2:
> **Hypothesis:** 
> * Users just lost the motivation over time.

> **Experiment:**
> * For users already in the free-trial, when they click "cancel" the enrollment, system will ask the user about the advantage of taking this course. If users still want to cancel it, then they can drop from the free trial.  

> **Expectation and Evaluation Metric**
> * If the hypothesis is true, then we should see a **higher Net Conversion** in Experiment Group, compared to Control Group.

#### Another good idea based on this post: 
https://zacks.one/udacity-a-b-testing-by-google/#Follow-Up-Experiment-How-to-Reduce-Early-Cancellations

> **Hypothesis:** 
> * Users don't have enough pre-requisite courses.

> **Experiment:**
> * After user click "Start Free Trial", we can ask if users have enough pre-requisite knowledge. If not, users will be diverted to the pre-requisite courses. If yes, then users will enroll in the free trial for 14 days.

> **Expectation and Evaluation Metric**
> * If the hypothesis is true, then we should see a **lower Gross Conversion** but a **higher Net Conversion** in Experiment Group, compared to Control Group.

## 10. Resources
These are the solutions made by other people who complete this project:
* solution-1: https://nancyyanyu.github.io/posts/8fdfc10f/#more
> Solution Author contact information - https://github.com/nancyyanyu

* solution-2: https://github.com/shubhamlal11/Udacity-AB-Testing-Final-Project 
* solution-3: https://www.kaggle.com/code/mariusmesserschmied/udacity-a-b-testing-final-course-project/notebook
* soluiton-4: https://zacks.one/udacity-a-b-testing-by-google/#Familywise-Error-Rate-FWER
* solution-5: https://towardsdatascience.com/a-b-testing-part-2-42b94e1fb1dc
* solution-6: https://www.xueni.me/2020-07-05-udacity-ab-testing-final-project/
* solution-7: https://rstudio-pubs-static.s3.amazonaws.com/347758_9da9522d18a8455fb810c48b11ff9824.html#3measure_of_variability
* solution-8: https://medium.com/@zhouyuchen999/a-b-testing-experiment-a-udacity-course-project-f958f7236278