# **作業說明**
# (這是Udacity關於A/B Test的期末專題)

Udacity希望了解，在免費14天試學網頁上，除了要信用卡資訊外，還想了解學生願意花多少小時學。如果少於某門檻(5小時)，就建議學生不要註冊，免費聽聽影音就好，免得浪費資源，降低學習成功率。

我們的題目是，增加這個頁面，是否對Gross Conversion(GC)和Net Conversion (NC)在統計學上(Alpha=0.05，Power=0.8)有幫助(d=0.01/0.0075)，亦即統計上的顯著(Significant)。

CI = click 數目

GC = 註冊數/CI (聽了建議仍然註冊的比例)

NC = 繳費數/CI (14天之後繳費且繼續的比例)

我們期待GC比原來下降，但NC不降，這表示省去資源但收入不降。

檔名：ab-tests-with-python.ipynb

**作業目標**

1.   經由範例程式，學習A/B Test 的步驟
2.   最低樣本數的計算方法
3.   自行開發信賴區間計算函數













作業
經由範例程式碼，熟悉A/B Test的步驟
請同學逐步跟隨程式了解A/B Test步驟

# **作業 嘗試以函數算出樣本數**

# **作業** 自行開發雙樣本比例的信賴區間函數


In [2]:
# 載入函式庫
import numpy as np
import math as mt
import pandas as pd
from scipy.stats import norm

In [3]:
# 制定 baseline
baseline = {"Cookies": 40000, "Clicks": 3200, "Enrollments": 660, "CTP": 0.08, "GConversion": 0.20625,
           "Retention": 0.53, "NConversion": 0.109313}

In [4]:
# 以 Cookies=5000 為基準
baseline["Cookies"] = 5000
baseline["Clicks"] = baseline["Clicks"]*(5000/40000)
baseline["Enrollments"] = baseline["Enrollments"]*(5000/40000)
baseline

{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollments': 82.5,
 'CTP': 0.08,
 'GConversion': 0.20625,
 'Retention': 0.53,
 'NConversion': 0.109313}

In [5]:
# 計算 Gross Conversion (GC) 的 p, n, 及 Standard deviation (sd)
GC = {}
GC["d_min"] = 0.01
GC["p"] = baseline["GConversion"]
GC["n"] = baseline["Clicks"]
GC["sd"] = round(mt.sqrt((GC["p"] * (1 - GC["p"])) / GC["n"]), 4)
GC

{'d_min': 0.01, 'p': 0.20625, 'n': 400.0, 'sd': 0.0202}

In [6]:
# 計算 Retention (R) 的 p, n, 及 Standard deviation (sd)
R = {}
R["d_min"] = 0.01
R["p"] = baseline["Retention"]
R["n"] = baseline["Enrollments"]
R["sd"] = round(mt.sqrt((R["p"] * (1 - R["p"])) / R["n"]), 4)
R

{'d_min': 0.01, 'p': 0.53, 'n': 82.5, 'sd': 0.0549}

In [7]:
# 計算 Net Conversion (NC) 的 p, n, 及 Standard deviation (sd)
NC = {}
NC["d_min"] = 0.0075
NC["p"] = baseline["NConversion"]
NC["n"] = baseline["Clicks"]
NC["sd"] = round(mt.sqrt((NC["p"] * (1 - NC["p"])) / NC["n"]), 4)
NC

{'d_min': 0.0075, 'p': 0.109313, 'n': 400.0, 'sd': 0.0156}

In [21]:
# 作業：計算樣本數
# 使用套件求樣本數
import statsmodels.stats.api as sms
from math import ceil

effect_size = sms.proportion_effectsize(GC["p"]-1.0*GC["d_min"], GC["p"]+0.0*GC["d_min"])
required_n = sms.NormalIndPower().solve_power(
    effect_size,
    power=0.8,
    alpha=0.05,
    ratio=1
)
required_n = ceil(required_n)
print("需求樣本數為", required_n)

需求樣本數為 25231


In [9]:
def get_z_score(alpha):
    return norm.ppf(alpha)

# 計算兩個標準差
def get_sds(p, d):
    sd1=mt.sqrt(2*p*(1-p))
    sd2=mt.sqrt(p*(1-p)+(p+d)*(1-(p+d)))
    sds=[sd1, sd2]
    return sds

# find sample size
def get_sampSize(sds, alpha, beta, d):
    n=pow((get_z_score(1-alpha/2)*sds[0]+get_z_score(1-beta)*sds[1]),2)/pow(d,2)
    return n

In [10]:
GC["d"]=0.01
R["d"]=0.01
NC["d"]=0.0075

In [11]:
# 從 GConversion 角度求樣本數
GC["SampSize"]=round(get_sampSize(get_sds(GC["p"], GC["d"]), 0.05, 0.2, GC["d"]))
GC["SampSize"]=round(GC["SampSize"]/baseline["CTP"] *2)
GC["SampSize"]

645875

In [12]:
# 從 Retention 角度求樣本數
R["SampSize"]=round(get_sampSize(get_sds(R["p"], R["d"]), 0.05, 0.2, R["d"]))
R["SampSize"]=round(R["SampSize"]/baseline["GConversion"]/baseline["CTP"] *2)
R["SampSize"]

4737818

In [13]:
# 從 NConversion 角度求樣本數
NC["SampSize"]=round(get_sampSize(get_sds(NC["p"], NC["d"]), 0.05, 0.2, NC["d"]))
NC["SampSize"]=round(NC["SampSize"]/baseline["CTP"] *2)
NC["SampSize"]

685325

In [14]:
df_control = pd.read_csv("control_data.csv", encoding="utf-8")
df_experiment = pd.read_csv("experiment_data.csv", encoding="utf-8")
df_control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [29]:
pageviews_cont = df_control["Pageviews"].sum()
pageviews_exp = df_experiment["Pageviews"].sum()
clicks_cont = df_control["Clicks"].loc[df_control["Enrollments"].notnull()].sum()
clicks_exp = df_experiment["Clicks"].loc[df_experiment["Enrollments"].notnull()].sum()
enrollments_cont = df_control["Enrollments"].sum()
enrollments_exp = df_experiment["Enrollments"].sum()
payments_cont = df_control["Payments"].sum()
payments_exp = df_experiment["Payments"].sum()

In [32]:
alpha=0.05
GC_cont = enrollments_cont / clicks_cont
GC_exp = enrollments_exp / clicks_exp
GC_pooled = (enrollments_cont + enrollments_exp) / (clicks_cont + clicks_exp)
GC_sd_pooled = mt.sqrt(GC_pooled*(1-GC_pooled)*(1/clicks_cont + 1/clicks_exp))
GC_ME = round(get_z_score(1-alpha/2)*GC_sd_pooled, 4)
GC_diff = round((GC_exp - GC_cont), 4)
print("The change due to the experiment is ", GC_diff*100, "%")
print("Confidence interval: [",GC_diff-GC_ME,",",GC_diff+GC_ME,"]")
print ("The change is statistically significant if the CI doesn't include 0. In that case, it is practically significant if",-GC["d_min"],"is not in the CI as well.")

The change due to the experiment is  -2.06 %
Confidence interval: [ -0.0292 , -0.012 ]
The change is statistically significant if the CI doesn't include 0. In that case, it is practically significant if -0.01 is not in the CI as well.


In [33]:
NC_cont = payments_cont / clicks_cont
NC_exp = payments_exp / clicks_exp
NC_pooled = (payments_cont + payments_exp) / (clicks_cont + clicks_exp)
NC_sd_pooled = mt.sqrt(NC_pooled*(1-NC_pooled)*(1/clicks_cont + 1/clicks_exp))
NC_ME = round(get_z_score(1-alpha/2)*NC_sd_pooled, 4)
NC_diff = round((NC_exp - NC_cont), 4)
print("The change due to the experiment is ", NC_diff*100, "%")
print("Confidence interval: [",NC_diff-NC_ME,",",NC_diff+NC_ME,"]")
print ("The change is statistically significant if the CI doesn't include 0. In that case, it is practically significant if",-NC["d_min"],"is not in the CI as well.")

The change due to the experiment is  -0.49 %
Confidence interval: [ -0.0116 , 0.0018000000000000004 ]
The change is statistically significant if the CI doesn't include 0. In that case, it is practically significant if -0.0075 is not in the CI as well.


In [38]:
# 作業：自行開發雙樣本比例的信賴區間函數
import scipy.stats as stats

def two_proportions_confint(success_a, size_a, success_b, size_b, significance=0.05):
    prop_a = success_a / size_a
    prop_b = success_b / size_b
    sd = np.sqrt(prop_a * (1 - prop_a) / size_a + prop_b * (1 - prop_b) / size_b)
    confidence = 1 - significance
    
    z = stats.norm(loc=0, scale=1).ppf(confidence + significance / 2)
    
    prop_diff = prop_b - prop_a
    confint = prop_diff + np.array([-1, 1]) * z * sd
    return prop_diff, confint

# CI for GConversion
two_proportions_confint(enrollments_cont, clicks_cont, enrollments_exp, clicks_exp, significance=0.05)

(-0.020554874580361565, array([-0.02912016, -0.01198959]))

In [39]:
# CI for NConversion
two_proportions_confint(payments_cont, clicks_cont, payments_exp, clicks_exp, significance=0.05)

(-0.0048737226745441675, array([-0.01160419,  0.00185674]))