# **[作業]電商如何以A/B Test 驗證新網頁設計有效**
## **Can eCommerce UX change boost the conversion rate from 0.12 to 0.11?**
知識點:
*   effect size
*   sample size for A/B test 
*   type I error = 0.05 and Power= 0.8
*   z-score, confidence interval
數據 : ab_data.csv from Kaggle

# **[作業目標]**
1.   了解Binomial分布，以及用常態分布求統計解的方法
2.   判讀A/B Test 結果

# **[作業重點]**
1.   如何決定最小樣本數
2.   如何以Z值，p-Value和信賴區間(Confidence Interval)判斷A/B結果是否顯著

In [4]:
#import necesary library
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
from math import ceil

In [10]:
# Reading data
raw_data=pd.read_csv('ab_data.csv')
#show info
raw_data.info()
#show head
raw_data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [12]:
#check unique
raw_data['user_id'].value_counts()

637561    2
821876    2
643869    2
938802    2
916765    2
         ..
710897    1
708848    1
665839    1
663790    1
630836    1
Name: user_id, Length: 290584, dtype: int64

In [21]:
# Decide the effect_size base on prob.1 and prob.2
import statsmodels.stats.api as sms
prob1=0.12
prob2=0.11
effect_size=sms.proportion_effectsize(prob1,prob2)
print('effect size=\t',effect_size)
# Decide reuqired number of sample
reuqired_n=sms.NormalIndPower().solve_power(effect_size=effect_size,power=0.8,alpha=0.05,ratio=1)
reuqired_n=ceil(reuqired_n)
print('required n=\t',reuqired_n)

effect size=	 0.031352702218681694
required n=	 15970


In [94]:
data=raw_data.drop_duplicates(subset=['user_id'],keep=False)

In [98]:
#seperate into "control" and "treatment"
control=data[data['group']=='control']
treatment=data[data['group']=='treatment']

control_sample=control.sample(reuqired_n,random_state=22)
treatment_sample=treatment.sample(reuqired_n,random_state=22)

In [89]:
ab_test_data=pd.concat([control_sample,treatment_sample])
ab_test_data=ab_test_data.reset_index(drop=True)
print(ab_test_data)

       user_id                   timestamp      group landing_page  converted
0       763854  2017-01-21 03:43:17.188315    control     old_page          0
1       690555  2017-01-18 06:38:13.079449    control     old_page          0
2       861520  2017-01-06 21:13:40.044766    control     old_page          0
3       630778  2017-01-05 16:42:36.995204    control     old_page          0
4       656634  2017-01-04 15:31:21.676130    control     old_page          0
...        ...                         ...        ...          ...        ...
31935   780954  2017-01-19 07:49:58.295232  treatment     new_page          0
31936   700881  2017-01-13 12:31:37.243352  treatment     new_page          0
31937   829626  2017-01-14 09:07:57.784950  treatment     new_page          0
31938   773197  2017-01-05 00:23:56.177295  treatment     new_page          0
31939   720502  2017-01-15 15:54:12.049241  treatment     new_page          0

[31940 rows x 5 columns]


In [100]:
#以函數計算z_stat, pval, confidence interval
from statsmodels.stats.proportion import proportion_confint, proportions_ztest
control_reuslt=ab_test_data[ab_test_data['group']=='control']['converted']
treatment_result=ab_test_data[ab_test_data['group']=='treatment']['converted']

n_control=control_reuslt.count()
n_treatment=treatment_result.count()

In [102]:
successes=[control_reuslt.sum(),treatment_result.sum()]
nobs=[n_control,n_treatment]
print(successes)
print(nobs)

[1932, 1928]
[15970, 15970]


In [104]:
z_stat,pval=proportions_ztest(count=successes,nobs=nobs)
(lcl_control,lcl_treatment),(ucl_control,ucl_treatment)=proportion_confint(count=successes,nobs=nobs)

In [112]:
print("('z_stat','pval')=({0:.3f},{1:.3f})".format(z_stat,pval))
print("ci 95% for control group: (LCL,UCL)=[{0:.3f},{1:.3f}]".format(lcl_control,ucl_control))
print("ci 95% for treatment group: (LCL,UCL)=[{0:.3f},{1:.3f}]".format(lcl_treatment,ucl_treatment))
print('>>>>>>>> 效果不顯著 <<<<<<<<')

('z_stat','pval')=(0.069,0.945)
ci 95% for control group: (LCL,UCL)=[0.116,0.126]
ci 95% for treatment group: (LCL,UCL)=[0.116,0.126]
>>>>>>>> 效果不顯著 <<<<<<<<
