## Data cleaning and Overiew

In [19]:
import numpy as np
import pandas as pd 
from scipy.stats import shapiro
import matplotlib.pyplot as plt


In [3]:
data = pd.read_csv('C:/Users/WEI/Desktop/数分面试与项目/ab_test_results.csv')
data.head()

Unnamed: 0,user_id,group,views,clicks
0,1,control,10.0,0.0
1,2,control,1.0,0.0
2,3,control,1.0,0.0
3,4,control,2.0,0.0
4,5,control,3.0,0.0


In [5]:
data.info() # for missing value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   user_id  120000 non-null  int64  
 1   group    120000 non-null  object 
 2   views    120000 non-null  float64
 3   clicks   120000 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 3.7+ MB


In [6]:
data[data.duplicated()] # showen no duplicated

Unnamed: 0,user_id,group,views,clicks


In [9]:
data['group'].value_counts() # counr for data from control/test group

control    60000
test       60000
Name: group, dtype: int64

## Calculate the click rate

Formulate: click rate = (click) / (aggregate views)

In [17]:
test_group = data[data['group']=='test'].sum() # sum of test group
ctrl_group = data[data['group']=='control'].sum() # ctrl group

In [39]:
test_click_rate = test_group['clicks']/test_group['views']* 100 # click rate percentage
ctrl_click_rate = ctrl_group['clicks']/ctrl_group['views']* 100
print('test_click_rate:' , test_click_rate)
print('ctrl_click_rate' , ctrl_click_rate)

test_click_rate: 3.8504233146113953
ctrl_click_rate 3.4673424332983336


It is easy find that the test_click_rate is higher than ctrl_click_rate in 0.4 percentage.But we can not say that these two have significent difference.
So we further need to use hypothesis test to confirm.

## How to choose proper statistical methods

In A/B test, there are several hypothesis test methods below

·**Z-test**: when the sample size is large(normaly over 30) and the population standard deviation is known, we can use Z-test

·**t-test**: when the sample size is small(normaly less than 30) or the population standard deviation is unknown, we can use t-test

·**Mann-Whitney U test**: When data doesn't meet the normal distribution assumption, we can use Mann-Whitney U test to test if the median of the two sample have significant difference.  Specifically, the Mann-Whitney U test is a non parametric test method. This method does not need require the assumption of normal distribution of the data, so it is suitable for various types of data.

·**Welch's t test**: Similar to t-test, but is allows two sample have different variances.

·**Kolmogorov-Smironv test**: used to compare whether the distribution function of two samples are the same, which is valid for non-parametric distribution test.

·**Chi-square test**; Used to compare whether the category varible has the same distribution, suitable for count data and category data.

· **McNemar test**: Used to compare whether the distribution of categorical variables is the same for related samples or paired samples.


In this project, the data is large(larger than 30),the standard deviation can be calculated, but we do not know whether is meet the assumption of normal distribution. SO **Z- test** and **Whitney U test** all can be used. 

We are going to introduce this two test methods:

In [43]:
# get clicks columns of test and control group
test_data = data.loc[data['group'] == 'test','clicks' ]
ctrl_data = data.loc[data['group'] == 'control','clicks']

## Z-test

H0: ctrl group and test group do not have significant difference

H1: ctrl group and test group have significant difference


In [57]:
from scipy import stats 
# mean and std
mean1, mean2 = np.mean(test_data),np.mean(ctrl_data)
std1, std2 = np.std(test_data,ddof= 1), np.std(ctrl_data, ddof= 1)
# Z-score and p-value
z_score, p_value = stats.ttest_ind(test_data, ctrl_data, equal_var= True)
# result
print(f"test group mean:{mean1:.4f}，std:{std1:.4f}")
print(f"ctrl group mean:{mean2:.4f}，std:{std2:.4f}")
print(f"z-score:{z_score:.4f}, p-value;{p_value:.4f}")


test group mean:0.1937，std:0.4903
ctrl group mean:0.1717，std:0.4538
z-score:8.0476, p-value;0.0000


p-value is less than 0.05, so the click rate between test and ctrl group have significant.

## Mann-Whitney U

Firstly, we need to judge whether the data meet normality test. If not, we can use the Mann-Whitney U.

**H0**: The data conforms to the assumption of normality.

**H1**: The data do not conform to the assumption of normality.

There are two general method to conduct a normality test on the data.

### Shapiro-Wilk Normality Test

In [58]:
# shapiro() function, 
# return two value:  "statistic W and p-value."

In [62]:
from scipy.stats import shapiro
W,p_value = shapiro(test_data)
print('test_group')
print("Statistic W:", W)
print("p-value:", p_value)

W, p_value = shapiro(ctrl_data)
print('ctrl_group')
print("Statistic W:", W)
print("p-value:", p_value)

test_group
Statistic W: 0.43864309787750244
p-value: 0.0
ctrl_group
Statistic W: 0.41856634616851807
p-value: 0.0




In [64]:
from scipy.stats import anderson
result = anderson(test_data)
print('test_group')
print("statistic D:", result.statistic)
print("critical value:", result.critical_values)
print("significant level:", result.significance_level)
result = anderson(ctrl_data)
print('ctrl_group')
print("statisic D:", result.statistic)
print("critical value:", result.critical_values)
print("significant level:", result.significance_level)

test_group
statistic D: 15253.934731520058
critical value: [0.576 0.656 0.787 0.918 1.092]
significant level: [15.  10.   5.   2.5  1. ]
ctrl_group
statisic D: 16077.556844314357
critical value: [0.576 0.656 0.787 0.918 1.092]
significant level: [15.  10.   5.   2.5  1. ]


The corresponding significance levels are [15, 10, 5, 2.5, 1]. This means that at the 15%, 10%, 5%, 2.5% and 1% significance levels, for non-normally distributed data, the statistic D of the Anderson-Darling normality test needs to exceed the corresponding critical value to reject the null hypothesis.

Combining the results of the above two methods, it can be seen that the data does not meet the normality test, so the null hypothesis **can be rejected**, that is, the data **does not meet the normal distribution**.

## Mann-Whitney U Test

So we can choose Mann-Whitney U test, to test if the click rate of these two group have significant difference.

**H0**: ctrl_group and test_group do not have significant difference

**H1**: ctrl_group and test_group have significant difference

In [65]:
from scipy.stats import mannwhitneyu
statistic, p_value = mannwhitneyu(test_data,ctrl_data)
print("Statistic:", statistic)
print("p-value:", p_value)

Statistic: 1827378002.0
p-value: 3.362997071095112e-13


Since p < 0.05, there is significant difference between test and ctrl group.