In [80]:
import pandas as pd
import numpy as np

## 1. Load Data

In [97]:
df = pd.read_csv('effect_tb.csv')

In [98]:
df.head(3)

Unnamed: 0,1,1.1,0,1.2
0,1,1000004,0,1
1,1,1000004,0,2
2,1,1000006,0,1


In [85]:
# rename the columns
df.columns = ["dt","user_id","label","dmp_id"]

In [86]:
# drop dt column
df = df.drop(columns = "dt")

In [95]:
df.head(3)

Unnamed: 0,user_id,label,dmp_id
0,1000004,0,1
1,1000004,0,2
2,1000006,0,1


## 2. Data Cleaning 

### 2.1 Duplicate Value

In [88]:
df.duplicated().sum()

12983

In [89]:
df = df.drop_duplicates()
df.duplicated().sum()

0

### 2.2 Missing Value

In [90]:
df.isnull().sum()

user_id    0
label      0
dmp_id     0
dtype: int64

There is no missing data in the dataset.

### 2.3 Outliers

In [38]:
cols = ['label','dmp_id']
for col in cols:
    print('Column:{}'.format(col))
    print(df[col].unique())
    print('*'*100)

Column:label
[0 1]
****************************************************************************************************
Column:dmp_id
[1 2 3]
****************************************************************************************************


There is no outliers

## 3. Check sample size


Before performing the A/B test, check whether the sample size meets the minimum required by the test.

we use Evan Miller's sample size calculation tool to calculate the minimum require size. 

First, set the click-through rate baseline and the minimum increase ratio. We set the click-through rate of the control group as the baseline.

In [57]:
# CTR of the control group
print(str(round(df[df["dmp_id"] == 1]["label"].mean()*100,2)) +'%')

1.26%


<img src="abtest.png">


In [49]:
df["dmp_id"].value_counts()

1    1905662
2     411107
3     316205
Name: dmp_id, dtype: int64

The sample sizes of the two experimental groups of were 411,100 and 316,200 respectively, which met the minimum sample size requirement.


## 4. Hypothesis Testing


In [59]:
print('Control Group:',str(round(df[df["dmp_id"] == 1]["label"].mean()*100,2)) +'%')
print("Marketing Strategy 1： " ,str(round(df[df["dmp_id"] == 2]["label"].mean()*100,2)) +'%')
print("Marketing Strategy 2： " ,str(round(df[df["dmp_id"] == 3]["label"].mean()*100,2)) +'%')

Control Group: 1.26%
Marketing Strategy 1：  1.53%
Marketing Strategy 2：  2.62%


It can be seen that strategy one and strategy two have different degrees of increase in click-through rate compared with the control group.

Strategy one increased by 0.2% while strategy two increased by 1.3%. Only strategy two met our previous requirement for the minimum increase in click-through rate.

Next, a hypothesis test needs to be performed to see if the increase is significant.

### a. Null hypothesis and Alternative hypothesis
p1 - CTR of control group

p2 - CTR of experimental group with strategy two：

* Null Hypothesis: H0： p1 ≥ p2
* Alternate Hypothesis H1： p1 ＜ p2

Perform Z-Test at the significance level of 95%. 

### 4.1 Method One


In [78]:
# numbers of user 
user_old = len(df[df.dmp_id == 1])  # Control Group
user_new = len(df[df.dmp_id == 3])  # Experimental group

# numbers of click
click_old = len(df[(df.dmp_id ==1) & (df.label == 1)])
click_new = len(df[(df.dmp_id ==3) & (df.label == 1)])

# CTR (Click Through Rate) ()
ctr_old = click_old / user_old
ctr_new = click_new / user_new


In [92]:
# Z-score
z = (ctr_old - ctr_new) / np.sqrt(r * (1 - r)*(1/user_old + 1/user_new))

print("Z-score：", z)

Z-score： -59.44164223047762


In [94]:
# Check the z-score corresponding to α=0.05

from scipy.stats import norm
z_alpha = norm.ppf(0.05)
z_alpha

-1.6448536269514729

z_alpha = -1.64， z-score = -59.44，since z-score is way less then z_aplpha, we will reject the null hypothesis and the increase in the click-through rate of strategy two is statistically significant.


### 4.2 Method two 


In [77]:
import statsmodels.stats.proportion as sp
z_score, p = sp.proportions_ztest([click_old, click_new],[user_old, user_new], alternative = "smaller")
print("Z-score：",z_score,"，p-value：", p)

Z-score： -59.44164223047762 ，p-value： 0.0


P value is less than 0.05, the null hypothesis can be rejected. 


## 5. Summary 

In summary, the second strategy has a significant effect on increasing the click-through rate, and nearly doubled the number compared to the control group. Therefore, the second marketing strategies should be selected for promotion.
