 #                                     A/B Testing Case Study on Web Page design

## Objective 
The objective of this case study is to illustrate the process of conducting A/B testing. An artificial dataset from Kaggle has been utilized for this purpose.

# Part 1. Bussiness Goal and Product Feature Detail 

## Define testing variations and Choosing right metrics

Through surveys, focus groups, Google analytics, interviews with customers can help define the areas where business needs optimization.

The commonly used metrics are the number of visits on the page, session duration, and click-through rates


This case study employs a sample dataset obtained from Kaggle to exemplify the process of A/B testing. The dataset includes a 'landing_page' column with two distinct categorical values: 'old_page' and 'new_page.' The objective of this study is to apply A/B testing methodology to determine whether the new page design yields a 2% increase in conversion rate.

Consequently, our primary metrics of interest will be the Conversion Rates.

The study comprises two groups: the control group and the treatment group. The control group will be exposed to the old design, while the treatment group will experience the new design."


# Part 2. State Bussiness Hypothesis, and Null Hypothesis

Null Hypothesis (H0): There is no effect of the new design on the conversion rate.

Alternative Hypothesis (Ha): There is a change in the conversion rate.

Our primary metric is Conversion Rate
Our Secondary metric could be Revenue, Avarage Order Value, Click Through Rate
Our Guardrail metrids could be Bounce Rates

In this particular case, as the dataset in question originates from Kaggle, we will not be evaluating the secondary metrics and guardrail metrics due to the absence of available information.

# Part 3. Choose alpha, Power, Minimum Detectable Effect (MDE), and Calculate Sample Size and Duration for the Test

### Randomization Units

The initial step in determining randomization units involves assessing the presence of a network effect. If a network effect exists, the selection of randomization units needs to be adapted accordingly. Depending on the study's characteristics, potential approaches include geo-based randomization, time-based randomization, network-cluster randomization, or network ego-centric randomization.

In our specific case study, we will assume the absence of a network effect that requires adjustment. Therefore, the randomization unit for the experiment will be the individual user.

### Sample size calculation
Baseline conversion rate, minimum detectable effect, statistical power, and significance level are four important parameters required to estimate sample size.

In this test, 0.05 used as the significance level corresponds to 95% of the confidence interval. As given in the case study, the minimum detectable effect should be 2%. Generally, experiments use 80% as statistical power, which means that if there is any difference between new and old design, then there is an 80% chance that this test can detect it.

Let's assume the baseline conversion rate is 12%, and as a result the expected rate with 2% MDE would be 14%. 


In [1]:
# import libraries
import numpy as np
import pandas as pd
import statsmodels.stats.api as sm
from statsmodels.stats.proportion import proportions_ztest, proportion_confint


In [2]:
# Lets calculate effect size first and then sample size
# we will have a 1:1 ratio of the treatment to control group 

base_cr = 0.12
expected_cr = 0.14

effect_size = sm.proportion_effectsize(base_cr, expected_cr)

sample_size = sm.NormalIndPower().solve_power(effect_size, power = 0.8, alpha = 0.05, ratio = 1)
sample_size = round(sample_size)
sample_size


4432

The test requires a minimum of 4569 samples from each group to produce a statistically significant result.

### Duration of the Test

To estimate the duration of an A/B test, the first step is to assess the number of daily visitors or users on your website. Next, divide the required sample size by the expected daily traffic or user volume. This calculation will provide an estimate of the test duration necessary to collect the required data. However, it's important to note that the estimated duration is an approximation, as unforeseen factors and fluctuations in traffic patterns can influence the actual test duration. It is advisable to run the test for an adequate period to accommodate any weekly or seasonal variations in user behavior.

# Part 3. Data Collection and Exploration

### This Case Study uses an example dataset from Kaggle to demonstrate A/B testing. 

The table below shows the top 5 rows of the dataset. 
The dataset has 5 columns; however, columns such as group, landing_page, and converted are more relevant for this test.

In [3]:
# Import data
data = pd.read_csv(r'C:\Users\freta\Downloads\archive.zip')
data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


### Data cleaning:
The first step for any analysis should start with exploratory data analysis. Hence, it is good to check and remove any null and duplicate values in the dataset.

In [4]:
# check null value

data.isnull().sum()

# The dataset doesn't have any null values

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [5]:
# check duplicate users
len(data) - data['user_id'].nunique()

3894


Within the dataset, there are 3894 instances of duplicate users, which arise either from users engaging in multiple participations or due to the presence of the same user in both the control and treatment (experiment) group.

In [6]:
# remove duplicate data
data_dedup = data.drop_duplicates(subset=['user_id'])

# check duplicate users
len(data_dedup)- data_dedup['user_id'].nunique()

0

### Checking the the consistency in allocation of the users in to either group

The dataset consists of two groups: control and treatment. The control group should see the old design, and the experiment/treatment group should see the new design. If there exists any discrepancy, then it should be tackled at this stage.


In [7]:
# Check group
pd.crosstab(data_dedup['group'], data_dedup['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1006,144226
treatment,144314,1038


Based on the displayed table, it can be observed that 1006 users from the control group were exposed to the new design, while 1038 users from the treatment group were exposed to the old design. As a result, the subsequent action to take would involve removing these users.

In [8]:
# Remove the users that were exposed to the wrong landing page
treat_mis_index = data_dedup [(data_dedup['group'] =='treatment') & (data_dedup['landing_page'] == 'old_page')].index
control_mis_index = data_dedup[(data_dedup['group'] == 'control') & (data_dedup['landing_page'] == 'new_page')].index

df_cleaned = data_dedup.drop(treat_mis_index).drop(control_mis_index)

pd.crosstab(df_cleaned['group'], df_cleaned['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0,144226
treatment,144314,0


The test necessitates a minimum of 4432 samples from each group in order to yield a statistically significant outcome. Nevertheless, if there is an abundance of additional data, utilizing it would enhance the statistical power of the test. For the sake of demonstration, this test solely employs 4432 samples from each group.

In [9]:
control = df_cleaned.query("group == 'control'").sample(n=sample_size, random_state=42)
treatment = df_cleaned.query("group == 'treatment'").sample(n=sample_size, random_state=42)

df_ab = pd.concat([control, treatment], axis=0, ignore_index=True)

pd.crosstab(df_ab['group'], df_ab['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0,4432
treatment,4432,0


# Part 4. Analyze the Results and Draw Conclusions

### Randomization Check
To check randomization, we analyze the distribution of important variables (e.g., age, gender, race, location) across the control and treatment groups. For categorical variables like gender, we calculate proportions within each group.For continuous variables like age, we employ a t-test to compare the mean values between the two groups. However, in this particular case study, we lack variables of this nature, preventing us from conducting a randomization analysis.

### Statistical Test

The choice of an appropriate statistical test in an A/B test relies on specific details such as the type of data, sample size, and experimental design. The two most commonly used statistical tests in A/B testing are:

1.	Z-test: This test is employed when comparing proportions or percentages between two groups. It assumes that the data follows a normal distribution and is suitable for large sample sizes.

2.	T-test: The t-test is used to compare means between two groups. It can be an independent samples t-test or a paired samples t-test, depending on whether the observations in the groups are independent or paired, respectively. The t-test assumes that the data follows a normal distribution and is robust to moderate departures from normality, particularly for large sample sizes.


In this case study, the appropriate statistical test would be the Z-test since the target metric is the conversion rate, which is a proportion. For the Z-test of proportions, both the control and treatment groups should have sufficient sample sizes. It is recommended that both groups have at least 10 successes and 10 failures as a general rule of thumb. This ensures that the sample sizes are large enough to support the normal approximation for the distribution of proportions.


In [16]:

# Select the relevant data for control and treatment groups
control_results = df_ab.loc[df_ab['group'] == 'control', 'converted']
treatment_results = df_ab.loc[df_ab['group'] == 'treatment', 'converted']

# Calculate the sample sizes
n_control = control_results.shape[0]
n_treatment = treatment_results.shape[0]

# Count the number of conversions
converted_control = control_results.sum()
converted_treatment = treatment_results.sum()

# Create the conversion and sample size lists
conversion_list = [converted_control, converted_treatment]
n_list = [n_control, n_treatment]
# Check if both groups have at least 10 successes and 10 failures

print(f"treatment success: {converted_treatment}")
print(f"treatment failure: {n_treatment - converted_treatment}")
print(f"control success: {converted_control}")
print(f"control failure: {n_control - converted_control}")


treatment success: 536
treatment failure: 3896
control success: 544
control failure: 3888


Since the expected number of successes and failures is greater than 10 for each group, 
both groups meet the rule of thumb.

In [17]:
# Perform the z-test
z_stat, p_value = proportions_ztest(conversion_list, n_list)

# Printing the results
print(f"Converted control: {converted_control}/{n_control}")
print(f"Converted treatment: {converted_treatment}/{n_treatment}")
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

Converted control: 544/4432
Converted treatment: 536/4432
Z-statistic: 0.2598
P-value: 0.7950


### Conclusion

The p-value of 0.795 is above the significance level of 0.05. Hence, we cannot reject the null hypothesis. This test indicates that the new design would not increase the expected conversion rate by 2%.

### Recommendation

Since there is no statistically significalnt increase in the conversion rate of the new web page, my recommendation would be to refrain from implementing the new web page. 