<a href="https://colab.research.google.com/github/basava-999/A-B-Testing-For-Conversion-Rate/blob/main/A_B_Market_Ad_Campaign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A/B Testing for a UI/UX design to investigate improvement in CRs for Control Vs Experiment Versions**

### **Scenario**

This Testing is about two versions of a product being invested by a Product Team at a **Online E-Commerce Company**. The UI/UX team really worked hard on a new version of the product page, with a hope that will lead to a higher conversion rate. The Product Manager that told that the **CR** on the webpage is being **13%** on average throughout the year, and the team would be satisfied with a **2% lift upon previous CR**, meaning that the new **UI/UX** of the web page is success if it raises conversion rate to **15%**.

## **Experiment Design**

**Formulate Hypothesis**

Given we don't know if the new design will perform better or worse (or the same?) as our current design, we'll choose a two-tailed test:

<br>

$$
H_0 : P_o = P_n
$$

$$
H_1 : P_o \neq P_n
$$

<br>

where $P_o$ & $P_n$ stands for conversion rates of old UI/UX & new UI/UX. And we also define our **False Positive Rate with 0.05**

$$
\alpha = 0.05
$$

The $\alpha$
 value is a threshold we set, by which we say "if the probability of observing a result as extreme or more ( $p
-value $) is lower than
, then we reject the null hypothesis". Since our
 (indicating 5% probability), our confidence ($1 - \alpha$
) is 95%.

### **Choosing the variables**

**For our test we'll need two groups:**
<br>

- `A control group` - They'll be shown the old design

- `A treatment (or experimental) group` - They'll be shown the new design
<br>

This will be our Independent Variable. The reason we have two groups even though we know the baseline conversion rate is that we want to control for other variables that could have an effect on our results, such as seasonality: by having a `control group` we can directly compare their results to the `treatment group`, because the only systematic difference between the groups is the design of the product page, and we can therefore attribute any differences in results to the designs.<br>
<br>

For our *Dependent Variable* (i.e. what we are trying to measure), we are interested in capturing the **conversion rate**. A way we can code this is by each user session with a binary variable:

- `0` - The user did not buy the product during this user session

- `1` - The user bought the product during this user session
<br>

This way, we can easily calculate the mean for each group to get the conversion rate of each design.

### **Choosing Minimum Sample Size**
<br>
It is important to note that since we won't test the whole user base (our population), the conversion rates that we'll get will inevitably be only estimates of the true rates.

The number of people (or user sessions) we decide to capture in each group will have an effect on the precision of our estimated conversion rates: `the larger the sample size`, the more precise our estimates (i.e. the smaller our confidence intervals), `the higher the chance to detect a difference` in the two groups, if present.

<br>

*The sample size we need is estimated through something called Power analysis, and it depends on a few factors:*

<br>

- **Power of the test ( $1-\beta$)** - This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at **0.9** as a convention (here's more info on statistical power, if you are curious)
- **Alpha value ( $\alpha$
)** - The critical value we set earlier to **0.05**
- **Effect size** - How big of a difference we expect there to be between the conversion rates

<br>

Since, team would be happy with a difference of **2%**, we can use **13%** and **15%** to calculate the effect size we expect.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.stats.api as ssa

**Effect Size for 2 sample proportion** :    

$$ cohens_h = 2 arcsin(P_n) - 2 arcsin( P_o)$$

**Power ( Each Sample Size)** :

$$
\text{Power} = P\left( Z > Z_{1-\alpha/2} - \frac{|\Delta p|}{SE} \right)
$$  

$$
\Delta p = p_1 - p_2
$$  

$$
SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}
$$  

$$
\text{Power} = 1 - \beta
$$


In [2]:

effect_size = ssa.proportion_effectsize( .15, .13)                      # minimum detetable effect size**

sample_size = ssa.NormalIndPower().solve_power(
    effect_size = effect_size,
    power = 0.9,
    alpha = 0.05,
    ratio = 1
)

sample_size = np.ceil( sample_size ).astype( int )                      # sample size for each group

print( 'Effect Size: {}'.format( effect_size ) )
print( 'Sample Size: {}'.format( sample_size ) )

Effect Size: 0.0576728617308947
Sample Size: 6319



We'd need **at least 6319 observations for each group**.

Having set the power parameter to `0.9` in practice means that if there exists an actual difference in conversion rate between our designs, assuming the difference is the one we estimated **(13% vs. 15%)**, we have about **90%** chance to detect it as statistically significant in our test with the sample size we calculated.



# **Data Collection**

So now that we have our required sample size, we need to collect the data. Usually at this point you would work with your team to set up the experiment, likely with the help of the Engineering team, and make sure that you collect enough data based on the sample size needed.

<br>

However, since we'll use a dataset that we found online, in order to simulate this situation we'll:
<br>

1. Download the dataset from https://www.kaggle.com/datasets/zhangluyuan/ab-testing
2. Read the data into a pandas DataFrame
3. Check and clean the data as needed
4. Randomly sample **n=6319** rows from the DataFrame for each group
<br>
**Note**: Generally, we would not need to perform step 4, this is just for the sake

In [3]:

data = pd.read_csv( '/content/ab_data.csv' )
data.sample(5)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
66609,929300,2017-01-17 06:50:35.348516,control,old_page,0
99336,741263,2017-01-06 03:10:05.177089,treatment,new_page,0
223566,692444,2017-01-23 11:38:10.962384,control,old_page,1
230168,735846,2017-01-24 07:41:23.421190,treatment,new_page,0
204243,918108,2017-01-13 07:16:42.025496,treatment,new_page,1


In [6]:
data['timestamp'] = pd.to_datetime( data['timestamp'] )
data.set_index( 'timestamp',inplace = True )
data.head()

Unnamed: 0_level_0,user_id,group,landing_page,converted
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-01-21 22:11:48.556739,851104,control,old_page,0
2017-01-12 08:01:45.159739,804228,control,old_page,0
2017-01-11 16:55:06.154213,661590,treatment,new_page,0
2017-01-08 18:28:03.143765,853541,treatment,new_page,0
2017-01-21 01:52:26.210827,864975,control,old_page,1


In [7]:

data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 294478 entries, 2017-01-21 22:11:48.556739 to 2017-01-16 12:40:24.467417
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   group         294478 non-null  object
 2   landing_page  294478 non-null  object
 3   converted     294478 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 11.2+ MB


In [9]:

cross_tab = pd.crosstab(
    data['group'],
    data['landing_page']
)

cross_tab

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


There are **294478 rows** in the DataFrame, each representing a user session, as well as **4 columns** :

<br>

1. **user_id** - The user ID of each session
2. **group** - Which group the user was assigned to for that session `{control, treatment}`
3. **landing_page** - Which design each user saw on that session `{old_page, new_page}`
4. **converted** - Whether the session ended in a conversion or not `(binary, 0=not converted, 1=converted)`

<br>

We'll actually only use the group and converted columns for the analysis.

Before we go ahead and sample the data to get our subset, let's make sure there are no users that have been sampled multiple times.

In [13]:

if data.user_id.nunique() != len(data):
  print( 'Multiple Entries Count : ', len(data) - data.user_id.nunique() )


Multiple Entries Count :  3894


There are 3894 users that appear multiple times in the dataset

In [16]:

data.drop_duplicates( subset = 'user_id', inplace = True )


if data.user_id.nunique() != len(data):
  print( 'Multiple Entries Count : ', len(data) - data.user_id.nunique() )

else:
  print( 'No Duplicates Found' )


No Duplicates Found


In [17]:


cross_tab = pd.crosstab(
    data['group'],
    data['landing_page']
)

cross_tab

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1006,144226
treatment,144314,1038


In [23]:

miss_exposure = cross_tab.iloc[ 0, 0] + cross_tab.iloc[ 1, 1]

miss_exposure_P = miss_exposure * 100 / cross_tab.sum().sum()

if miss_exposure_P > 5:
  print( f'Large Mis-Exposure : { np.round(miss_exposure_P, 4)} %')

else :
  print( f'Small Mis-Exposure : { np.round( miss_exposure_P, 4 )} %')

Small Mis-Exposure : 0.7034 %


## **Sampling**

Now, We can proceed and sample **n=6319** entries for each of the groups. We can use pandas' **DataFrame.sample()** method to do this, which will perform Simple Random Sampling for us.

In [25]:


control_grou = data[ data['group'] == 'control' ].sample( n = 6319, random_state = 9 )
treatment_grou = data[ data['group'] == 'treatment' ].sample( n = 6319, random_state = 9 )
data_downsampled = pd.concat( [ control_grou, treatment_grou ] )
data_downsampled.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12638 entries, 2017-01-11 04:49:36.225453 to 2017-01-13 11:44:12.722206
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       12638 non-null  int64 
 1   group         12638 non-null  object
 2   landing_page  12638 non-null  object
 3   converted     12638 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 493.7+ KB


In [26]:

data_downsampled.group.value_counts()

Unnamed: 0_level_0,count
group,Unnamed: 1_level_1
control,6319
treatment,6319


### **Basic Stats**

In [27]:

data_downsampled.describe()

Unnamed: 0,user_id,converted
count,12638.0,12638.0
mean,789141.139816,0.123754
std,91162.729696,0.329313
min,630003.0,0.0
25%,711038.25,0.0
50%,789852.5,0.0
75%,868095.5,0.0
max,945999.0,1.0


In [33]:

grouped = data_downsampled.groupby( 'group' )['converted']
grouped.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
control,6319.0,0.12502,0.330768,0.0,0.0,0.0,0.0,1.0
treatment,6319.0,0.122488,0.327874,0.0,0.0,0.0,0.0,1.0


<br>

Judging by the stats above, it does look like our two designs performed very similarly, with our new design performing slightly worse, approx. **12.5% vs. 12.2%** conversion rate.

<br>

In [34]:

data.groupby( 'group' )['converted'].mean()

Unnamed: 0_level_0,converted
group,Unnamed: 1_level_1
control,0.120297
treatment,0.118843



# **Testing the hypothesis**

<br>

The final step of our analysis is testing our hypothesis. Since we have a very large sample, we can use the normal approximation for calculating our
-value **(i.e. z-test)**.

Again, Python makes all the calculations very easy. We can use the statsmodels.stats.proportion module to get the
-value and confidence intervals:

In [35]:

from statsmodels.stats.proportion import proportions_ztest, proportion_confint

In [36]:

control_results = data_downsampled[ data_downsampled['group'] == 'control' ]['converted']
treatment_results = data_downsampled[ data_downsampled['group'] == 'treatment' ]['converted']

In [41]:

n_con = control_results.count()
n_treat = treatment_results.count()

success_con = control_results.sum()
success_treat = treatment_results.sum()

successes = [ success_con, success_treat ]
totals    = [ n_con, n_treat ]

z_stat, p_value = proportions_ztest( successes, totals )

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint( successes, totals, alpha = 0.05 )

print( f'Z-Statistic               : { np.round( z_stat, 4 ) }' )
print( f'P-Value                   : { np.round( p_value, 4 ) }\n' )

if p_value < 0.05:
    print( 'Reject null hypothesis' )
    print( 'There is a significant difference between the two designs' )
else:
    print( 'Fail to reject null hypothesis' )
    print( 'There is no significant difference between the two designs' )

print( f'\nCI 95% for control group  : [{ np.round( lower_con, 4 ) }, { np.round( upper_con, 4 ) }]' )
print( f'CI 95% for treatment group: [{ np.round( lower_treat, 4 ) }, { np.round( upper_treat, 4 ) }]' )


Z-Statistic               : 0.4322
P-Value                   : 0.6656

Fail to reject null hypothesis
There is no significant difference between the two designs

CI 95% for control group  : [0.1169, 0.1332]
CI 95% for treatment group: [0.1144, 0.1306]


## **Conclusion**


Since our
**P -value=0.6656** is way above our
=0.05, we cannot reject the null hypothesis
, which means that our new design did not perform significantly different (let alone better) than our old one.

<br>

Additionally, if we look at the confidence interval for the treatment group **([0.1144, 0.1306], i.e. 11.44% - 13.06%)** we notice that:

- It includes our baseline value of **13%** conversion rate
- It does not include our target value of **15% (the 2% uplift we were aiming for)**

<br>

This conveys that it is more likely that the true conversion rate of the new design is similar to our baseline or less , rather than the **15%** target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design.