# A/B Testing Conversion Rate Optimization

In this notebook we’ll go over the process of analysing an A/B experiment, from formulating a hypothesis, testing it, and finally interpreting results.

**Potential scenario**: Let’s imagine you work on the product team at a medium-sized online e-commerce business. The UX designer worked really hard on a new version of the product page, with the hope that it will lead to a higher conversion rate. The product manager (PM) told you that the current conversion rate is about 13% on average throughout the year, and that the team would be happy with an increase of 2%, meaning that the new design will be considered a success if it raises the conversion rate to 15%.

Before rolling out the change, the team would be more comfortable testing it on a small number of users to see how it performs, so you suggest running an A/B test on a subset of your user base users.

## Designing our experiment

**Formulating a hypothesis** Given we don’t know if the new design will perform better or worse (or the same?) as our current design, we’ll choose a two-tailed test:
$$
H_0: p = p_0
$$
$$
H_1: p\neq p_0
$$
We’ll also set a confidence level of 95%:
$$
\alpha = 0.05
$$

The $\alpha$ value is a threshold we set, by which we say “if the probability of observing a result as extreme or more (p-value) is lower than $\alpha$, then we reject the Null hypothesis”. 

## Choosing the variables
For our test we’ll need two groups:
- A control group - They'll be shown the old design
- A treatment (or experimental) group - They'll be shown the new design

This will be our Independent Variable. The reason we have two groups even though we know the baseline conversion rate is that we want to control for other variables that could have an effect on our results, such as seasonality.

For our Dependent Variable (i.e. what we are trying to measure), we are interested in capturing the conversion rate. A way we can code this is by each user session with a binary variable:

- 0 - The user did not buy the product during this user session
- 1 - The user bought the product during this user session

## Choosing a sample size

It is important to note that since we won’t test the whole user base (our population), the conversion rates that we’ll get will inevitably be only estimates of the true rates.

**The larger the sample size**, the more precise our estimates (i.e. the smaller our confidence intervals), **the higher the chance to detect a difference in the two groups, if present**.

On the other hand, the larger our sample gets, the more expensive (and impractical) our study becomes.

The sample size we need is estimated through something called Power analysis, and it depends on a few factors:

- **Power of the test** $(1 — \beta)$ — This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 by convention
- **Alpha value** $(\alpha)$ — The critical value we set earlier to 0.05
- **Effect size** — How big of a difference we expect there to be between the conversion rates

Since our team would be happy with a difference of 2%, we can use 13% and 15% to calculate the effect size we expect.

In [None]:
# Packages imports
import pandas as pd
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

%matplotlib inline

effect_size = sms.proportion_effectsize(0.13, 0.15)    # Calculating effect size based on our expected rates

print(f'effect size: {effect_size}')

required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, 
    alpha=0.05, 
    ratio=1
    )                                                  # Calculating sample size needed

required_n = ceil(required_n)                          # Rounding up to next whole number                          
print(f'required n: {required_n}')

We’d need **at least 4720 observations for each group.**

Having set the power parameter to 0.8 in practice means that if there exists an actual difference in conversion rate between our designs, assuming the difference is the one we estimated (13% vs. 15%), we have about 80% chance to detect it as statistically significant in our test with the sample size we calculated.

## Collecting and preparing the data

However, since we’ll use a dataset that we found online, in order to simulate this situation we’ll:

1. Load the dataset
2. Read the data into a pandas DataFrame
3. Check and clean the data as needed
4. Randomly sample $n=4720$ rows from the DataFrame for each group *

**Note**: Normally, we would not need to perform step 4, this is just for the sake of the exercise

In [None]:
df = pd.read_csv('ab_data.csv')
df.head()

In [None]:
df.info()

In [None]:
# To make sure all the control group are seeing the old page and viceversa
pd.crosstab(df['group'], df['landing_page'])

There are **294478 rows** in the DataFrame, each representing a user session, as well as **5 columns**:

- user_id - The user ID of each session
- timestamp - Timestamp for the session
- group - Which group the user was assigned to for that session {control, treatment}
- landing_page - Which design each user saw on that session {old_page, new_page}
- converted - Whether the session ended in a conversion or not (binary, 0=not converted, 1=converted)

Before we go ahead and sample the data to get our subset, let’s make sure there are no users that have been sampled multiple times.

In [None]:
session_counts = df['user_id'].value_counts(ascending=False)
multi_users = session_counts[session_counts > 1].count()

print(f'There are {multi_users} users that appear multiple times in the dataset')

There are, in fact, 3894 users that appear more than once. Since the number is pretty low, we’ll go ahead and remove them from the DataFrame to avoid sampling the same users twice.

In [None]:
users_to_drop = session_counts[session_counts > 1].index

df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

## Sampling

Now that our DataFrame is nice and clean, we can proceed and sample $n=4720$ entries for each of the groups. We can use pandas` DataFrame.sample()` method to do this, which will perform Simple Random Sampling for us.

Note: We’ve set $\text{random\_state}=22$ so that the results are reproducible.

In [None]:
control_sample = df[df['group'] == 'control'].sample(n=required_n, random_state=22)
treatment_sample = df[df['group'] == 'treatment'].sample(n=required_n, random_state=22)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)
ab_test

In [None]:
ab_test.info()

In [None]:
ab_test['group'].value_counts()

## Visualising the results

The first thing we can do is to calculate some basic statistics to get an idea of what our samples look like.

In [None]:
conversion_rates = ab_test.groupby('group')['converted'].agg(["mean", 'std', 'sem'])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']
conversion_rates.style.format('{:.3f}')

Judging by the stats above, it does look like **our two designs performed very similarly**, with our new design performing slightly better, approx. **12.3% vs. 12.6% conversion rate**.

In [None]:
# Plotting the data will make these results easier to grasp:
plt.figure(figsize=(8,6))
sns.barplot(x='group', y='converted', data=ab_test, errorbar=('ci', 95))
plt.title('Conversion rate by group')
plt.show()

The conversion rates for our groups are indeed very close. Also note that the conversion rate of the control group is lower than what we would have expected given what we knew about our avg. conversion rate (12.3% vs. 13%). This goes to show that there is some variation in results when sampling from a population.

So… the treatment group's value is higher. **Is this difference statistically significant?**

## Testing the hypothesis

The last step of our analysis is testing our hypothesis. Since we have a very large sample, we can use the normal approximation for calculating our p-value (i.e. z-test).

In [None]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint


control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']

n_con = control_results.count()
n_con_succ = control_results.sum()

n_treat = treatment_results.count()
n_treat_succ = treatment_results.sum()

successes = [n_con_succ, n_treat_succ]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

## Drawing conclusions

Since our p-value=0.732 is way above our α=0.05 threshold, we cannot reject the Null hypothesis Hₒ, which means that our new design did not perform significantly different (let alone better) than our old one :(

Additionally, if we look at the confidence interval for the treatment group ([0.116, 0.135], or 11.6-13.5%) we notice that:

It includes our baseline value of 13% conversion rate
It does not include our target value of 15% (the 2% uplift we were aiming for)

What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline, rather than the 15% target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design, and that unfortunately we are back to the drawing board!