# AB Test Case Study

### Homework

Before you begin, remember to import the necessary libraries.

In [19]:
import pandas as pd
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

The Marketing and Product teams are studying a new bundle-offer that they want to test in the checkout funnel of the e-commerce platform. In particular, once the user adds something to the cart, they will see an additional screen showing a special offer on a related product if they add it to the cart and purchase them together. 

To be sure that this is a good idea, they conducted an AB test and, after the fact, they call you in to analyse the results, to verify whether the increased conversion rate is statistically significant. 

1. The test results are in the `data/ab-test-homework.csv` csv file. Import them, familiarise yourself with the data and perform a permutation test to assess wether the treatment's higher conversion rate is statistically significant (note: control group = 'GRP A'; treatment group = 'GRP B'). 
2. Are you satisfied with the results? If you would have been consulted before the test, would you have done things differently?

***

In [20]:
df = pd.read_csv('data/ab-test-homework.csv')
df.sample(10)

Unnamed: 0,idUser,dtPurchased,nmGroup,isConv
5198,60091027,2017-10-02,GRP B,0
1705,17538793,2017-03-14,GRP A,0
368,75180339,2017-04-26,GRP A,1
1062,58094131,2017-04-23,GRP A,0
3070,76253764,2017-03-03,GRP B,0
2132,11785259,2017-05-29,GRP A,0
1803,68088927,2017-03-16,GRP A,1
1566,64967316,2017-12-14,GRP A,0
2695,32356826,2017-09-23,GRP A,0
5967,52539873,2017-09-14,GRP B,0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6177 entries, 0 to 6176
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   idUser       6177 non-null   int64 
 1   dtPurchased  6177 non-null   object
 2   nmGroup      6177 non-null   object
 3   isConv       6177 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 193.2+ KB


In [22]:
df['dtPurchased'] = pd.to_datetime(df['dtPurchased'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6177 entries, 0 to 6176
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   idUser       6177 non-null   int64         
 1   dtPurchased  6177 non-null   datetime64[ns]
 2   nmGroup      6177 non-null   object        
 3   isConv       6177 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 193.2+ KB


In [23]:
df.sample(5)

Unnamed: 0,idUser,dtPurchased,nmGroup,isConv
713,37444470,2017-09-16,GRP A,0
1911,21102250,2017-06-04,GRP A,0
540,89217852,2017-12-09,GRP A,1
4464,40083177,2017-05-03,GRP B,1
6141,52142418,2017-12-23,GRP B,0


In [24]:
# 1. The test results are in the `data/ab-test-homework.csv` csv file. 
# Import them, familiarise yourself with the data and perform a permutation test to assess wether the treatment's higher conversion rate is statistically significant 
# (note: control group = 'GRP A'; treatment group = 'GRP B'). 

In [25]:
df.groupby('nmGroup')['isConv'].mean()

nmGroup
GRP A    0.267745
GRP B    0.290661
Name: isConv, dtype: float64

In [26]:
control_group = df[df['nmGroup']=='GRP A']['isConv'].values
control_group

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [27]:
treatment_group = df[df['nmGroup']=='GRP B']['isConv'].values
treatment_group

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

In [28]:
difference_mean = np.mean(treatment_group) - np.mean(control_group)
difference_mean

0.022915606569783087

In [33]:
n = 1000
# np.random.seed(1)

mu_diff_samples = []
for i in range(n): 
    # extract a subsample of 50% of data without replacement
    df_sample1 = df.sample(frac=0.5, replace=False)
    # use set() to get indexes of rows that were not extracted from df in previous step
    ix2 = set(df.index) - set(df_sample1.index)
    # create dataframe with other 50% of data
    df_sample2 = df.iloc[list(ix2)]
    # calculate difference in mean from each sample
    mu_diff_temp = np.mean(df_sample1['isConv']) - np.mean(df_sample2['isConv'])
    # append sampled mean difference to mu_diff_samples list
    mu_diff_samples.append(mu_diff_temp)

In [30]:
df.shape

(6177, 4)