# A/B Test Challenge



---

#### What is an A/B Test? 

It is a decision making support & research methodology that allow you to measure an impact of a change in a product (e.g.: a digital product). For this challenge you will analyse the data resulting of an A/B test performed on a digital product where a new set of sponsored ads are included.


#### Measure of success

Metrics are need it to measure the success of your product. They are typically split in the following categories: 

- __Enganged based metrics:__ number of users, number of downloads, number of active users, user retention, etc.

- __Revenue and monetization metrics:__ ads and affiliate links, subscription-based, in-app purchases, etc.

- __Technical metrics:__ service level indicators (uptime of the app, downtime of the app, latency).



---

## Metrics understanding

In this part you must analyse the metrics involved in the test. We will focus in the following metrics:

- Activity level + Daily active users (DAU).

- Click-through rate (CTR)

### Activity level

In the following part you must perform every calculation you consider necessary in order to answer the following questions:

- How many activity levels you can find in the dataset (Activity level of zero means no activity).

- What is the amount of users for each activity level.

- How many activity levels do you have per day and how many records per each activity level.

At the end of this section you must provide your conclusions about the _activity level_ of the users.

__Dataset:__ `activity_pretest.csv`

In [36]:
import pandas as pd
import numpy  as np
from statsmodels.stats.weightstats import ztest
from scipy import stats
from scipy.stats import norm
import seaborn as sns
import matplotlib.pylab as plt

In [2]:
# your-code
activity_pretest_csv = pd.read_csv('../../data/AB_test/activity_pretest.csv')

In [3]:
activity_pretest = activity_pretest_csv.sort_values(by = 'activity_level', ascending = False)
activity_pretest

Unnamed: 0,userid,dt,activity_level
1859999,bac5da9e-ef79-4ae9-9efe-cd6eca093db2,2021-10-31,20
1843656,46ddc861-1060-4f2e-bd5d-141951e767a9,2021-10-11,20
1843647,2beee4b9-4d65-40c2-bf6e-8e03e57590d3,2021-10-11,20
1843648,94f643c2-d29e-42f8-8af5-2790e992971d,2021-10-11,20
1843649,e99c93b7-de5a-40fa-b1b5-4dceb2b382e9,2021-10-11,20
...,...,...,...
606082,5f456e7d-9786-4a3d-90b0-aaf24825c1c7,2021-10-21,0
606081,c5c95a01-106a-4093-a9da-00887cd8d19d,2021-10-21,0
606080,cdfcb25d-5394-43b8-9562-bd4f4d7c3ea9,2021-10-21,0
606079,b9b52ddf-b7e5-4ee8-8e51-f2d3e4f5d819,2021-10-21,0


- How many activity levels you can find in the dataset (Activity level of zero means no activity)

In [4]:
activity_pretest['activity_level'].value_counts()

0     909125
5      49227
2      49074
18     48982
10     48943
16     48934
12     48911
19     48901
6      48901
11     48832
9      48820
1      48732
3      48659
14     48620
15     48599
4      48556
13     48534
8      48396
17     48395
7      48339
20     24520
Name: activity_level, dtype: int64

* What is the amount of users for each activity level.

In [5]:
activity_pretest_groupby_userid = activity_pretest.groupby(['userid'])
activity_pretest_groupby_userid.value_counts()

userid                                dt          activity_level
0002a1ca-0b76-41cd-91e6-9aa51947b7fc  2021-10-01  0                 1
                                      2021-10-02  14                1
                                      2021-10-27  0                 1
                                      2021-10-26  8                 1
                                      2021-10-25  3                 1
                                                                   ..
ffff4881-db48-4836-b669-676f3bb5a761  2021-10-10  5                 1
                                      2021-10-09  0                 1
                                      2021-10-08  9                 1
                                      2021-10-07  10                1
                                      2021-10-31  1                 1
Length: 1860000, dtype: int64

* How many activity levels do you have per day and how many records per each activity level.

levels per day
records per level

In [6]:
activity_pretest_groupby_dt = activity_pretest.groupby(by = ['dt'], as_index=False)
activity_pretest_groupby_dt

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000020838E0ACA0>

In [7]:
activity_pretest_dt = activity_pretest['dt'].value_counts().sort_values()
activity_pretest_dt

2021-10-31    60000
2021-10-12    60000
2021-10-16    60000
2021-10-09    60000
2021-10-08    60000
2021-10-15    60000
2021-10-14    60000
2021-10-13    60000
2021-10-03    60000
2021-10-04    60000
2021-10-01    60000
2021-10-02    60000
2021-10-07    60000
2021-10-11    60000
2021-10-10    60000
2021-10-06    60000
2021-10-27    60000
2021-10-28    60000
2021-10-24    60000
2021-10-25    60000
2021-10-30    60000
2021-10-29    60000
2021-10-19    60000
2021-10-18    60000
2021-10-20    60000
2021-10-17    60000
2021-10-23    60000
2021-10-22    60000
2021-10-05    60000
2021-10-26    60000
2021-10-21    60000
Name: dt, dtype: int64

In [8]:
activity_pretest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1860000 entries, 1859999 to 0
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   userid          object
 1   dt              object
 2   activity_level  int64 
dtypes: int64(1), object(2)
memory usage: 56.8+ MB


Activity level + Daily active users (DAU).
actividad diaria

Click-through rate (CTR)

### Daily active users (DAU)

![ab_test](./img/user_activity_ab_testing.JPG)


The daily active users (DAU) refers to the amount of users that are active per day (activity level of zero means no activity). You must perform the calculation of this metric and provide your insights about it.

__Dataset:__ `activity_pretest.csv`

In [9]:
# your-code
activity_pretest.sort_values(by='dt')

Unnamed: 0,userid,dt,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0
16369,f9bf447e-6358-41b7-b402-11fabc876b2f,2021-10-01,0
16370,efb8ac42-8ac5-4690-9b24-5062ca38aaf8,2021-10-01,0
16399,679f2548-f23c-4213-b014-ee710c238505,2021-10-01,0
16400,593c5cc9-9bde-43cb-a14c-b8adc357e3a9,2021-10-01,0
...,...,...,...
884210,faa87376-fb4b-4244-94a8-b22c208e8e63,2021-10-31,0
884170,19f03ec6-f25e-4b29-93ec-36fea54267a4,2021-10-31,0
884172,4ad4ad0d-499e-4c48-bc8b-58fb7acfc65d,2021-10-31,0
884161,b900f2ea-03ab-47c1-b1df-c4d54d7a4c78,2021-10-31,0


### Click-through rate (CTR)

![ab_test](./img/ad_click_through_rate_ab_testing.JPG)

Click-through rate (CTR) refers to the percentage of clicks that the user perform from the total amount ads showed to that user during a certain day. You must perform the analysis of this metric (e.g.: average CTR per day) and provide your insights about it.

__Dataset:__ `ctr_pretest.csv`

In [10]:
# your-code
ctr_pretest_csv = pd.read_csv('../../data/AB_test/ctr_pretest.csv')

In [11]:
ctr_pretest = activity_pretest_csv
ctr_pretest

Unnamed: 0,userid,dt,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,0
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


---

## Pretest metrics 

In this section you will perform the analysis of the metrics using the dataset that includes the result for the test and control groups, but only for the pretest data (i.e.: prior to November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups prior to the start of the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

---

# <span style="color:brown"> **We seek to reject the null hypothesis, and "confirm" the alternative hypothesis, which we think is true.** </span>

---

In [12]:
# your-code
activity_all_csv = pd.read_csv('../../data/AB_test/activity_all.csv')
ctr_all_csv = pd.read_csv('../../data/AB_test/ctr_all.csv')

In [13]:
activity_all = activity_all_csv
activity_all.sort_values(by='dt')

Unnamed: 0,userid,dt,groupid,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0,0
2067287,465e71e4-49fd-4103-b852-254dd5dcf2de,2021-10-01,0,7
2067288,0052685a-a5fc-48b4-b7b1-53475cfbb5fd,2021-10-01,0,7
2067289,dd6dbd85-6fd8-43df-9a9a-d1b59b04b2a4,2021-10-01,1,7
2067290,7343e7db-9f5a-4240-b224-ab50821f7615,2021-10-01,1,7
...,...,...,...,...
2654417,3cfc6c6c-750b-4d78-8c97-5c7f66b87470,2021-11-30,1,11
2654416,3c99a48a-645c-49f0-9cf3-7e8fb8f18ca8,2021-11-30,0,11
2654415,9aa701ba-18ff-4e9e-b4cc-0b225586f8a6,2021-11-30,1,11
2654413,7f27e2d6-b51d-4b34-b590-8f8f50a2fe03,2021-11-30,0,11


In [14]:
ctr_all = ctr_all_csv
ctr_all

Unnamed: 0,userid,dt,groupid,ctr
0,60389fa7-2d71-4cdf-831c-c2bb277ffa1e,2021-11-13,0,31.81
1,b59cb225-d160-4851-92d2-7cc8120a2f63,2021-11-13,0,30.46
2,aa336050-934e-453f-a5b0-dd881fcd114e,2021-11-13,0,34.25
3,8df767f4-a10f-4322-a722-676b7e02b372,2021-11-13,0,34.92
4,a74762ed-4da0-42ab-91d2-40d7e808dfe9,2021-11-13,0,34.95
...,...,...,...,...
2303403,932e0348-ea2d-4b98-8782-aa84420f0796,2021-11-12,1,37.27
2303404,6775a825-6d3d-4dc3-9335-cad061736752,2021-11-12,1,39.14
2303405,a7b55365-21f1-4123-b2b5-485a8c7b98da,2021-11-12,1,40.05
2303406,a6fa937c-6f40-4f04-b15b-f1de09e179db,2021-11-12,1,38.14


---

## Experiment metrics 

In this section you must perform the same analysis as in the previous section, but using the data generated during the experiment (i.e.: after November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups during the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

### <span style ='color:red'>Group of control, group of experiment. 1 November 2021. Four dataframes</span>

In [15]:
def row_filter(df, cat_var, cat_values):
    df = df[df[cat_var].isin([cat_values])].sort_values(by='dt', ascending=False)
    return df.reset_index(drop=True)

In [16]:
# your-code
ctr_all_0 = row_filter(ctr_all,'groupid', 0)
ctr_all_0['dt'] = pd.to_datetime(ctr_all_0['dt'], format='%Y-%m-%d')
ctr_all_0

Unnamed: 0,userid,dt,groupid,ctr
0,f48dbfa5-b518-4b54-a962-e2e9142436b6,2021-11-30,0,32.86
1,dfa190da-3afe-4500-befa-30d8ebdeaf80,2021-11-30,0,34.62
2,c8691dbd-a40e-44c3-8871-71c853a93111,2021-11-30,0,34.71
3,13a0ed53-3ac4-4603-bd68-ef8e1cf45551,2021-11-30,0,31.61
4,b1e9bae4-9b25-4f18-8113-9ab158c4cb4d,2021-11-30,0,30.17
...,...,...,...,...
948402,dac525fd-3274-4cf4-8861-2cd04ca4ff70,2021-10-01,0,32.86
948403,ace0bd91-fe87-49ba-8db2-93cb34a963c5,2021-10-01,0,34.92
948404,0ca96593-77c7-4526-b287-9031a008a4f2,2021-10-01,0,31.07
948405,6e5fa72b-7bd3-45ef-af33-45803cca27bc,2021-10-01,0,33.00


In [17]:
ctr_all_1 = row_filter(ctr_all,'groupid', 1)
ctr_all_1['dt'] = pd.to_datetime(ctr_all_1['dt'], format='%Y-%m-%d')
ctr_all_1_before = ctr_all_1.loc[(ctr_all_1['dt'] < '2021-11-01')]
ctr_all_1_before

Unnamed: 0,userid,dt,groupid,ctr
879073,482aa08c-51b5-48cf-9e9d-d36f939f69b9,2021-10-31,1,32.99
879074,91f98d0a-532b-4565-9078-9fc2d9c4c8b0,2021-10-31,1,30.01
879075,2156bade-1847-496e-97b4-eae2494406fd,2021-10-31,1,31.95
879076,e56bab21-6b25-496b-9cc0-f4c638e34024,2021-10-31,1,33.59
879077,a9a346c8-1f96-4a19-b989-372f8d859366,2021-10-31,1,33.39
...,...,...,...,...
1354996,31d05431-db75-4990-9bae-84425f1a5c7c,2021-10-01,1,35.24
1354997,70f0a24c-bd7b-406f-a761-3e15d84b6a14,2021-10-01,1,32.65
1354998,ff1f4add-83da-4a00-9754-fc8d554a9cdf,2021-10-01,1,31.78
1354999,8d54137a-fcea-4c21-a6bf-fe97ab0372ee,2021-10-01,1,34.27


In [18]:
ctr_all_1_after = ctr_all_1.loc[(ctr_all_1['dt'] >= '2021-11-01')]
ctr_all_1_after

Unnamed: 0,userid,dt,groupid,ctr
0,fc5e8542-255d-4b76-a4c5-a22dcd48bb8e,2021-11-30,1,37.25
1,cd88d415-cad7-45f5-9e29-43e15ea92139,2021-11-30,1,40.46
2,434ff241-4d68-4139-82d7-189ee5b06ec8,2021-11-30,1,40.82
3,0156bc52-b152-448d-b174-a01f1aea8bb9,2021-11-30,1,37.69
4,6bd40a10-9796-46f6-948d-7f4b74d63e69,2021-11-30,1,37.33
...,...,...,...,...
879068,b0949c78-5f08-427a-9f25-62e446b6d3a9,2021-11-01,1,36.24
879069,01b5d9d8-8b61-4eab-b3b7-e0de9e14ba1e,2021-11-01,1,37.30
879070,70ebbd95-fcb8-4972-bd6f-e802a71e4e8c,2021-11-01,1,35.38
879071,4b4d7f8b-ab22-44d8-9152-ff73db704a3f,2021-11-01,1,39.49


In [19]:
ctr_all_0_before = ctr_all_0.loc[(ctr_all_0['dt'] < '2021-11-01')]
ctr_all_0_before

Unnamed: 0,userid,dt,groupid,ctr
473460,3f40c765-ea51-4746-b9af-9c4b2e507050,2021-10-31,0,30.05
473461,b5512988-58a3-4d46-b136-1fe8ef6e26a7,2021-10-31,0,35.96
473462,aea081ef-b452-4da4-b7f6-a50dcd978660,2021-10-31,0,30.62
473463,ad15f543-06a4-4bea-983e-073c0df8688f,2021-10-31,0,30.85
473464,2ccd3500-af47-4431-8c68-4988f712880a,2021-10-31,0,32.31
...,...,...,...,...
948402,dac525fd-3274-4cf4-8861-2cd04ca4ff70,2021-10-01,0,32.86
948403,ace0bd91-fe87-49ba-8db2-93cb34a963c5,2021-10-01,0,34.92
948404,0ca96593-77c7-4526-b287-9031a008a4f2,2021-10-01,0,31.07
948405,6e5fa72b-7bd3-45ef-af33-45803cca27bc,2021-10-01,0,33.00


In [20]:
ctr_all_0_after = ctr_all_0.loc[(ctr_all_0['dt'] >= '2021-11-01')]
ctr_all_0_after

Unnamed: 0,userid,dt,groupid,ctr
0,f48dbfa5-b518-4b54-a962-e2e9142436b6,2021-11-30,0,32.86
1,dfa190da-3afe-4500-befa-30d8ebdeaf80,2021-11-30,0,34.62
2,c8691dbd-a40e-44c3-8871-71c853a93111,2021-11-30,0,34.71
3,13a0ed53-3ac4-4603-bd68-ef8e1cf45551,2021-11-30,0,31.61
4,b1e9bae4-9b25-4f18-8113-9ab158c4cb4d,2021-11-30,0,30.17
...,...,...,...,...
473455,9b3497ee-fe9b-4218-8c01-40ff99df321e,2021-11-01,0,33.69
473456,7636a583-877f-4179-a7c0-e487692a12d1,2021-11-01,0,31.13
473457,f721dbfd-0498-40af-b47e-3f3ebe695bdd,2021-11-01,0,30.58
473458,85e3b8fb-8715-4479-96e5-737ecb21d98f,2021-11-01,0,35.88


In [21]:
ctr_all_0_after_mean = ctr_all_0_after['ctr'].mean() # Not experiment, after
ctr_all_1_after_mean = ctr_all_1_after['ctr'].mean() # Experiment, after
ctr_all_0_before_mean = ctr_all_0_before['ctr'].mean() # Not Experiment, before
ctr_all_1_before_mean = ctr_all_1_before['ctr'].mean() # Experiment, before

In [28]:
ctr_all_1_before_mean

32.99957172093257

In [31]:
# z test
hypothesis_mean = ctr_all_0_before_mean
sample_mean = ctr_all_0_after['ctr']
alpha = 0.05

Z_score, p_value = ztest(sample_mean, value=hypothesis_mean)
print(f'Z_score: {Z_score}', f'\np-value: {p_value}')

Z_score: -1.5622853085868977 
p-value: 0.1182207914176757


In [32]:
# z test experiment
hypothesis_mean = ctr_all_1_before_mean
sample_mean = ctr_all_1_after['ctr']
alpha = 0.05

Z_score, p_value = ztest(sample_mean, value=hypothesis_mean)
print(f'Z_score: {Z_score}', f'\np-value: {p_value}')

Z_score: 2704.670233205835 
p-value: 0.0


In [34]:
# t test
stats.ttest_ind(a=ctr_all_1_after_mean, b=ctr_all_1_before_mean, equal_var=True)


  stats.ttest_ind(a=ctr_all_1_after_mean, b=ctr_all_1_before_mean, equal_var=True)


Ttest_indResult(statistic=nan, pvalue=nan)

In [37]:
confidence = 0.95
norm.interval(confidence,
              loc=ctr_all_1_after_mean,
              scale=1)

(36.03699514172137, 39.956923110801476)

---

## Conclusions

Please provide your conclusions after the analyses and your recommendation whether we may or may not implement the changes in the digital product.

# <font color = 'blue'> your-conclusions
The experiment is successful, the change is good



---