# **Testing the AB ANALYTICS library**

In [1]:
import pandas as pd
from ab_analytics import *

## Dataset

In [2]:
data = pd.read_csv('data/ab_data_tourist.csv')

data

Unnamed: 0,user_id,date,group,purchase,price
0,851104,2021-01-21,A,0,0
1,804228,2021-01-12,A,0,0
2,661590,2021-01-11,B,0,0
3,853541,2021-01-08,B,0,0
4,864975,2021-01-21,A,1,150000
...,...,...,...,...,...
294473,751197,2021-01-03,A,0,0
294474,945152,2021-01-12,A,0,0
294475,734608,2021-01-22,A,0,0
294476,697314,2021-01-15,A,0,0


-----

## **Function `data_divider`**

### Divided data

In [3]:
datas = data_divider(data, 'group')

In [4]:
data_a = datas[0]

data_a

Unnamed: 0,user_id,date,group,purchase,price
2,661590,2021-01-11,B,0,0
3,853541,2021-01-08,B,0,0
6,679687,2021-01-19,B,1,10000
8,817355,2021-01-04,B,1,60000
9,839785,2021-01-15,B,1,10000
...,...,...,...,...,...
294462,677163,2021-01-03,B,0,0
294465,925675,2021-01-07,B,0,0
294468,643562,2021-01-02,B,0,0
294472,822004,2021-01-04,B,0,0


In [5]:
data_b = datas[1]

data_b

Unnamed: 0,user_id,date,group,purchase,price
0,851104,2021-01-21,A,0,0
1,804228,2021-01-12,A,0,0
4,864975,2021-01-21,A,1,150000
5,936923,2021-01-10,A,0,0
7,719014,2021-01-17,A,0,0
...,...,...,...,...,...
294471,718310,2021-01-21,A,0,0
294473,751197,2021-01-03,A,0,0
294474,945152,2021-01-12,A,0,0
294475,734608,2021-01-22,A,0,0


----

## **Function `data_intersect_cleaner`**

### Cleaned data

In [6]:
data_cld = data_intersect_cleaner(data_a, data_b, 'user_id')

data_cld

Unnamed: 0,user_id,date,group,purchase,price
2,661590,2021-01-11,B,0,0
3,853541,2021-01-08,B,0,0
6,679687,2021-01-19,B,1,10000
8,817355,2021-01-04,B,1,60000
9,839785,2021-01-15,B,1,10000
...,...,...,...,...,...
294471,718310,2021-01-21,A,0,0
294473,751197,2021-01-03,A,0,0
294474,945152,2021-01-12,A,0,0
294475,734608,2021-01-22,A,0,0


In [7]:
new_rows = data_cld.shape[0]
old_rows = data.shape[0]

print(f"Old data rows number is {old_rows} \n" 
      f"New data rows number is {new_rows} \n"
      f"Are they equal? - {old_rows == new_rows}") 

Old data rows number is 294478 
New data rows number is 290688 
Are they equal? - False


---

## **Function `pivot_and_conversion`**

### Pivot without renaming

In [8]:
groupped = pivot_and_conversion(data_cld, 
                                ['date', 'group'], 
                                count='user_id', 
                                converted='purchase')

groupped.head(10)

Unnamed: 0,date,group,user_id,purchase,conversion
0,2021-01-02,A,2854,359,12.578837
1,2021-01-02,B,2861,345,12.058721
2,2021-01-03,A,6596,748,11.340206
3,2021-01-03,B,6619,754,11.391449
4,2021-01-04,A,6566,799,12.168748
5,2021-01-04,B,6536,764,11.689106
6,2021-01-05,A,6433,790,12.280429
7,2021-01-05,B,6511,749,11.503609
8,2021-01-06,A,6606,764,11.565244
9,2021-01-06,B,6751,838,12.412976


### Pivot with renaming

In [9]:
renamed = pivot_and_conversion(data_cld, 
                               ['date', 'group'], 
                               count='user_id', 
                               converted='purchase', 
                               rename=['count_visits', 'count_purchases'])

renamed.head(10)

Unnamed: 0,date,group,count_visits,count_purchases,conversion
0,2021-01-02,A,2854,359,12.578837
1,2021-01-02,B,2861,345,12.058721
2,2021-01-03,A,6596,748,11.340206
3,2021-01-03,B,6619,754,11.391449
4,2021-01-04,A,6566,799,12.168748
5,2021-01-04,B,6536,764,11.689106
6,2021-01-05,A,6433,790,12.280429
7,2021-01-05,B,6511,749,11.503609
8,2021-01-06,A,6606,764,11.565244
9,2021-01-06,B,6751,838,12.412976


### Invalid renaming

In [10]:
pivot_and_conversion(data_cld, 
                     ['date', 'group'], 
                     count='user_id', 
                     converted='purchase', 
                     rename='count_visits')

In 'rename' list must be 2 names: first is for 'count', second is for 'converted'


---

## **Function `cumulative_metrics`**

### Cumulative data

In [11]:
cumulative = cumulative_metrics(renamed, 
                                cols=['count_visits', 'count_purchases'], 
                                groupby='group')

cumulative.head(10)

Unnamed: 0,date,group,count_visits,count_purchases,conversion,cum_count_visits,cum_count_purchases
0,2021-01-02,A,2854,359,12.578837,2854,359
1,2021-01-02,B,2861,345,12.058721,5715,704
2,2021-01-03,A,6596,748,11.340206,12311,1452
3,2021-01-03,B,6619,754,11.391449,18930,2206
4,2021-01-04,A,6566,799,12.168748,25496,3005
5,2021-01-04,B,6536,764,11.689106,32032,3769
6,2021-01-05,A,6433,790,12.280429,38465,4559
7,2021-01-05,B,6511,749,11.503609,44976,5308
8,2021-01-06,A,6606,764,11.565244,51582,6072
9,2021-01-06,B,6751,838,12.412976,58333,6910


---

## **Functions `z_proportions_advanced` and `hypothesis_decision`**

### Test and decision

In [12]:
H0 = "Conversions in A and B groups are equal."
H1 = "Conversion in A group is larger than conversion in B group."

pvalue = z_proportions_advanced(data_cld, 'group', 'purchase', 'larger')

hypothesis_decision(pvalue, 0.05, H0, H1)

0.12 > 0.05 
We accept null hypothesis. Conversions in A and B groups are equal.


-----

## **Functions `conf_interval` and `conf_message`**

### The confident interval and message

In [13]:
ci = conf_interval(data_cld['price'], rnd=2)

conf_message(ci, 'price')

Confident interval for 'price' is (10023.27, 10278.84)


### Invalid data

In [14]:
conf_interval(data_cld, rnd=2)

The 'data' must be a single column, not a 2D table.


---

## **Function `proportions_conf_interval`**

### Interval for proportions

In [15]:
conf_interval_a = proportions_conf_interval(data_a, 
                                            n='user_id', 
                                            xp='purchase', 
                                            rnd=2)
conf_interval_b = proportions_conf_interval(data_b, 
                                            n='user_id', 
                                            xp='purchase', 
                                            rnd=2)

conf_message(conf_interval_a, 'conversion', 'A')
conf_message(conf_interval_b, 'conversion', 'B')

Confident interval for 'conversion' of A group is (11.73, 12.06)
Confident interval for 'conversion' of B group is (11.87, 12.21)


---

## **Functions `diff_prop_conf_interval` and `diff_decision`**

### The interval and decision

In [16]:
diff_inverval = diff_prop_conf_interval(data_a, 
                                        data_b, 
                                        n='user_id', 
                                        xp='purchase', 
                                        rnd=2)

diff_decision(diff_inverval)

Difference of confident intervals is: (-0.38, 0.09)
Both variants are equivalent.


-----

## **Function `multiple_decision`**

### The decision

In [18]:
H0 = "Values in all groups have normal distribution"
H1 = "In one of the groups, or in both, value distribution is not normal"

_, pv_a = shapiro(data_a['price'])
_, pv_b = shapiro(data_b['price'])

multiple_decision(pv_a, pv_b, 0.05, H0, H1)

We reject null hypothesis. In one of the groups, or in both, value distribution is not normal
