In [62]:
import math
from scipy import stats
import pandas as pd
import datetime
from datetime import datetime

Let's load the data and take a look at our groups)

In [5]:
df = pd.read_csv('/Users/anastasiapiterskaa/Downloads/logs_exp_us.csv',sep='\t')
 
df.columns =['event_name','user_id','timestamp','experiment_id']
df['timestamp'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x))
 
df.head()

Unnamed: 0,event_name,user_id,timestamp,experiment_id
0,MainScreenAppear,4575588528974610257,2019-07-25 07:43:36,246
1,MainScreenAppear,7416695313311560658,2019-07-25 14:11:42,246
2,PaymentScreenSuccessful,3518123091307005509,2019-07-25 14:28:47,248
3,CartScreenAppear,3518123091307005509,2019-07-25 14:28:47,248
4,PaymentScreenSuccessful,6217807653094995999,2019-07-25 14:48:42,248


For calculating proportions we need to know the number of users in the group. We will call groups 246 and 247 - test_1 and test_2 and 248 is control

In [8]:
test_1=df[df.experiment_id==246]['user_id'].unique()

In [9]:
test_1

array([4575588528974610257, 7416695313311560658, 8351860793733343758, ...,
       4876403292056911122, 6660805781687343085, 7823752606740475984])

Number of users in the first group

In [45]:
len(test_1)

2489

In [10]:
test_2=df[df.experiment_id==247]['user_id'].unique()
control=df[df.experiment_id==248]['user_id'].unique()

Number of users in the second group

In [46]:
len(test_2)

2520

Now we have lists with our users from each group and we can use it in calculation of our proportions) 

Suppose, in the middle of the project you get a pivot like the one below and you need to compare the 'conversion' of two test groups for different events. So you need to check whether two test groups are significantly different in each event.

In [18]:
pivot = df.pivot_table(index='event_name', values='user_id', columns='experiment_id', aggfunc=lambda x: x.nunique()).reset_index()
pivot

experiment_id,event_name,246,247,248
0,CartScreenAppear,1270,1240,1239
1,MainScreenAppear,2456,2482,2501
2,OffersScreenAppear,1545,1530,1538
3,PaymentScreenSuccessful,1202,1160,1185
4,Tutorial,279,286,282


What are we actually checking? Whether there is a statistical difference in conversion aka proportion - "Is the share of users (from all users in the test) that had "CartScreenAppear of one test group is statistically different from another?"

To do this, we need to use the test of proportions or z-score. Let's revise the example from the theory

In [29]:
from scipy import stats as st
import numpy as np
import math as mth

alpha = .05 # significance level

successes = np.array([78, 120])
trials = np.array([830, 909])

# success proportion in the first group:
p1 = successes[0]/trials[0]

# success proportion in the second group:
p2 = successes[1]/trials[1]

# success proportion in the combined dataset:
p_combined = (successes[0] + successes[1]) / (trials[0] + trials[1])

# the difference between the datasets' proportions
difference = p1 - p2
# calculating the statistic in standard deviations of the standard normal distribution
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))

# setting up the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1)
# calculating the statistic in standard deviations of the standard normal distribution
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))

# setting up the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1) 

p_value = (1 - distr.cdf(abs(z_value))) * 2

print('p-value: ', p_value)

if (p_value < alpha):
    print("Rejecting the null hypothesis: there is a significant difference between the proportions")
else:
    print("Failed to reject the null hypothesis: there is no reason to consider the proportions different")

p-value:  0.012621025223628068
Rejecting the null hypothesis: there is a significant difference between the proportions


Now, in our case **success** for each group  will be the value of action from pivot table. For example, a success of CartScreenAppear for group 246 is 1270 

We can access this value of success directly from our pivot table)

In [27]:
pivot[pivot.event_name=='CartScreenAppear'][246].iloc[0]

1270

Thus, knowing the event name and the name of the group, we can access any value of success from our pivot table for any event. Keep in mind that value of trials will always be the same - the number of users in the groups)

At first, let's get all our variables for the test)

*group1*- the number of first group(like 246)


*group2* - the number of second group ( like 247)

In [53]:
def check_hypothesis(group1,group2, event, alpha=0.05):
    #let's start with successes, using indexing on our pivot table
    successes1=pivot[pivot.event_name==event][group1].iloc[0]
    successes2=pivot[pivot.event_name==event][group2].iloc[0]
    print( "Success for group", group1, "is",successes1, "for event", event,"\n",
          "Success for group", group2, "is",successes2, "for event", event,"\n")

In [54]:
check_hypothesis(246,247, 'CartScreenAppear', alpha=0.05)

Success for group 246 is 1270 for event CartScreenAppear 
 Success for group 247 is 1240 for event CartScreenAppear 



Now let's add trials. Remember, trials are the total users in groups

In [55]:
def check_hypothesis(group1,group2, event, alpha=0.05):
    #let's start with successes, using 
    successes1=pivot[pivot.event_name==event][group1].iloc[0]
    successes2=pivot[pivot.event_name==event][group2].iloc[0]
    
    #for trials we can go back to original df or used a pre-aggregated data
    trials1=df[df.experiment_id==group1]['user_id'].nunique()
    trials2=df[df.experiment_id==group2]['user_id'].nunique()
    
    print( "Success for group", group1, "is",successes1, "for event", event,"\n",
          "Success for group", group2, "is",successes2, "for event", event,"\n",
         "Trials for group", group1, "is",trials1,"\n",
         "Trials for group", group2, "is",trials2,"\n")

In [56]:
check_hypothesis(246,247, 'CartScreenAppear', alpha=0.05)

Success for group 246 is 1270 for event CartScreenAppear 
 Success for group 247 is 1240 for event CartScreenAppear 
 Trials for group 246 is 2489 
 Trials for group 247 is 2520 



Now let's add the rest of the function!

In [69]:
def check_hypothesis(group1,group2, event, alpha=0.05):
    #let's start with successes, using 
    successes1=pivot[pivot.event_name==event][group1].iloc[0]
    successes2=pivot[pivot.event_name==event][group2].iloc[0]
    
    #for trials we can go back to original df or used a pre-aggregated data
    trials1=df[df.experiment_id==group1]['user_id'].nunique()
    trials2=df[df.experiment_id==group2]['user_id'].nunique()
    
    #proportion for success in the first group
    p1 = successes1/trials1

   #proportion for success in the second group
    p2 = successes2/trials2

    # proportion in a combined dataset
    p_combined = (successes1 + successes2) / (trials1 + trials2)

  
    difference = p1 - p2
    
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

  
    distr = stats.norm(0, 1) 


    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for",event, 'and groups',group1,group2)
    else:
        print("Fail to Reject H0 for", event,'and groups',group1,group2)  

In [70]:
check_hypothesis(246,247, 'CartScreenAppear', alpha=0.05)

p-value:  0.19817540651836918
Fail to Reject H0 for CartScreenAppear and groups 246 247


Now you can use this function to iterate over list of events or groups. Good luck!

In [74]:
for i in pivot.event_name.unique():
    check_hypothesis(246,247, i, alpha=0.05)

p-value:  0.19817540651836918
Fail to Reject H0 for CartScreenAppear and groups 246 247
p-value:  0.5856639791209877
Fail to Reject H0 for MainScreenAppear and groups 246 247
p-value:  0.3233225048227193
Fail to Reject H0 for OffersScreenAppear and groups 246 247
p-value:  0.10902546202332886
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 247
p-value:  0.8756574258016974
Fail to Reject H0 for Tutorial and groups 246 247
