**T-test for difference in means**


Running these set of experiments to decide which checkout page design yields higher order value and fastest purchase decision time.

In [10]:
import pingouin
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

checkout = pd.read_csv(r'//Users//EJEGUS//Desktop//AB Testing DataCamp//checkout.csv')

In [12]:
checkout.head()

Unnamed: 0.1,Unnamed: 0,user_id,checkout_page,order_value,purchased,gender,browser,time_on_page
0,0,877621,A,29.410131,1,F,chrome,66.168628
1,1,876599,A,,0,M,firefox,49.801887
2,2,905407,A,27.446845,1,M,chrome,56.744856
3,3,883562,A,30.602233,1,M,safari,71.890718
4,4,840542,A,29.668895,1,F,safari,67.410696


In [57]:
# Calculate the mean order_value per variant, and run a t-test for difference in order_value between variants A and B.

print ('avg order value:', checkout.groupby('checkout_page')['order_value'].mean())

avg order value: checkout_page
A    24.956437
B    29.876202
C    34.917589
Name: order_value, dtype: float64


In [24]:
# H0: A mean = b mean (order value)
# HA: A mean != b mean (order value)
ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='A']['order_value'], 
                       y=checkout[checkout['checkout_page']=='B']['order_value'],
                       paired=False,
                       alternative="two-sided")
ttest

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-32.285094,3110.673039,two-sided,1.772895e-197,"[-5.22, -4.62]",0.901468,5.0700000000000004e+203,1.0


The results of the t-test comparing the order values between checkout page 'A' and checkout page 'B' are as follows:

T: The T-statistic value is -32.285094.

dof: The degrees of freedom for the test are 3110.673039.

alternative: The alternative hypothesis specified for the test is two-sided.

p-val: The p-value obtained from the test is 1.772895e-197, which is extremely low.

CI95%: The 95% confidence interval for the difference in means between the two groups is [-5.22, -4.62].

cohen-d: The effect size (Cohen's d) is 0.901468, indicating a large effect size.

BF10: The Bayes Factor is 5.07e+203, providing strong evidence in favor of the alternative hypothesis.

power: The statistical power of the test is 1.0, indicating a high probability of correctly rejecting the null hypothesis when it is false.

Overall, the results suggest that there is a significant difference in order values between checkout page 'A' and checkout page 'B'. The extremely low p-value, large effect size, and high statistical power indicate strong evidence against the null hypothesis and support the conclusion that there is a significant difference in order values between the two checkout pages.



In [59]:
# Calculate the mean time_on_page per variant, run a t-test for its difference between variants A and B,
# and note the p-value and confidence interval of the difference. Will you reject the Null hypothesis?

print ('avg time on page:', checkout.groupby('checkout_page')['time_on_page'].mean() ) 

avg time on page: checkout_page
A    44.668527
B    42.723772
C    42.223772
Name: time_on_page, dtype: float64


In [28]:
# H0: A mean = b mean (time on page)
# HA: A mean != b mean (time on page)
ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='A']['time_on_page'], 
                       y=checkout[checkout['checkout_page']=='B']['time_on_page'],
                       paired=False,
                       alternative="two-sided")
ttest

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,7.026673,5998,two-sided,2.349604e-12,"[1.4, 2.49]",0.181428,1305000000.0,1.0


In [38]:
# Calculate the mean time_on_page per variant, run a t-test for its difference between variants A and C, and note the p-value 
# and confidence interval of the difference. Will you reject the Null hypothesis?

ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='A']['time_on_page'], 
                       y=checkout[checkout['checkout_page']=='C']['time_on_page'],
                       paired=False,
                       alternative="two-sided")
ttest

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,8.833244,5998,two-sided,1.316118e-18,"[1.9, 2.99]",0.228073,1811000000000000.0,1.0


All of the differences were statistically significant with page design C having the highest order_value and shortest time_on_page. But notice how we made multiple comparisons and analyzed multiple metrics using the same experiment data. Is there something we need to account for?

## Pairwise t-tests

Manually running separate comparisons using individual t-tests can be a pain as the number of groups gets larger. Thankfully, the pingouin package's .pairwise_tests() method can make things easier.


In [50]:
# Perform a pairwise t-test on signup, grouped by landing-page
pairwise = pingouin.pairwise_tests(data=checkout,
                                   dv="time_on_page",
                                   between="checkout_page",
                                   padjust="bonf")

pairwise

Unnamed: 0,Contrast,A,B,Paired,Parametric,T,dof,alternative,p-unc,p-corr,p-adjust,BF10,hedges
0,checkout_page,A,B,False,True,7.026673,5998.0,two-sided,2.349604e-12,7.048812e-12,bonf,1305000000.0,0.181405
1,checkout_page,A,C,False,True,8.833244,5998.0,two-sided,1.316118e-18,3.948354e-18,bonf,1811000000000000.0,0.228045
2,checkout_page,B,C,False,True,1.995423,5998.0,two-sided,0.04604195,0.1381258,bonf,0.212,0.051515


Based on the results provided in the output for the pairwise comparisons between checkout pages A, B, and C, we can make the following conclusions:

A vs. B Comparison:

The time spent on checkout page A is significantly different from the time spent on checkout page B (p < 0.001).
The effect size (Cohen's d) for this comparison is small (0.181), indicating a relatively small difference between the two groups.
A vs. C Comparison:

The time spent on checkout page A is significantly different from the time spent on checkout page C (p < 0.001).
The effect size (Cohen's d) for this comparison is small to medium (0.228), indicating a slightly larger difference compared to the A vs. B comparison.
B vs. C Comparison:

The time spent on checkout page B is not significantly different from the time spent on checkout page C (p = 0.046).
The effect size (Cohen's d) for this comparison is small (0.051), suggesting a small difference between the two groups.
Overall, the results suggest that there are significant differences in the time spent on checkout pages A and C compared to checkout page B. However, there is no significant difference between the time spent on checkout pages B and C. The effect sizes indicate that the differences observed are relatively small to medium in magnitude.

# Non-parametric statistical tests

A non-parametric statistical test is a type of hypothesis test that does not make assumptions about the distribution of the data being analyzed. Unlike parametric tests, which assume that the data follows a specific distribution (e.g., normal distribution), non-parametric tests are distribution-free and are used when the data does not meet the assumptions of parametric tests.

In [98]:
# Calculate the mean and count of time on page by variant
print(checkout.groupby('checkout_page')['time_on_page'].agg({'mean','count'}))

                    mean  count
checkout_page                  
A              44.668527   3000
B              42.723772   3000
C              42.223772   3000


In [114]:
# purposly take a sample < 30 to fail the parametric test assumptions. Take a random sample of size 25 from each variant
np.random.seed(40)

ToP_samp_B = checkout[checkout['checkout_page'] == 'B'].sample(25)['time_on_page']
ToP_samp_C = checkout[checkout['checkout_page'] == 'C'].sample(25)['time_on_page']

In [116]:
# Run a Mann-Whitney U test
mwu_test = pingouin.mwu(x=ToP_samp_B,
                        y=ToP_samp_C,
                        alternative='two-sided')

mwu_test

Unnamed: 0,U-val,alternative,p-val,RBC,CLES
MWU,416.0,two-sided,0.045663,0.3312,0.6656


The p-value of 0.046 suggests the decision to reject or fail to reject the the Null hypothesis that the sample means are drawn from the same distribution may need to include other factors.