# Practice: Statistical Significance

Let's say that we've collected data for a web-based experiment. In the experiment, we're testing the change in layout of a product information page to see if this affects the proportion of people who click on a button to go to the download page. This experiment has been designed to have a cookie-based diversion, and we record two things from each user: which page version they received, and whether or not they accessed the download page during the data recording period. (We aren't keeping track of any other factors in this example, such as number of pageviews, or time between accessing the page and making the download, that might be of further interest.)

Your objective in this notebook is to perform a statistical test on both recorded metrics to see if there is a statistical difference between the two groups.

In [1]:
# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import rpy2.robjects as robjects
from IPython.display import Latex, display
from statsmodels.stats.proportion import proportions_ztest

%matplotlib inline

In [2]:
# import data
data = pd.read_csv("data/statistical_significance_data.csv")
data.head(10)

Unnamed: 0,condition,click
0,1,0
1,0,0
2,0,0
3,1,1
4,1,0
5,1,0
6,0,0
7,1,1
8,0,0
9,1,0


In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click.

## Checking the Invariant Metric

First of all, we should check that the number of visitors assigned to each group is similar. It's important to check the invariant metrics as a prerequisite so that our inferences on the evaluation metrics are founded on solid ground. If we find that the two groups are imbalanced on the invariant metric, then this will require us to look carefully at how the visitors were split so that any sources of bias are accounted for. It's possible that a statistically significant difference in an invariant metric will require us to revise random assignment procedures and re-do data collection.

In this case, we want to do a two-sided hypothesis test on the proportion of visitors assigned to one of our conditions. Choosing the control or the experimental condition doesn't matter: you'll get the same result either way. Feel free to use whatever method you'd like: we'll highlight two main avenues below.

If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split. Do this many times (200 000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. Don't forget that, since we have a two-sided test, an extreme case also includes values on the opposite side of 50/50. (e.g. Since simulated outcomes of .48 and lower are considered as being more extreme than an actual observation of 0.48, so too will simulated outcomes of .52 and higher.) The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

If you want to take an analytic approach, you could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. Recall that this is possible thanks to our large sample size and the central limit theorem. To get a precise p-value, you should also perform a 
continuity correction, either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had 415 / 850 assigned to the control group, then the normal approximation would take the area to the left of $(415 + 0.5) / 850 = 0.489$ and to the right of $(435 - 0.5) / 850 = 0.511$.)

You can check your results by completing the following the workspace and the solution on the following page. You could also try using multiple approaches and seeing if they come up with similar outcomes!

- $H_{0}: \text{The two groups are balanced.}$
- $H_{\alpha}: \text{The two groups are imblanced.}$

- $\alpha = 0.05$

### Analytic Approach

Conditions for inference on one sample z-test of proportion

- **Random**: The data needs to come from a random sample or randomized experiment.
- **Normal**: The sampling distribution of $\hat p$ needs to be approximately normal — needs at least $10$ expected successes and $10$ expected failures.
- **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

In [3]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control, n_treatment = data.groupby('condition').size()  # similar to data.groupby('condition').count()

In [4]:
# One-proportion z test (manually)
k, n, p = n_control, n_obs, 0.5  # 491, 999, 0.5
p_hat = k / n  # 0.4914914914914915
# SE = stats.bernoulli.std(p) / np.sqrt(n)  # 0.015819299929208316
SE = np.sqrt(p * (1 - p) / n)  # 0.015819299929208316

zstat = (p_hat - p) / SE  # -0.5378561975930812
if zstat < 0:
    pval = stats.norm.cdf(x=zstat) * 2  # 0.5906763307135386
    # OR
    # pval = stats.norm(loc=p, scale=SE).cdf(x=p_hat) * 2  # 0.5906763307135386
else:
    pval = stats.norm.sf(x=zstat) * 2

zstat, pval

(-0.5378561975930812, 0.5906763307135386)

In [5]:
# One-proportion z-test
k, n, p = n_control, n_obs, 0.5

zstat, pval = proportions_ztest(
    count=k, nobs=n, value=p, alternative="two-sided", prop_var=p
)
zstat, pval

(-0.5378561975930812, 0.5906763307135386)

In [6]:
# R
# Define an R code snippet to run
r_code = f'''
prop.test(
    x=c({k}),
    n=c({n}),
    p=.5,
    alternative="two.sided",
    # Note that, by default, the function prop.test() used the Yates continuity correction, 
    # which is really important if either the expected successes or failures is < 5. 
    # If you don’t want the correction, use the additional argument correct = FALSE in prop.test() function. 
    # The default value is TRUE. (This option must be set to FALSE to make the test mathematically equivalent to 
    # the uncorrected z-test of a proportion.)
    correct=FALSE
)
'''

# Run the R code
print(robjects.r(r_code))


	1-sample proportions test without continuity correction

data:  c(491) out of c(999), null probability 0.5
X-squared = 0.28929, df = 1, p-value = 0.5907
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4605827 0.5224654
sample estimates:
        p 
0.4914915 




Conditions for a goodness-of-fit tests:
- **Random**: The data came from a random sample from the population of interest, or a randomized experiment.
- **Large counts**: All expected counts are at least $5$. There are no conditions attached to the observed counts.
- **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

In [7]:
# Chi-square goodness-of-fit test (manually)
p = np.array([0.5, 0.5])
total = n_obs  # 999
k = len(p) - 1  # 1
expected = p * total  # [499.5, 499.5]
observed = np.array([n_control, n_treatment])  # [491, 508]

# calculate chi-square statistic manually
chisq = np.sum((observed - expected) ** 2 / expected)

# calculate p-value by standard chi-square distribution
pval = stats.chi2.sf(x=chisq, df=k)

chisq, pval

(0.28928928928928926, 0.5906763307135376)

In [8]:
# Chi-square goodness-of-fit test
p = np.array([0.5, 0.5])
total = n_obs
expected = p * total
observed = np.array([n_control, n_treatment])
chisq, pval = stats.chisquare(f_obs=observed, f_exp=expected)

chisq, pval

(0.28928928928928926, 0.5906763307135376)

### Simulation Approach (Bootstrap)

In [9]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control, n_treatment = data.groupby("condition").size()

In [10]:
# simulate outcomes under null, compare to observed outcome
p = 0.5
n_trials = 200_000

rvs = stats.binom.rvs(n=n_obs, p=p, size=n_trials)

normL = (rvs <= n_control).mean()  # normal distribution left side
normR = (rvs > (n_treatment)).mean()  # normal distribution right side
print(normL + normR)

0.59094


## Checking the Evaluation Metric

After performing our checks on the invariant metric, we can move on to performing a hypothesis test on the evaluation metric: the click-through rate. In this case, we want to see that the experimental group has a significantly larger click-through rate than the control group, a one-tailed test.

The simulation approach for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

There are a few analytic approaches possible here, but you'll probably make use of the normal approximation again in these cases. In addition to the pooled click-through rate, you'll need a pooled standard deviation in order to compute a z-score. While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.

As with the previous question, you'll find a quiz and solution following the workspace for you to check your results.

- $H_{0}: \text{P(Click-through rate | control group)} = \text{P(Click-through rate | treatment group)}$
- $H_{\alpha}: \text{P(Click-through rate | control group)} < \text{P(Click-through rate | treatment group)}$

- $\alpha = 0.05$

### Analytic Approach

> [AB testing calculator](https://abtestguide.com/calc/?ua=491&ub=508&ca=39&cb=57)

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20230212193535.png)

In [11]:
# Two-sample pooled z-test for proportion manually
n1, n2 = data.groupby("condition").size()  # 491, 508
k1, k2 = data.groupby("condition").sum()["click"]  # 39, 57
alpha = 0.05
CL = 1 - alpha

# estimated proportion
p_hat_1 = k1 / n1
p_hat_2 = k2 / n2
p_hat = (k1 + k2) / (n1 + n2)

# difference
d0 = 0
d = p_hat_2 - p_hat_1

# standard error
SE = np.sqrt(p_hat * (1 - p_hat) * (1 / n1 + 1 / n2))

# margin of error
critical_value = stats.norm.ppf(CL)
MOE = critical_value * SE

# confidence interval
ci_low, ci_upp = d - MOE, d + MOE
print("CI:", (ci_low, ci_upp))

# z statistic & p-value
zstat = (d - d0) / SE  # pooled
if zstat > 0:
    pval = stats.norm.sf(zstat)
else:
    pval = stats.norm.cdf(zstat)
print("zstat, pval:", (zstat, pval))

CI: (0.002095268435703467, 0.06345470991476239)
zstat, pval: (1.7571887396196666, 0.039442821974613705)


In [12]:
# Two-sample pooled z-test for proportion by proportions_ztest
n1, n2 = data.groupby("condition").size()  # 491, 508
k1, k2 = data.groupby("condition").sum()["click"]  # 39, 57

# difference
d0 = 0  # assume there is no difference of click-throught rate between two groups
count = [k1, k2]
nobs = [n1, n2]

zstat, pval = proportions_ztest(
    count,
    nobs,
    value=d0,  # null hypothesis
    alternative="smaller",
    prop_var=False,  # pooled
)
zstat, pval

(-1.7571887396196666, 0.039442821974613705)

In [13]:
# R
# Define an R code snippet to run
r_code = f'''
prop.test(
    x=c({k1}, {k2}),
    n=c({n1}, {n2}),
    p=NULL,
    alternative="less",
    conf.level={CL},
    # Note that, by default, the function prop.test() used the Yates continuity correction, 
    # which is really important if either the expected successes or failures is < 5. 
    # If you don’t want the correction, use the additional argument correct = FALSE in prop.test() function. 
    # The default value is TRUE. (This option must be set to FALSE to make the test mathematically equivalent to 
    # the uncorrected z-test of a proportion.)
    correct=FALSE
)
'''

# Run the R code
print(robjects.r(r_code))


	2-sample test for equality of proportions without continuity correction

data:  c(39, 57) out of c(491, 508)
X-squared = 3.0877, df = 1, p-value = 0.03944
alternative hypothesis: less
95 percent confidence interval:
 -1.000000000 -0.002222566
sample estimates:
    prop 1     prop 2 
0.07942974 0.11220472 




### Simulation Approach (Bootstrap)

In [14]:
# Simulation approach: your work here: feel free to create additional code cells as needed!
# get number of trials and number of 'successes'
A, B = data.groupby("condition").size()
p_click = data["click"].mean()

# get difference of click-through rate
A_click_rate, B_click_rate = data.groupby("condition")["click"].mean()
d_click_rate = A_click_rate - B_click_rate

In [15]:
# simulate outcomes under null, compare to observed outcome
n_trials = 200_000

# assume A_clicks and B_clicks have the same and fair chance to get a click
A_clicks = stats.binom.rvs(n=A, p=p_click, size=n_trials)
B_clicks = stats.binom.rvs(n=B, p=p_click, size=n_trials)
# difference of A_clicks / A and B_clicks / B
samples = A_clicks / A - B_clicks / B

# probability of difference samples that smaller than d_click_rate
(samples < d_click_rate).mean()

0.03902