#### Metrics
Generally, rate is used when you want to measure the usability of the site, and probability when you want to measure the impact.



### Theory - https://rpubs.com/superseer/ab_testing
1. Binomial Distribution- 
For a binomial distribution with probability p , the mean is given by p and the standard deviation is p∗(1−p)/N−−−−−−−−−−−√ where N is the number of trials. A binomial distribution can be used when

    The outcomes are of 2 types
    Each event is independent of the other
    Each event has an identical distribution (i.e. p is the same for all)
2. Confidence Interval - 
A confidence interval indicates the range within which the mean is expected to fall in multiple trials of the experiment. For e.g., consider $\hat(p)$ - the proportion of users that click, where N is the number of users. Let us assume a binomial distribution (this requires $N\hat(p)>5$ and N(1−\hat(p))>5). The margin of error is given by
$$m=z∗se$$
$$ m = z*\sqrt{\frac{\hat{p}.(1-\hat{p})}{N}}$$
For a 95% confidence interval, z = 1.96.

3. Hypothesis Testing
The null hypothesis states that the difference between the control and experiment is due to chance. If $p_{cont}$ and $p_{test}$ are the control and test probabilities, then according to the null hypothesis
$$ H0:p_{exp}−p_{cont}=0 $$

The alternate hypothesis is that
$$ H1:p_{exp}−p_{cont}≠0 $$

4. Comparing two samples
For comparing two samples, we calculate the pooled standard error. For e.g., suppose Xcont and Ncont are the control number of users that click, and the total number of users in the control group. Let Xexp and Nexp be the values for the experiment. The pooled probability is given by
$$\hat{p}_{pool}= \frac{X_{cont}+X_{exp}}{N_{cont}+ N_{test}}$$
$$ SE_{pool} = \sqrt{\hat{p}_{pool}*(1-\hat{p}_{pool})*(\frac{1}{N_{cont}}+\frac{1}{N_{test}})} $$
$$ \hat{d} =\hat{p}_{exp}-\hat{p}_{cont} $$
$$ H_0: d = 0  \text{  where  } \hat{d} \sim N(0,SE_{pool})$$
If $\hat{d}>1.96∗SE_{pool}$ or $\hat{d}<−1.96∗SE_{pool}$ then we can reject the null hypothesis and state that our difference represents a statistically significant difference
5. Practical significance
Practical significance is the level of change that you would expect to see from a business standpoint for the change to be valuable. What is considered practically significant can vary by field. In medicine, one would expect a 5,10 or 15% improvement for the result to be considered practically significant. At Google, for example, a 1-2% improvement in click through probability is practically significant.

The statistical significance bar is often lower than the practical significance bar, so that if the outcome is practically significance, it is also statistically significant.

6. Size vs Power trade-off
One of the decisions is to determine the number of data points needed to get a statistically significant result. This is called statistical power. Power has an inverse trade-off with size. The smaller the change you want to detect or the increased confidence you want to have in the result, means you have to run a larger experiment.

As you increase the number of samples, the confidence interval moves closer to the mean

$$\alpha = P(\text{reject null | null true})$$
$$\beta = P(\text{fail to reject null | null false})$$
$1−β$ is referred to as the sensitivity of the experiment, or statistical power. People often choose high sensitivity, typically around 80%.

For a small sample, α is low and β is high. For a large sample α remains the same but β goes down (i.e. sensitivity increases). A good online calculator for determing the number of samples is here. As you change one of the parameters, your sample size will change as well. For example:

    If you increase the baseline click through probability (under 0.5) then this increases the standard error, and therefore, you need a higher number of samples
    If you increase the practical significance level, you require a fewer number of samples since larger changes are easier to detect
    If you increase the confidence level, you want to be more certain that you are rejecting the null. At the same sensivitiy, this would require increasing the number of samples
    If you want to increase the sensitivity, you need to collect more samples


In [2]:
import numpy as np
N_cont = 10072  # Control samples (pageviews)
N_exp = 9886  # Test samples (pageviews)
X_cont = 974  # Control clicks
X_exp = 1242  # Exp. clicks

p_pool = (X_cont + X_exp)/(N_cont+N_exp)
se_pool = np.sqrt(p_pool*(1-p_pool)*(1/N_cont + 1/N_exp))

p_cont = X_cont/N_cont
p_exp = X_exp/N_exp
d = p_exp - p_cont
d


0.028928474040117697

In [3]:
#95% confidence
m = 1.96*se_pool
m

0.0087179739320380565

In [4]:
cf_min = d-m
cf_max = d+m
d_min = 0.02 # Minimum practical significance value for difference
cf_min


0.020210500108079642

In the above example, since the minimum confidence limit is greater than 0 and the practical significance level of 0.02, we conclude that it is highly probable that click through probability is higher than 0.02 and is significant. Based on this, one would launch the new version



### 2. Metrics