# A/B Testing Problem Setup

## A/B Testing
- Scenario
- You run a Software as a Service(SaaS) startup
- You have a landing page where you get people to sign up
- Signup = enter email address and click signup button
- Not everyone who visits your site will signup
- Conversion rate = proportion of people who sign up

- Suppose your lead marketer has identified problems with your landing page(not responsive,slow load time, bad copy etc)
- They create a new,possibly better page
- You (as a data scientist) want to measure which page is better, using data and math
- Recall confidence interval concept
- we know rate(page 1) = 1/10 vs rate(page 2) = 2/10 is not as good as rate(page 1) = 10/100 vs rate(page 2) = 20/100
- How do we quantify this?


- We quantify this with statistical significance testing


- Significance level = $\alpha$ (5%, 1% common)
- Is the difference in mean height between men and women statistically siginificant at significance level $\alpha$?

$$
\mu_{1} = ?\mu_{2}
$$

## Hypotheses
- Null hypothesis(no difference):
    - Example: no difference in height between men and women
    - Example: no difference in effect between drug and placebo
    - $H_{0}: \mu_{1} = \mu_{2}  $
- Alternative hypothesis(one-sided test):
    - $H_{1}: \mu_{1} > \mu_{2}$
    

- Alternative hypothesis(2-sided test):
    - Example: test if men taller than women OR women taller than men
    - Example: test if drug works better than placebo OR worse than placebo
    - $H_{1}: \mu_{1} \neq \mu_{2}$

- we will do the 2-sided test($\neq$)
- we will show quantitatively if the 2 groups are different

# A/B testing Recipe
- Continue our example(height of men vs women, drug vs placebo)
- As is typical in frequentist statistics, we will assume data is Gaussian-distributed
- We collect some data, 2 lists of heights, one for men, one for women
- $X_{1}$= {$x_{11},x_{12},...,x_{1N}$ }
- $X_{2}$= {$x_{21},x_{22},...,x_{2N}$ }

- We create a test statistics (called __t__)
$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$

- $s_{p}$ = pooled std dev(use unbiased estimates for all s - divided by N-1, not N)
- N = size of each group

- Refer below sample mean,std dev
$$
\hat{\mu} = \frac{1}{N}\sum_{i=1}^{N}x_{i} \\
\hat{\sigma} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\hat{\mu})^2}
$$

- Recall estimate of mean - it was a sum of random variables, therefore also random variables
- t is also a function of random variables, therefore also a random variables
- can be shown that t is t-distributed
- we will see t-distribution and other more exotic distribution a lot when studying statistical testing + Bayesian methods

## t-distribution
- looks like a Gaussian with fatter tails

![](https://study.com/cimages/videopreview/o84agbwoq4.jpg)
![](https://cn.bing.com/th?id=OIP.-3zt17eY9pfOiFcewnoiEQHaFj&pid=Api&rs=1)
![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Student_t_pdf.svg/1200px-Student_t_pdf.svg.png)
![](http://www.obg.cuhk.edu.hk/ResearchSupport/StatTools/Pics/t.png)

- PDF
![](https://www.thoughtco.com/thmb/gAsKXrZg1kekAQnZVxb4J94QklA=/768x0/filters:no_upscale():max_bytes(150000):strip_icc()/tdist-56b749523df78c0b135f5be6.jpg)

- we won't use it directly
- 1 parameter : $v$ = degree of freedom (df)
- for our statistical test, $v$ = 2N -2

## Test Statistic
- If mean(X1) = mean(X2) -> t = 0
    - falls in center of t-distribution
- If mean(X1) >> mean(X2) -> t= large
    - falls in right tail
- If mean(X1) << mean(X2) -> t= small
    - falls in left tail
    
- symmetry -> doesn't matter if we call men = 1, women = 2 or vice versa

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$


## Area under t-dtistribution
- Use scipy.stats.t.cdf
- If in left tail -> CDF close to 0
- If in right tail -> CDF close to 1
- For a significance level $\alpha$ = 0.05, t < -2.776, or t > 2.776
- we call that a statistically significant difference



# p-values

## A/B testing Cont
- Previously: I showed you an algorithm to determine whether the difference between 2 groups was statistically significant or not
- Algorithm is done, but we need some more terminology
- In stats, we are looking for one number < significance level($\alpha$)
- Should be small no matter which tail end we're on

## p-value
- What's this small number? it's called a p-value
- Much controversy
- People still discussing it today
- Out goal: get to the Bayesian way of doing things
- p-value definition
    - the probability of obtaining a result equal to or more extreme than what was actually observed, when the null hypothesis is true
    

- If: average height of men == average height of women (null hypothesis)
- Then: p-value is the probability of observing the difference we measured(or any larger difference)
- In other words: If t $\propto$ mean(X1) - mean(X2) is very large, p-value should be very small


- we want to keep using our significance level $\alpha$
- if p-value < $\alpha$
- difference is statistically significant
- we reject the null hypothesis
- otherwise:
- we cannot reject the null hypothesis
- this doesnot mean the null hypothesis is true, it just means we can't reject it with the data we collected

- For our example, we want p-value < 5% = 0.05
- But the answer we got before(for statistical significance) was:
- Area < 0.025 or Area > 0.975
- If we get Area >0.975, take 1- Area < 0.025
- now we have a small < 0.025
- multiply both sides by 2
- 2 x( small number) < 0.05
- p-value = 2 x(small number)
- multiply by 2 only for 2-sided test
- one sided test -> we are only checking if X1 > X2, don't multiply by 2
- this means the one sided test has more power than the two sided test
- it doesn't require a test statistic to be as extreme in order to be significant 
- in general
- the more assumptions you make, the more power your test has
- opposite also true: less assumptions, less power


## Testing Characteristics
- we already know if mean(X1) >> mean(X2) or vice versa, t will be larger
- what about $s_{p}$ and N?
- $t \propto \sqrt{N}/s_{p}$
- bigger N -> bigger t (smaller p-value)
- bigger $s_{p}$ -> smaller t (bigger p-value) 

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$


- same situation as with confidence interval
- t grows slowly (square root) with N
- t decreases faster (inversely proportional) with $s_{p}$

## small sample sizes?
- t depends on N - if N is small, t is smaller, p-value is bigger
- since statistical significance is a function of N, it already takes sample size into account
- not correct to say small value of N makes a finding false

## Pooled standard deviation
- recall: N is the size of each group
- what if each group is of a different size? take a weighted combination

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{ \frac{1}{n_{1}}+\frac{1}{n_{2}}  }}\\
s_{p} = \sqrt{\frac{(n_{1}-1)s_{X_{1}}^2+ (n_{2}-1)s_{X_{2}}^2}{n_{1}+n_{2}-2} }
$$

## Another assumption
- not obvious- we've assumed that standard deviation of both groups is the same
- what if they're different? use welch's t-test
- important: the steps are the same: find t -> find df -> fidn p-value

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{\Delta}^{-} }\\
s_{\Delta}^{-} = \sqrt{\frac{s_{1}^2}{n_{1}}+ \frac{s_{2}^2}{n_{2}} }\\
d.f.= \frac{(s_{1}^2/n_{1}+s_{2}^2/n_{2})^2 }{(s_{1}^2/n_{1})^2/(n_{1}-1)+ (s_{2}^2/n_{2})^2/(n_{2}-1) }
$$

## More Assumptions
- we assumed data was gaussian
- we will see what test goes with click data later
- what if we don't want to assume a distribution?
- use nonparametric tests/distribution-free tests
- some popular ones are
    - kolmogorov-Smirnov test
    - Kruskal-Wallis test
    - Mann-Whitney U test
- API is the same
- less assumptions-> less power -> need more extreme difference for statistically significant p-value

## 1-sided vs 2 sided
- we did not make the 1 sided assumption
- effect of 1 sided test on p-value: it's easier to show significance because we don't multiply area by 2
- sometimes you don't want to do 1 sided test
- Drug testing- you want to test if drug is better, you also want to test if drug is worse
- But if you have an effective drug, and you want to test if a new drug is better, you can use a 1 sided test

## Summary
- generate test statistics, from that we know its distribution
- look to see if it's at the extreme values of the distribution(statistical significance)
- if statistically significant, reject the null hypothesis

## In Code

```python
import numpy as np
from scipy import stats

# generate data
N = 10
a = np.random.randn(N) + 2 # mean 2, variance 1
b = np.random.randn(N) # mean 0, variance 1
```

```python
a,b

(array([0.47644157, 2.65279007, 0.66894388, 1.8490248 , 1.60699778,
        3.02461137, 1.78354336, 0.17284328, 2.68667113, 1.66448736]),
 array([ 0.78296108, -0.14376871, -0.87388677, -0.63003727,  1.1729029 ,
         0.02082976, -0.5373537 , -0.56871037,  0.10464732,  0.47919825]))
```         

```python

#ddof : int, optional
#“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of # elements. By default ddof is zero.
# roll your own t-test:
var_a = a.var(ddof=1) # unbiased estimator, divide by N-1 instead of N
var_b = b.var(ddof=1)
```

```python
var_a,var_b
(0.9500324342271126, 0.4466096488668789)
```


$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$
```python
s = np.sqrt( (var_a + var_b) / 2 ) # balanced standard deviation
t = (a.mean() - b.mean()) / (s * np.sqrt(2.0/N)) # t-statistic
```

```python
df = 2*N - 2 # degrees of freedom
p = 1 - stats.t.cdf(np.abs(t), df=df) # one-sided test p-value
print("t:\t", t, "p:\t", 2*p) # two-sided test p-value
```


# Reference
[t_distribution calc](http://www.dmbru.dentistry.ubc.ca/Calculating_Companion/applets/t_distribution/t_distribution.php)
![](http://kisi.deu.edu.tr/joshua.cowley/StudentTTable.png)