# A/B Testing Problem Setup

## A/B Testing
- Scenario
- You run a Software as a Service(SaaS) startup
- You have a landing page where you get people to sign up
- Signup = enter email address and click signup button
- Not everyone who visits your site will signup
- Conversion rate = proportion of people who sign up

- Suppose your lead marketer has identified problems with your landing page(not responsive,slow load time, bad copy etc)
- They create a new,possibly better page
- You (as a data scientist) want to measure which page is better, using data and math
- Recall confidence interval concept
- we know rate(page 1) = 1/10 vs rate(page 2) = 2/10 is not as good as rate(page 1) = 10/100 vs rate(page 2) = 20/100
- How do we quantify this?


- We quantify this with statistical significance testing


- Significance level = $\alpha$ (5%, 1% common)
- Is the difference in mean height between men and women statistically siginificant at significance level $\alpha$?

$$
\mu_{1} = ?\mu_{2}
$$

## Hypotheses
- Null hypothesis(no difference):
    - Example: no difference in height between men and women
    - Example: no difference in effect between drug and placebo
    - $H_{0}: \mu_{1} = \mu_{2}  $
- Alternative hypothesis(one-sided test):
    - $H_{1}: \mu_{1} > \mu_{2}$
    

- Alternative hypothesis(2-sided test):
    - Example: test if men taller than women OR women taller than men
    - Example: test if drug works better than placebo OR worse than placebo
    - $H_{1}: \mu_{1} \neq \mu_{2}$

- we will do the 2-sided test($\neq$)
- we will show quantitatively if the 2 groups are different

# A/B testing Recipe
- Continue our example(height of men vs women, drug vs placebo)
- As is typical in frequentist statistics, we will assume data is Gaussian-distributed
- We collect some data, 2 lists of heights, one for men, one for women
- $X_{1}$= {$x_{11},x_{12},...,x_{1N}$ }
- $X_{2}$= {$x_{21},x_{22},...,x_{2N}$ }

- We create a test statistics (called __t__)
$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$

- $s_{p}$ = pooled std dev(use unbiased estimates for all s - divided by N-1, not N)
- N = size of each group

- Refer below sample mean,std dev
$$
\hat{\mu} = \frac{1}{N}\sum_{i=1}^{N}x_{i} \\
\hat{\sigma} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\hat{\mu})^2}
$$

- Recall estimate of mean - it was a sum of random variables, therefore also random variables
- t is also a function of random variables, therefore also a random variables
- can be shown that t is t-distributed
- we will see t-distribution and other more exotic distribution a lot when studying statistical testing + Bayesian methods

## t-distribution
- looks like a Gaussian with fatter tails

![](https://study.com/cimages/videopreview/o84agbwoq4.jpg)
![](https://cn.bing.com/th?id=OIP.-3zt17eY9pfOiFcewnoiEQHaFj&pid=Api&rs=1)
![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Student_t_pdf.svg/1200px-Student_t_pdf.svg.png)
![](http://www.obg.cuhk.edu.hk/ResearchSupport/StatTools/Pics/t.png)

- PDF
![](https://www.thoughtco.com/thmb/gAsKXrZg1kekAQnZVxb4J94QklA=/768x0/filters:no_upscale():max_bytes(150000):strip_icc()/tdist-56b749523df78c0b135f5be6.jpg)

- we won't use it directly
- 1 parameter : $v$ = degree of freedom (df)
- for our statistical test, $v$ = 2N -2

## Test Statistic
- If mean(X1) = mean(X2) -> t = 0
    - falls in center of t-distribution
- If mean(X1) >> mean(X2) -> t= large
    - falls in right tail
- If mean(X1) << mean(X2) -> t= small
    - falls in left tail
    
- symmetry -> doesn't matter if we call men = 1, women = 2 or vice versa

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$


## Area under t-dtistribution
- Use scipy.stats.t.cdf
- If in left tail -> CDF close to 0
- If in right tail -> CDF close to 1
- For a significance level $\alpha$ = 0.05, t < -2.776, or t > 2.776
- we call that a statistically significant difference



# p-values

## A/B testing Cont
- Previously: I showed you an algorithm to determine whether the difference between 2 groups was statistically significant or not
- Algorithm is done, but we need some more terminology
- In stats, we are looking for one number < significance level($\alpha$)
- Should be small no matter which tail end we're on

## p-value
- What's this small number? it's called a p-value
- Much controversy
- People still discussing it today
- Out goal: get to the Bayesian way of doing things
- p-value definition
    - the probability of obtaining a result equal to or more extreme than what was actually observed, when the null hypothesis is true
    

- If: average height of men == average height of women (null hypothesis)
- Then: p-value is the probability of observing the difference we measured(or any larger difference)
- In other words: If t $\propto$ mean(X1) - mean(X2) is very large, p-value should be very small


- we want to keep using our significance level $\alpha$
- if p-value < $\alpha$
- difference is statistically significant
- we reject the null hypothesis
- otherwise:
- we cannot reject the null hypothesis
- this doesnot mean the null hypothesis is true, it just means we can't reject it with the data we collected

- For our example, we want p-value < 5% = 0.05
- But the answer we got before(for statistical significance) was:
- Area < 0.025 or Area > 0.975
- If we get Area >0.975, take 1- Area < 0.025
- now we have a small < 0.025
- multiply both sides by 2
- 2 x( small number) < 0.05
- p-value = 2 x(small number)
- multiply by 2 only for 2-sided test
- one sided test -> we are only checking if X1 > X2, don't multiply by 2
- this means the one sided test has more power than the two sided test
- it doesn't require a test statistic to be as extreme in order to be significant 
- in general
- the more assumptions you make, the more power your test has
- opposite also true: less assumptions, less power


## Testing Characteristics
- we already know if mean(X1) >> mean(X2) or vice versa, t will be larger
- what about $s_{p}$ and N?
- $t \propto \sqrt{N}/s_{p}$
- bigger N -> bigger t (smaller p-value)
- bigger $s_{p}$ -> smaller t (bigger p-value) 

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$


- same situation as with confidence interval
- t grows slowly (square root) with N
- t decreases faster (inversely proportional) with $s_{p}$

## small sample sizes?
- t depends on N - if N is small, t is smaller, p-value is bigger
- since statistical significance is a function of N, it already takes sample size into account
- not correct to say small value of N makes a finding false

## Pooled standard deviation
- recall: N is the size of each group
- what if each group is of a different size? take a weighted combination

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{ \frac{1}{n_{1}}+\frac{1}{n_{2}}  }}\\
s_{p} = \sqrt{\frac{(n_{1}-1)s_{X_{1}}^2+ (n_{2}-1)s_{X_{2}}^2}{n_{1}+n_{2}-2} }
$$

## Another assumption
- not obvious- we've assumed that standard deviation of both groups is the same
- what if they're different? use welch's t-test
- important: the steps are the same: find t -> find df -> fidn p-value

$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{\Delta}^{-} }\\
s_{\Delta}^{-} = \sqrt{\frac{s_{1}^2}{n_{1}}+ \frac{s_{2}^2}{n_{2}} }\\
d.f.= \frac{(s_{1}^2/n_{1}+s_{2}^2/n_{2})^2 }{(s_{1}^2/n_{1})^2/(n_{1}-1)+ (s_{2}^2/n_{2})^2/(n_{2}-1) }
$$

## More Assumptions
- we assumed data was gaussian
- we will see what test goes with click data later
- what if we don't want to assume a distribution?
- use nonparametric tests/distribution-free tests
- some popular ones are
    - kolmogorov-Smirnov test
    - Kruskal-Wallis test
    - Mann-Whitney U test
- API is the same
- less assumptions-> less power -> need more extreme difference for statistically significant p-value

## 1-sided vs 2 sided
- we did not make the 1 sided assumption
- effect of 1 sided test on p-value: it's easier to show significance because we don't multiply area by 2
- sometimes you don't want to do 1 sided test
- Drug testing- you want to test if drug is better, you also want to test if drug is worse
- But if you have an effective drug, and you want to test if a new drug is better, you can use a 1 sided test

## Summary
- generate test statistics, from that we know its distribution
- look to see if it's at the extreme values of the distribution(statistical significance)
- if statistically significant, reject the null hypothesis

## In Code

```python
import numpy as np
from scipy import stats

# generate data
N = 10
a = np.random.randn(N) + 2 # mean 2, variance 1
b = np.random.randn(N) # mean 0, variance 1
```

```python
a,b

(array([0.47644157, 2.65279007, 0.66894388, 1.8490248 , 1.60699778,
        3.02461137, 1.78354336, 0.17284328, 2.68667113, 1.66448736]),
 array([ 0.78296108, -0.14376871, -0.87388677, -0.63003727,  1.1729029 ,
         0.02082976, -0.5373537 , -0.56871037,  0.10464732,  0.47919825]))
```         

```python

#ddof : int, optional
#“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of # elements. By default ddof is zero.
# roll your own t-test:
var_a = a.var(ddof=1) # unbiased estimator, divide by N-1 instead of N
var_b = b.var(ddof=1)
```

```python
var_a,var_b
(0.9500324342271126, 0.4466096488668789)
```


$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{p}\sqrt{2/N}}\\
s_{p} = \sqrt{\frac{s_{1}^2+s_{2}^2}{2}}
$$
```python
s = np.sqrt( (var_a + var_b) / 2 ) # balanced standard deviation
t = (a.mean() - b.mean()) / (s * np.sqrt(2.0/N)) # t-statistic
```

```python
df = 2*N - 2 # degrees of freedom
p = 1 - stats.t.cdf(np.abs(t), df=df) # one-sided test p-value
print("t:\t", t, "p:\t", 2*p) # two-sided test p-value
```


```python
# built-in t-test:
t2, p2 = stats.ttest_ind(a, b)
print("t2:\t", t2, "p2:\t", p2)
```

- stats.ttest_int 
- We can use this test, if we observe two independent samples from the same or different population, e.g. exam scores of boys and girls or of two ethnic groups. The test measures whether the average (expected) value differs significantly across samples. If we observe a large p-value, for example larger than 0.05 or 0.1, then we cannot reject the null hypothesis of identical average scores. If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages

# t-test Exercise


- advertisement_click.csv
```python
advertisement_id,action
B,1
B,1
A,0
B,0
A,1
A,0
B,0
```
- you should check how many distinct advertisements there are(to decide whether or not to use the Bonferroni correction)

## What to do
- use the t-test to determine if one advertisement is better(in a statistically significant sense) than another


## Key point
- It doesn't matter where the data came from
- e.g. it could be news article headlines
- what would we do?
- simply replace the header~ new_article_id,action
- all data is the same
    - different landing page/website designs/logos/
    

## How do I collect this data?
- That's a question only you can anwser
- e.g if you work at a company and your data is stored on Hadoop, then you'd write a script to get the data off Hadoop(or run your analysis on those files directly)
- MySQLdb
- SaaS and CSV


## In Code

```python
import numpy as np
import pandas as pd
from scipy import stats
```

```python
# get data
df = pd.read_csv('advertisement_clicks.csv')
a = df[df['advertisement_id'] == 'A']
b = df[df['advertisement_id'] == 'B']
a = a['action']
b = b['action']
```

```python
# built-in t-test:
t, p = stats.ttest_ind(a, b)
print("t:\t", t, "p:\t", p)

# welch's t-test:
t, p = stats.ttest_ind(a, b, equal_var=False)
print("Welch's t-test:")
print("t:\t", t, "p:\t", p)

a.mean: 0.304
b.mean: 0.372
t:	 -3.2211732138019786 p:	 0.0012971905467125246
```


$$
t = \frac{\bar{X}_{1}- \bar{X}_{2}}{s_{\Delta}^{-} }\\
s_{\Delta}^{-} = \sqrt{\frac{s_{1}^2}{n_{1}}+ \frac{s_{2}^2}{n_{2}} }\\
d.f.= \frac{(s_{1}^2/n_{1}+s_{2}^2/n_{2})^2 }{(s_{1}^2/n_{1})^2/(n_{1}-1)+ (s_{2}^2/n_{2})^2/(n_{2}-1) }
$$

```python
# welch's t-test manual:

N1 = len(a)
s1_sq = a.var()
N2 = len(b)
s2_sq = b.var()
t = (a.mean() - b.mean()) / np.sqrt(s1_sq / N1 + s2_sq / N2)

nu1 = N1 - 1
nu2 = N2 - 1
df = (s1_sq / N1 + s2_sq / N2)**2 / ( (s1_sq*s1_sq) / (N1*N1 * nu1) + (s2_sq*s2_sq) / (N2*N2 * nu2) )
p = (1 - stats.t.cdf(np.abs(t), df=df))*2
print("Manual Welch t-test")
print("t:\t", t, "p:\t", p)
```

## [How to Code the Student’s t-Test from Scratch in Python](https://machinelearningmastery.com/how-to-code-the-students-t-test-from-scratch-in-python/)

```python
t = observed difference between sample means / standard error of the difference between the means
or t = (mean(X1) - mean(X2)) / sed

sed = sqrt(se1^2 + se2^2)
 	
se = std / sqrt(n)

```



```python
# calculate means
mean1, mean2 = mean(data1), mean(data2)

# calculate sample standard deviations
std1, std2 = std(data1, ddof=1), std(data2, ddof=1)

# calculate standard errors
n1, n2 = len(data1), len(data2)
se1, se2 = std1/sqrt(n1), std2/sqrt(n2)

or
#Alternately, we can use the sem() SciPy function to calculate the standard error directly
# calculate standard errors
se1, se2 = sem(data1), sem(data2) 

# standard error on the difference between the samples
sed = sqrt(se1**2.0 + se2**2.0)

# calculate the t statistic
t_stat = (mean1 - mean2) / sed

```
__The number of degrees of freedom for the test is calculated as the sum of the observations in both samples, minus two__
```python
# degrees of freedom
df = n1 + n2 - 2
```

__The critical value can be calculated using the percent point function (PPF) for a given significance level, such as 0.05 (95% confidence)__
```python
# calculate the critical value
from scipy import stats
alpha = 0.05
cv = stats.t.ppf(1.0 - alpha, df)
```

__The p-value can be calculated using the cumulative distribution function on the t-distribution, again in SciPy__
```python
# calculate the p-value
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
```

__Here, we assume a two-tailed distribution, where the rejection of the null hypothesis could be interpreted as the first mean is either smaller or larger than the second mean__


```python
# t-test for independent samples
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from scipy.stats import sem
from scipy.stats import t

# function for calculating the t-test for two independent samples
def independent_ttest(data1, data2, alpha):
	# calculate means
	mean1, mean2 = mean(data1), mean(data2)
	# calculate standard errors
	se1, se2 = sem(data1), sem(data2)
	# standard error on the difference between the samples
	sed = sqrt(se1**2.0 + se2**2.0)
	# calculate the t statistic
	t_stat = (mean1 - mean2) / sed
	# degrees of freedom
	df = len(data1) + len(data2) - 2
	# calculate the critical value
	cv = t.ppf(1.0 - alpha, df)
	# calculate the p-value
	p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
	# return everything
	return t_stat, df, cv, p

# seed the random number generator
seed(1)
# generate two independent samples
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = independent_ttest(data1, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))
# interpret via critical value
if abs(t_stat) <= cv:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')
# interpret via p-value
if p > alpha:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are equal.')
t=-2.262, df=198, cv=1.653, p=0.025
Reject the null hypothesis that the means are equal.
Reject the null hypothesis that the means are equal.
```

# [sample-one-tailed-t-test-with-numpy-scipy](https://stackoverflow.com/questions/15984221/how-to-perform-two-sample-one-tailed-t-test-with-numpy-scipy)

```python
First let's formulate our investigative question properly. The data we are investigating is

A = np.array([0.19826790, 1.36836629, 1.37950911, 1.46951540, 1.48197798, 0.07532846])
B = np.array([0.6383447, 0.5271385, 1.7721380, 1.7817880])

with the sample means

A.mean() = 0.99549419
B.mean() = 1.1798523

I assume that since the mean of B is obviously greater than the mean of A, you would like to check if this result is statistically significant.

So we have the Null Hypothesis

H0: A >= B

that we would like to reject in favor of the Alternative Hypothesis

H1: B > A

Now when you call scipy.stats.ttest_ind(x, y), this makes a Hypothesis Test on the value of x.mean()-y.mean(), which means that in order to get positive values throughout the calculation (which simplifies all considerations) we have to call

stats.ttest_ind(B,A)

instead of stats.ttest_ind(B,A). We get as an answer

    t-value = 0.42210654140239207
    p-value = 0.68406235191764142

and since according to the documentation this is the output for a two-tailed t-test we must divide the p by 2 for our one-tailed test. So depending on the Significance Level alpha you have chosen you need

p/2 < alpha

in order to reject the Null Hypothesis H0. For alpha=0.05 this is clearly not the case so you cannot reject H0.

An alternative way to decide if you reject H0 without having to do any algebra on t or p is by looking at the t-value and comparing it with the critical t-value t_crit at the desired level of confidence (e.g. 95%) for the number of degrees of freedom df that applies to your problem. Since we have

df = sample_size_1 + sample_size_2 - 2 = 8

we get from a statistical table like this one that

t_crit(df=8, confidence_level=95%) = 1.860

We clearly have

t < t_crit

so we obtain again the same result, namely that we cannot reject H0.

```

# A/B test for Click-Through Rates(Chi-Square Test)

## Test for Click-Through Rates
- you know when test, let's do another one
- This works for click through rates(not Gaussian but Bernoulli)
- works for any categorical variable where we count things

|[]()|Click|No Click|
|--|--|--|
|Advertisement A|36|14|
|Advertisement B|30|25|

## Chi-Square Test Statistic
- $\chi^2$ test statistic
- Always positive
- Like t-distribution: 1 parameter = degree of freedom
- Like t, also has location/scale(is avaliable in Scipy) - default loc = 0,scale = 1



$$
\chi^2 = \sum_{i}^{}\frac{(\text{observed}_{i} -\text{expected}_{i})^2}{\text{expected}_{i}}
$$

- i = every cell of table


|[]()|Click|No Click|Click + No Click|
|--|--|--|--|
|Advertisement A|36|14|50|
|Advertisement B|30|25|55|
|Ad A+ Ad B|66|39|105|

- i = (Ad A,Click)
- $\text{observed}_{i}$ = 36
- $\text{expected}_{i}$ = ( number times Ad A is shown )*p(click) = 50*(66/105) = 31.429

- i = (Ad A,No Click)
- $\text{observed}_{i}$ = 14
- $\text{expected}_{i}$ = ( number times Ad A is shown )*p(No click) = 50*(39/105) = 18.571 = (row 1)*(col 2)/N

- $\chi^2$ = $(36-31.429)^2/31.429+(14-18.571)^2/18.571+(30-34.571)^2/34.571+(25-20.429)^2/20.429=3.418$

## shortcut
- only works for 2x2 table

$$
\chi^2 = \frac{(ad-bc)^2(a+b+c+d)}{(a+b)(c+d)(a+c)(b+d)}
$$

## What's next?
- Distribution is >= 0
- Extreme is when observed far away from expected
- so $\chi^2$ is large
- so 1- CDF($\chi^2$) gives us a small number
- this is the p-vale
- as is typical, if p-value < $\alpha$ , then Ad A and Ad B are significantly different

```python
# T is table
scipy.stats.chi2_contingency(T,correction=False)
```

- what's this correction?
- Chi square test statistics asymptotically approaches the chi-square distribution
- when N = $\infty$
- same situation as CLT(central limit theory)
- same distirbution when estimating $\mu$
- yates correction(set correction = True in Scipy)
- Fisher's exact test(different test,same data)



## In Code

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2, chi2_contingency
```

```python
class DataGenerator:
  def __init__(self, p1, p2):
    # p1,p2 click probability of group1,group2
    self.p1 = p1
    self.p2 = p2
    
    
  def next(self):
    # return click(1),no click(0)
    click1 = 1 if (np.random.random() < self.p1) else 0
    click2 = 1 if (np.random.random() < self.p2) else 0
    return click1, click2
  
```

```python

# contingency table
#        click       no click
#------------------------------
# ad A |   a            b
# ad B |   c            d
#
# chi^2 = (ad - bc)^2 (a + b + c + d) / [ (a + b)(c + d)(a + c)(b + d)]
# degrees of freedom = (#cols - 1) x (#rows - 1) = (2 - 1)(2 - 1) = 1

# short example

# T = np.array([[36, 14], [30, 25]])
# c2 = np.linalg.det(T)**2 * T.sum() / ( T[0].sum()*T[1].sum()*T[:,0].sum()*T[:,1].sum() )
# p_value = 1 - chi2.cdf(x=c2, df=1)

# equivalent:
# (36-31.429)**2/31.429+(14-18.571)**2/18.571 + (30-34.571)**2/34.571 + (25-20.429)**2/20.429

```

- only works for 2x2 table

$$
\chi^2 = \frac{(ad-bc)^2(a+b+c+d)}{(a+b)(c+d)(a+c)(b+d)}
$$
```python
def get_p_value(T):
  # same as scipy.stats.chi2_contingency(T, correction=False)
  det = T[0,0]*T[1,1] - T[0,1]*T[1,0]
  c2 = float(det) / T[0].sum() * det / T[1].sum() * T.sum() / T[:,0].sum() / T[:,1].sum()
  p = 1 - chi2.cdf(x=c2, df=1)
  return p
```

```python
def run_experiment(p1, p2, N):
  data = DataGenerator(p1, p2)
  p_values = np.empty(N)
  T = np.zeros((2, 2)).astype(np.float32)
  for i in range(N):
    c1, c2 = data.next()
    T[0,c1] += 1
    T[1,c2] += 1
    # ignore the first 10 values
    if i < 10:
      p_values[i] = None
    else:
      p_values[i] = get_p_value(T)
  plt.plot(p_values)
  plt.plot(np.ones(N)*0.05)
  plt.show()

run_experiment(0.1, 0.11, 20000)

```

# Reference
[t_distribution calc](http://www.dmbru.dentistry.ubc.ca/Calculating_Companion/applets/t_distribution/t_distribution.php)
![](http://kisi.deu.edu.tr/joshua.cowley/StudentTTable.png)