<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Statistical Inference

---

# Housekeeping

As mentioned earlier...

- If you wish to leave early - after the main slides - then I will diplomatically cast a blind eye to that.

- We have a catch up session on Monday after the lightning talks, so I expect we will again cover these slides (and exercises) then.

- If you wish to stay on until 9pm, that’s ok too!

## Learning Objectives / Agenda

#### How much does my data actually tell me about the world?

- Define a confidence interval and a p-value
- Understand the theory of hypothesis testing
- Know how to perform hypothesis tests and how to calculate confidence intervals and p-values using Python
- Articulate the main considerations of study design and the problem with p-values
- Short examples of statitical testing

## How much does my data actually tell me about the world?

Imagine you want to know the average height of people.

You sample 100 people at random and measure their heights. The mean and standard deviation of these heights are 1.5m and 0.1m.

**How confident are you that people are 1.5m tall on average?**

Confidence intervals give you a tool to measure that

They're "intervals" because your confidence is tied to a **range** of values

You'd report something like *"based on my sample, people are on average between 1.3m and 1.7m tall, with a 95% confidence."*

Where do those numbers come from?

In [1]:
from scipy import stats

stats.norm.interval(0.95, loc=1.5, scale=0.1)

(1.3040036015459946, 1.6959963984540054)

What does this interval mean exactly?

Specifically, this says:

If you drew 100 samples of people and measured their average heights,

then 95 times the 95% confidence interval would contain the **true** population mean.

http://rpsychologist.com/d3/CI/

What does changing the 95% to 90% or 99% do?

What about a 10% CI?

In [1]:
from scipy import stats

print("10% CI:", stats.norm.interval(0.10, loc=1.5, scale=0.1))
print("90% CI:", stats.norm.interval(0.9, loc=1.5, scale=0.1))
print("99% CI:", stats.norm.interval(0.99, loc=1.5, scale=0.1))
print("99.999% CI:", stats.norm.interval(0.99999, loc=1.5, scale=0.1))

10% CI: (1.4874338653144925, 1.5125661346855075)
90% CI: (1.3355146373048528, 1.6644853626951472)
99% CI: (1.24241706964511, 1.75758293035489)
99.999% CI: (1.0582826586529994, 1.9417173413467606)


## How sure can I be of my findings?

In doing science, we always want to err on the side of being sceptical.

If you measure a difference between things, or an effect of X on Y, you want to assume it's due to chance.

Then you have tools to try and suggest otherwise.

### Example

I want to find out if there's a significant height difference between horse jockeys and players from the NBA.

The way we frame this in hypothesis testing is we have **two** hypotheses.

The **null** hypothesis $H_0$, which assumes there is **no** difference (or no effect of X on Y)

The **alternate** hypothesis $H_1$, which assumes there **is** a difference (or an effect)

What are my hypotheses in this case?

$H_0$: there is no difference between jockeys and basketball players.

$H_1$: there **is** such a difference.

Then we need to decide on a **significance level**.

i.e. "how unlikely does it need to be that my findings are purely based on chance for me to trust them?"

Typically 5% (i.e. 0.05)

For comparing the means of two groups we can use a t-test.

The t test (also called Student’s T Test) compares two averages (means) and tells you if they are different from each other. The t test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.




Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. 

It was developed by William Sealy Gosset under the pseudonym Student. 

![](assets/images/tdist.png)

The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean.

https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame()

np.random.seed(42)

df["jockeys"] = np.random.normal(150, 10, 100)
df["jockeys_2"] = np.random.normal(150, 10, 100)
df["basketball_players"] = np.random.normal(190, 10, 100)

df.describe()

Unnamed: 0,jockeys,jockeys_2,basketball_players
count,100.0,100.0,100.0
mean,148.961535,150.223046,190.648963
std,9.081684,9.53669,10.842829
min,123.802549,130.812288,157.587327
25%,143.990943,141.943395,183.445565
50%,148.730437,150.841072,190.976957
75%,154.059521,155.381704,197.044374
max,168.522782,177.201692,228.527315


In [5]:
from scipy import stats

t_statistic, p_value = stats.ttest_ind(df["jockeys"],df["basketball_players"])

print(p_value)

2.452490316264101e-74


That's a very small number. That means it is extremely unlikely that this difference is due to chance.

Let's try with another random set of jockeys.

In [6]:
t_statistic_2, p_value_2 = stats.ttest_ind(df["jockeys"], df["jockeys_2"])

print(p_value_2)

0.3392652865361483


In the first case, the p-value was tiny.

That means that **assuming the null hypothesis**, i.e. "there is no significant difference between groups" (which we always do)...

it would be **extremely unlikely** to get two samples with such different means **purely by chance**.

Therefore there **is** a significant difference between the groups, and we **reject the null hypothesis**.

In the second case, the p-value was 0.34, meaning it is 34% likely we'd get a difference due to chance.

That's not enough evidence to conclude a difference, so we **fail to reject the null hypothesis**.

Important wording!

Case 1: *"we reject the null hypothesis"* and **not** *"we proved a difference"* or *"we proved the alternate hypothesis"*

Case 2: *"we fail to reject the null hypothesis"* and **not** *"we proved there is no difference"*

Remember, we're always cautious about our findings

## Errors

#### Type I

False positives, i.e. concluding there is an effect/difference when there isn't one

![](assets/images/xkcd_jelly_beans_2.png)

![](assets/images/xkcd_jelly_beans_3.png)

#### Type II

False negatives, i.g. concluding no effect/difference where there is one

Remember the "boy who cried wolf":

- first the boy claimed there was a wolf when there wasn't one (Type I)

- then there was no response from the villagers when there actually was a wolf (Type II)

What's worse? Type I or Type II?

It depends...!

In law, a false positive (jailing an innocent person) is worse than a false negative (a guilty person goes free)

In medicine, a false negative (missing a diagnosis) is worse than a false positive (sending healthy people for follow ups)

Always think of the context when thinking about the cost of false positives and false negatives.

## The Problem with P-Values

What are some reasons it may not be good to rely on p-values?

- the 5% cutoff is arbitrary

- the more hypotheses you test, the higher the chance of even rare events, so 5% gets worse as a cutoff
    - one solution is the [Bonferroni correction](http://www.statsmakemecry.com/smmctheblog/bonferroni-correction-in-regression-fun-to-say-important-to.html)

- it is unintuitive even to scientists: [http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values](http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/)

*"If you torture the data long enough, it will confess."* - Ronald Coase

## Study Design

#### Elements of good study design

Your stakeholder wants to know if it's better to sell cars in the morning auction or the afternoon auction.

- What are your null and alternate hypotheses?
- If you could design the auction in a way to test this, what would you do?
    - how would you design the two auctions?
    - what would you be measuring?

- **control** for variables you're not interested in testing
    - e.g. make sure you don't sell Fiestas in the morning and Porsches in the afternoon

- make sure your samples are **representative**
    - remember the different sampling biases?
    - are the samples large enough for you to see the desired effect?

### Power

While statistical significance is the term you’ll hear most often, many people forget about statistical power. 

Where significance is the probability of seeing an effect when none exists, power is the probability of seeing an effect where it does actually exist.

So when you have low power levels, there is a big change that a real winner is not recognized. 

Statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. 

If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down.

<img src="assets/images/power_effect.png" style="width:50%" />

Know that the four main factors that affect the power of any test of statistical significance are:

    the effect size
    the sample size (N)
    the alpha significance criterion (α)
    statistical power, or the chosen or implied beta (β)

However, for practical purposes, all you really need to know is that 80% power is the standard for testing tools. To reach such a level, you need either a large sample size, a large effect size, or a longer duration test.

<a id="statistical-tests"></a>
## Statistical Tests

<img src="assets/images/testing_1.png" style="width:100%" />

## Practical example: A/B testing

In web analytics, A/B testing (bucket tests or split-run testing) is a randomized experiment with two variants, A and B.

- It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. 

- A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variant A against variant B, and determining which of the two variants is more effective.

- Multivariate testing or multinomial testing is similar to A/B testing, but may test more than two versions at the same time or use more controls. 

- Simple A/B tests are not valid for observational, quasi-experimental or other non-experimental situations, as is common with survey data, offline data, and other, more complex phenomena.