# Statistical Inference: Intro to Hypothesis Testing

By Carl Shan with credit to [Cyrille Rossant](https://cyrille.rossant.net/) and [Chris Ketelson](https://www.cs.colorado.edu/~ketelsen/index.html) for inspiration and examples.

# Instructions

This notebook introduces the idea of hypothesis testing to you through simulation, examples and exercises.

Please go through this notebook, completing the exercises as you come across them.

Here are the major sections of this notebook:

1. [Start-Here: You're at the Carnival](#Start-Here:-You're-at-the-Carnival)
2. [Intro to Hypothesis Testing](#Introduction-to-Hypothesis-Testing)
3. [Hypothesis Testing: Understanding Through Statistics](#Hypothesis-Testing:-Understanding-through-Statistics)
4. [The Z-Test](#A-new-idea:-The-Z-Test)
5. [More Programming and Probability Exercises](#More-Programming-and-Probability-Exercises)

## Start Here: You're at the Carnival

> Imagine you're at the carnival.
> 
> ![Carnival](https://today.uic.edu/files/2017/10/IMG_0513-1000x667.jpg)
>
> A carnie waves to you, drawing you to their booth. 
>
> They tell you the following: you get to play a coin-flip game for \$10. If the carnie flips a `tails`, you win \$25. Otherwise, if they flip `heads`, you don't win anything.
>
> Each time you play this game, you have to pay \$10.




**Exercise 1**: Is this a good bet for you? Why or why not? Write your answer below.

If you are unsure, you can also simulate playing this game and calculate your average profit/loss over many runs.

In [None]:
### YOUR ANSWER OR CODE HERE ###








> Now, say you decide to play this game.
> 
> You hand over \$10 and the carnie flips the coin.
> 
> The coin flips `heads`.
> 
> "No problem" you say to yourself underneath your breath. You fork over another \$10. `heads` again. That's not good. You're out $20.
>
> You furrow your brow and frown. Something doesn't seem right here.
> 
> Suspicious, you declare that you won't play another round until you have had a chance to inspect the coin again.
> 
> The carnie smiles slyly, handing you over the coin.
> 
> You take it, weighing it in your hand. You can't tell for sure, but you think there's some *small, nearly inperceptible* difference in weight from coins you felt in the past.
> 
> You decide to conduct a test. You quickly flip the coin 100 times in a row.
> You see the following data.
> ```python
       ['H', 'T', 'H', 'T', 'T', 'T', 'H', 'H', 'T', 'T', 'H', 'H', 'H',
       'H', 'H', 'T', 'T', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'H', 'H',
       'H', 'H', 'H', 'H', 'H', 'H', 'T', 'H', 'H', 'H', 'T', 'H', 'H',
       'H', 'H', 'H', 'H', 'H', 'T', 'T', 'H', 'T', 'T', 'H', 'H', 'T',
       'T', 'H', 'T', 'T', 'H', 'H', 'H', 'H', 'H', 'T', 'H', 'H', 'T',
       'T', 'T', 'H', 'H', 'H', 'H', 'T', 'T', 'H', 'T', 'H', 'H', 'H',
       'H', 'H', 'T', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
       'H', 'T', 'T', 'H', 'T', 'H', 'H', 'H', 'H']
```

**Exercise 2**: Copy the `list` above into the code cell below. Write some code that counts the total number of heads that occurred.

In [None]:
### YOUR CODE HERE ###








Hmmm, okay. So you do seem to get a suspiciously large number of heads. But is it a rigged game? Or did this come up by chance?

How can you tell, and what evidence do you have?

Fortunately, we have various statistical methods that we can use to answer questions like this. 

### Exercise 3

1. In the cell below, describe if you think this data indicates that the coin is rigged. Why or why not?
2. How many `heads` would have to come up in `100` flips for you to be *certain* that the coin is rigged? 
3. How many `heads` would have to come up in `100` flips for you to be *pretty confident* that the coin is rigged? 
4. How many `heads` would have to come up in `100` flips for you to be *somewhat confident* that the coin is rigged? 
5. How did you come up with the number of `heads` you require for questions `2` through `4`?

In [None]:
"""
Write your responses here:

1. Do you think this data indicates that the coin is rigged? Why or why not?




2. How many heads would have to come up in 100 flips for you to be certain that the coin is rigged?




3. How many heads would have to come up in 100 flips for you to be pretty confident that the coin is rigged?




4. How many heads would have to come up in 100 flips for you to be somewhat confident that the coin is rigged?




5. How did you come up with the number of heads you require for questions 2 through 4?



"""


# Introduction to Hypothesis Testing


## What you will learn
Alright, so you came up with some numbers in the previous exercises. Now we're going to make things more precise. 

You're going to learn a statistical method to collect data figure out if the coin is fair/unfair to *any degree of certainty you choose*.


## What is hypothesis testing?
So you have some data. And what you're wondering is which of the two situations you're in:

    Hypothesis 1. The coin is fair. You're not being ripped off. The data just happened to end up the way that it is due to pure, random chance.
    Hypothesis 2. The coin is unfair. The data is that way because the coin is biased.
    

In statistics, we call `Hypothesis 1` the **null** hypothesis because it represents *no strange phenomena occurring*. Usually the **null hypothesis** is the one we don't believe is true.

![Hypothesis testing](http://statisticslectures.com/images/null1.gif)

In this example, `Hypothesis 2` is what we are actually suspicious of. We call it our **alternative hypothesis**.


## Which hypothesis is true: null or alternative?

So which situation are you more likely to be in: are you being ripped off with an unfair coin (**alternative hypothesis**) or did the coin just generate the data by chance (**null hypothesis**).

Statisticians have invented a way to help you figure things out. It's called `Statistical Hypothesis Testing`.

It allows you to distinguish between which Hypothesis is more likely to be true.

Read on to learn how it works.

## Statistical Hypothesis Testing: How it works

Statistical hypothesis testing's approach is to answer the following: IF the coin IS fair, HOW LIKELY is it to have created this data?

If even a fair coin is pretty likely to have created the data we saw above, then we shrug our shoulders and say, "Well I guess I can't really conclude that the coin is unfair."

But if there's only a tiny chance (e.g., `0.0000000000001%`) of a fair coin producing the data we saw, then we say "Ah ha! It's very unlikely Hypothesis 1 is true, so therefore it's much more likely for Hypothesis 2 to be correct."

Alright, we've more clearly defined our problem. 

    Original question: Which hypothesis is right? Hypothesis 1: coin is fair or Hypothesis 2: coin is unfair?

    New question: how likely is it that a fair coin could have produced this data?

So we now have to figure this out.

### Modeling Coin Flips

Let's write some code to help us understand how to perform hypothesis testing.

**Exercise 4**: Write some code that simulates a coin flip. Use this code to simulating flipping it 100 times and calculate the number of heads that shows up in those 100 flips.

In [None]:
### YOUR CODE HERE








**Exercise 5:** Call what you did in **Exercise 4** a "trial". Now, write some code that performs this trial 1000 times. In other words, your code should simulate the following 1000 times:
* flip 100 coins
* count the number of heads

Then plot a histogram of the number of heads that you calculated. (Remember to `%pylab inline` if you want to plot in the notebook).

In [None]:
### YOUR CODE HERE









**Exercise 6**:
Answer the following question: 
* What does the shape of distribution of the histogram remind you of? Is it similar to anything you've seen before?
* Write some code that calculates the following: What % of the time time does the number of heads that turns up EQUAL OR EXCEED the suspicious number of times that the carnival's coin generated?

**NOTE**: The above proportion is called the `p-value` in statistics (it's called `p-value` because it means `probability value`).

In [None]:
### YOUR ANSWER AND CODE HERE








**Exercise 7:** Given that the number of heads in your simulation only occurred that many times, which hypothesis do you think is more likely? Why?**

     Hypothesis 1. The coin is fair. You're not being ripped off. The data just happened to end up the way that it is due to pure, random chance.
     Hypothesis 2. The coin is unfair. The data is that way because the coin is biased.

In [None]:
### YOUR ANSWER AND EXPLANATION HERE







***

## Hypothesis Testing: Understanding through Statistics

(Make sure you've completed the above exercises before moving forward into this section.)

We're now going to break down the code you wrote above into statistical terms to help you understand.

### The first statistical idea: `alpha`

Based on your simulation, the probability of the suspicious data happening if the coin was really fair should be some small percentage. 

Let's call it `X`.

Let's also come up with another idea called our `risk tolerance`. It should be a number between `0` and `1`. The idea behind the `risk tolerance` is that if `X` falls beneath our risk tolerance threshold, we are comfortable claiming that the coin is rigged.

If your `risk tolerance` is `0.10` or `10%`, that means you will only claim the coin is rigged IF the data the coin generates happens `< 10%` of the time.

Call this `risk tolerance` threshold `alpha`. (The Greek letter `alpha` is $\alpha$.)

So, if `X < alpha`, then we say the coin is rigged. 

Else, we say that we don't have sufficient evidence to claim that the coin is rigged.

For example, `X` is really low (e.g., `0.0001%`), then that means with a fair coin we would see the data we observed < `0.0001%` of the time. 

So if our "suspiciousness threshold" is `5%` or `0.05`, then we should claim the coin is rigged!

![alpha](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/p-value1.jpg)

### Seeing `alpha` in real life

The "threshold" that our data needs to be that we pick to say that the game is rigged is known in statistics as `alpha`. (It's also known as the `critical value`). You can think of it as our "bar for rejecting Hypothesis 1". If our chance of the observed data showing up under Hypothesis 1 is lower than the `alpha`, then we will say that "the data leads us to believe that Hypothesis 1 is false."

**What does it mean to pick a lower or higher `alpha`?**

The smaller you pick your `risk threshold` of `alpha` to be, the less likely we are to say that things are "rigged", because the harder it is for our data to be smaller than the bar.

So `alpha` is really up to you!

There is no *right* `alpha` to pick. It depends on how risk tolerant you are.


**NOTE**: In statistics we have multiple names for the same idea. `alpha` is also known as the `critical value`.

**What do people typically choose as their `alpha`?**

Usually in many fields of social science research, the `alpha` is set to `0.05`. In other words, the data researchers get needs to be so rare as to only occur 95% of the time before we can safely declare that something funky is going on and the null hypothesis can't be true.




### Example to help you understand

Imagine if you work in college admissions.

Two personal essays written by two students who attend the same school show up on your desk. 

As you read them, you notice something suspicious.

![Admissions Officer](https://ak0.picdn.net/shutterstock/videos/3024700/thumb/1.jpg)

Uh oh.

The two essays share the same theme, essay structure and many of the same words! In fact, you find 4 different sentences in the two essays that are **exactly the same**.

That's unusual.

Given your background, you know that this doesn't happen often. 

You suspect that there might be some plagarism going on.

But at the same time, you don't want to false accuse any students of plagarism. Maybe this just happened by chance. It's certainly possible when there are tens of thousands of students applying to college each year.

So you run the two papers through an online plagarism detector that prints out the probability that the essays are plagarized.

You cross your fingers and say to yourself, "unless the online plagarism software says it's 98% confident that there's plagarism going on, I'm not going to reject these two students. After all, I only want to act if I'm sure."

**\*\*BEEP\*\***

Your computer signals that it's done crunching the data.

You read the printout on the screen.

**"THERE IS A 97% CHANCE THAT ONE OF THE TWO ESSAYS PLAGARIZED THE OTHER"**

You sigh.

**Exercise 8**: In the above example, what is the `null hypothesis`, `alternative hypothesis` and `alpha`?

If you were the admissions officer, would you reject the students from your college for plagarism given the evidence and the `alpha` level? Why or why not?

In [None]:
### YOUR ANSWER AND EXPLANATION HERE








**Exercise 9**: Simulating `alpha` (aka the `critical value`)

Let's go back to the carnival example with coin flips and see if we can use our newfound knowledge of `alpha` values.

Your exercise: write a function called `is_rigged()` that takes in the following inputs:

* a parameter called `flips` which will be a list of "H" and "T".
* a threshold parameter called `alpha`

Your function should do the following:

* return `True` if the number of heads that you're checking appears less than `alpha` percentage of the time after some number of trials. 
* return `False` otherwise.


**Example**

* For example, if you get 58 heads, how many times does 58 heads or more come up with a fair coin in 1000 trials?
* If the percentage of times 58 heads or more came up is less than `alpha` percent, then you should return `True` because you believe things are rigged.
* But if the percentage of times 58 or more heads came up is more than `alpha`, then you can't be sure it's rigged. After all, maybe it was just a fluke.

In other words, your function should be your best guess as to whether the game is rigged or not.

In [None]:
#### YOUR CODE HERE















## A new idea: The Z-Test

In the examples you've done above, you were able to simulate a coin flip a bunch of times.

What if you can't easily simulate it?

Now, many times in statistics we don't always run a lot of simulations and compare our data with how the simulations went.

We can also mathematically figure out, based upon a number of different factors.

In the above coin-flipping example at the Carnival, we had two hypotheses: 
* H1 - the coin is fair 
* H2 - the coin is unfair.

We typically call `H1` the `null hypothesis` because it represents a "normal", "no change" or `null` scenario.

We call `H2` the `alternative hypothesis` because it's what we believe to be true if the `null hypothesis` is false.

The percentage of times that of times that we flipped a fair coin and got AS or MORE heads than our data is called the `p-value`.

If the `p-value` is small and less than our `alpha` threshold, then it's unlikely our data was generated under the scenario of the `null hypothesis`. Thus we would reject the `null hypothesis` in favor of the `alternative hypothesis`.


**... I'm still a bit confused about all this stuff. `alpha`, and `p-value` and `hypothesis` ...**

If you are still confused and looking to learn more, you can use one of the resources below:

* [11 min video by Khan Academy](https://www.youtube.com/watch?v=-FtlH4svqx4)
* [Short article](https://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/)


**Okay. I'm a bit less confused now.**

Great, let's keep on going!

To get the `p-value`, we need to calculate something called the `z-score` which will be what you do here in this example.

### Understanding how to calculate z-score

Building on our example with coin flips, let's suppose that after `trials=100` flips, we get `heads=61` heads. We choose an `alpha` level of `0.05`: is the coin fair or not? Our null hypothesis is: the coin is fair, so `chance=0.5`.

Now it's time to calculate the z-score.

The equation for the z-score is below:

$$ zScore = \frac{numHeads - numExpectedHeads}{standardDeviation} $$

Where the quantities are computed in the following way:

> $numHeads$ is the number of heads you got.
> 
> $numExpectedHeads$ is the number of heads you'd expect to get in a 100 flips with a fair coin.
> 
> $standardDeviation = \sqrt{chanceOfHeadsIfFair * (1 - chanceOfHeadsIfFair) * numTrials}$

**NOTE:** I know that the above formula for standard deviation is different than ones you may have learned in a other math classes. Don't worry. However it can be mathematically proven that the standard deviation for something like a coin toss is equivalent to the above. 

The $standardDeviation$ equation above will give you the "number of heads that will deviate from the $numExpectedHeads$".

If you want to investigate, the formula above is the [standard deviation of the Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution#Variance).

**Exercise 10** Using the formula and variables defined above, calculate the z-score of the above example using the variables I define for you in the cell.

In [None]:
#### YOUR CODE HERE

numTrials = 100  # number of coin flips
numHeads = 61  # number of heads
numExpectedHeads = _____ # what is the expected number of heads you'd get in 100 flips?
chanceOfHeadsIfFair = .5  # null-hypothesis of fair coin







## What does the z-score mean?

The z-score represents *the number of standard deviations* your data is away from the average outcome, given the null hypothesis a fair coin.

The more standard deviations away you are, the less likely that a fair coin produced your data. And the more likely that something else is going on.

If you would like to better understand exactly why the z-score is calculated using the equation above, you can use the resources below:

* [Watch this 7 min Khan Academy video](https://www.youtube.com/watch?v=Wp2nVIzBsE8)
* [Watch this 10 min video combining z-scores and p-values](https://www.youtube.com/watch?v=mai23vW8uFM)

## Calcuating the `p-value` using the `z-score`

Now using this `z-score`, we can calculate how "extreme" it is. In other words, how likely were were likely to get this `z-score` if the coin was truly fair.

You're going to calculate the `p-value` given this `z-score`.

### Carl, what are `z-score` and `p-value` again?

Remember the `z-score` tells you *how many standard deviations* the data you collected is from what you would expect.

The larger the `z-score` is, the lower the chance that it was generated purely randomly and the more likely that some interferance or other process is happening.

### Okay, what about `p-value`?

`p-value` is the *probability that you would have gotten this data under normal circumstances*. 

Therefore the larger the `z-score` the lower the `p-value`.


### Alright, I'm ready.

Here's how you're going to do it.

There's a function in the `scipy.stats` module called `stats.norm.cdf()` that takes in as input a `z-score` and returns the **probability that your data was NOT as extreme as this z-score**.

Now, this isn't quite what you want. But with a little bit of algebra, you can do some basic math on the output of `stats.norm.cdf()` to return the `p-value`.

Remember, the `p-value` is the **probablity that your data is MORE extreme than the z-score**.

Do some thinking, some basic math, and write some code below that calculates the `p-value` of the `z-score` you calculated above.

In [None]:
# You may need to `pip install scipy` in your Terminal
import scipy.stats as stats

### YOUR CODE HERE
# you will need to use `stats.norm.cdf()` and give this function the z-score that you calculated above.







***

## Additional Reading

If you still find yourself confused and wanting a review, you should read this article on [The Basics of Hypothesis Testing](http://20bits.com/article/hypothesis-testing-the-basics) to get another review.

You can also see if you can complete the short problems on this [Khan Academy page](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/a/p-value-conclusions).

## More Programming and Probability Exercises

Once you have solved the exercises above, you can work on the ones below. It will help you practice your programming skills, although the exercises below will not directly cover hypothesis testing.

You can solve the questions below either by writing a program that simulates the scenario OR by mathematically computing it by hand.

**Exericise 1** 

In six coin tosses, what is the probability of having a different side come up with each throw, that is, that you never get two tails or two heads in a row?

In [None]:
### YOUR CODE AND/OR ANSWER HERE





**Exercise 2. Coin Tossing Game**

A famous coin tossing game called the [`St. Petersburg Paradox`](https://en.wikipedia.org/wiki/St._Petersburg_paradox) has the following rules: 

The player tosses a coin repeatedly until a tail appears or tosses it a maximum of 1000 times if no tail appears. 

The initial stake starts at 2 dollars and is doubled every time heads appears. The first time tails appears, the game ends and the player wins whatever is in the pot. 

Thus the player wins 2 dollars if tails appears on the first toss, 4 dollars if heads appears on the first toss and tails on the second, 8 dollars if heads appears on the first two tosses and tails on the third, and so on. 

Mathematically, the player wins $2^k$ dollars, where $k$ equals the number of tosses until the first tail. 

If it costs $15 to play this game, should you expect to make money in the long run in playing this game?

In [None]:
### YOUR CODE AND/OR ANSWER HERE





**Exercise 3**

Randomly select three distinct integers $a, b, c$ from the set of numbers $[{1, 2, 3, 4, 5, 6, 7}]$


What is the probability that $a + b > c$?

# Advanced 

### Chuck-a-luck
***

The game Chuck-a-luck is played by rolling 3 dice and betting on a number between 1 and 6. You win your bet multiplied by the number of times your chosen number appears on the the three dice. You lose your bet if your number doesn't appear at all. For example

* If you bet $\$1$ on $5$ and you roll ${3,~4,~5}$ then you get to keep your $\$1$ plus you win another $\$1$.
* If you bet $\$1$ on $5$ and roll ${4,~5,~5}$ you get to keep your $\$1$ plus you win $\$2$.
* On the other hand, if you bet $\$1$ on $5$ and roll ${2,~3,~4}$ then you lose your $\$1$.

A quick look at this game may make it appear reasonably fair. Since you roll 3 dice and there seems to be a probability $\frac{1}{2}$ that your chosen number appears and so the odds should be in your favor.

**Part A**: Let's write a function called `chuck_a_trial` that takes as its sole required parameter the integer `my_number` that you bet on and returns your winnings or losses. To make things simple, we'll assume that we always bet $\$1$ on every roll.

In [None]:
### YOUR CODE HERE






**Part B**: Write a function `chuck_a_luck_simulator` that takes the integer `my_number` that you bet on and runs many simulations of `chuck_a_trial` and computes your average winnings over all of the trials. 

To control the number of trials in your simulation, add an optional parameter `num_trials` initialized to `1000`.

In [None]:
### YOUR CODE HERE






**Part C**: Based on your simulation above, how fair or unfair is this game? Or said another way, how much do you expect to win/lose if you play this game for a very long time? Why?

In [None]:
### YOUR ANSWER AND EXPLANATION HERE






***

### Simulating Roulette
*** 

A Las Vegas roulette board contains 38 numbers $\{0, 00, 1, 2, \ldots, 36\}$. Of the non-zero numbers, 18 are red and 18 are black. You can place bets on various number/color combinations and each type of bet pays-out at a different rate.  For example: 

- If you bet $\$1$ on red (or black) and win you win $\$1$ and get your original $1$ back. 
- If you bet any particular number and win you win $\$35$ and get your original $1$ back. 
- If you bet on the first dozen (1-12), or second dozen (13-24), or third dozen (25-36) nonzero numbers and win you win $\$2$ and get your original $1$ back. 

![alt text](https://www.lasvegasdirect.com/wp-content/uploads/2016/09/American-Roulette-Table.gif)


If you would like to better understand the game of Roulette, you can [watch this 8-min video](https://www.youtube.com/watch?v=NXUpW2QdN08).


It seems like there are so many ways to win!  In reality, some very careful probability theory was done by the game designers to ensure that there is not much difference in any particular payout.  We'll explore roulette both by simulation and by hand in this exercise. 


You can also see this website for expected payouts of each Roulette combination:

* [Roulette payouts](https://www.roulettesites.org/rules/odds/)


The following function simulates the spin of a Las Vegas roulette board.  

In [None]:
import numpy as np

def spin_roulette():
    """
        Arguments:
        
            This function takes in no arguments. 
        
        Return:
        
            It returns a string.
            
            The string will be a combination of a number and color representing Roulette combinations.

        Examples:

            >> spin_roulette()
            '1R'

            >> spin__roulette()
            '13B'
    """
    numbers = np.array(["0", "00"] + [str(num) for num in range(1, 37)])
    red = [str(num) for num in [1,3,5,7,9,12,14,16,18,19,21,23,25,27,30,32,34,36]] 
    black = [str(num) for num in [2,4,6,8,10,11,13,15,17,20,22,24,26,28,29,31,33,35]]
    green = ["0", "00"]
    number = np.random.choice(numbers)
    color = "R" if number in red else "B" if number in black else "G"
    
    return number + color



In [None]:
# Let's make sure it works. Run this cell a few times.
spin_roulette()

**Part A**: Write some code that estimates the expected winnings by betting on red (or black). 

In other words, how often do you win money after many trials if you consistently bet on red (or black)?

You will need to estimate the probability that red (or black) shows up.

Your answer should be close to the payouts on [this website](https://www.roulettesites.org/rules/odds/).

In [None]:
### YOUR CODE HERE







**Part B**: Write a function that estimates the expected winnings by betting on a particular number.

If write the function correctly, over 100000 or so trials the winnings should settle to be around `-0.052` or `-2/38`.

The designers of roulette have intentionally fixed the payouts and probability of the game such that any bet, in the long run, will lose the players money (hence the expected winnings being negative).

In [None]:
### YOUR CODE HERE







**Part C**: Write a function that estimates the expected winnings by betting on the first dozen nonzero numbers.

Test your function to see if it works correctly. Over 100000 trials, it should also be cose to `-2/38`.

In [None]:
### YOUR CODE HERE





