In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
%matplotlib inline

# Module 3.1 Part 4: Sampling

In this lecture guide, you'll learn about how probability can be used to analyze how individuals are sampled from a population

5 videos make up this notebook, for a total run time of 26:18.

1. [Sampling](#section1) *1 video, total runtime 6:48*
2. [Distributions](#section2) *2 videos, total runtime 9:13*
3. [Simulations](#section3) *1 video, total runtime 2:53*
4. [Statistics](#section4) *1 video, total runtime 7:25*

Textbook readings: [Chapter 10: Sampling and Empirical Distributions](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)

<a id='section1'></a>
## 1. Sampling

Let's start with a warm-up probability question.

A population has 100 people, including Mo and Jo. We sample two people from the population at random, without replacement. Each person is equally likely to be included in the sample.

What's the probability that both Mo and Jo are in the sample? In other words, find P(both Mo and Jo are in the sample). 

<details>
    <summary>Solution</summary>
    There are two different ways that both Mo and Jo can be in the sample: (1) we draw Mo, then Jo or (2) we draw Jo, then Mo. <br>
    
    P(first Mo, then Jo) = (1/100) * (1/99)
    P(first Jo, then Mo) = (1/100) * (1/99)
    
   Since the two events are mutually exclusive, we can use the addition rule to find the probabilty that either happens. <br>
    
    P(both Mo and Jo are in the sample) = P(first Mo, then Jo) + P(first Jo, then Mo) 
    P(both Mo and Jo are in the sample) = (1/100) * (1/99) + (1/100) * (1/99)  
    P(both Mo and Jo are in the sample) = 0.0002 
</details>

In the video below, you'll learn about the different ways that a population can be sampled. 

In [None]:
YouTubeVideo("YUA7fcT9sXU")

Run the cell below to load the top movies dataset

In [None]:
top_movies = Table.read_table('https://www.inferentialthinking.com/data/top_movies.csv')
top_movies

In the cell below, write code that takes a uniform random sample from `top_movies` of 10 movies that grossed more than $400,000,000 after adjusting for inflation.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    top_movies.where("Gross (Adjusted)", are.above(400000000)).sample(10)
</details>

Supposed we create a sample by taking all movies released after 2010. What kind of sample is this?

<details>
    <summary>Solution</summary>
    A deterministic sample
</details>

<a id='section2'></a>
## 2. Distributions

In this video, you'll about learn about probability distributions. Distributions are used to describe all possible outcomes of
an event.

### Probability and Empirical Distributions

In [None]:
YouTubeVideo("f7z8QSovv10")

Under what circumstances will a emperical distribution look like the probability distribution?

<details>
    <summary>Solution</summary>
    When the emperical distribution is generated through many, many independent repetitions of an experiment.
</details>

### Large Random Samples

In [None]:
YouTubeVideo("z6tlWBbhEGM")

<a id='section3'></a>
## 3. Simulations

When we don't know the probablity distribution of an event, we can simulate many trials to generate a empirical distribution that approximates the probability distribution.

In [None]:
YouTubeVideo("bZAH45VowH0")

In the cell below, simulate rolling 20 fair 10-sided die and return the proportion of times that a "7" is rolled.

In [None]:
dice = np.arange(1, 8)
...

<details>
    <summary>Solution</summary>
    
    np.count_nonzero(np.random.choice(dice, 20) == 7) / 20
</details>

<a id='section4'></a>
## 4. Statistics

We often want to learn about a fixed, unknown parameter of a population, but don't have the ability to take a census.
Instead, we can estimate a population parameter by generating a statistic from simulations.

In [None]:
YouTubeVideo("6FLXlYSa8NY")

Write code that repeats the process of rolling 20 fair 10-sided die 1000 times. Keep track of the proportion of times that a "7" in each trial in the array `prop_sevens`

In [None]:
dice = np.arange(1, 11)
trials = 1000
prop_sevens = ...

for i in ...:
   prop_sevens = ...

prop_sevens

<details>
    <summary>Solution</summary>
    
    dice = np.arange(1, 11)
    trials = 1000
    prop_sevens = make_array()

    for i in np.arange(trials):
       prop_sevens = np.append(num_sevens, np.count_nonzero(np.random.choice(dice, 20) == 7) / 20)

    prop_sevens
</details>

Run the cell below to display the empirical distribution of your statistic. 

In [None]:
Table().with_column("Proportion of Sevens", num_sevens).hist(bins = np.arange(0, 1.01, 0.075))