### Sampling Distributions Introduction

In order to gain a bit more comfort with this idea of sampling distributions, let's do some practice in python.

Below is an array that represents the students we saw in the previous videos, where 1 represents the students that drink coffee, and 0 represents the students that do not drink coffee.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)

students = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])


<br>

`1.` Find the proportion of students who drink coffee in the above array. Store this value in a variable **p**.

In [None]:
p = (students == 1).mean()
p

<br>

`2.` Use numpy's **random.choice** to simulate 5 draws from the `students` array.  What is proportion of your sample drink coffee?

In [None]:
sample = np.random.choice(students,5)
p_sample = sample.mean()
p_sample

<br>

`3.` Repeat the above to obtain 10,000 additional proportions, where each sample was of size 5.  Store these in a variable called `sample_props`.

In [None]:
sample_props = []
for i in range(10000):
    s = np.random.choice(students,5)
    mean = s.mean()
    sample_props.append(mean)

sample_props = np.array(sample_props)

<br>

`4.` What is the mean proportion of all 10,000 of these proportions?  This is often called **the mean of the sampling distribution**.

In [None]:
sample_props.mean()

<br>

`5.` What are the variance and standard deviation for the original 21 data values?

In [None]:
students.std() , students.var() 

<br>

`6.` What are the variance and standard deviation for the 10,000 proportions you created?

In [None]:
sample_props.std() , sample_props.var()

<br>

`7.` Compute p(1-p), which of your answers does this most closely match?

In [None]:
p*(1-p) #The variance of the original data

<br>

`8.` Compute p(1-p)/n, which of your answers does this most closely match?

In [None]:
p*(1-p)/5    # The variance of the sample mean of size 5 (sample_props.var())


<br>

`9.` Notice that your answer to `8.` is commonly called the **variance of the sampling distribution**.  If you were to change your first sample to be 20, what would this do for the variance of the sampling distribution?  Simulate and calculate the new answers in `6.` and `8.` to check that the consistency you found before still holds.

In [None]:
sample_props_20 = []
for i in range(10000):
    s = np.random.choice(students,20)
    mean = s.mean()
    sample_props_20.append(mean)

sample_props_20 = np.array(sample_props_20)

In [None]:
sample_props_20.var() , p*(1-p)/20


<br>

`10.` Finally, plot a histgram of the 10,000 draws from both the proportions with a sample size of 5 and the proportions with a sample size of 20.  Each of these distributions is a sampling distribution.  One is for the proportions of sample size 5 and the other a sampling distribution for proportions with sample size 20.

In [None]:

plt.hist(sample_props, label='size = 5')
plt.hist(sample_props_20, alpha=0.6, label='size = 20')
plt.title('Histgram of the 10,000 draws')
plt.legend()
plt.show()