# Distributions Homework
In this homework we will explore how distributions of data impact our ability to draw conclusions about observations. From weighted coins to polling responses, we will determine how much we can take away from our data.  

**Outline:**  
1) Generate distributions of data  
2) Produce sampling distributions  
3) Look at how any frequency distribution can produce a sampling distribution  
4) Review concepts of sampling distributions  

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from ipywidgets import *
from questioner import question

## Generating Distributions
To start, let's consider a probabilistic distribution. While the example uses the flip of a coin, you will be using the roll of a die. Below, we can simulate flipping the coin once using <code>np.random.choice</code>. Below, there are two options for outcomes of a coin flip: a 0 signifying tails or a 1 signifying heads.

In [None]:
coin_choices = [0, 1]
coin_flip = np.random.choice(coin_choices)
coin_flip

In the cell below, reproduce the roll of a die: what happens when you repeatedly run the cell?

In [None]:
##Your Answer Here
die_choices = [...]
die_roll = ...
die_roll

By running the cell multiple times, you should see different outcomes: sometimes a 1, other times a 2, etc. We can reproduce this by adding a parameter to our previous function call: <code>size</code>. Below, we simulate flipping a coin 30 times.

In [None]:
coin_flips = np.random.choice(coin_choices, size=30)
coin_flips

In the cell below, reproduce the roll of a die 100 times:

In [None]:
## Your Answer Here
dice_rolls = ...
dice_rolls

To better understand our data, we will want to visualize it. The <code>plt.hist</code> function allows you to prduce a histogram of the data we produced. What do you notice about the ratio of heads to tails?

**NOTE**: the histogram function ouputs a list of lists that are outside the scope of this course. Just focus on the histogram.

In [None]:
plt.hist(coin_flips)

In the cell below, reproduce the histogram above with dice rolls:

In [None]:
## Your Code Here:
...

## Producing a Sampling Distribution
We just produced one sample. To be able to draw conclusions about what we observe in the sample, we need to create a *sampling distribution*. By repeatedly generating samples from the same underlying population, we can create a distribution of different statistics of the population. Statisticians traditionally use the mean as the statistic of interest (it has useful mathematical properties outside of the scope of this homework). Let's explore how to produce a sampling distribution.

To begin, we need to repeatedly generate samples. In the example, we need to repeat the process of flipping 30 coins 10,000 times. To achieve this, we can use a <code>for</code> loop. Every time the loop repeats, we need to follow several steps:  
1) Reproduce the sample of coin flips like in the section above.  
2) Calculate the mean using <code>np.mean</code>.  
3) Add the mean to a list outside of the for loop with <code>list.append(mean)</code>.

In [None]:
sample_coin_means = []
for _ in range(10000): #Process to repeat 10,000 times
    sample_coin_flips = np.random.choice(coin_choices, size=30) ##Draw one sample of 30 coin flips
    sample_flips_average = np.mean(sample_coin_flips) #Take the average of the coin flips
    sample_coin_means.append(sample_flips_average) #Add average to a list
sample_coin_means

Below, follow the above steps, but with dice rolls rather than coin flips.

In [None]:
##Your Code Here
sample_die_means = ...
for _ in range(10000): #Process to repeat 10,000 times
    sample_die_rolls = ... ##Draw one sample of 30 die rolls
    sample_die_average = ... #Take the average of the die rolls
    sample_die_means.append(sample_die_average) #Add average to a list
sample_die_means

Now that we have our distributions, we can plot them using the <code>plt.hist()</code> function from above. Below, a histogram of the coin means:

In [None]:
plt.hist(sample_coin_means)

Using the cell above as a guide, plot a histogram of the sample means for the dice.

In [None]:
##Your Code Here
...

What do you notice about the distribution of the sample means?  

*Your answer here*


## Sampling Distribution Universality
Regardless of the underlying frequency distribution, we can generate a sampling distribution that holds the same properties as any other sampling distribution. To begin, let's look at several distributions (you don't need to know how the distribution is generated, just observe the shape of the distribution and the sampling distribution that it produces). The code will produce a distribution using a function you may not be familiar with, but that ist the point: even if we don't understand how our underlying frequency distribution is generated, we can still follow the same principles. 

The following distribution follows a sine function. Notice what happens when we produce a sampling distribution from this frequency function.

In [None]:
x = np.arange(1,10, step=.01)
plt.hist(np.sin(x))

Now, we can produce a sampling distribution for this frequency function. Notice how the code follows a similar procedure as above.

In [None]:
sample_dist1_means = []

for _ in range(10000):  
    sample_dist1 = np.sin(np.random.uniform(10, size=30))
    sample_dist1_average = np.mean(sample_dist1)
    sample_dist1_means.append(sample_dist1_average)
plt.hist(sample_dist1_means)

Let's look at another example! In this example, we will use a quadratic function to produce our frequency distribution.

In [None]:
x = np.arange(1,10, step=.01)
plt.hist(x**2)

Let's create a sampling distribution from this frequency distribution:

In [None]:
sample_dist1_means = []
for _ in range(10000):  
    sample_dist1 = np.random.uniform(10, size=30)**2
    sample_dist1_average = np.mean(sample_dist1)
    sample_dist1_means.append(sample_dist1_average)
plt.hist(sample_dist1_means)

What conclusions can you draw about how you can generate sampling distributions?

*YOUR ANSWER HERE*

## The Importance of Sample Size
What impact does the sample size have on the sampling distribution. Recall that the sample size refers to the size of each of the individual samples, *not* how many samples you have in total. In the cell below, try several different sample sizes to see what effect you can have. 

In [None]:
def sample_interact(sample_size):
    sample_means = []
    for _ in range(10000):
        sample = np.random.choice(die_choices, size=sample_size)
        sample_mean = np.mean(sample)
        sample_means.append(sample_mean)
    plt.hist(sample_means, bins=30)
    plt.xlim(3,4)
    std_dev = np.std(sample_means)
    avg = np.mean(sample_means)
    std_devs = np.array([(avg - i*std_dev, avg + i*std_dev) for i in range(1,4)]).flatten()
    for index, position in np.ndenumerate(std_devs):
        index = index[0]
        color = 'y' if index > 3 else 'b' if index > 1 else 'r'
        plt.axvline(position, color=color)
interact(sample_interact, sample_size=(10,500), continuous_update=False)

Now, we can use the above interactive to help us answer some questions:

**QUESTION 1:** Decreasing the sample size, while holding the confidence level the same, will
do what to the length of your confidence interval?

In [None]:
question('make it bigger')
question('make it smaller')
question('it will stay the same')
question('cannot be determined from the given information')

**QUESTION 2:** Decreasing the confidence level, while holding the sample size the same, will
do what to the length of your confidence interval?

In [None]:
question('make it bigger')
question('make it smaller')
question('it will stay the same')
question('cannot be determined from the given information')

**QUESTION 3:** A 95% confidence interval for the mean number of televisions per American household
is (Lower limit: 1.15, Upper limit 4.20). For each of the following statements about the
above confidence interval, select the option if it is true, and leave it blank if it is false.

In [None]:
question('We are 95% confident that the true mean number of televisions per American household is between 1.15 and 4.20.')
question('95% of all American households have between 1.15 and 4.20 televisions.')
question('Of 100 intervals calculated the same way (95%), we expect 95 of them to capture the population mean.')
question('Of 100 intervals calculated the same way (95%), we expect 100 of them to capture the sample mean.')

For the following questions, we will be using the below scenario:  
*Researchers are concerned about the impact of students working while they are enrolled in classes, and they’d like to know if students work too much and therefore are spending less time on their classes than they should be. First, the researchers need to find out, on average, how many hours a week students are working. They know from previous studies
that the standard deviation of this variable is about 5 hours.*

**Question 4:** A survey of 225 students provides a sample mean of 7.10 hours worked. What is a 95% confidence interval based on this sample?

In [None]:
## YOUR ANSWER HERE
lower_bound = 7.10 + 1.96 * (5/np.sqrt(225))
upper_bound = ...
lower_bound, upper_bound

**Question 5:** Suppose that this confidence interval was (6.82, 7.38). Which of these is a
valid interpretation of this confidence interval?

In [None]:
question('There is a 95% probability that a randomly selected student worked between 6.82 and 7.38 hours.')
question('We are 95% confident that the average number of hours worked by students in our sample is between 6.82 and 7.38 hours.')
question('We are 95% confident that the interval between 6.82 and 7.38 hours contains the average number of hours worked by all UF students.')

**Question 6:** We have 95% confidence in our interval, instead of 100% confidence, because we need to account for the fact that:

In [None]:
question('the sample may not be truly random.')
question('we have a sample, and not the whole population.')
question('the distribution of hours worked may be skewed')
question('all of the above')

**Question 7:** The researchers are not satisfied with their confidence interval and want to do another study to find a shorter confidence interval. Which of the following best describes what they should do?

In [None]:
question('They should increase their confidence level and increase their sample size.')
question('They should increase their confidence level or decrease their sample size.')
question('They should decrease their confidence level or increase their sample size.')
question('They should decrease their confidence level and decrease their sample size.')

## Saving Your Notebook
Now that you've finished the homework, we need to save it! To do this, click <code>File</code> $\rightarrow$ <code>Download as</code> $\rightarrow$ <code>PDF via Chrome</code>