# Homework 8: Central Limit Theorem, Normal and t-distributions

**Reading**: 
* [Estimation](https://www.inferentialthinking.com/chapters/13/Estimation.html)
* [Why the mean matters](https://www.inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html)

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

As usual, **run the cell below** to import the modules.

In [1]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)

from scipy import stats
from scipy.stats import norm
from scipy.stats import t

## 1. The CLT and Book Reviews

On homework 7 your friend recommended the book *Data Feminism* by D'Ignazio and Klein, so you decided to look at reviews for the book on *goodreads* and *Amazon* just to be sure that it's worth buying. 

In this section you will again investigate reviews for a recommended book. Let's say that on Amazon, the book only has 80% positive reviews. On GoodReads, it has 91% positive reviews. You decide to investigate a bit further by looking at the percentage of positive reviews for the book on 5 different websites that you know of, and you collect these positive review percentages in a table called `book_reviews.csv`. Run the cell below to load the table.

In [2]:
book_reviews = Table.read_table("book_reviews.csv") 
book_reviews

Positive Review Percentage
80
91
78
50
85


**Question 1.1**. Calculate the average percentage of positive reviews from your sample and assign it to `initial_sample_mean`.

<!--
BEGIN QUESTION
name: q1_1
manual: false
-->

In [3]:
initial_sample_mean = np.mean(book_reviews.column(0))
initial_sample_mean

76.799999999999997

You've calculated the average percentage of positive reviews from your sample, so now you want to do some inference using this information. 

**Question 1.2**. First, simulate 10,000 bootstrap resamples of the positive review percentages. For each bootstrap resample, calculate the resample mean and store the resampled means in an array called `resample_positive_percentages`. A histogram is then plotted of the resampled means.

<!--
BEGIN QUESTION
name: q1_2
manual: false
-->

In [None]:
resample_positive_percentages = make_array()

for i in np.arange(10000):
    resample = 
    resample_avg_positive = ...
    resample_positive_percentages = ...
    
# Do NOT change these lines.
(Table().with_column("Average % of Positive Reviews in Resamples", 
                     resample_positive_percentages).hist("Average % of Positive Reviews in Resamples"))

**Question 1.3**. What is the the shape of the empirical distribution of the average percentage of positive reviews based on our original sample? At what value is the distribution centered? Assign your answer to the variable `initial_sample_mean_distribution`--your answer should be either `1`, `2`, `3`, or `4` corresponding to the following choices:

*Hint: Look at the histogram in Question 2. Run the cell that generated the histogram a few times to check your intuition.*

1. The distribution is approximately normal because of the Central Limit Theorem, and it is centered at the original sample mean.
2. The distribution is not necessarily normal because the Central Limit Theorem may not apply, and it is not centered at the original sample mean.
3. The distribution is approximately normal because of the Central Limit Theorem, but it is not centered at the original sample mean.
4. The distribution is not necessarily normal because the Central Limit Theorem may not apply, and it is centered at the original sample mean.

<!--
BEGIN QUESTION
name: q1_3
manual: false
-->

In [None]:
initial_sample_mean_distribution = ...

According to the Central Limit Theorem, the probability distribution of the sum or average of a *large random sample* drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn. 

**Question 1.4**. Note the statement about the sample being large and random. Is this sample large and random? Give a brief explanation.

<!--
BEGIN QUESTION
name: q1_4
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

Though you have an estimate of the true percentage of positive reviews (the sample mean), you want to measure how variable this estimate is. 

**Question 1.5**. Find the standard deviation of your resampled average positive review percentages, which you stored in `resample_positive_percentages`, and assign the result to the `resampled_means_variability`.

<!--
BEGIN QUESTION
name: q4_5
manual: false
-->

In [None]:
resampled_means_variability = ...
resampled_means_variability

This estimate is pretty variable! To make the estimate less variable, let's say you found a way to randomly sample reputable marketplaces from across the web which sell this book. Let's say that there are 43 of these marketplaces. The percentages of positive reviews are loaded into the table `more_book_reviews`.

In [None]:
# Just run this cell
more_book_reviews = Table.read_table("more_book_reviews.csv")
more_book_reviews

In the question, we'll test an important result of the Central Limit Theorem. According to the CLT, the standard deviation of all possible sample means can be calculated using the following formula:

$$
\text{SD of all possible sample means} = \dfrac{\text{Population SD}}{\sqrt{\text{sample size}}}
$$

This formula gives us another way to approximate the SD of the sample means other than calculating it empirically. We can test how well this formula works by calculating the SD of sample means for different sample sizes.

The following code calculates the SD of sample means using the CLT and empirically for a range of sample sizes. Then, it plots a scatter plot comparing the SD of the sample means calculated with both methods. Each point corresponds to a different sample size. 

In [None]:
# Just run this cell. It's not necessary for you to read this code, but you can do 99% of this on your own!
# Note: this cell might take a bit to run.

def empirical_sample_mean_sd(n):
    sample_means = make_array()
    for i in np.arange(500):
        sample = more_book_reviews.sample(n).column('Positive Review Percentage') 
        sample_mean = np.mean(sample) 
        sample_means = np.append(sample_means, sample_mean) 
    return np.std(sample_means)

def predict_sample_mean_sd(n):
    return np.std(more_book_reviews.column(0)) / (n**0.5)

sd_table = Table().with_column('Sample Size', np.arange(1,44))
predicted = sd_table.apply(predict_sample_mean_sd, 'Sample Size')
empirical = sd_table.apply(empirical_sample_mean_sd, 'Sample Size')
sd_table = sd_table.with_columns('Predicted SD', predicted, 'Empirical SD', empirical)
sd_table.scatter('Sample Size')
plots.ylabel("SD of Sample Mean");

**Question 1.6**. Assign the numbers corresponding to all true statements to an array called `sample_mean_sd_statements`. 
1. The SD of the sample means is proportional to the square root of the sample size.
2. The SD of the sample means is proportional to 1 divided by the square root of the sample size. 
3. The empirical and predicted sample mean SDs will be approximately equal to each other when the sample size is large.
4. The empirical and predicted sample mean SDs will be exactly equal to each other when the sample size is large.
5. The plot above shows that the size of the population affects the SD of the sample means.
6. The plot above shows that as our sample size increases, our estimate for the true percentage of positive reviews becomes more accurate. 

<!--
BEGIN QUESTION
name: q1_6
manual: false
-->

In [None]:
sample_mean_sd_statements = ...

The code below now uses the `more_book_reviews` table to construct a 95% confidence interval. Run the cell below.

In [None]:
sample_means = make_array()

for i in np.arange(5000):
    sample_means = np.append(sample_means, np.mean(more_book_reviews.sample().column('Positive Review Percentage')))

left_end = percentile(2.5, sample_means)
right_end = percentile(97.5, sample_means)
    
Table().with_column("Mean Percentages of Positive reviews", 
                     sample_means).hist()
plots.plot([left_end,right_end], [0, 0], color='gold', lw=10);

print("We predict with 95% confidence that the true mean percentage of positive reviews for this book is between", 
      np.round(left_end,1),
      'and',
      np.round(right_end,1),
      'percent'
     )

## 2. Back to Data Feminsim Book Reviews

On homework 7 you investigated reviews for *Data Feminism* by D'Ignazio and Klein from *goodreads*, where the book had 861 reviews, with 756 of them at 4 stars or above (out of 5 stars). To construct a 90% confidence interval for the true proportion of all reviews of the book that are 4 stars or above (out of 5 stars) you constructed a 90% confidence interval using similar code as shown below. Run the cell below.

In [None]:
resample_star_props = make_array()

for i in np.arange(10000):
    resample_star_props = np.append(resample_star_props, sample_proportions(861, make_array(756/861, 1 - 756/861)).item(0))

left_end_df = percentile(5, resample_star_props)
right_end_df = percentile(95, resample_star_props)
    
Table().with_column("Proportion of Star Ratings at 4 or Above in Resamples", 
                     resample_star_props).hist("Proportion of Star Ratings at 4 or Above in Resamples", bins = 15)
plots.plot([left_end_df,right_end_df], [0, 0], color='gold', lw=10);

print("We predict with 90% confidence that the true proportion of 4 star or higher reviews for Data Feminsim is between", 
      np.round(left_end_df,3),
      'and',
      np.round(right_end_df,3)
     )

**Question 2.1.** Determine the mean and standard deviation of the 10000 simulated values. Assign these values to `mean_simulation` and `sd_simulation`.

In [None]:
mean_simulation = ...
sd_simulation = ...

mean_simulation, sd_simulation

The normal distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is: 

![Normal_Distribution](https://github.com/JA-McLean/STOR120/blob/880f113bfb44fdc94399de8f4a2b991736d8b2fd/Images/normal_dist.jpg?raw=true)

The parameter &mu; is the mean of the distribution, while the parameter &sigma; is its standard deviation. 

Normal distributions are important in statistics partly due to the central limit theorem. It states that, under some conditions, the average of many samples converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.

To construct a confidence interval for the true proportion of all reviews of the book that are 4 stars or above (out of 5 stars), we could use a normal distribution centered at the sample proportion $\hat{p}$ and with a standard deviation:


$$
\text{SD of all possible sample proportions} = \sqrt{\dfrac{\text{$\hat{p}$ * (1 - $\hat{p}$)}}{\text{sample size}}}
$$

**Question 2.2.** Calculate the mean and standard deviation for the normal distribution that can be used to approximate the empirical bootstrap distribution constructed above to determine the 90% confidence interval. Assign these values to `mean_normal` and `sd_normal`.

In [None]:
mean_normal = ...
sd_normal = ...

mean_normal, sd_normal

Run the cell below to again plot the empirical bootstrap distribution constructed above to determine the 90% confidence interval and also the plot of the normal distribution.

In [None]:
Table().with_column("Proportion of Star Ratings at 4 or Above in Resamples", 
                     resample_star_props).hist("Proportion of Star Ratings at 4 or Above in Resamples", bins = 15)
plots.plot([left_end_df,right_end_df], [0, 0], color='gold', lw=10);

x = np.arange(min(resample_star_props),max(resample_star_props), 0.0001)
plots.plot(x, norm.pdf(x, mean_normal, sd_normal));

For the standard normal distribution shown below, with a mean of 0 and standard deviation of 1, there exist some value for `z_star`, such that &plusmn; `z_star` bound the middle C * 100% of the area under the normal distribution.

![normal_dist_C_percent](https://github.com/JA-McLean/STOR120/blob/master/Images/normal_dist_C_percent.png?raw=true)

For example in the distribution below, there is some value for `z_star` that bounds the middle 80% of the area of under the standard normal distribution. To determine this value, run the cell below. The code will determine the values that separate the bottom 10% of the area under the curve from the top 90% as well as the bottom 90% of the area under the curve from the top 10%. Since the normal distribution is symmetric, the absolute value of these values are the same.

![normal_dist_C_percent](https://github.com/JA-McLean/STOR120/blob/master/Images/normal_dist_80_percent.png?raw=true)

In [None]:
# run this cell
norm.ppf(0.1), norm.ppf(0.90)

**Question 2.3.** From the above example you can see that if we go approximately 1.28 standard deviations above and below the mean of the distribution, we would bound the middle 80% of the area under the distribuion. How many standard deviations above and below the mean would contain the middle 90% of the area under the standard normal distribution? Assign this value to `z_star`.

In [None]:
z_star = ...
z_star

Below is a normal distribution with a mean of `mean_normal` and a standard deviation of `sd_normal` with the middle 90% of the area under the distribution shaded to represent the 90% confidence interval. 

![normal_dist_C_percent](https://github.com/JA-McLean/STOR120/blob/master/Images/normal_dist_df.png?raw=true)

To determine the bounds of this confidence interval, you will need to use `mean_normal` and go `z_star` standard deviations (`sd_normal`) above and below `mean_normal`.

**Question 2.4.** Use `mean_normal`, `z_star`, and `sd_normal` to find the 90% confidence interval for the true proportion of 4 star or higher reviews for Data Feminsim. Assign values to `left_end_df_norm` and `right_end_df_norm`

In [None]:
left_end_df_norm = ...
right_end_df_norm = ...

left_end_df_norm, right_end_df_norm

How does this compare with the confidence interval constructed earlier with the bootstrapping process?

## 3. t-distribution and Book Reviews

Student's t-distribution (or simply the t-distribution) is a probability distribution that is used when estimating the mean of a population in situations where the **sample size is small** and the **population's standard deviation is unknown**. It was developed by English statistician William Sealy Gosset under the pseudonym "Student". Gosset first published it in 1908 in the scientific journal Biometrika using the pseudonym "Student" because his employer preferred staff to use pen names when publishing scientific papers. While working at the Guinness Brewery in Dublin, Ireland, Gosset was interested in the problems of small samples – for example, the chemical properties of barley with small sample sizes.

![t_Distribution](https://github.com/JA-McLean/STOR120/blob/master/Images/t_dist.jpg?raw=true)

The t-distribution is symmetric and bell-shaped, like the normal distribution. However, the t-distribution has heavier tails, meaning that it is more prone to producing values that fall far from its mean. If we take a sample of n observations from a distribution, then the t-distribution with n-1 degrees of freedom can be defined as the distribution of the location of the sample mean relative to the true mean, divided by the sample standard deviation, after multiplying by the standardizing term sqrt(n). In this way, the t-distribution can be used to construct a confidence interval for the true mean.

The t-distribution plays a role in a number of widely used statistical analyses, including Student's t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. 

**Question 3.1.** Determine the mean and standard deviation of the values from the simulation in section 1. of this homework used to construct a 95% confidence interval for the true mean percentage of positive reviews for the (unnamed) book. Assign these values to `mean_simulation2` and `sd_simulation2`.

In [None]:
mean_simulation2 = ...
sd_simulation2 = ...

mean_simulation2, sd_simulation2

To construct a confidence interval for the true mean percentage of positive reviews for the (unnamed) book, we could use a t-distribution centered at the sample mean and with a standard deviation:


$$
\text{SD of all possible sample mean percentages} = \dfrac{\text{s}}{\sqrt{\text{sample size}}}
$$

where `s` is the standard deviation of the sample

**Question 3.2.** Calculate the mean and standard deviation for the t-distribution that can be used to approximate the empirical bootstrap distribution constructed above to determine the 95% confidence interval for the true mean percentage of positive reviews for the (unnamed) book. Assign these values to `mean_t` and `sd_t`.

In [None]:
mean_t = ...
sd_t = ...

mean_t, sd_t

As shown in the image above with William Gosset, the shape of the t-distribution, unlike the normal distribution, depends on the degrees of freedom, which in this situation is the sample size minus one. The larger the sample size, the more the t-distribution looks like the normal distribution. For smaller samples size the tails are heavier since there is more uncertainty is using the standard deviation of the sample as a prediction for the standard deviation of the population.

For the t-distribution shown below, with a mean of 0, standard deviation of 1, and sample size of 25, there exist some value for `t_star`, such that &plusmn; `t_star` bound the middle C * 100% of the area under the t-distribution.

![t_dist_C_percent](https://github.com/JA-McLean/STOR120/blob/master/Images/t_dist_C_percent.png?raw=true)

For example in the distribution below, there is some value for `t_star` that bounds the middle 80% of the area of under the t distribution. To determine this value, run the cell below. The code will determine the values that separate the bottom 10% of the area under the curve from the top 90% as well as the bottom 90% of the area under the curve from the top 10%. Since the t distribution is symmetric, the absolute value of these values are the same.

![t_dist_C_percent](https://github.com/JA-McLean/STOR120/blob/master/Images/t_dist_80_percent.png?raw=true)




In [None]:
# run this cell
t.ppf(0.1, 25 - 1), t.ppf(0.90, 25 - 1)

**Question 3.3.** How many standard deviations above and below the mean would contain the middle 95% of the area under the t distribution used to approximate the empirical bootstrap distribution constructed above to determine the 95% confidence interval for the true mean percentage of positive reviews for the (unnamed) book? Assign this value to `t_star`.

In [None]:
t_star = ...
t_star

**Question 3.4.** Use `mean_t`, `t_star`, and `sd_t` and the t distribution to find the 95% confidence interval for the true mean percentage of positive reviews for the (unnamed) book. To determine the bounds of this confidence interval, you will need to use `mean_t` and go `t_star` standard deviations (`sd_t`) above and below `mean_t`.

Assign values to `left_end_t` and `right_end_t`

In [None]:
left_end_t = ...
right_end_t = ...

left_end_t, right_end_t

How does this compare with the confidence interval constructed earlier with the bootstrapping process?

Congratulations! You are done with homework 8.