In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

### CE 93: Lab Assignment 08

You must submit the lab to Gradescope by the due date. You will submit the zip file produced by running the final cell of the assignment.

## About this Lab
The objective of this assignment is to work with confidence intervals both using methods we learned in lecture and simulations.

## Instructions 
**Run the first cell, Initialize Otter**, to import the autograder and submission exporter.

Throughout the assignment, replace `...` with your answers. We use `...` as a placeholder and theses should be deleted and replaced with your answers.

Any part listed as a "<font color='red'>**Question**</font>" should be answered to receive credit.

**Please save your work after every question!**

To read the documentation on a Python function, you can type `help()` and add the function name between parentheses.

**Run the cell below**, to import the required modules.

In [None]:
# Please run this cell, and do not modify the contents
import math
import numpy as np
import scipy
import pandas as pd
import statistics as stats
import cmath
import re
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import hashlib
import ipywidgets as widgets
from ipywidgets import FileUpload
from IPython.display import display
from PIL import Image
import os
import resources
from scipy.stats import *                    
import random
import statsmodels.graphics.gofplots as sm

def get_hash(num):
    """Helper function for assessing correctness"""
    return hashlib.md5(str(num).encode()).hexdigest()

## About Lab 08

In this lab, we will learn how to use `Python` to make interval estimates of distribution parameters based on observed data. For an unknown parameter $\theta$, a point estimate $\hat{\theta}$ is a single-value estimate of the parameter. 

An interval estimate of the parameter $\theta$, which is usually denoted $(\hat{\theta}_{lower},\hat{\theta}_{upper})$, represents an interval within which the true value of the parameter lies with $100(1-\alpha)$% confidence, where $\alpha$ is the significance level (how willing are you to be wrong). The most common value of $\alpha$ is $0.05$, i.e., the interval is obtained for 95% confidence level. This means that if you were to take many samples and compute a confidence interval for each sample, you expect about 95% of the intervals to contain the true population parameter, and about 5% will not include the parameter.

In the lecture, we saw that a two-sided confidence interval for the population mean can be given by the expression:

$$\overline{x} \pm z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}}$$

where $ \overline{x}$ is the sample mean, $\sigma$ is the population standard deviation, $n$ is the sample size, and $z_{\alpha /2}$ is the multiplier corresponding to the critical values of the standard normal $z$-distribution. 

The above equation is only applicable when:
* The underlying population is normal and the population standard deviation $\sigma$ is known

or
* The sample size $n$ is large enough, even if the population standard deviation is unknown and the samples come from an unknown distribution. In this case, the sample standard deviation $s$ is used instead of $\sigma$.

When the sample size $n$ is not large enough and the population standard deviation is unknown, student's $t$-distribution should be used. The $t$-distribution can be used only if the population follows a normal distribution. In this case, a two-sided confidence interval for the population mean can be given by the expression:

$$\overline{x} \pm t_{\alpha/2} \dfrac{s}{\sqrt{n}}$$

The critical values ($z_{\alpha/2}$ or $t_{\alpha/2}$) can be easily found in `Python` using the `.ppf()` method of common distributions. For example, `norm.ppf(0.95)` returns $z_{0.05}$. In general, `norm.ppf(1-p)` returns $z_{p}$, where $p$ is the area/probability to the right of $z_{p}$. If you are interested in a one-sided confidence interval, $p = \alpha$. Otherwise, for a two-sided confidence interval, $p = \alpha/2$.

For the $t$-distribution, `t.ppf(1-p, df=n-1)` returns $t_{p}$. Note that you have to specify the degrees of freedom `df` in this case, which is equal to the sample size minus one: $n-1$. If you are interested in a one-sided confidence interval, $p = \alpha$. Otherwise, for a two-sided confidence interval, $p = \alpha/2$.

Let's practice obtaining these confidence multipliers using Python.

<font color='red'>**Question 1.0.**</font> Using `Python`, find the critical value (i.e. confidence multiplier) for each of the following cases. In all of these cases, assume the population standard deviation is unknown and that the samples come from a normal random variable. Use the appropriate distribution in each case ($z$- or $t$-distribution). Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (0.7 pts)

1. A sample size of 50 and a two-sided 95% confidence interval. Assign your answer to `q1_0`.
2. A sample size of 10 and a two-sided 95% confidence interval. Assign your answer to `q1_1`.
3. A sample size of 50 and a one-sided 95% confidence interval. Assign your answer to `q1_2`.
4. A sample size of 15 and a one-sided 95% confidence interval. Assign your answer to `q1_3`.

In [None]:
# ANSWER CELL

# n = 50 and two-sided 95%
q1_0 = ...
print(f'n = 50 and two-sided 95% critical value: {q1_0:.3f}' if not isinstance(q1_0, type(Ellipsis)) else None)

# n = 10 and two-sided 95%
q1_1 = ...
print(f'n = 10 and two-sided 95% critical value: {q1_1:.3f}' if not isinstance(q1_1, type(Ellipsis)) else None)

# n = 50 and one-sided 95%
q1_2 = ...
print(f'n = 50 and one-sided 95% critical value: {q1_2:.3f}' if not isinstance(q1_2, type(Ellipsis)) else None)

# n = 15 and one-sided 95%
q1_3 = ...
print(f'n = 15 and one-sided 95% critical value: {q1_3:.3f}' if not isinstance(q1_3, type(Ellipsis)) else None)

In [None]:
grader.check("q1_0")

## Monthly Average UV Irradiance

We will work with monthly average UV irradiance $(\text{in } mW/m^2)$ data set in California. The data were collected every month in California for the years 2005-2015 (source: https://ephtracking.cdc.gov/DataExplorer/#/).

**UV irradiance** is the radiant power arriving at a surface per unit area. This is an important environmental factor that reflects exposure to sunlight and UV. This data set is measured at noon, when the dose of UV irradiance is usually highest. The data set represents environmental exposures per unit area and do not directly account for personal exposures at an individual level.

<center><figure>
  <img src="https://www.sciencefacts.net/wp-content/uploads/2024/01/Irradiance.jpg" style="width:50%">
    <figcaption style="text-align:center"><strong> <br> UV Irradiance: </strong> <a href="https://www.sciencefacts.net/wp-content/uploads/2024/01/Irradiance.jpg">(https://www.sciencefacts.net/)</a></figcaption>   
</figure></center>

### Load the data

Let's load the provided data set `UV_irradiance.csv`. It has three features:

|Feature|Units|Description|
|:-|:-|:-|
|Year | N/A| The year in which the measurement was made|
|Month | N/A| Numerical numbers corresponding to the month of the measurement|
|Monthly average UV irradiance at noon| $$mW/m^2$$ | Average UV irradiance values for each month of a year in California|

* load using the Pandas `read_csv()` function

Run the cell below, which reads the data and saves it as a variable named `df`.

In [None]:
# read a .csv file as a DataFrame
df = pd.read_csv('resources/UV_irradiance.csv')

# returns the first 5 rows of the data set by default
df.head()

### Create Variables from the DataFrame

We want to generate a data vector for year, month, and monthly average UV irradiance at noon.

<font color='red'>**Question 2.0.**</font> Create different variables for each column in the Dataframe. You can refer to previous labs to answer this question. (0.3 pts)
- Create a variable `year` for the year the measurement was taken
- Create a variable `month` for the month the measurement was taken
- Create a variable `UV` for the irradiance measurements

In [None]:
# ANSWER CELL
# create variables for year, month, UV

year = ...
month = ...
UV = ...

In [None]:
grader.check("q2_0")

The data set has a total of 132 UV measurements, which correspond to monthly measurements for 11 years: 
* $11 \text{ years} \times 12 \text{ months/year} = 132 \text{ measurements}$ 

<font color='red'>**Question 2.1.**</font> If we want to plot the change in average UV irradiance with time over the 132 months, what graphical method should we use? Assign your answer to the variable `q2_1` as a string. (0.25 pts)

**A.** Scatterplot \
**B.** Histogram \
**C.** Boxplot \
**D.** Line Graph \
**E.** Bar chart \
**F.** None of the above

Your answer should be a string, e.g., `"A"`, `"B"`, etc.\
Remember to put quotes around your answer choice.

In [None]:
# ANSWER CELL
q2_1 = ...
q2_1

In [None]:
grader.check("q2_1")

### Winter UV Irradiance

We will first examine the average UV irradiance for the winter months (months = 12, 1, and 2). Run the code cell below to create a data vector with average UV irradiance during winter.

In [None]:
# Run the code cell below to create a data vector with average UV irradiance during winter

# return UV values only for months = 12, 1, and 2
winter_UV = UV[(month==12)|(month==1)|(month==2)]
print(f'There are {len(winter_UV)} measurements during winter months.')

In real life applications, we might not always have a large enough sample. So, to make our analysis more interesting, we will select a sample of size 10 from the winter measurements.

### Random Sampling

Next, we will select a random subsample of size $n=10$ from the full sample of the average UV irradiance during the winter months (which has a size of $33$).

We can select a random sample using `random.choices(sequence, k)`, where `sequence` is the data set we want to sample from and `k` is the sample size. The `random.choices()` function does not directly work with DataFrame inputs. So we will convert our DataFrame to a `list` and then take a random sample from it.

We will specify the random seed at the beginning of the code so that everyone gets the same sample.

Run the code cell below to take a random sample of size $10$ from the full sample of the average UV irradiance during the winter months.

In [None]:
#set the random seed equal to 99
random.seed(99)

# select a random sample
winter_UV_sample = random.choices(list(winter_UV), k=10)

print(f'The selected sample is: {winter_UV_sample} mW/m^2')

### Confidence Intervals for Mean

Next, we want to create confidence intervals for the average UV irradiance during the **winter** months using the sample of size 10 (saved as `winter_UV_sample`). In this section, we will create these intervals using the same methods we discussed in the lecture.

If we only have the sample standard deviation $s$, a two-sided confidence interval for the population mean can be obtained by the expression:

$$\overline{x} \pm c \dfrac{s}{\sqrt{n}}$$

where $c$ is the critical value:
* $z_{\alpha/2}$ if the sample is large enough
* $t_{\alpha/2}$ if the sample is small and the population follows a normal distribution

So first, we need to obtain our point estimates:
* sample mean, $\overline{x}$
* sample standard deviation, $s$

<font color='red'>**Question 3.0.**</font> Calculate the sample mean and **sample** standard deviation of the sample of the average UV irradiance during the winter months (data saved as `winter_UV_sample`). Assign your answers to `mean_winter_UV` and `stdev_winter_UV`. Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (0.5 pts)

*Note:* We saw in Lab 02 that `np.std()` takes an optional parameter `ddof`: "Delta Degrees of Freedom". By default, this is 0, which returns the population standard deviation. To get the sample standard deviation, you need to specify `ddof=1`. *Just a heads up, remember this for later questions and labs; we won't nag you about it again.*

In [None]:
# ANSWER CELL

# Get sample mean
mean_winter_UV = ...
print(f'Sample Mean of UV irradiance during winter: {mean_winter_UV:.3f} mW/m^2' if not isinstance(mean_winter_UV, type(Ellipsis)) else None)

# Get sample stdev
stdev_winter_UV = ...
print(f'Sample Standard Deviation of UV irradiance during winter: {stdev_winter_UV:.3f} mW/m^2' if not isinstance(stdev_winter_UV, type(Ellipsis)) else None)

In [None]:
grader.check("q3_0")

### $z$-Statistic Confidence Intervals

<font color='red'>**Question 3.1.**</font> Using the $z$-statistic, what is a two-sided 95% confidence interval for the mean of the average UV irradiance during the winter? Assign the lower estimate of the confidence interval to `q3_1_lower` and the upper estimate to `q3_1_upper`. Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (0.5 pts)

In [None]:
# ANSWER CELL

q3_1_lower = ...
q3_1_upper = ...

print(f'95% confidence interval using z: ({q3_1_lower:.3f}, {q3_1_upper:.3f}) mW/m^2' if not isinstance(q3_1_lower, type(Ellipsis)) and not isinstance(q3_1_upper, type(Ellipsis)) else None)

In [None]:
grader.check("q3_1")

### $t$-Statistic Confidence Intervals

<font color='red'>**Question 3.2.**</font> Using the $t$-statistic, what is a two-sided 95% confidence interval for the mean of the average UV irradiance during the winter? Assign the lower estimate of the confidence interval to `q3_2_lower` and the upper estimate to `q3_2_upper`. Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (0.5 pts)

In [None]:
# ANSWER CELL

q3_2_lower = ...
q3_2_upper = ...

print(f'95% confidence interval using t: ({q3_2_lower:.3f}, {q3_2_upper:.3f}) mW/m^2' if not isinstance(q3_2_lower, type(Ellipsis)) and not isinstance(q3_2_upper, type(Ellipsis)) else None)

In [None]:
grader.check("q3.2")

### Interpretation of Results

<font color='red'>**Question 3.3.**</font> Compare your confidence intervals from the $z$- and $t$-statistics. What can you say about the confidence intervals in this case? Assign ALL that apply to the variable `q3_3`. (0.75 pts)

**A.** It is acceptable to use the $z$-statistic \
**B.** It is not acceptable to use the $z$-statistic \
**C.** It is acceptable to use the $t$-statistic \
**D.** It is acceptable to use the $t$-statistic only if the population is normal \
**E.** It is acceptable to use the $t$-statistic only if the sample mean is normal \
**F.** The confidence interval based on the $t$-statistic is wider than that based on the $z$-statistic \
**G.** The confidence interval based on the $t$-statistic is narrower than that based on the $z$-statistic \
**H.** The difference between the two confidence intervals ($z$ vs. $t$) is due to the confidence level \
**I.** The difference between the two confidence intervals ($z$ vs. $t$) is due to the confidence multiplier

Answer in the next cell. Add each selected choice as a string and separate each two answer choices by a comma. For example, if you want to select `"A"` and `"B"`, your answer should be `"A", "B"`.\
Assign your answer to the given variable.
Remember to put quotes around each answer choice.

In [None]:
# ANSWER CELL
q3_3 = ...
q3_3

In [None]:
grader.check("q3_3")

<font color='red'>**Question 3.4.**</font> If we were using all the winter measurements in `UV_winter` (which would be a sample of size 33), and still computing a two-sided 95% confidence interval, which of the following would be True? Assign ALL that apply to the variable `q3_4`. (0.75 pts)

**A.** It would be acceptable to use the $z$-statistic \
**B.** It would not be acceptable to use the $z$-statistic \
**C.** The confidence interval based on the $z$-statistic would be exact \
**D.** The confidence interval based on the $z$-statistic would be an approximation \
**E.** The confidence interval (based on $z$ or $t$) would likely be narrower compared to that based on the sample of size 10 \
**F.** The confidence interval (based on $z$ or $t$) would likely be wider compared to that based on the sample of size 10 \
**G.** If using $z$, the difference in interval width compared to a sample of size 10 would be due to the confidence multiplier \
**H.** If using $z$, the difference in interval width compared to a sample of size 10 would be due to $\sqrt{n}$ in the equation \
**I.** If using $t$, the difference in interval width compared to a sample of size 10 would be due to the confidence multiplier \
**J.** If using $t$, the difference in interval width compared to a sample of size 10 would be due to $\sqrt{n}$ in the equation

Answer in the next cell. Add each selected choice as a string and separate each two answer choices by a comma. For example, if you want to select `"A"` and `"B"`, your answer should be `"A", "B"`.\
Assign your answer to the given variable.
Remember to put quotes around each answer choice.

In [None]:
# ANSWER CELL
q3_4 = ...
q3_4

In [None]:
grader.check("q3_4")

### Data Histogram

In the lecture, we mentioned that when the population standard deviation is unknown and the sample size is small, the $t$-statistic is fully applicable **only if the population follows a normal distribution**. Estimating confidence intervals for populations that do not follow a normal distribution and small samples requires careful consideration.

Let's assume that `winter_UV` represents the population. The confidence intervals you computed above using the sample of size 10 require that the population be normally distributed (in this case, `winter_UV` be normally distributed). Let's see if this is actually the case. If the assumption of normal distribution is not valid, we cannot use a $t$-distribution to get confidence intervals.

<font color='red'>**Question 4.0.**</font> In the code cell below, plot a frequency histogram of `winter_UV` with `bins=7` and assign it to the variable `histogram`. (0.25 pts)

In [None]:
# ANSWER CELL

# Do not modify this line for grading purposes
import matplotlib.pyplot as plt

# create figure and axes
fig_1, ax_1 = plt.subplots(nrows=1, ncols=1, figsize=(4,3))

# Edit the code below to plot a frequency histogram of winter_UV (only edit where you have ...)

# Plot frequency histogram. Assign the plot to the variable histogram.
histogram = ...

# Label the axes
...

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
grader.check("q4.0")

It should be obvious that the data in this case do not come from a normal distribution. So the confidence intervals we computed above are not accurate.

### Q-Q Plot

The histogram can be used to judge, in a rough way, if the data plausibly come from a normal distribution. A better graphical way to judge whether a theoretical distribution is a good fit is a Quantile-Quantile (Q-Q) plot. Recall that quantiles divide the observations in a sample into intervals with equal probabilities. We can choose any number of intervals. For example, we can divide the observations into four equal intervals, each with 25% of the data (we commonly refer to these as quartiles). Or we can divide the observations into hundred equal parts, each with 1% of the data (we commonly refer to these as percentiles).

A Q-Q plot (Quantile-Quantile plot) is commonly used to assess whether sample data follow a specific theoretical distribution (in most cases normal distribution). A Q-Q plot compares the observed quantiles of the sample data and theoretical quantiles if the data followed a specific theoretical distribution. If the data fall on a straight line (45$^\circ$ angle), this suggests that the observed quantiles of the sample data follow those of the selected theoretical distribution, and hence, the assumed distribution would be reasonable. Otherwise, if the data do not fall on a straight line, the assumed distribution would not be a good fit.

The following figures compare Q-Q plots for data that follow a normal distribution versus data that do not follow a normal distribution.

<br>

<center><figure>
  <img src="https://www.reneshbedre.com/assets/posts/qq/qq_compare.webp?ezimgfmt=rs:900x490/rscb2/ngcb2/notWebP" style="width:70%">
    <figcaption style="text-align:center"><strong> <br> Example Q-Q Plots: Good fit (left) versus poor fit (right) </strong> <a href="https://www.reneshbedre.com/assets/posts/qq/qq_compare.webp?ezimgfmt=rs:900x490/rscb2/ngcb2/notWebP">(https://www.reneshbedre.com/)</a></figcaption>   
</figure></center>

To create a Q-Q plot, we can use the `qqplot()` function from the `statsmodels` library. You can read more about it in the documentation [here](https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html). The default theoretical distribution to compare the observations to is the standard normal distribution $Z \sim N(0, 1)$. The most important input parameters are:

`qqplot(data, loc=0, scale=1, line=None, ax=None)`

where:
* `x`: a sequence with the observations
* `loc`: location parameter, which is the mean for a normal distribution. Default is 0.
* `scale`: scale parameter, which is the standard deviation for a normal distribution. Default is 1.
* `line`: the reference line to which the data is compared. Generally, you want to use `line='45'`
* `ax`: Axes object. If given, the Q-Q plot is created and plotted on the Axes object.

Below is a detailed example of how to create a Q-Q plot in Python using `qqplot()`. Instead of using actual observations, we created a random data set for illustration. The code first generates a random sample from a normal distribution and second from a lognormal distribution. Then, the code generates Q-Q plots for both data sets to check the fit relative to a theoretical normal distribution.

Read then run the code below.

In [None]:
# set the seed number
np.random.seed(4)

# create figure and axes
fig_2, ax_2 = plt.subplots(nrows=2, ncols=2, figsize=(6,6))

# genrate random data from normal distribution
norm_data = norm.rvs(loc=250, scale=50, size=200)

# genrate random data from lognormal distribution
lognorm_data = lognorm.rvs(s=0.25, scale=np.exp(2), size=200)

# Plot a histogram of the normal data
ax_2[0, 0].hist(norm_data, bins=12, ec='k')

# Create Q-Q plot of the normal data and show it on ax_2[0,1]
# We will specify loc and scale to be the mean and standard deviation of the observed data
sm.qqplot(norm_data, loc=np.mean(norm_data), scale=np.std(norm_data, ddof=1), line='45', ax=ax_2[0,1])

# Plot a histogram of the lognormal data
ax_2[1, 0].hist(lognorm_data , bins=12, ec='k')

# Create Q-Q plot of the lognormal data and show it on ax_2[1,1]
# We will specify loc and scale to be the mean and standard deviation of the observed data
sm.qqplot(lognorm_data, loc=np.mean(lognorm_data), scale=np.std(lognorm_data, ddof=1), line='45', ax=ax_2[1,1])

# Label the axes
ax_2[0,0].set(title = 'Normal Data Histogram',
         ylabel = 'Frequency',
         xlabel = 'Data')

ax_2[0,1].set(title = 'Normal Data Q-Q Plot')
    
ax_2[1,0].set(title = 'Lognormal Data Histogram',
         ylabel = 'Frequency',
         xlabel = 'Data')

ax_2[1,1].set(title = 'Lognormal Data Q-Q Plot')

# Display the plot
plt.tight_layout()
plt.show()

In a Q-Q plot, the x-axis displays the theoretical quantiles, which represent where your data would be if they were normally distributed. The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data are normally distributed.

We can see in the first Q-Q plot above (upper right) that the data values tend to closely follow the 45-degree line, which means the data are likely normally distributed. This shouldn't be surprising since we generated the first data values using the `norm.rvs()` function.

In the second Q-Q plot above (lower right), the data values do not follow the 45-degree line, particularly at low and high values, which is an indication that the data do not follow a normal distribution. Again, this shouldn't be surprising since we generated the second  data values using the `lognorm.rvs()` function.

### Data Q-Q Plot

Let's create a Q-Q plot using `winter_UV` to better assess if a normal distribution is a good fit. 

<font color='red'>**Question 4.1.**</font> In the code cell below, create a Q-Q plot of `winter_UV` against a theoretical normal distribution. The parameters of the distribution (`loc` and `scale`) should be estimated from the available sample. (0.5 pts)

In [None]:
# ANSWER CELL

# Do not modify these lines for grading purposes
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import statsmodels.graphics.gofplots as sm

# create figure and axes
fig_3, ax_3 = plt.subplots(nrows=1, ncols=1, figsize=(3,3))

# Edit the code below to create a Q-Q plot of winter_UV (only edit where you have ...)

# Create Q-Q plot
...

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
grader.check("q4.1")

So now we have further evidence that the radiance data likely do not come from a normal population. Hence, the equations we discussed in the lecture for confidence intervals are not applicable.

This is the case because to derive the confidence interval equations, we had to determine the distribution of the estimator (in this case distribution of sample mean) and then use the corresponding table of that distribution to obtain the confidence multiplier.

For example if we have a large enough sample, then by the Central Limit Theorem, $\overline{X} \sim N$, and we could use the $z$-distribution. If we have a small sample from a normal population and the population standard deviation is unknown, $\dfrac{\overline{X}-\mu}{s/\sqrt{n}} \sim t$, and we could use the $t$-distribution.

If we have a small sample and the population distribution is not normal, we cannot determine the distribution of $\overline{X}$, and hence, we cannot use any of the equations we derived.

Fortunately, we can use simulation to generate confidence intervals for any estimate without making assumptions on the distribution of the data (or any assumptions at all!).

### Bootstrapping 

Confidence intervals can be obtained through simulation by "bootstrapping" our data. We treat our data as the population and draw **many many** samples from our data to get a confidence interval. To compare the confidence intervals based on bootstrapping with what you computed previously, we will use the sample of size 10, which is saved as `winter_UV_sample`.

There are different bootstrapping approaches, but here are the steps to the most common approach:

1. Draw a sample of the same size as the data **with replacement**. *If our sample has 10 observations, we will draw a random sample of size 10 with replacement.*
> Because we are sampling **with replacement**, the same data point may appear in the sample more than once. *Therefore, the samples will not be exactly the same as the data, even if both have the same size*.
2. Compute the mean (i.e., statistic) of this new sample.
3. Repeat the process many many times, each time obtaining a new sample and hence a new mean.
4. Once the resampling is done, we would have 5000 samples for the sample mean.
5. Finally, we end up with a simulated distribution for the sample mean (i.e., statistic).

<br>

<center><figure>
  <img src="https://blogs.sas.com/content/iml/files/2018/12/bootstrapSummary.png" style="width:100%">
    <figcaption style="text-align:center"><strong> <br> Bootstrapping: </strong> <a href="https://blogs.sas.com/content/iml/files/2018/12/bootstrapSummary.png">(https://blogs.sas.com/)</a></figcaption>   
</figure></center>


Let's first generate a single bootstrap sample from our data.

Read then run the code cell below multiple times and check the output.

In [None]:
# Run the code cell below to select a single bootstrap sample

# print original data
print(f'Original sample: \t {list(winter_UV_sample)} mW/m^2')

# select a random sample of the same size as the data and with replacement
winter_UV_bootstrap = random.choices(list(winter_UV_sample), k=len(winter_UV_sample))

# print the bootstrap sample
print(f'Single bootstrap sample: {list(winter_UV_bootstrap)} mW/m^2') 

# print mean of the bootstrap sample
print(f'\nMean of Single bootstrap sample: {np.mean(winter_UV_bootstrap):.3f} mW/m^2') 

It should be evident that in the bootstrap sample, some of the same values appear multiple times, and some values from the original data do not appear at all. Because this is a random sample and we are not specifying the seed number at the beginning of the code cell, every time you rerun it, you will get a different bootstrap sample.

By taking the mean of the bootstrap sample, we now have one sample for the sample mean. If we take more bootstrap samples and get the mean of each sample, we can obtain the distribution of the sample mean.

Now, let's select a total of **5000** bootstrap samples and calculate the mean of each sample. 

Run the code cell below. Note that here we are specifying `random.seed(99)`.

In [None]:
#set the random seed equal to 99
random.seed(99)

# specify the total number of samples to create
n_samples = 5000

# create an empty array to save the means of each sample
bootstrap_means = []

# loop through a total of n_samples times
for i in range(n_samples):
    
    # select a random sample of the same size as the data and with replacement
    winter_UV_bootstrap = random.choices(list(winter_UV_sample), k=len(winter_UV_sample))
    
    # calculate the sample mean of the bootstrapped sample
    winter_UV_bootstrap_mean = np.mean(winter_UV_bootstrap)
    
    # append the mean value to save all the means
    bootstrap_means = np.append(bootstrap_means, winter_UV_bootstrap_mean)

# print a few bootstrapped means
print(f'Sample Bootstrapped Means: [{bootstrap_means[0]}, {bootstrap_means[1]}, ..., {bootstrap_means[-1]}] mW/m^2')

<font color='red'>**Question 5.0.**</font> In the code cell below, plot a frequency histogram of `bootstrap_means`. Follow these steps: (0.5 pts)

1. Plot a frequency histogram of `bootstrap_means` with `bins=15` and assign it to the variable `histogram_2`.
2. Set the x-axis label to `'Bootstrap Means ($mW/m^2$)'` and the y-axis label to `'Frequency'`.

In [None]:
# ANSWER CELL

# Do not modify this line for grading purposes
import matplotlib.pyplot as plt

# create figure and axes
fig_4, ax_4 = plt.subplots(nrows=1, ncols=1, figsize=(4,3))

# Edit the code below to plot a frequency histogram of bootstrap_means (only edit where you have ...)

# Plot frequency histogram. Assign the plot to the variable histogram_2.
histogram_2 = ...

# Label the axes
...

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
grader.check("q5.0")

Now that we have a simulated distribution of the sample mean, we can create a confidence interval without having to make any assumptions about its distribution. Keep in mind that the histogram you created above is for the bootstrapped sample means. So the above is the simulated distribution of $\overline{X}$.

### Confidence Intervals by Bootstrapping

We can then create a confidence interval as follows:

* If we want a two-sided 95% confidence interval, we need to find the middle range of values that includes 95% of the bootstrap means
* To make our interval symmetric, we can find the 2.5 percentile and the 97.5 percentile of the bootstrap means
* The interval between the 2.5 percentile and the 97.5 percentile would thus correspond to the middle 95% of the bootstrap means
* This would be our 95% confidence interval!
* If we want a different confidence level, simply find the lower and upper percentiles such that 100(1-$\alpha$)% of the bootstrap means are in between these percentiles. For a 99% confidence level, these would be the 0.5 and 99.5 percentiles.

Run the code cell below to see how we generate this interval.

In [None]:
# create figure and axes
fig_5, ax_5 = plt.subplots(nrows=1, ncols=1, figsize=(4,3))

# get the 2.5 and 97.5 percentiles of the bootstrapped means
low, upper = np.percentile(bootstrap_means, [2.5, 97.5])

# plot a histogram of the bootstrapped means
ax_5.hist(bootstrap_means, bins=15, ec='k')

# plot red vertical lines at the 2.5 and 97.5 percentiles
ax_5.vlines(low, 0, 1000, 'r')
ax_5.vlines(upper, 0, 1000, 'r')

# Label the axes
ax_5.set(title = 'Bootstrap means of winter UV Histogram',
         ylabel = 'Frequency',
         xlabel = 'Bootstrap Means ($mW/m^2$)',
         ylim=(0, 1000)) 

# Display the plot
plt.tight_layout()
plt.show()

# Print the intervals
print(f'95% confidence interval using Bootstrapping: \t ({low:.3f}, {upper:.3f}) mW/m^2')
print(f'95% confidence interval using z: \t \t ({q3_1_lower:.3f}, {q3_1_upper:.3f}) mW/m^2' if not isinstance(q3_1_lower, type(Ellipsis)) and not isinstance(q3_1_upper, type(Ellipsis)) else None)
print(f'95% confidence interval using t: \t \t ({q3_2_lower:.3f}, {q3_2_upper:.3f}) mW/m^2' if not isinstance(q3_2_lower, type(Ellipsis)) and not isinstance(q3_2_upper, type(Ellipsis)) else None)

In the plot above, the data lower than the red line to the left correspond to the lower 2.5% of the bootstrapped means, and the data higher than the red line to the right correspond to the upper 2.5% of the bootstrapped means.

Thus, the values in between the two red lines correspond to the middle 95% of the bootstrapped means. This would be our 95% confidence interval based on simulation. We didn't have to use mathematical equations to construct this confidence interval. Instead, we used the simulated sample means from bootstrapping, which does not require making any assumptions about the population.

### 90% Bootstrapped Confidence Interval

<font color='red'>**Question 6.0.**</font> Using the bootstrapped sample means created above (saved as `bootstrap_means`), what is a two-sided 90% confidence interval for the mean of the average UV irradiance during the winter? Assign the lower estimate of the confidence interval to `q6_0_lower` and the upper estimate to `q6_0_upper`. Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (1.0 pt)

*Hint: All you need to do is calculate the correct percentiles based on the desired confidence level.* 

In [None]:
# ANSWER CELL

# get lower and upper estimates of the 90% confidence interval
q6_0_lower, q6_0_upper = ...

# Print the interval
print(f'90% confidence interval using Bootstrapping: ({q6_0_lower:.3f}, {q6_0_upper:.3f}) mW/m^2' if not isinstance(q6_0_lower, type(Ellipsis)) and not isinstance(q6_0_upper, type(Ellipsis)) else None)

In [None]:
grader.check("q6.0")

Bootstrapping is powerful, but it's not magic — it can only work with the information available in the original sample. If the samples are not representative of the whole population, then bootstrapping will not be very accurate. Also, you might get a different confidence interval each time, since the procedure is based on random sampling.

### Confidence Interval for Median

The nice thing about bootstrapping is that we do not need to make any assumptions on the distributions. We can generate confidence intervals for **any** parameter, regardless of the distribution of the data, or the bootstrapped estimate.

We discussed the Central Limit Theorem, which says that the sum or average of many distributions tends toward normal. That's why our confidence intervals for the mean were generally based on a normal distribution. What if we want a confidence interval for the **median**?! We have no idea what the distribution of the median is! That's when bootstrapping can be extremely helpful.

<font color='red'>**Question 7.0.**</font> Select a total of 5000 bootstrap samples and obtain the median of each sample. Use appropriate Python expressions that return the median of a data set. (0.5 pts)

In [None]:
# ANSWER CELL

#set the random seed equal to 99
random.seed(99) # DO NOT CHANGE OR REMOVE THIS LINE

# Edit the code below to obtain medians (only edit where you have ...)

# specify the total number of samples to create
n_samples = ...

# create an empty array to save the median of each sample
bootstrap_medians = []

# loop through a total of n_samples times
for i in range(n_samples):
    
    # select a random sample of the same size as the data and with replacement
    winter_UV_bootstrap = random.choices(list(winter_UV_sample), k=len(winter_UV_sample))
    
    # calculate the sample median of the bootstrapped sample
    winter_UV_bootstrap_median = ...
    
    # append the median value to save all the medians
    bootstrap_medians = np.append(bootstrap_medians, winter_UV_bootstrap_median)
    
# print a few bootstrapped medians
print(f'Sample Bootstrapped Medians: [{bootstrap_medians[0]}, {bootstrap_medians[1]}, ..., {bootstrap_medians[-1]}] mW/m^2')

In [None]:
grader.check("q7.0")

<font color='red'>**Question 7.1.**</font> Calculate a 95% confidence interval for the median of the UV irradiance during the winter months and plot the results along with the histogram of `bootstrap_medians`. Follow these steps: (1.0 pt)

1. Calculate a two-sided 95% confidence interval for the median. Assign the lower estimate of the confidence interval to `q7_1_lower` and the upper estimate to `q7_1_upper`. 
2. Plot a frequency histogram of `bootstrap_medians` with `bins=11` and assign it to the variable `histogram_3`.
3. Plot vertical red lines at the confidence interval bounds extending from 0 to 1500.
4. Set the x-axis label to `'Bootstrap Medians ($mW/m^2$)'` and the y-axis label to `'Frequency'`.

In [None]:
# ANSWER CELL

# Do not modify these lines for grading purposes
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection

# create figure and axes
fig_6, ax_6 = plt.subplots(nrows=1, ncols=1, figsize=(4,2.5))

# Edit the code below to plot a frequency histogram of bootstrap_medians (only edit where you have ...)

# get lower and upper estimates of the 95% confidence interval for the median
q7_1_lower, q7_1_upper = ...

# Plot frequency histogram. Assign the plot to the variable histogram_3.
histogram_3 = ...

# plot red vertical lines at the confidence interval estimates
...
...

# Label the axes
...

# Display the plot
plt.tight_layout()
plt.show()

# Print the interval
print(f'95% confidence interval for Median using Bootstrapping: ({q7_1_lower:.0f}, {q7_1_upper:.0f}) mW/m^2' if not isinstance(q7_1_lower, type(Ellipsis)) and not isinstance(q7_1_upper, type(Ellipsis)) else None)

In [None]:
grader.check("q7.1")

### Summer versus Winter UV Irradiance

So far, we have focused on the winter months only. Next, we will compare the winter and summer months (months = 6, 7, and 8). Run the code cell below to create a data vector with average UV irradiance for the summer months.

In [None]:
# Run the code cell below to create a data vector with average UV irradiance for the summer months

# return UV values only for months = 6, 7, and 8
summer_UV = UV[(month==6)|(month==7)|(month==8)]
print(f'There are {len(summer_UV)} measurements during summer months.')

We are interested in comparing the average UV irradiance at noon for the winter months (December/January/February: 12, 1, 2) and the summer months (June/July/August: 6, 7, 8).

Recall that we created two  data vectors, one with average UV irradiance of the winter months (`winter_UV`) and another for the average UV irradiance of the summer months (`summer_UV`).

<font color='red'>**Question 8.0.**</font> Create two boxplots, one for the average UV irradiance during the winter and another for the average UV irradiance during the summer. Create both boxplots **in the same plot** (refer to Lab 02). Follow these steps: (0.5 pts)

1. Create a boxplot of `winter_UV` and `summer_UV` (in this order) and assign it to variable `q8_0`
2. Set the y-axis label equal to `'UV Irradiance ($mW/m^2$)'`
3. Set the x-axis tick labels equal to `['Winter', 'Summer']`.

In [None]:
# ANSWER CELL

# Do not modify this line for grading purposes
import matplotlib.pyplot as plt

# create figure and axes
fig_7, ax_7 = plt.subplots(nrows=1, ncols=1, figsize=(4,3))

# Edit the code below to boxplot of winter_UV and summer_UV (only edit where you have ...)

# Create boxplot and assign it to variable q8_0
q8_0 = ...

# set xticklabels and ylabel
...

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
grader.check("q8.0")

<font color='red'>**Question 8.1.**</font> What can you tell about the average UV irradiance during the winter and the summer months based on your plot? Assign ALL that apply to the variable `q8_1`. (0.5 pts)

**A.** The UV irradiance during the winter is skewed to the right \
**B.** The median of the UV irradiance is greater for the summer months than the winter months \
**C.** The UV irradiance during the winter has outliers \
**D.** The UV irradiance during the winter has a symmetric distribution \
**E.** The UV irradiance during the summer is skewed to the left \
**F.** The UV irradiance during the summer has outliers \
**G.** The interquartile range of the UV irradiance is greater for the winter months than the summer months

Answer in the next cell. Add each selected choice as a string and separate each two answer choices by a comma. For example, if you want to select `"A"` and `"B"`, your answer should be `"A", "B"`.\
Assign your answer to the given variable.
Remember to put quotes around each answer choice.

In [None]:
# ANSWER CELL
q8_1 = ...
q8_1

In [None]:
grader.check("q8_1")

### Confidence Interval for Difference in Means

If we have two large enough samples of sizes $n_X$ and $n_y$ with sample means $\overline{x}$ and $\overline{y}$ and sample standard deviations $s_X$ and $s_Y$, a two-sided $100(1-\alpha)\%$ confidence interval for the difference in population means $\mu_X-\mu_Y$ can be obtained using the following equation:

$$\overline{x}-\overline{y}\pm z_{\alpha/2}\sqrt{\dfrac{s_X^2}{n_X} + \dfrac{s_Y^2}{n_Y}}$$

where:
* $\overline{x}-\overline{y}$ is a point estimate for $\mu_X-\mu_Y$
* $\sqrt{s_X^2 / n_X + s_Y^2/n_Y}$ is the standard error of the estimator

We have 33 samples for the winter UV irradiance and 33 samples for the summer UV irradiance. Assume that $Y$ represents the winter UV samples and $X$ represents the summer UV samples. We want to construct a confidence interval on the difference in the means: $\mu_X-\mu_Y$ (population mean of summer UV irradiance minus population mean of winter UV irradiance).

<font color='red'>**Question 8.2.**</font> Based on the winter UV irradiance (saved as `winter_UV`) and the summer UV irradiance (saved as `summer_UV`), calculate a point estimate for the difference in the means: $\mu_X-\mu_Y$ and assign it to the variable `mean_difference_UV`. Also, calculate the standard error of the estimator of $\mu_X-\mu_Y$ and assign it to the variable `se_difference_UV`. Do not just manually type the numeric answers. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. Assume both `winter_UV` and `summer_UV` represent samples. (0.5 pts)

In [None]:
# ANSWER CELL

# get point estimate for difference in means
mean_difference_UV = ...
print(f'Point estimate for difference in means: {mean_difference_UV:.3f} mW/m^2' if not isinstance(mean_difference_UV, type(Ellipsis)) else None)

# get standard error of the estimator of the difference in means
se_difference_UV = ...
print(f'Estimator Standard Error: {se_difference_UV:.3f} mW/m^2' if not isinstance(se_difference_UV, type(Ellipsis)) else None)

In [None]:
grader.check("q8.2")

<font color='red'>**Question 8.3.**</font> Calculate a two-sided 99% confidence interval for the difference in population means $\mu_X-\mu_Y$. Assign the lower estimate of the confidence interval to `q8_3_lower` and the upper estimate to `q8_3_upper`. Do not just manually type the numeric answer. Use Python expressions that return the desired answer and assign the expression to the corresponding variable. (0.5 pts)

In [None]:
# ANSWER CELL

q8_3_lower = ...
q8_3_upper = ...

print(f'99% confidence interval for difference in population means: ({q8_3_lower:.1f}, {q8_3_upper:.1f}) mW/m^2' if not isinstance(q8_3_lower, type(Ellipsis)) and not isinstance(q8_3_upper, type(Ellipsis)) else None)

In [None]:
grader.check("q8_3")

### You're done with this Lab!

**Important submission information:** After completing the assignment, click on the Save icon from the Tool Bar &nbsp;<i class="fa fa-save" style="font-size:16px;"></i>&nbsp;. After saving your notebook, **run the cell with** `grader.check_all()` and confirm that you pass the same tests as in the notebook. Then, **run the final cell** `grader.export()` and click the link to download the zip file. Finally, go to Gradescope and submit the zip file to the corresponding assignment. 

**Once you have submitted, stay on the Gradescope page to confirm that you pass the same tests as in the notebook.**

In [None]:
%matplotlib inline
img = mpimg.imread('resources/animal.jpg')
imgplot = plt.imshow(img)
imgplot.axes.get_xaxis().set_visible(False)
imgplot.axes.get_yaxis().set_visible(False)
print("Congratulations on finishing this lab!")
plt.show()

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Make sure you submit the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)