# STAT 201 - Statistical Inference for Data Science


## Lecture 7: Confidence Intervals (of means and proportions) based on the assumption of Normality or the Central Limit Theorem

**Attribution:** these slides are adapted from [Rodolfo Lourenzutti's slides](https://github.com/UBC-STAT/stat-201/blob/website/lec/stat201-week7.pdf)

## Reminders


- Worksheet 7 and tutorial 7 are due on Saturday, October 28 
- Project Proposal due on November 4th  
    - Instructions under "Deliverable 2: Project Proposal" on the [course website](https://ubc-stat.github.io/stat-201/group-project.html) 




## Re-cap

#### Module 1 - Introduction to Statistical Inference and Sampling
- Understand the terminologies in statistical inferene (Population, Sample, Sampling, Point estimate (sample statistics),  Sample distribution, Sampling distribution, Standard error)

#### Module 2 - Populations and Sampling
- Sampling methodology, Sampling distribution

#### Module 3 - Bootstrapping and its Relationship to the Sampling Distribution
- Bootstrap distribution

#### Module 4 - Confidence Intervals via bootstrapping
- Bootstrap confidence interval

#### Module 5 - Hypothesis Testing via simulation/randomization 

#### Module 6 - Midterm  1

## Statistical Inference

1. **Point estimation**: we estimate an unknown parameter using a *single number* calculated from sample data.
2. **Interval estimation**: we estimate an unknown parameter using an *interval of values* that plausibly contain the true parameter value (and state how confident we are the interval captures the true value).
3. **Hypothesis testing**: we make a statement about the value of an unknown population parameter, and we check whether or not the data obtained from the sample provide evidence against this claim.

## Review: Bootstrapping
<img src="https://miro.medium.com/max/1575/1*SgeDm_wb2QNSF0CSYVmhuw.jpeg" width=2000> 

*image source: [Towards Data Science](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307)*


## Today: Traditional methods
<img src="https://miro.medium.com/max/1575/1*QpDvUNXHTSDbgwXP2kVBsA.jpeg" width=2000> 

*image source: [Towards Data Science](https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307)*

<img src="img/stat201-week7.png" width=1000>

## Sampling Scenarios

| Scenario | Population parameter                | Symbol     | Point estimate | Symbol(s)|
|----------|:------------------------------------|:------------|:--------------|:---------|
| 1	       | Population proportion               |  $p$  <img width=100/>| Sample proportion | $\hat{p}$ | 	 
| 2	       | Population mean                     | $\mu$      | Sample mean | $\bar{x}$ | 	 
| 3	       | Difference in population proportions| $p_1 - p_2$| Difference in sample proportions | $\hat{p}_1 - \hat{p}_2$ | 	 
| 4	       | Difference in population means      | $\mu_1 - \mu_2$ | Difference in sample means	| $\bar{x}_1- \bar{x}_2$      | 	

TABLE 7.5: Scenarios of sampling for inference from [Modern Dive](https://moderndive.com/7-sampling.html#sampling-conclusion-central-limit-theorem) by Kim & McConville

## Normal (Gaussian) distribution

A distribution defined by two values: 

1. the mean $\mu$ ("mu") and 
2. the standard deviation $\sigma$ ("sigma") 

A normal distribution is: 
- Unimodal (one peak) and bell-shaped;
- Symmetric around the mean, $\mu$

<img src="https://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod1/8/normal_curve.jpg" width=600> 


In [None]:
IRdisplay::display_html('<iframe src="https://www.zoology.ubc.ca/~whitlock/Kingfisher/SamplingNormal.htm" width=1000, height=550iframe>') 

## Location & spread
- The mean $\mu$ controls the location of the center of the distribution
- The standard deviation $\sigma$, controls the spread of the curve (wider or narrower)
- We give a normal distribution with a mean $\mu = 0$ and the standard deviation $\sigma = 1$ a special name. It's called the **standard normal distribution**.

<img src="img/stat201-week7_Page_10.png" width=1800>

## 68-95-99.7% rule (Empirical rule) 

If a variable follows a Normal distribution, there are three rules of thumb we can use:

- approximately 68% of values will lie within $\pm1$ standard deviation of the mean 
- approximately 95% of values will lie within $\pm2$ standard deviation of the mean 
- approximately 99.7% of values will lie within $\pm3$ standard deviation of the mean
<p align="center">
    <img src="https://andymath.com/wp-content/uploads/2019/12/empirical-rule-normdist2.jpg" width=1000>
</p>

*image source: [Andymath.com](https://andymath.com/normal-distribution-empirical-rule/)*

## Functions in R for the Normal distribution

- We can use software packages to get the desired probability or quantiles.

- Probabilities: e.g., $P(X \le 0)$
> pnorm(0, mu, sigma)

- Quantiles: e.g., $P(X \le x) = 0.25$ 
> qnorm(0.25, mu, sigma)

<div style="display:flex">
     <div style="flex:1;padding-right:10px;">
          <img src="img/pnorm.png" width="800"/>
     </div>
     <div style="flex:1;padding-left:10px;">
          <img src="img/qnorm.png" width="800"/>
     </div>
</div>



*image source: [Visual guide to pnorm, dnorm, qnorm, and rnorm functions in R](https://diggingdeeperwithstats.wordpress.com/2021/05/21/visual-guide-to-pnorm-dnorm-qnorm-and-rnorm-functions-in-r/)*

```
pnorm(0)
pnorm(-2, 0, 1)
qnorm(0.84)
```

## Clicker Question 1

Suppose the lengths of fish in a certain lake follow approximately a Normal distribution with mean $\mu = 100$ mm and standard deviation $\sigma = 10$ mm.

Approximately, what percentage of fish have lengths below 90 mm? (Use the 68-95-99.7 rule)

A. 2.5% 

B. 5%

C. 16%

D. 68%

In [None]:
pnorm(..., mean = ..., sd = ...)

In [None]:
pnorm(90, mean = 100, sd = 10)


C

## Clicker Question 2

Suppose the lengths of fish in a certain lake follow approximately a Normal distribution with mean $\mu = 100$ mm and standard deviation $\sigma = 10$ mm.

How big is the length such that only 2.5\% of fish fall above? 

A. 100 mm

B. 110 mm

C. 120 mm

D. 130 mm

In [None]:
qnorm(..., mean = 100, sd = 10)


In [None]:
qnorm(.975, mean = 100, sd = 10)


## Recall: Sampling Distributions of Proportions

The sampling distribution of proportions is the distribution of the sample proportions of all possible random samples of size $n$ that can be obtained from a population.



For sufficiently large samples (such that $n\times p\ge 10$ and $n \times (1-p)\ge 10$) and with other necessary conditions, the sampling distribution of $\hat{p}$ is approximately Normal with 
- with mean $p$  and
- standard error $\sqrt{\frac{p \times (1-p)}{n}}$

The larger the sample size n, the better the normal approximation.

## Recall: Sampling Distribution of Means
The sampling distribution of means is the distribution of the means of all the possible random samples of size $n$ that could be selected from a population.



Let $x_1, x_2, ..., x_n$ be independent values of a random sample from some population with mean $\mu$ and standard deviation $\sigma$. The Central Limit Theorem (CLT) states that for large sample sizes ($n$) the sampling distribution of the sample mean is approximately Normally distributed, regardless of the distribution of the population one samples from with
- mean $\mu$ and 
- standard error $\frac{\sigma}{\sqrt{n}}$




Note: If the population distribution is Normal, the sample mean $\bar{X}$ follows the Normal model with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ regardless of the sample size. 

## Central Limit Theorem - Explain using an applet


In [None]:
IRdisplay::display_html('<iframe src="https://www.zoology.ubc.ca/~whitlock/Kingfisher/CLT.htm" width=800, height=550iframe>') 

## Assumptions & conditions
- The sample is randomly drawn from the population. 
- The sample values are independent. In general, if your sample size is greater than 10% of the population size, there will be a severe violation of independence.
- The sample size must be large enough.
    - For proportions: 
        - check $n\times p \ge 10$ and $n\times(1-p) \ge 10$.
    - For means: 
        - there is no universal guideline, and we might need a large sample size. Usually, however, sample sizes larger than 30 are enough to get a reasonable approximation (but it is not guaranteed).

## Recall 
The **standard error** is the standard deviation of point estimates

$$\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$$ 

- $\sigma_{\hat{p}}$: standard deviation of sample proportions (i.e. standard error)
- $p$: population proportion

$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$ 

- $\sigma_{\bar{x}}$: standard deviation of sample mean (i.e. standard error)
- $\sigma$: standard deviation of population

### In reality: 

- We don't know the value of the population proportion ($p$) so we estimate $p$ using the sample proportion ($\hat{p}$). Thus we estimate $\sigma_{\hat{p}}$ (variability of $\hat{p}$) using:

$$\sigma_{\hat{p}} = \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}$$

- We don't know the value of the population standard deviation ($\sigma$) so we estimate $\sigma$ using the sample standard deviation ($s$). Thus we estimate $\sigma_{\bar{x}}$ using 

$$\sigma_{\bar{X}} = \frac{s}{\sqrt{n}}$$


## Idea behind confidence intervals 
- By the 68-95-99.7 rule, approximately 95% of $\hat{p}$ will fall between $p \pm 2 \times \sigma_{\hat{p}}$ 

<p align="center"> <img src=img/sampling_dist_p.png width=1200>
</p>

<p align="center"> <img src="https://faculty.elgin.edu/dkernler/statistics/ch09/images/sample-proportions.jpg" width=1000>
</p>

## Confidence Intervals


Today the confidence intervals for population parameters we will be learning about today take the following form:

<p style="text-align: center;"> statistic $\pm$ margin of error </p>
<p style="text-align: center;"> statistic $\pm$ critical value $\times$ standard error </p>

## Confidence Intervals: proportions

Formula for a confidence interval (CI) for the population proportion $p$:

$$\hat{p} \pm z^* \times \sigma_{\hat{p}}$$

$$\hat{p} \pm z^* \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$$

where $z^∗$ is called the critical value.

## Confidence Intervals: proportions
<img src="img/stat201-week7_Page_21.png" width=1000>

## Finding the critical value
- The critical value depends on the confidence level you choose 

- $z^∗$ is the value such that the upper tail area (area to the right of $z^*$) under the standard normal curve equals to $\frac{1 - C}{2}$ (where $C$ is the confidence level).

<img src="http://www.stat.yale.edu/Courses/1997-98/101/confdiag.gif" width=800> 

## Finding $z^*$ in R

Recall:
- Quantiles: e.g., $P(X \le x) = 0.25$ 
> qnorm(0.25, mu, sigma)


For a 95% confidence level, the $z^*$ is...

In [None]:
qnorm(...)

In [None]:
qnorm(0.975)

## Confidence Intervals: means

Formula for a confidence interval (CI) for the population mean $\mu$:

$$\bar{x} \pm z^* \times \sigma_{\bar{x}}$$

$$\bar{x} \pm z^* \times \frac{s}{\sqrt{n}}$$

where $z^∗$ is the z-score such that upper tail area under the standard Normal curve is equal to $\frac{1 - C}{2}$ (where $C$ is the confidence level).

*Note: we could actually get a better approximation using the t-distribution, which we will learn about in the next module. However, for large $n$, the Normal and $t$ distributions are quite close*

## Confidence Intervals: means
<img src="img/stat201-week7_Page_18.png" width=1000>

## Confidence Intervals: means
<img src="img/stat201-week7_Page_19.png" width=1000>

## Assumptions & conditions for constructing confidence intervals
- The sample is randomly drawn from the population. 
- The sample values are independent. In general, if your sample size is greater than 10% of the population size, there will be a severe violation of independence.
- The sample size must be large enough. 
    - For proportions: Check $n\times \hat{p} \ge 10$ and $n\times(1-\hat{p}) \ge 10$ (Remember we don't know $p$!)
    - For means: Usually, sample sizes larger than 30 are enough to get a reasonable approximation (but it is not guaranteed).

## Clicker question 3 

A manufacturer is producing a chocolate bar. The factory manager wants to find out about the true proportion of the bars that are overweight (i.e. the weight is larger than the labeled weight). They take a random sample of 100 bars and find that 20 out of 100 are overweight. 

A 95% confidence interval for the true proportion of the bars that are overweight is:

A. $0.20 \pm 0.04$

B. $0.20 \pm 0.08$

C. $100 \pm 0.04$

D. $100 \pm 0.08$

## Confidence interval interpretation


In [None]:
IRdisplay::display_html('<iframe src="https://www.zoology.ubc.ca/~whitlock/Kingfisher/CIMean.htm" width=800, height=550iframe>') 

## Confidence Interval (CI) for Population mean $\mu$ is
$\left[\bar{x}-Z_{1-\alpha}^* \dfrac{s}{\sqrt{n}} , \; \bar{x} + Z_{1-\alpha}^* \dfrac{s}{\sqrt{n}}\right] $
<br />

#### $\bar{x}\pm Z_{1- \alpha}^* \dfrac{s}{\sqrt{n}}$ where  $\bar{x} $ is the point estimate for $\mu$ and  $ Z_{1- \alpha}^* \dfrac{s}{\sqrt{n}}$ is the margin of error
<br />
Confidence Interval takes the form 

### $\qquad \Rightarrow \qquad$ point estimate $\pm$ margin of error

## Confidence Intervals: Difference in means

<img src="img/stat201-week7_Page_22.png" width=1000>  

## Confidence Intervals: Difference in means

<img src="img/stat201-week7_Page_23.png" width=1000> 

## Confidence Intervals: Difference in proportions

<img src="img/stat201-week7_Page_24.png" width=1000> 

## Confidence Intervals: Difference in proportions

<img src="img/stat201-week7_Page_25.png" width=1000> 

## Get started on worksheet 7!
- Navigate to Canvas, open `worksheet_07`

If you get stuck:

- Discuss with your neighbours, the TAs & Instructors, and consult the textbook reading to help you get unstuck when needed!
- Ask your group mates if you need help with any questions
- The TAs and myself will walk around and answer questions

## What did you learn today?

-  
- 
- 

## Take a break! 

Come back at 3:00 PM 

<!--- Summer term --->