In the first exercise of Unit 4 (Sampling from a Population), the goal was to compare point estimates of mean weight, $\overline{x}_{weight}$, to the population parameter, $\mu_{weight}$. In this exercise, instead of calculating point estimates from samples, we will calculate interval estimates, i.e. **confidence intervals**. A confidence interval gives a plausible range of values for a parameter and is centered at the point estimate. 

**Goal:** Observe the effect of sampling by treating the 13,572 individuals in `yrbss` as our target population, and drawing random samples. How do interval estimates of mean weight, $(\overline{x}_{weight} - m, \text{ } \overline{x}_{weight} + m)$, calculated from random samples compare to the population parameter, $\mu_{weight}$?


The following information about the YRBSS data is copied from the previous exercise...

This exercise uses data from the Youth Risk Factor Behavior Surveillance System (YRBSS), a yearly survey conducted by the US Centers for Disease Control to measure health-related activity in high-school aged youth. The dataset \texttt{yrbss} in the OIBioStat package contains responses from the 13,572 participants in 2013 for a subset of the variables included in the complete survey data. 


Variables in the dataset include:

  - age: age in years

  - gender: gender of participant, recorded as either female or male
  
  - grade: grade in high school (9-12)
  
  - height: height, in meters (1 m = 3.28 ft)
  
  - weight: weight, in kilograms (1 kg = 2.2 lbs)
  
  - helmet.12m: frequency that the student wore a helmet while biking in the last 12 months, either always, most of time, sometimes, rarely, never, or did not ride
  
  - physically.active.7d: number of days physically active for 60+ minutes in the last 7 days
  
  - strength.training.7d: number of days of strength training in the last 7 days

  
The CDC used the responses from the 13,572 students to estimate the health behaviors of the target population: the 21.2 million high school aged students in the US in 2013. 


Recall from the previous excersice, we can take a random sample of size 10 from `yrbss' as follows:

In [2]:
require(oibiostat)
data(yrbss)
require(dplyr)
set.seed(5011) 
yrbss.sample = sample_n(yrbss, size=10)
yrbss.sample

Unnamed: 0,age,gender,grade,hispanic,race,height,weight,helmet.12m,text.while.driving.30d,physically.active.7d,hours.tv.per.school.day,strength.training.7d,school.night.hours.sleep
7516,15,female,9,not,White,1.7,65.77,never,0,2,<1,0,8
10731,16,female,11,hispanic,,1.57,60.78,never,,5,do not watch,5,6
3541,17,female,10,not,White,1.65,72.58,did not ride,did not drive,0,,0,<5
11356,16,female,11,hispanic,Asian,1.7,66.68,never,,1,3,3,6
11849,17,female,12,not,White,1.68,46.27,sometimes,1-2,7,2,7,
6411,17,female,12,not,,1.52,68.04,never,30,0,5+,0,7
10405,17,female,11,not,Black or African American,1.75,93.9,did not ride,did not drive,0,<1,0,6
11864,18,male,12,not,White,1.47,63.5,never,20-29,7,2,5,6
3664,18,male,12,not,White,1.7,75.75,did not ride,0,0,1,2,6
2881,14,female,9,not,White,1.68,68.04,never,0,4,2,5,8


#### Problem 1:

A confidence interval is calculated from four quantities: the sample mean $\overline{x}$, the sample standard deviation $s$, the sample size $n$, and $t^{\star}$.

a) Calculate $\overline{x}_{weight}$ and $s_{weight}$, the mean and standard deviation of weight in the sample. 


  b) For a 95% confidence interval, $t^{\star}$ is the point on a $t$ distribution with $n - 1$ degrees of freedom that has area 0.975 to the left. Calculate the value of $t^\star$ for a 95% confidence interval that has degrees of freedom $10 - 1 = 9$. 



c) Calculate a 95% confidence interval for the true mean weight and interpret this interval.

$$\left( \overline{x} - \dfrac{s}{\sqrt{n}} t^{\star}, \text{ } \overline{x} + \dfrac{s}{\sqrt{n}} t^{\star} \right)$$


#### Problem 2:

In general, for a confidence interval of $(1 - \alpha)(100)\%$, $t^{\star}$ is the point on a $t$ distribution with $n - 1$ degrees of freedom that has area $1 - (\alpha / 2)$ to the left. For a 95% confidence interval, $\alpha = 0.05$; $t^\star$ is the point on a $t$ distribution with area $1 - (0.05/2) = 0.975$ to the left.

a) Calculate a 90% confidence interval based on the sample weights.

b) Calculate a 99% confidence interval based on the sample weights.

c) Compare the 95% confidence interval calculated in part a) to the 90% and 99% confidence intervals. Describe the relationship between confidence level and the width of the interval. 

d) Which the intervals calculated (90%, 95%, 99%) do you find to be the most informative as an estimate of the mean weight of high-school age students in the US? Explain your answer.

#### Problem 3: 

The `t.test()` command can be used to calculate confidence intervals in `R`. For example, the command to calculate a 95% confidence interval for `age` is:

In [None]:
t.test(yrbss.sample$age, na.rm = TRUE, conf.level = 0.95)$conf.int

# OR using dplyr
yrbss.sample %>%
    summarize(lower=t.test(age, na.rm = TRUE, conf.level = 0.95)$conf.int[1], 
              upper=t.test(age, na.rm = TRUE, conf.level = 0.95)$conf.int[2])

a) Calculate a 95% confidence interval for weight using `t.test()`.


b) Examine the effect of larger sample sizes on the confidence interval by re-running the code for sample sizes of 25, 100, and 1000. Use the last four digits of your ID in the `set.seed()` function. Describe your observations.

#### Problem 4:

The method illustrated for computing an $x$% confidence interval will produce an interval that, on average, contains the true population mean $x$ times out of 100. 

a) Calculate the population mean weight, $\mu_{weight}$ for `yrbss`.


b) Does the 95% confidence interval you calculated in part b) of Question 3 for sample size 100 contain $\mu_{weight}$?

c) Draw your 95% confidence interval from part b) of Question 3 on the board.

#### Problem 5: 

Run the following code chunk to take 1,000 random samples of size 100 from `yrbss.complete`. For each sample, the code calculates mean weight for participants in the sample and stores the value in `sample.means`. The code also calculates the margin of error $m$ according to the defined `conf.level`, and stores it in `m`. The variable `contains.mu` records `TRUE` if the interval contains $\mu_{weight}$ and `FALSE` otherwise.

In [None]:
require(mosaic)

#set parameters
sample.size = 100
replicates = 1000
conf.level = 0.95

#set seed
set.seed(2018)

#calculate sample means and margin of errors
confInt = do(replicates)*{
  yrbss.sample = sample_n(yrbss, size=sample.size)
  
  sample.df = sample.size - 1
  t.star = qt(1 - (1 - conf.level)/2, df = sample.df)
  
  sample.means = mean(yrbss.sample$weight,na.rm=TRUE)
  m = t.star * (sd(yrbss.sample$weight,na.rm=TRUE) / sqrt(sample.size))
    
  c(ci.lb = sample.means - m, ci.ub = sample.means + m)
}

head(confInt)

#does the confidence interval contain mu?
mu = mean(yrbss$weight,na.rm=TRUE)
contains.mu = (confInt$ci.lb < mu) & (confInt$ci.ub > mu) 
table(contains.mu)

a) How many intervals contained the population mean $\mu_{weight}$? 

b) What happens when `conf.level` is changed? Test out 0.90 and 0.99.

c) From what you have observed in this exercise about the relationship between an interval estimate for the mean ($\overline{x} \pm m$) and the population mean ($\mu$), evaluate the following statement:
  
*"The 95% confidence interval as calculated from the 13,572 sampled high school students in the 2013 YRBSS survey is (67.61, 68.20) kg. It is possible to definitively conclude that the interval contains the mean weight of the 21.2 million high school aged students in the US in 2013."*