### Unit 4: Exercise 3 (Hypothesis Testing)

**Reading for this exercise**     

- *OI Biostat*: Sections 4.3, 4.4


The previous exercises in Unit 4 discussed how to calculate point estimates and interval estimates for a population mean ($\mu$) from a sample. In the simulation of repeated sampling from a population with the \texttt{yrbss} dataset, we have observed that in most cases, estimates of $\mu$ calculated from samples are fairly accurate. However, it is possible that by random chance, a sample results in an interval estimate that does not contain $\mu$.

What is the likelihood of observing a particular sample mean $\overline{x}$ if the population average is assumed to be $\mu$? **Hypothesis testing** is a method for calculating the probability of making a specific observation under a null hypothesis. At the end of this exercise, we will examine the relationship between hypothesis tests and confidence intervals.


*Exmple: Do American adults tend to be overweight?*

Body mass index (BMI) is an approximate scale used to estimate body fat that adjusts for both height and weight. According to the World Health Organization (WHO), the normal range for BMI is between 18.5 and 24.99. Individuals with BMI of 25 or greater are classified as overweight, while individuals with BMI of 30 or greater are classified as obese.

We will investigate this question using data from the National Health and Nutrition Examination Survey (NHANES), a survey conducted annually by the US Centers for Disease Control (CDC). The complete `NHANES` dataset contains 10,000 observations, which will be our artificial target population. BMI information is only available for survey participants that are age 21 or older.

There are two possible approaches: 

- Calculating a confidence interval for the population mean BMI ($\mu_{BMI}$)

- Conducting a formal hypothesis test 

Run the following code chunk to draw a random sample of size 200 from `NHANES` and select the individuals in the sample that are age 21 or older to be stored in `nhanes.samp.adult`. 

In [8]:
require(dplyr)
#load the dataset
require(NHANES)

#create sample
set.seed(5011)
nhanes.samp = sample_n(NHANES, size = 200)

#create sample of adults
nhanes.samp.adult = nhanes.samp %>%
    filter(Age >=21)
head(nhanes.samp.adult)

ID,SurveyYr,Gender,Age,AgeDecade,AgeMonths,Race1,Race3,Education,MaritalStatus,⋯,RegularMarij,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregnantNow
63147,2011_12,male,41,40-49,,White,White,Some College,Married,⋯,No,,Yes,Yes,15.0,50.0,1.0,No,Heterosexual,
57165,2009_10,male,48,40-49,586.0,Black,,High School,Married,⋯,Yes,17.0,No,Yes,17.0,81.0,10.0,No,Heterosexual,
69465,2011_12,female,50,50-59,,White,White,College Grad,Divorced,⋯,No,,No,Yes,17.0,4.0,1.0,No,Heterosexual,
57313,2009_10,female,74,70+,889.0,White,,College Grad,Widowed,⋯,,,,,,,,,,
56047,2009_10,female,27,20-29,329.0,White,,9 - 11th Grade,NeverMarried,⋯,Yes,16.0,No,Yes,13.0,10.0,3.0,No,Heterosexual,No
57056,2009_10,male,26,20-29,316.0,Mexican,,High School,NeverMarried,⋯,Yes,14.0,Yes,Yes,15.0,8.0,1.0,No,Heterosexual,


#### Problem 1:

a) Explore the distribution of BMI in `nhanes.samp.adult`. Using numerical and graphical summaries, describe the distribution of BMI in `nhanes.samp.adult`. From the data in the sample, does it seem like the population mean BMI will be outside the BMI range defined as normal (18.5 - 24.99)? 

b) Calculate a 95% confidence interval for BMI using `nhanes.samp.adult`. Does the interval suggest that the population average BMI is outside the normal range?

#### Problem 2:

Conduct a hypothesis test.

a) Formulate null and alternative hypotheses. The symbol $\mu$ denotes a population mean, while $\mu_0$ refers to the numeric value specified by the null hypothesis. 

b) Specify a significance level, $\alpha$.

c) Calculate the test statistic. 
    
  $$t = \dfrac{\overline{x} - \mu_0}{s/ \sqrt{n}} $$

d) Calculate the $p$-value.

e) Draw a conclusion.

#### Problem 3:

Is mean body temperature really 98.6 F? Conduct a hypothesis test to evaluate this claim using data from 130 healthy volunteers who participated in a vaccine study. The data are in the file `body_temperatures.Rdata` (see code below for loading data).

a) Choose whether to conduct a one-sided or two-sided test. Formulate null and alternative hypotheses.

b) Specify a significance level, $\alpha$.

c) Calculate the test statistic.
 
d) Calculate the $p$-value.

e) Confirm your calculations in parts c) and d) using `t.test()`.

f) Draw a conclusion.


In [11]:
load("body_temperatures.Rdata")