# Unit #6 Code


## Question #1

A response time is normally distributed with standard deviation of 25 milliseconds. A new system has been installed, and we wish to estimate the true average response time, $\mu$, for the new environment. 

Assuming that the response times are still normally distributed with $\sigma = 25$, what sample size is necessary to ensure that the resulting 95% confidence has a width of (at most) 10?

In [16]:
s = 25; alpha = 0.05; cv = qnorm(1-0.025);
n = (2*cv*25/10)^2;
n
#so, n = 97

## Question #2

The EPA considers indoor radon levels above 4 picocuries per liter (pCo/L) of air to be high enought to warrant amelioration effects. Tests in a sample of 200 homes found 127 of these sampled households to have indoor randon levels above 4 pCi/L. Calculate the 99% confidence interval for the proportion of homes with indoor radon levels above 4 pCi/L.

In [31]:
n = 200; p = 127/n; z = qnorm(1-0.01/2); se = sqrt(p*(1-p)/n);

ci = c(p - z*se, p + z*se)
ci

## Question #3

(a) Read in the dataset from this link: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv.

This dataset is related to a red Portuguese "Vinho Verde" wine. 

In [17]:
site = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = read.table(site, sep = ";", header = TRUE)
head(wine)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


(b) Calculate a 90% confidence interval for the mean pH level.

In [32]:
alpha = 0.1; n = length(wine$pH); z = qnorm(1-alpha/2)
x_bar = mean(wine$pH); s = sd(wine$pH); se = s/sqrt(n)
ci = c(x_bar - z*se, x_bar + z*se)
ci

## Question #4

In this example, we will construct a simulation to verify the “coverage properties” of a confidence interval for the mean of a normal distribution.

(a) Simulate a matrix with m = 1000 rows and n = 100 columns, where each entry is a random number from the population N (0, 1). Interpret each row as a sample from the population. 


In [10]:
m = 1000; n = 100
x = t(replicate(m, expr = rnorm(n)))
dim(x)

(b) Suppose that we didn’t know the mean, μ, of the population and wanted to estimate it using a confidence interval. For each sample, calculate the 95% confidence for the mean, μ.

In [31]:
xbar = rowMeans(x)
ci = matrix(NA, nrow = 1000, ncol = 2)
ci = t(rbind(xbar - qnorm(1-0.05/2)/sqrt(n), xbar + qnorm(1-0.05/2)/sqrt(n)));


(c) Why would we use a confidence interval instead of just reporting the sample mean $\bar{x}$?

A confidence interval gives us a range that estimates the true population value for a statistic rather than just one sample.

(d) Interpret the confidence interval for the first sample (i.e., when m = 1).

In [28]:
ci[1,]

We are 95% confident that the interval (-0.307, 0.084) covers the true mean.

(e) Justify why, in part (b), you can use critical values from the normal distribution or critical values from the t distribution.

In the above calculation, we generated samples from a normal population, the sample size was large, and assumed that $\sigma$ was known. So, in this case, we don't need the t-distribution.

(f) Calculate the proportion of confidence intervals that cover the true μ. Does it match what theory suggests? If it deviates from what theory suggests, explain why.

In [30]:
cl = (1-sum(ci[,1] > 0 | ci[,2] < 0)/m); print(cl)

[1] 0.948


## Question #5

The authors of the article “Adjuvant Radiotherapy and Chemotherapy in Node- Positive Premenopausal Women with Breast Cancer” (New Engl. J. of Med., 1997: 956–962) reported on the results of an experiment designed to compare treating cancer patients with chemotherapy only to treatment with a combination of chemotherapy and radiation.

Of the 154 individuals who received the chemotherapy-only treatment, 76 survived at least 15 years, whereas 98 of the
164 patients who received the hybrid treatment survived at least that long.

What is the 99% confidence interval for this difference in proportions?


In [25]:
n1 = 154; n2 = 164; 
p1 = 76/n1; p2 = 98/n2; z = qnorm(1-0.01/2); 
se = sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2); est = (p1-p2);

ci = c(est - z*se, est + z*se)
ci