2_SLR_Significance_Confidence_fkingfisher.Rmd

---
title: "Test Model for Significance, Generate Confidence and Prediction Intervals"
author: " Forest Kingfisher"
date: "`r Sys.Date()`"
output: html_document
---
## Exercise 2.12, taken fromMontgomery, Douglas C., et al. <u>Introduction to Linear Regression Analysis</u>, John Wiley & Sons, Incorporated, 2012.

The number of pounds of steam used per month at a plant is thought to be related to the average monthly ambient temperature. The past year’s usages and temperatures follow.

```{r, echo=TRUE}
#Import steam data by month
dat212 <- read.csv("https://raw.githubusercontent.com/forestwhite/Regression/main/data/212Table.csv")
names(dat212)[names(dat212) == "ï..Month"] <- "Month"
dat212
```

**b. Test for significance of regression.**

There are two tests we can use to test the significance of the whole model, the t and F tests. To apply the t test, we can use summary data from the linear model or calculate the t statistic manually and compare it to the t critical value for the degrees of freedom and 95% confidence target:

 * t_critical value (alpha = 0.05, df = n-2 = 10): 2.228139, 
 * t_statistic for our data: 272.255
 
t_statistic > t_critical, therefore we <u>reject H_0</u> and conclude that <u>number of pounds of steam used per month is significantly correlated to the average monthly ambient temperature</u>.

```{r, echo=TRUE}
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)

# calculate t-statistic for our data set using the linear model
lm212<-lm(dat212$Usage.l000~dat212$Temperature)
summary(lm212)$coefficients[2 , 3] #t-statistic

# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
yhat_i
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_error
variance_y<-SS_error/10
variance_y
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)
SSxx

# calculate t-statistic for our data set manually
t_stat<-9.20847/(sqrt(variance_y/SSxx))
t_stat
```
To apply the F test, we use the F value calculated in the linear model or calculate the F statistic for our model manually and compare it to the critical F value given our degrees of freedom:

 * F_critical value (alpha = 0.05, df = n-2 = 10): 4.964603 
 * F_statistic for our data: 74122.78
 
F_statistic > F_critical, therefore we <u>reject H_0</u> and conclude that <u>number of pounds of steam used per month is significantly correlated to the average monthly ambient temperature</u>.

```{r, echo=TRUE}
# calculate one-sided F critical value for alpha=0.05, degrees of freedom=n-2=10
qf(p=.05, df1=1, df2=10, lower.tail=FALSE)

# calculate F-statistic for our data set using the linear model
lm212<-lm(Usage.l000~Temperature, dat=dat212) #IMPORTANT: Do not use $ notation here or predict() cant match column names later
summary(lm212)$f[1]

# calculate y_hat estimate, sum of squares error, and sum of squares regression (response)
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_regression<-sum((yhat_i-mean(dat212$Usage.l000))^2)

# calculate F-statistic for our data set manually
f_stat<-(SS_regression/1)/(SS_error/10)
f_stat
```

**c. Plant management believes that an increase in average ambient temperature of 1 degree will increase average monthly steam consumption by 10,000 lb. Do the data support this statement?**

This question suggests Beta_1 is 10 for the population, and asks us to test this against the Beta_1_hat we get from the data. The t test statistics are:

 * t_critical value (alpha = 0.05, df = n-2 = 10): 2.228139
 * t_statistic for Beta_1 = 10: -23.40216
 
|t_statistic| > t_critical, therefore we <u>reject H_0</u> and conclude that <u>the slope is not 10</u>.

```{r, echo=TRUE}
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)

#store least squares fit and hypothesized coefficients
beta_1_hat<-9.20847
beta_1_0<-10

# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)

# calculate t-statistic for null hypothesis Beta_1 = 10
t_stat<-(beta_1_hat-beta_1_0)/(sqrt(variance_y/SSxx))
t_stat
```

**d. Construct a 99% confidence and prediction intervals on steam usage in a month with average ambient temperature of 58°.**

 * The Confidence Interval(E[y|58]) = (525.594,529.9244)
 * The Prediction Interval(y_i,new) = (521.2238,534.2945)

```{r, echo=TRUE}
# calculate critical t value for 99% confidence and 10 degrees of freedom
t_99<-qt(p=.01/2, df=10, lower.tail=FALSE)

#calculate the linear fit model value for 58°
yhat_58<-9.20847*58-6.33209

#calculate the sample variance of y
variance_y<-SS_error/10

#mean squared error is equal to variance of y, manually calculate confidence interval
lower_confidence <- yhat_58-t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_confidence <- yhat_58+t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_confidence
upper_confidence

#use R to calculate confidence from the linear model
predict(lm212,data.frame(Temperature=58),level=0.99, interval="confidence")

#mean squared error is equal to variance of y, manually calculate prediction interval
lower_prediction <- yhat_58-t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_prediction <- yhat_58+t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_prediction
upper_prediction
predict(lm212,data.frame(Temperature=58),level=0.99, interval="prediction")
```

**Plot the data adding a fitted regression line, 95% confidence interval, and 95% prediction interval. **

```{r, echo=TRUE}
# generate confidence intervals
newx<-seq(min(dat212$Temperature),max(dat212$Temperature),4.4)
conf<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="confidence")
pred<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="prediction")

# plot the linear model
par(mar = c(5,5, 5, 3)) #adjust plot margins
plot(dat212$Usage.l000~dat212$Temperature, col="steelblue", xlab = "average monthly ambient temperature, Fahrenheit",ylab = "1000s of lbs of steam",main="Pounds of steam used vs. \n average monthly ambient temperature")
abline(lm212, col="steelblue")

lines(newx,conf[,2], col="forestgreen", lty = 'dashed')
lines(newx,conf[,3], col="forestgreen", lty = 'dashed')
lines(newx,pred[,2], col="firebrick2", lty = 'dashed')
lines(newx,pred[,3], col="firebrick2", lty = 'dashed')
```

### Complete Code

The complete R code used in this analysis is as follows:

```{r, eval=FALSE}
#Import steam data by month
dat212 <- read.csv("https://raw.githubusercontent.com/forestwhite/Regression/main/data/212Table.csv")
names(dat212)[names(dat212) == "ï..Month"] <- "Month"
dat212
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)

# calculate t-statistic for our data set using the linear model
lm212<-lm(dat212$Usage.l000~dat212$Temperature)
lm212
summary(lm212)$coefficients[2 , 3] #t-statistic

# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)

# calculate t-statistic for our data set manually
t_stat<-9.20847/(sqrt(variance_y/SSxx))
t_stat

# calculate one-sided F critical value for alpha=0.05, degrees of freedom=n-2=10
qf(p=.05, df1=1, df=10, lower.tail=FALSE)

# calculate t-statistic for our data set using the linear model
lm212<-lm(Usage.l000~Temperature, dat=dat212) #IMPORTANT: Do not use $ notation here or predict() cant match column names later
summary(lm212)$f[1]

# calculate y_hat estimate, sum of squares error, and sum of squares regression (response)
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_regression<-sum((yhat_i-mean(dat212$Usage.l000))^2)

# calculate F-statistic for our data set manually
f_stat<-(SS_regression/1)/(SS_error/10)
f_stat

# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)

beta_1_hat<-9.20847
beta_1_0<-10

# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)

# calculate t-statistic for null hypothesis Beta_1 = 10
t_stat<-(beta_1_hat-beta_1_0)/(sqrt(variance_y/SSxx))
t_stat

# calculate critical t value for 99% confidence and 10 degrees of freedom
t_99<-qt(p=.01/2, df=10, lower.tail=FALSE)

#calculate the linear fit model value for 58°
yhat_58<-9.20847*58-6.33209

#calculate the sample variance of y
variance_y<-SS_error/10

#mean squared error is equal to variance of y, manually calculate confidence interval
lower_confidence <- yhat_58-t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_confidence <- yhat_58+t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_confidence
upper_confidence

#use R to calculate confidence from the linear model
predict(lm212,data.frame(Temperature=58),level=0.99, interval="confidence")

#mean squared error is equal to variance of y, manually calculate prediction interval
lower_prediction <- yhat_58-t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_prediction <- yhat_58+t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_prediction
upper_prediction
predict(lm212,data.frame(Temperature=58),level=0.99, interval="prediction")

# generate confidence intervals
newx<-seq(min(dat212$Temperature),max(dat212$Temperature),4.4)
conf<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="confidence")
pred<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="prediction")

# plot the linear model
par(mar = c(5,5, 5, 3)) #adjust plot margins
plot(dat212$Usage.l000~dat212$Temperature, col="steelblue", xlab = "average monthly ambient temperature, Fahrenheit",ylab = "1000s of lbs of steam",main="Pounds of steam used vs. \n average monthly ambient temperature")
abline(lm212, col="steelblue")

lines(newx,conf[,2], col="forestgreen", lty = 'dashed')
lines(newx,conf[,3], col="forestgreen", lty = 'dashed')
lines(newx,pred[,2], col="firebrick2", lty = 'dashed')
lines(newx,pred[,3], col="firebrick2", lty = 'dashed')
```