-
Notifications
You must be signed in to change notification settings - Fork 0
/
2_SLR_Significance_Confidence_fkingfisher.Rmd
251 lines (192 loc) · 10.7 KB
/
2_SLR_Significance_Confidence_fkingfisher.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Test Model for Significance, Generate Confidence and Prediction Intervals"
author: " Forest Kingfisher"
date: "`r Sys.Date()`"
output: html_document
---
## Exercise 2.12, taken fromMontgomery, Douglas C., et al. <u>Introduction to Linear Regression Analysis</u>, John Wiley & Sons, Incorporated, 2012.
The number of pounds of steam used per month at a plant is thought to be related to the average monthly ambient temperature. The past year’s usages and temperatures follow.
```{r, echo=TRUE}
#Import steam data by month
dat212 <- read.csv("https://raw.githubusercontent.com/forestwhite/Regression/main/data/212Table.csv")
names(dat212)[names(dat212) == "ï..Month"] <- "Month"
dat212
```
**b. Test for significance of regression.**
There are two tests we can use to test the significance of the whole model, the t and F tests. To apply the t test, we can use summary data from the linear model or calculate the t statistic manually and compare it to the t critical value for the degrees of freedom and 95% confidence target:
* t_critical value (alpha = 0.05, df = n-2 = 10): 2.228139,
* t_statistic for our data: 272.255
t_statistic > t_critical, therefore we <u>reject H_0</u> and conclude that <u>number of pounds of steam used per month is significantly correlated to the average monthly ambient temperature</u>.
```{r, echo=TRUE}
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)
# calculate t-statistic for our data set using the linear model
lm212<-lm(dat212$Usage.l000~dat212$Temperature)
summary(lm212)$coefficients[2 , 3] #t-statistic
# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
yhat_i
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_error
variance_y<-SS_error/10
variance_y
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)
SSxx
# calculate t-statistic for our data set manually
t_stat<-9.20847/(sqrt(variance_y/SSxx))
t_stat
```
To apply the F test, we use the F value calculated in the linear model or calculate the F statistic for our model manually and compare it to the critical F value given our degrees of freedom:
* F_critical value (alpha = 0.05, df = n-2 = 10): 4.964603
* F_statistic for our data: 74122.78
F_statistic > F_critical, therefore we <u>reject H_0</u> and conclude that <u>number of pounds of steam used per month is significantly correlated to the average monthly ambient temperature</u>.
```{r, echo=TRUE}
# calculate one-sided F critical value for alpha=0.05, degrees of freedom=n-2=10
qf(p=.05, df1=1, df2=10, lower.tail=FALSE)
# calculate F-statistic for our data set using the linear model
lm212<-lm(Usage.l000~Temperature, dat=dat212) #IMPORTANT: Do not use $ notation here or predict() cant match column names later
summary(lm212)$f[1]
# calculate y_hat estimate, sum of squares error, and sum of squares regression (response)
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_regression<-sum((yhat_i-mean(dat212$Usage.l000))^2)
# calculate F-statistic for our data set manually
f_stat<-(SS_regression/1)/(SS_error/10)
f_stat
```
**c. Plant management believes that an increase in average ambient temperature of 1 degree will increase average monthly steam consumption by 10,000 lb. Do the data support this statement?**
This question suggests Beta_1 is 10 for the population, and asks us to test this against the Beta_1_hat we get from the data. The t test statistics are:
* t_critical value (alpha = 0.05, df = n-2 = 10): 2.228139
* t_statistic for Beta_1 = 10: -23.40216
|t_statistic| > t_critical, therefore we <u>reject H_0</u> and conclude that <u>the slope is not 10</u>.
```{r, echo=TRUE}
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)
#store least squares fit and hypothesized coefficients
beta_1_hat<-9.20847
beta_1_0<-10
# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)
# calculate t-statistic for null hypothesis Beta_1 = 10
t_stat<-(beta_1_hat-beta_1_0)/(sqrt(variance_y/SSxx))
t_stat
```
**d. Construct a 99% confidence and prediction intervals on steam usage in a month with average ambient temperature of 58°.**
* The Confidence Interval(E[y|58]) = (525.594,529.9244)
* The Prediction Interval(y_i,new) = (521.2238,534.2945)
```{r, echo=TRUE}
# calculate critical t value for 99% confidence and 10 degrees of freedom
t_99<-qt(p=.01/2, df=10, lower.tail=FALSE)
#calculate the linear fit model value for 58°
yhat_58<-9.20847*58-6.33209
#calculate the sample variance of y
variance_y<-SS_error/10
#mean squared error is equal to variance of y, manually calculate confidence interval
lower_confidence <- yhat_58-t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_confidence <- yhat_58+t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_confidence
upper_confidence
#use R to calculate confidence from the linear model
predict(lm212,data.frame(Temperature=58),level=0.99, interval="confidence")
#mean squared error is equal to variance of y, manually calculate prediction interval
lower_prediction <- yhat_58-t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_prediction <- yhat_58+t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_prediction
upper_prediction
predict(lm212,data.frame(Temperature=58),level=0.99, interval="prediction")
```
**Plot the data adding a fitted regression line, 95% confidence interval, and 95% prediction interval. **
```{r, echo=TRUE}
# generate confidence intervals
newx<-seq(min(dat212$Temperature),max(dat212$Temperature),4.4)
conf<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="confidence")
pred<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="prediction")
# plot the linear model
par(mar = c(5,5, 5, 3)) #adjust plot margins
plot(dat212$Usage.l000~dat212$Temperature, col="steelblue", xlab = "average monthly ambient temperature, Fahrenheit",ylab = "1000s of lbs of steam",main="Pounds of steam used vs. \n average monthly ambient temperature")
abline(lm212, col="steelblue")
lines(newx,conf[,2], col="forestgreen", lty = 'dashed')
lines(newx,conf[,3], col="forestgreen", lty = 'dashed')
lines(newx,pred[,2], col="firebrick2", lty = 'dashed')
lines(newx,pred[,3], col="firebrick2", lty = 'dashed')
```
### Complete Code
The complete R code used in this analysis is as follows:
```{r, eval=FALSE}
#Import steam data by month
dat212 <- read.csv("https://raw.githubusercontent.com/forestwhite/Regression/main/data/212Table.csv")
names(dat212)[names(dat212) == "ï..Month"] <- "Month"
dat212
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)
# calculate t-statistic for our data set using the linear model
lm212<-lm(dat212$Usage.l000~dat212$Temperature)
lm212
summary(lm212)$coefficients[2 , 3] #t-statistic
# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)
# calculate t-statistic for our data set manually
t_stat<-9.20847/(sqrt(variance_y/SSxx))
t_stat
# calculate one-sided F critical value for alpha=0.05, degrees of freedom=n-2=10
qf(p=.05, df1=1, df=10, lower.tail=FALSE)
# calculate t-statistic for our data set using the linear model
lm212<-lm(Usage.l000~Temperature, dat=dat212) #IMPORTANT: Do not use $ notation here or predict() cant match column names later
summary(lm212)$f[1]
# calculate y_hat estimate, sum of squares error, and sum of squares regression (response)
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
SS_regression<-sum((yhat_i-mean(dat212$Usage.l000))^2)
# calculate F-statistic for our data set manually
f_stat<-(SS_regression/1)/(SS_error/10)
f_stat
# calculate two-tail t critical value for alpha=0.05, degrees of freedom=n-2=10
qt(p=.05/2, df=10, lower.tail=FALSE)
beta_1_hat<-9.20847
beta_1_0<-10
# calculate y_hat estimate, sum of squares error, variance, and sum of squares x^2
yhat_i<-9.20847*dat212$Temperature-6.33209
SS_error<-sum((dat212$Usage.l000-yhat_i)^2)
variance_y<-SS_error/10
SSxx<-sum((dat212$Temperature-mean(dat212$Temperature))^2)
# calculate t-statistic for null hypothesis Beta_1 = 10
t_stat<-(beta_1_hat-beta_1_0)/(sqrt(variance_y/SSxx))
t_stat
# calculate critical t value for 99% confidence and 10 degrees of freedom
t_99<-qt(p=.01/2, df=10, lower.tail=FALSE)
#calculate the linear fit model value for 58°
yhat_58<-9.20847*58-6.33209
#calculate the sample variance of y
variance_y<-SS_error/10
#mean squared error is equal to variance of y, manually calculate confidence interval
lower_confidence <- yhat_58-t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_confidence <- yhat_58+t_99*(sqrt(variance_y*(1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_confidence
upper_confidence
#use R to calculate confidence from the linear model
predict(lm212,data.frame(Temperature=58),level=0.99, interval="confidence")
#mean squared error is equal to variance of y, manually calculate prediction interval
lower_prediction <- yhat_58-t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
upper_prediction <- yhat_58+t_99*(sqrt(variance_y*(1+1/12+(58-mean(dat212$Temperature))^2/SSxx)))
lower_prediction
upper_prediction
predict(lm212,data.frame(Temperature=58),level=0.99, interval="prediction")
# generate confidence intervals
newx<-seq(min(dat212$Temperature),max(dat212$Temperature),4.4)
conf<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="confidence")
pred<-predict(lm212,data.frame(Temperature=newx),level=0.95,interval="prediction")
# plot the linear model
par(mar = c(5,5, 5, 3)) #adjust plot margins
plot(dat212$Usage.l000~dat212$Temperature, col="steelblue", xlab = "average monthly ambient temperature, Fahrenheit",ylab = "1000s of lbs of steam",main="Pounds of steam used vs. \n average monthly ambient temperature")
abline(lm212, col="steelblue")
lines(newx,conf[,2], col="forestgreen", lty = 'dashed')
lines(newx,conf[,3], col="forestgreen", lty = 'dashed')
lines(newx,pred[,2], col="firebrick2", lty = 'dashed')
lines(newx,pred[,3], col="firebrick2", lty = 'dashed')
```