-
Notifications
You must be signed in to change notification settings - Fork 0
/
final514fall2020.Rmd
505 lines (329 loc) · 21.7 KB
/
final514fall2020.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
---
title: "Final exam"
author: "Utkarsha Vidhale"
date: "Nov 27th, 2020."
output:
html_document:
toc: true
toc_depth: 4
theme: yeti
highlight: pygments
---
```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
echo = TRUE,
fig.align = 'center' ,
out.width="50%",
fig.height = 5,
fig.width = 7
)
```
# Instructions
*Some standard writing considerations:* Replace comments that instruct you to put code with your own code. Ensure your plots and output are visible and readable. Ensure you've typed up an explanation of your answers wherever required.
* *Format*: delete **all** comments and replace with your answers and code.
* *Name*: do not forget to put your name on the exam under the 'author' heading.
* *Submission*: your submission must consist of your copy of this Markdown document *and* a knitted `html` file. Any other type of submission will receive no credit and no opportunity for a re-submission.
* *Issues*: In case of extreme / unusual difficulties related to the pandemic or health otherwise, please contact the instructor as soon as possible to try to reschedule the interrupted exam or otherwise resolve the issue. Late submissions are not accepted otherwise.
* This is a 2-hour exam. You may take about *3 hours* to complete it. These three hours need to be *consecutive*, but can be chosen at your convenience between 14:00hrs Central Time on Friday 27th November and *two hours prior to your scheduled* oral exam time.
* Exam must be submitted on Google classroom.
* You must sign up for the virtual meeting via the link posted on Campuswire post #204.
## Honor code
> I pledge on this day, 12/4/2020, not to use any book, website, or other online source to solve the midterm problems.
I will not use any additional technology other than my local installation of R (Python if applicable) and RStudio (or similar) that I'm using to knit Markdown documents.
I will not give or receive information to or from any other persons during this midterm.
This document was edited and PDF knitted by me alone.
*UTKARSHA VIDHALE*
* Exam download time: *10:45 AM*
* Exam finish time: *12:50 PM*
--------
## Grading and rubric
Here is how will you be graded on the final exam:
* 100 points total, broken up as follows:
* 80 points for complete solutions to the given problems submitted via google classroom;
* Problems 1-3: 30 points (10 each);
* Problems 4-5: 30 points (15 each);
* Problem 6: 20 points;
* 20 points for the virtual exam ”in-person” focusing on one of those problems (that I will choose at random):
* 5 points for knowledge of problem: being able to complete the solution ”live”;
* 5 points for completeness: writing up the details so that the statistics behind the solution makes
complete sense;
* 5 points for clarity of exposition and oral explanation;
* 5 points for being able to answer a related follow-up question (e.g., a more in-depth explanation
of a definition or a concept that the problem is about).
----
# Exam problems
## Problem 1 - Hypothesis tests for regression coefficients
This question is about the `Advertising` data set which we've seen extensively in the lectures on regression.
Consider the following table output of the `lm` function:
| | Coefficient | Std. error | t-statistic | p-value
| ----------| ----------| ----------| ----------| ----------|
| Intercept | 2.939 | 0.3119| 9.42 | <0.0001 |
| TV | 0.046 | 0.0014 |32.81| <0.0001 |
| radio | 0.189 | 0.0086| 21.89| <0.0001 |
| newspaper | - 0.001| 0.0059|-0.18|0.8599|
*Table: For the `Advertising` data, least squares coefficient estimates of the multiple linear regression of number of units sold on `radio`, `TV`, and `newspaper` advertising budgets.*
Describe the null hypotheses to which the p-values given in Table above correspond.
__Answer:__
The Null Hypothesis is that the coefficient corresponding to `TV`, `radio`, and `newspaper` is 0.
That is : `Sales` = `TV`$*$ 0 + `Radio`$*$ 0 + `Newspaper`$*$ 0 + `Intercept`
The p-value for TV; p < 0.0001, shows that we can reject the null-hypothesis and conclude that TV has an impact on Sales, when holding Radio and Newspaper advertising constant.
The p-value for Radio; p < 0.0001, shows that we can reject the null-hypothesis and conclude that Radio has an impact on Sales, when holding TV and Newspaper advertising constant.
The p-value for Newspaper; p < 0.0001, shows that Newspaper does not have an impact on Sales, when holding TV and Radio advertising constant.
Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of `sales`, `TV`, `radio`, and `newspaper`, rather than in terms of the coefficients of the linear model.
__Answer:__
The p-value of `TV` and `radio` is `<0.0001` which is (less than `0.05`) significantly low. This suggests that there is enough evidence to reject Null Hypothesis for `TV` and `radio`.
That is, When `radio` and `newspaper` advertising are considered constant, advertising of `TV` shows an impact on sales. And, When `TV` and `newspaper` advertising are considered constant, advertising of `radio` shows an impact on sales.
The p-value for `newspaper` is `0.8599` which is (more than `0.05`) significantly high. This suggests that there is no enough evidence to reject Null Hypothesis for `newspaper`.
That is, When `radio` and `TV` advertising are considered constant, advertising of `newspaper` does not show any impact on sales.
## Problem 2 - a regression puzzle
Suppose we have a data set with five predictors, $X_1 =$GPA, $X_2 =$ IQ, $X_3 =$ Gender (0 for Male and 1 for not Male), $X_4$ = Interaction between GPA and IQ, and $X_5$ = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars).
Suppose we use least squares to fit the model, and get:
$$ \hat\beta_0 = 50, \quad \hat\beta_1 =20, \quad \hat\beta_2 = 0.07, \quad \hat\beta_3 = 35, \quad \hat\beta_4 = 0.01, \quad \hat\beta_5 = −10.$$
* Which answer is correct, and why?
*($\rightarrow$ since we spent little time on interaction terms, make sure you can explain your answer to a) and b), and do your best for c) and d) as much as you are able.)*
a) For a fixed value of IQ and GPA, males earn more on average than others.
__Answer:__ Incorrect
The equation for salary:
$Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$
For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$
For Males (`Gender`= 0),
$Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$
If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$.
b) For a fixed value of IQ and GPA, non-males earn more on average than males.
__Answer:__ Incorrect
The equation for salary:
$Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$
For Feales (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$
For Males (`Gender`= 0),
$Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$
If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$.
c) For a fixed value of IQ and GPA, males earn more on average than non-males provided that the GPA is high enough.
__Answer:__ Correct
The equation for salary:
$Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$
For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$
For Males (`Gender`= 0),
$Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$
If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$.
d) For a fixed value of IQ and GPA, non-males earn more on average than males provided that the GPA is high enough.
__Answer:__ Incorrect
The equation for salary:
$Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$
For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$
For Males (`Gender`= 0),
$Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$
If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$.
* Predict the salary of a non-male with IQ of 110 and a GPA of 4.0.
__Answer:__ Salary of a non-male with IQ of 110 and a GPA of 4.0 = 137100
The equation for salary:
Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)
For `Gender` = Female (1), `IQ` = 110, `GPA` = 4.0,
Salary = 50 +20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) -10(GPA)
Salary = 50 + 20(4.0) + 0.07(110) +35 + 0.01(4.0 * 110) - 10(4.0)
```{r}
GPA = 4.0
IQ = 110
Salary = (50 +(20*GPA) + (0.07*IQ) + 35 + (0.01*GPA * IQ) - (10*4.0))
Salary
```
salary of a non-male with IQ of 110 and a GPA of 4.0 = `r round(Salary,1)` thousand dollars
* True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
__Answer:__ False.
In order to determine if the interaction term is statistically significant or not, we need to examine:
the hypothesis $H_0:\quad \hat\beta_4 = 0$ and look at the p-value associated with the statistics;
or
the p-value of the regression coefficient.
## Problem 3 - Training and testing RSS
This problem is about comparing linear (restrictive) and nonlinear (flexible) models using the training and testing residual sum of squares.
I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y=\beta_0+\beta_1X+\beta_2X^2+\beta_3X^3+\epsilon$.
* Suppose that the true relationship between X and Y is linear, i.e. $Y = \beta_0+\beta_1X+\epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
__Answer:__ As we use more flexible models, the training RSS decreases, this is simply because it is easier to get the predicted line closer to the observations when there is more flexibility how the output function can conform to the data. Without knowing more details about the training data, it is difficult to know which training RSS is lower between linear or cubic.
If the true relationship is linear, than the introduction of the cubic regression would merely introduce excess noise. Therefore, we would expect the training RSS to be lower for the cubic regression than the linear regression.
* Answer the same question as above, but using test rather than training RSS.
__Answer:__ The test RSS depends upon the test data. We may assume that polynomial regression will have a higher test RSS as the overfit from training would have more error than the linear regression.
* Suppose that the true relationship between $X$ and $Y$ is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
__Answer:__ Polynomial regression has lower train RSS than the linear fit because of higher flexibility: no matter what the underlying true relationshop is the more flexible model will closer follow points and reduce train RSS.
* Answer the same question as above, but using test rather than training RSS.
__Answer:__ There is not enough information to tell which test RSS would be lower for either regression given the problem statement is defined as not knowing "how far it is from linear". If it is closer to linear than cubic, the linear regression test RSS could be lower than the cubic regression test RSS. Or, if it is closer to cubic than linear, the cubic regression test RSS could be lower than the linear regression test RSS. It is dues to bias-variance tradeoff: it is not clear what level of flexibility will fit data better.
## Problem 4 - Exploring predictors
This questions involves the `Carseats` data set (from the library `ISLR`, which we've used several times in lectures/homework recently).
Run these commands to load the dataset.
```{r, warning=FALSE, message=FALSE}
require(ISLR)
data(Carseats)
library(dplyr)
```
Before you answer the following, make sure that the missing values have been removed from the data.
```{r}
sum(is.na.data.frame(Carseats))
```
(a) Which of the predictors are quantitative (discrete random variables), and which are qualiative (continuous random variables)?
```{r}
s=summary(Carseats)
s
```
__Answer:__ Qualitative Predictors : `ShelveLoc`, `Urban`, `US`.
Quantitative Predictors : `Sales`, `CompPrice`, `Income`, `Price`, `Population`, `Advertising`
(b) What is the range of each quantitative predictor? You can answer this using the `range()` function.
```{r}
sale=range(Carseats$Sales)
comp=range(Carseats$CompPrice)
inc=range(Carseats$Income)
price=range(Carseats$Price)
pop=range(Carseats$Population)
adv=range(Carseats$Advertising)
```
__Answer:__
Range of Quantitative Predictors:
Sales = `r sale`
CompPrice = `r comp`
Income = `r inc`
Price = `r price`
Population = `r pop`
Advertising = `r adv`
(c) What is the mean and standard deviation of each quantitative predictor?
```{r}
sale.sd = sd(Carseats$Sales)
comp.sd = sd(Carseats$CompPrice)
inc.sd = sd(Carseats$Income)
price.sd = sd(Carseats$Price)
pop.sd = sd(Carseats$Population)
adv.sd = sd(Carseats$Advertising)
```
__Answer:__
Sales - `r s[4]` Standard deviation:`r sale.sd`
CompPrice - `r s[10]` Standard deviation: `r comp.sd`
Income - `r s[16]` Standard deviation: `r inc.sd`
Price - `r s[34]` Standard deviation: `r price.sd`
Population - `r s[28]` Standard deviation: `r pop.sd`
Advertising - `r s[22]` Standard deviation: `r adv.sd``
(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
```{r}
SubSet = Carseats[c(10:85),]
Sum= summary(SubSet)
Sum
```
```{r}
sub_sale=range(SubSet$Sales)
sub_comp=range(SubSet$CompPrice)
sub_inc=range(SubSet$Income)
sub_price=range(SubSet$Price)
sub_pop=range(SubSet$Population)
sub_adv=range(SubSet$Advertising)
sub_sale.sd = sd(SubSet$Sales)
sub_comp.sd = sd(SubSet$CompPrice)
sub_inc.sd = sd(SubSet$Income)
sub_price.sd = sd(SubSet$Price)
sub_pop.sd = sd(SubSet$Population)
sub_adv.sd = sd(SubSet$Advertising)
```
__Answer:__
Mean, Standard Deiation and Range of of each predictor in the subset:
Sales - `r Sum[4]`, Standard deviation:`r sub_sale.sd`, Range: `r sub_sale`
CompPrice - `r Sum[10]`, Standard deviation: `r sub_comp.sd`, Range: `r sub_comp`
Income - `r Sum[16]`, Standard deviation: `r sub_inc.sd`, Range: `r sub_inc`
Price - `r Sum[34]`, Standard deviation: `r sub_price.sd`, Range: `r sub_price`
Population - `r Sum[28]`, Standard deviation: `r sub_pop.sd`, Range: `r sub_pop`
Advertising - `r Sum[22]`, Standard deviation: `r sub_adv.sd`, Range: `r sub_adv`
(e) Using the full data set, investigate the predictors *graphically*, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
```{r}
subset_1= Carseats %>% select(1:6, 8, 9)
#summary(subset_1)
pairs(subset_1)
#cor(subset_1)
```
__Answer:__ From the above plots we can observe that there is some linear relation between `Sales` and `Price` as well as between `CompPrice` and `Price`.
(f) Suppose that we wish to predict sales of car seats (in each location) (that is, the random variable `Sales`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `Sales`? Justify your answer
__Answer:__ Other than `Urban` and `US` (variables related to location), `Price` can be useful to predict `Sales`. It is because some linear relationship can be seen between `Price` and `Sales` in the plots.
## Problem 5 - Splitting the sample into training and testing
```{r}
library(psych)
library(car)
```
For this problem, continue using the `Carseats` data set.
a) Construct a scatterplot of `Sales` vs `Price`.
```{r}
scatterplot(Sales ~ Price, data=Carseats,
pch=19,
main="Scatterplot of Sales of Car Seats vs.Price ",
xlab="Price",
ylab="Sales")
```
__Answer:__ type your answer here.
b) Use the `sample()` command to construct `train`, a vector of observation indexes to be used for the purpose of training your model. This will partition the data set into the *training* set and the *testing* set.
- Describe what the `sample()` function as used above actually does.
```{r}
## 75% of the sample size
smp_size <- floor(0.75 * nrow(Carseats))
dim(Carseats)
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(Carseats)), size = smp_size)
train <- Carseats[train_ind, ]
test <- Carseats[-train_ind, ]
dim(train)
dim(test)
```
__Answer:__ The `sample()` function produces a random sample of size `smp_size`. Sampling is done without replacement.
c) Valiadate the partition you obtained for the data. Do you see any issues?
* *Hint*: this problem is *not* asking you to balance the training data set. It is instead asking you to determine *whether* balancing might be required. To determine that, you should use a hypothesis test! Which test? In `R`, there is *one function* you need to call to get the output for the comparison of two samples. Explain the answer; justify your conclusion.
```{r}
#table(Carseats)
```
__Answer:__ type your answer here.
## Problem 6 - fitting a multiple linear regression model
This question should be answered using the `Carseats` data set.
(a) Fit a multiple regression model to predict `Sales` using `Price`,
`Urban`, and `US.`
```{r}
attach(Carseats)
lm.fit = lm(Sales~Price+Urban+US)
summary(lm.fit)
```
__Answer:__ The function `lm` is used to apply multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.
(b) Provide an interpretation of each coefficient in the model. Be
careful—some of the variables in the model are qualitative!
__Answer:__
Price
The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales: as Price increases, Sales decreases.
UrbanYes
The linear regression suggests that there isn’t a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.
USYes
The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
e your code used to answer the question
__Answer:__ `Sales` = 13.04 + -0.05 `Price` + -0.02 `UrbanYes` + 1.20 `USYes`
(d) For which of the predictors can you reject the null hypothesis $H_0:\beta_j=0$?
__Answer:__ `Price` and `USYes`, based on the p-values, F-statistic, and p-value of the F-statistic.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
```{r}
lm.fit2 = lm(Sales ~ Price + US)
summary(lm.fit2)
```
__Answer:__ The variable `Urban` is not considered because of low intercept value and p-value greater than `0.05`.
(f) How well do the models in (a) and (e) fit the data?
__Answer:__ Based on the Standard error and $R^2$ of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
```{r}
confint(lm.fit2)
```
__Answer:__ Confidence Interval.
# Hints & shortcuts
## Exploring a new data set
One of the following may be helpful as you explore the data set:
```
View(Carseats)
help(Carseats)
str(Carseats)
```
## Loops and selections from data frames
You may want to consider some of the following functions or commands as you write code to solve the exam.
```{r}
tmp_data_set <- mtcars
tmp_col <- tmp_data_set[,1]
tmp_rows <- tmp_data_set[c(1,2,3,5),]
sapply(tmp_data_set[,1:7], max) # applies a function (in this case, `max`) to all of the indicated columns of the data frame
```
--------
# End of final exam, congratulations!