final514fall2020.Rmd

---
title: "Final exam" 
author: "Utkarsha Vidhale"
date: "Nov 27th, 2020." 
output:
  html_document:
    toc: true
    toc_depth: 4
    theme: yeti
    highlight: pygments
---

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
   echo = TRUE, 
   fig.align = 'center' , 
   out.width="50%", 
   fig.height = 5, 
   fig.width = 7
)
```


#  Instructions

*Some standard writing considerations:*  Replace comments that instruct you to put code with your own code. Ensure your plots and output are visible and readable. Ensure you've typed up an explanation of your answers wherever required. 

* *Format*: delete **all** comments and replace with your answers and code.
* *Name*: do not forget to put your name on the exam under the 'author' heading.
* *Submission*: your submission must consist of your copy of this Markdown document *and* a knitted `html` file. Any other type of submission will receive no credit and no opportunity for a re-submission.
* *Issues*: In case of extreme / unusual difficulties related to the pandemic or health otherwise, please contact the instructor as soon as possible to try to reschedule the interrupted exam or otherwise resolve the issue. Late submissions are not accepted otherwise. 
* This is a 2-hour exam. You may take about *3 hours* to complete it. These three hours need to be *consecutive*, but can be chosen at your convenience between 14:00hrs Central Time on Friday 27th November and *two hours prior to your scheduled* oral exam time. 
* Exam must be submitted on Google classroom. 
* You must sign up for the virtual meeting via the link posted on Campuswire post #204.

  
## Honor code

> I pledge on this day, 12/4/2020, not to use any book, website, or other online source to solve the midterm problems. 
I will not use any additional technology other than my local installation of R (Python if applicable) and RStudio (or similar) that I'm using to knit Markdown documents. 
I will not give or receive information to or from any other persons during this midterm. 
This document was edited and PDF knitted  by me alone.

*UTKARSHA VIDHALE*

   *  Exam download time: *10:45 AM*

   *  Exam finish time: *12:50 PM*


--------

## Grading and rubric

Here is how will you be graded on the final exam:

* 100 points total, broken up as follows:
* 80 points for complete solutions to the given problems submitted via google classroom;
  * Problems 1-3: 30 points (10 each); 
  * Problems 4-5: 30 points (15 each); 
  * Problem 6: 20 points;
* 20 points for the virtual exam ”in-person” focusing on one of those problems (that I will choose at random):
  * 5 points for knowledge of problem: being able to complete the solution ”live”;
  * 5 points for completeness: writing up the details so that the statistics behind the solution makes
complete sense;
  * 5 points for clarity of exposition and oral explanation;
  * 5 points for being able to answer a related follow-up question (e.g., a more in-depth explanation
of a definition or a concept that the problem is about).

----

# Exam problems 


## Problem 1 - Hypothesis tests for regression coefficients

This question is about the `Advertising` data set which we've seen extensively in the lectures on regression. 

Consider the following table output of the `lm` function: 

|           | Coefficient | Std. error | t-statistic | p-value
| ----------| ----------| ----------| ----------| ----------|
| Intercept | 2.939 | 0.3119| 9.42  | <0.0001 |
| TV        | 0.046 |      0.0014 |32.81| <0.0001 |
| radio     | 0.189 | 0.0086| 21.89| <0.0001 |
| newspaper | - 0.001| 0.0059|-0.18|0.8599|
*Table: For the `Advertising` data, least squares coefficient estimates of the multiple linear regression of number of units sold on `radio`, `TV`, and `newspaper` advertising budgets.*


Describe the null hypotheses to which the p-values given in Table above correspond. 

  __Answer:__ 
  
  The Null Hypothesis is that the coefficient corresponding to `TV`, `radio`, and `newspaper` is 0. 
  
  That is : `Sales` = `TV`$*$ 0 + `Radio`$*$ 0 + `Newspaper`$*$ 0 + `Intercept`
  
  The p-value for TV; p < 0.0001, shows that we can reject the null-hypothesis and conclude that TV has an impact on Sales, when holding Radio and Newspaper advertising constant. 
  
  The p-value for Radio; p < 0.0001, shows that we can reject the null-hypothesis and conclude that Radio has an impact on Sales, when holding TV and Newspaper advertising constant. 
  
  The p-value for Newspaper; p < 0.0001, shows that Newspaper does not have an impact on Sales, when holding TV and Radio advertising constant.
  
  
  


Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of `sales`, `TV`, `radio`, and `newspaper`, rather than in terms of the coefficients of the linear model.

  __Answer:__  
  
  The p-value of `TV` and `radio` is `<0.0001` which is (less than `0.05`) significantly low. This suggests that there is enough evidence to reject Null Hypothesis for `TV` and `radio`. 
  
  That is, When `radio` and `newspaper` advertising are considered constant, advertising of `TV` shows an impact on sales. And, When `TV` and `newspaper` advertising are considered constant, advertising of `radio` shows an impact on sales.
  
  The p-value for `newspaper` is `0.8599` which is (more than `0.05`) significantly high. This suggests that there is no enough evidence to reject Null Hypothesis for `newspaper`.
  That is, When `radio` and `TV` advertising are considered constant, advertising of `newspaper` does not show any impact on sales.


## Problem 2 - a regression puzzle

Suppose we have a data set with five predictors, $X_1 =$GPA, $X_2 =$ IQ, $X_3 =$ Gender (0 for Male and 1 for not Male), $X_4$ = Interaction between GPA and IQ, and $X_5$ = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars).

Suppose we use least squares to fit the model, and get:
$$ \hat\beta_0  = 50, \quad \hat\beta_1 =20, \quad \hat\beta_2 = 0.07, \quad \hat\beta_3 = 35, \quad \hat\beta_4 = 0.01, \quad \hat\beta_5 = −10.$$ 

* Which answer is correct, and why? 
*($\rightarrow$ since we spent little time on interaction terms, make sure you can explain your answer to a) and b), and do your best for c) and d) as much as you are able.)*
   a) For a fixed value of IQ and GPA, males earn more on average than others. 
     
     __Answer:__  Incorrect
     
     The equation for salary: 
     $Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$  
     For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$  
     For Males (`Gender`= 0), 
     $Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$  
     If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$. 
     
   b) For a fixed value of IQ and GPA, non-males earn more on average than males.
   
  __Answer:__ Incorrect 
     
     The equation for salary: 
     $Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$  
     For Feales (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$  
     For Males (`Gender`= 0), 
     $Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$  
     If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$. 
     
  c) For a fixed value of IQ and GPA, males earn more on average than non-males provided that the GPA is high enough.
   
  __Answer:__ Correct
     
     The equation for salary: 
     $Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$  
     For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$  
     For Males (`Gender`= 0), 
     $Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$  
     If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$. 
     
     
   d) For a fixed value of IQ and GPA, non-males earn more on average than males provided that the GPA is high enough.

  __Answer:__ Incorrect
     
     The equation for salary: 
     $Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)$  
     For Females (`Gender`= 1),$Salary = 50 + 20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) - 10GPA$  
     For Males (`Gender`= 0), 
     $Salary = 50 + 20GPA + 0.07IQ + 0.01(GPA * IQ)$  
     If we compare these two equations, it can be stated that average salary for males and females is equal only if `GPA` for males is $\ge 3.5$. 
     
  

*  Predict the salary of a non-male with IQ of 110 and a GPA of 4.0.

  __Answer:__  Salary of a non-male with IQ of 110 and a GPA of 4.0 = 137100
  
     The equation for salary: 
     Salary = 50 + 20GPA + 0.07IQ + 35Gender + 0.01(GPA * IQ) - 10(GPA * Gender)  
     For `Gender` = Female (1), `IQ` = 110, `GPA` = 4.0,    
     Salary = 50 +20GPA + 0.07IQ + 35 + 0.01(GPA * IQ) -10(GPA)
     Salary = 50 + 20(4.0) + 0.07(110) +35 + 0.01(4.0 * 110) - 10(4.0)

```{r}
GPA = 4.0
IQ = 110
Salary = (50 +(20*GPA) + (0.07*IQ) + 35 + (0.01*GPA * IQ) - (10*4.0)) 
Salary
```
    
    salary of a non-male with IQ of 110 and a GPA of 4.0 = `r round(Salary,1)` thousand dollars

*  True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

  __Answer:__ False. 
  
  In order to determine if the interaction term is statistically significant or not, we need to examine: 
  the hypothesis $H_0:\quad \hat\beta_4 = 0$  and look at the p-value associated with the statistics;  
    
   or 
    
   the p-value of the regression coefficient.

## Problem 3 - Training and testing RSS  

This problem is about comparing linear (restrictive) and nonlinear (flexible) models using the training and testing residual sum of squares. 

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y=\beta_0+\beta_1X+\beta_2X^2+\beta_3X^3+\epsilon$. 

* Suppose that the true relationship between X and Y is linear, i.e. $Y = \beta_0+\beta_1X+\epsilon$.  Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

  __Answer:__ As we use more flexible models, the training RSS decreases, this is simply because it is easier to get the predicted line closer to the observations when there is more flexibility how the output function can conform to the data. Without knowing more details about the training data, it is difficult to know which training RSS is lower between linear or cubic.  
  If the true relationship is linear, than the introduction of the cubic regression would merely introduce excess noise. Therefore, we would expect the training RSS to be lower for the cubic regression than the linear regression.
  
* Answer the same question as above, but using test rather than training RSS.

  __Answer:__ The test RSS depends upon the test data. We may assume that polynomial regression will have a higher test RSS as the overfit from training would have more error than the linear regression.
  
* Suppose that the true relationship between $X$ and $Y$ is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

  __Answer:__  Polynomial regression has lower train RSS than the linear fit because of higher flexibility: no matter what the underlying true relationshop is the more flexible model will closer follow points and reduce train RSS.
  
* Answer the same question as above, but using test rather than training RSS.
 
  __Answer:__ There is not enough information to tell which test RSS would be lower for either regression given the problem statement is defined as not knowing "how far it is from linear". If it is closer to linear than cubic, the linear regression test RSS could be lower than the cubic regression test RSS. Or, if it is closer to cubic than linear, the cubic regression test RSS could be lower than the linear regression test RSS. It is dues to bias-variance tradeoff: it is not clear what level of flexibility will fit data better. 
  
  

## Problem 4 - Exploring predictors 

This questions involves the `Carseats` data set (from the library `ISLR`, which we've used several times in lectures/homework recently). 

Run these commands to load the dataset. 
```{r, warning=FALSE, message=FALSE}
require(ISLR)
data(Carseats)
library(dplyr)
```

Before you answer the following, make sure that the missing values have been removed from the data.

```{r}
sum(is.na.data.frame(Carseats))
```

(a) Which of the predictors are quantitative (discrete random variables), and which are qualiative (continuous random variables)?
```{r}
s=summary(Carseats)
s
```


  __Answer:__ Qualitative Predictors : `ShelveLoc`, `Urban`, `US`.  
              Quantitative Predictors : `Sales`, `CompPrice`, `Income`, `Price`, `Population`, `Advertising`
  
(b) What is the range of each quantitative predictor? You can answer this using the `range()` function.


```{r}
sale=range(Carseats$Sales)
comp=range(Carseats$CompPrice)
inc=range(Carseats$Income)
price=range(Carseats$Price)
pop=range(Carseats$Population)
adv=range(Carseats$Advertising)
```

  __Answer:__ 
              
    Range of Quantitative Predictors:  
    Sales = `r sale`
    CompPrice = `r comp` 
    Income = `r inc`
    Price = `r price`
    Population = `r pop`
    Advertising = `r adv`

(c) What is the mean and standard deviation of each quantitative predictor?
```{r}
sale.sd = sd(Carseats$Sales)
comp.sd = sd(Carseats$CompPrice)
inc.sd = sd(Carseats$Income)
price.sd = sd(Carseats$Price)
pop.sd = sd(Carseats$Population)
adv.sd = sd(Carseats$Advertising)
```


  __Answer:__ 
  
    Sales         - `r s[4]`  Standard deviation:`r sale.sd`
    CompPrice     - `r s[10]` Standard deviation: `r comp.sd` 
    Income        - `r s[16]` Standard deviation: `r inc.sd`
    Price         - `r s[34]` Standard deviation: `r price.sd`
    Population    - `r s[28]` Standard deviation: `r pop.sd`
    Advertising   - `r s[22]` Standard deviation: `r adv.sd``

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
```{r}
SubSet = Carseats[c(10:85),]
Sum= summary(SubSet)
Sum
```
```{r}
sub_sale=range(SubSet$Sales)
sub_comp=range(SubSet$CompPrice)
sub_inc=range(SubSet$Income)
sub_price=range(SubSet$Price)
sub_pop=range(SubSet$Population)
sub_adv=range(SubSet$Advertising)

sub_sale.sd = sd(SubSet$Sales)
sub_comp.sd = sd(SubSet$CompPrice)
sub_inc.sd = sd(SubSet$Income)
sub_price.sd = sd(SubSet$Price)
sub_pop.sd = sd(SubSet$Population)
sub_adv.sd = sd(SubSet$Advertising)

```


  __Answer:__ 
    
    Mean, Standard Deiation and Range of of each predictor in the subset:  
    Sales         - `r Sum[4]`, Standard deviation:`r sub_sale.sd`, Range: `r sub_sale`
    CompPrice     - `r Sum[10]`, Standard deviation: `r sub_comp.sd`, Range: `r sub_comp`
    Income        - `r Sum[16]`, Standard deviation: `r sub_inc.sd`, Range: `r sub_inc`
    Price         - `r Sum[34]`, Standard deviation: `r sub_price.sd`, Range: `r sub_price`
    Population    - `r Sum[28]`, Standard deviation: `r sub_pop.sd`, Range: `r sub_pop`
    Advertising   - `r Sum[22]`, Standard deviation: `r sub_adv.sd`, Range: `r sub_adv`
    
    
    
  

(e) Using the full data set, investigate the predictors *graphically*, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
```{r}
subset_1= Carseats %>% select(1:6, 8, 9)
#summary(subset_1)
pairs(subset_1)
#cor(subset_1)
```



  __Answer:__ From the above plots we can observe that there is some linear relation between `Sales` and `Price` as well as between `CompPrice` and `Price`.

(f) Suppose that we wish to predict sales of car seats (in each location)  (that is, the random variable `Sales`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `Sales`? Justify your answer


  __Answer:__ Other than `Urban` and `US` (variables related to location), `Price` can be useful to predict `Sales`. It is because some linear relationship can be seen between `Price` and `Sales` in the plots.   





## Problem 5 - Splitting the sample into training and testing

```{r}
library(psych)
library(car)
```


For this problem, continue using the `Carseats` data set.  

a) Construct a scatterplot of `Sales` vs `Price`. 
```{r}
scatterplot(Sales ~ Price, data=Carseats,
             pch=19,
            main="Scatterplot of Sales of Car Seats vs.Price ",
            xlab="Price",
            
            ylab="Sales")
```


  __Answer:__ type your answer here. 

b) Use the `sample()` command to construct `train`, a vector of observation indexes to be used for the purpose of training your model. This will partition the data set into the *training* set and the *testing* set. 
   - Describe what the `sample()` function as used above actually does.

```{r}
## 75% of the sample size
smp_size <- floor(0.75 * nrow(Carseats))
dim(Carseats)
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(Carseats)), size = smp_size)

train <- Carseats[train_ind, ]
test <- Carseats[-train_ind, ]
dim(train)
dim(test)
```


  __Answer:__ The `sample()` function produces a random sample of size `smp_size`. Sampling is done without replacement. 
  
c) Valiadate the partition you obtained for the data. Do you see any issues? 
   * *Hint*: this problem is *not* asking you to balance the training data set. It is instead asking you to determine *whether* balancing might be required. To determine that, you should use a hypothesis test! Which test? In `R`, there is *one function* you need to call to get the output for the comparison of two samples. Explain the answer; justify your conclusion. 


```{r}
#table(Carseats)
```


  __Answer:__ type your answer here. 


## Problem 6 - fitting a multiple linear regression model

This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`,
`Urban`, and `US.`
```{r}
attach(Carseats)
lm.fit = lm(Sales~Price+Urban+US)
summary(lm.fit)
```


  __Answer:__ The function `lm` is used to apply multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

(b) Provide an interpretation of each coefficient in the model. Be
careful—some of the variables in the model are qualitative!


  __Answer:__ 



     Price 
     The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales: as Price increases, Sales decreases.
     
     UrbanYes 
    The linear regression suggests that there isn’t a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.
    
    USYes 
    The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units. 

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
e your code used to answer the question


  __Answer:__ `Sales` = 13.04 + -0.05 `Price` + -0.02 `UrbanYes` + 1.20 `USYes`

(d) For which of the predictors can you reject the null hypothesis $H_0:\beta_j=0$? 


  __Answer:__ `Price` and `USYes`, based on the p-values, F-statistic, and p-value of the F-statistic.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
```{r}
lm.fit2 = lm(Sales ~ Price + US)
summary(lm.fit2)
```


  __Answer:__ The variable `Urban` is not considered because of low intercept value and p-value greater than `0.05`. 

(f) How well do the models in (a) and (e) fit the data?


  __Answer:__ Based on the Standard error and $R^2$ of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better. 

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

```{r}
confint(lm.fit2)
```


  __Answer:__  Confidence Interval. 


# Hints & shortcuts

## Exploring a new data set 
One of the following may be helpful as you explore the data set: 
```
View(Carseats)
help(Carseats)
str(Carseats)
```

## Loops and selections from data frames

You may want to consider some of the following functions or commands as you write code to solve the exam. 
```{r}
tmp_data_set <- mtcars
tmp_col <- tmp_data_set[,1]
tmp_rows <- tmp_data_set[c(1,2,3,5),]
sapply(tmp_data_set[,1:7], max) # applies a function (in this case, `max`) to all of the indicated columns of the data frame
```


-------- 

# End of final exam, congratulations!