# Conceptual

### 1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model
- Null hypothesis states that a given coefficient is 0. The `p`-values are `<0.0001` (Intercept), `<0.0001` (TV), `<0.0001` (radio) and `0.8599` (newspaper). The conclusion is that there is evidence to reject the null hypothesis for the intercept, tv and radio coefficients, i.e. both tv and radio have an effect on sales. There isn't however evidence that `newspaper` budget has an effect on sales (`p` value high because of high <_estimated_> standard error of the coefficient).

### 2. Carefully explain the differences between the KNN classifier and KNN regression methods
- KNN stands for _K Nearest Neighbours_ - it's a method that assigns a response (for a test point `x`) based on `k` nearest neighbours (samples) of `x` based on some metric (e.g. Euclidean distance). KNN classifier will assign a majority class to a test sample `x`, whereas KNN regression model will assign an average response from all neighbours belonging to the neighbourhood of `x`. KNN method is non-parametric as it does not involve any parameters/form of the regression function or dicrimination boundary. It suffers from _Curse of dimensionality_ so it shouldn't be used when the number of predictors is high.

### 3. Suppose we have a data set with five predictors, $X_1 = GPA, X_2 = IQ, X_3 = Gender$ (1 for Female and 0 for Male), $X_4$ = Interaction between GPA and IQ, and $X_5$ = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get $\hat{β}_0 = 50, \hat{β}_1 = 20, \hat{β}_2 = 0.07, \hat{β}_3 = 35, \hat{β}_4 = 0.01, \hat{β}_5 = −10$. 

(a) Which answer is correct, and why?
- iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.
- The model is $Y = \hat{β}_0X_1 + \hat{β}_1X_2 + \hat{β}_3X_3 + \hat{β}_4X_1X_2 + \hat{β}_5X_1X_3$. For males, $X_3 = 0 \implies Y_{males} = \hat{β}_0X_1 + \hat{β}_1X_2 + \hat{β}_4X_1X_2$ so the difference in salary (female - male) is $\hat{β}_3 + \hat{β}_5X_1$. For a $GPA > 3.5$ males earn more.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

In [3]:
Y = 50 + 20 * 4 + 0.07 * 110 + 35 * 1 + 0.01 * (4 * 110) - 10 * (4 * 1)
print(Y, "k$")

137.1 k$


(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
- False. We don't know the (estimate) of the standard error for this coefficient and thus its p-value. It can be that the p-value is significant which would constitute an evidence against the null $H_0: \beta_4 = 0$. The size of a coefficient can still be significant if the corresponding p-value is small.

### 4. I collect a set of data (`n = 100` observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + ϵ$.

(a) Suppose that the true relationship between $X$ and $Y$ is linear, i.e. $Y = β_0 + β_1X + ϵ$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
- We'd expect (training) RSS for the cubic regression model to be lower than the RSS for the simple linear regression. This is because cubic model is more flexible - and it'd be equivalent to the linear model for $\beta_2 = 0, \beta_3 = 0$. Cubic regression coefficients have been found using least squares (i.e. minimising the squares of residuals) so they can not, by the very design, result in a higher RSS.

(b) Answer (a) using test rather than training RSS.
- We'd expect (test) RSS of a linear regression model to be lower than the RSS of a cubic regression. This is because the true underlying model is linear - one parameter conveys all the information there is to capture. Cubic model overfits to the training data by picking up noise ($\epsilon$).

(c) Suppose that the true relationship between $X$ and $Y$ is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
- Training RSS for the cubic model will be lower than for the linear model. This is because its more flexible and, using least squares, it has more ability to fit to the training data to minimise the RSS. In an extreme case, high-order polynomial could "pass through" all of the training samples resulting in 0 training RSS.

(d) Answer (c) using test rather than training RSS.
- There is not enough information to tell which (test) RSS will be lower. This will depend on how nonlinear the true relationship is.

### 5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form
$$
\hat{y}_i = x_i \hat{\beta}
$$
where
$$
\hat{\beta} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i'=1}^{n} x_i'^2}
$$
Show that we can write
$$
\hat{y}_i = \sum_{i'=1}^{n} a_i' y_i'
$$
What is $a_i'$?

Answer:
$$
\hat{y}_i = x_i \hat{\beta} = 
x_i \frac{\sum_{j=1}^{n} x_j y_j}{\sum_{k=1}^{n} x_k^2} = 
\sum_{j=1}^{n} \frac{x_i x_j}{\sum_{k=1}^{n} x_k^2} y_j = 
\sum_{j=1}^{n} a_j y_j
$$
where
$$
a_j = \frac{x_i x_j}{\sum_{k=1}^{n} x_k^2}
$$

### 6. Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point $(\bar{x}, \bar{y})$
- We know that, for simple linear regression $y = \hat{\beta}_0 + \hat{\beta}_1 x$. From (3.4) we have: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$. Substituting $x = \bar{x}$ into the simple linear regression equation we get:
$$
y(\bar{x}) = \hat{\beta}_0 + \hat{\beta}_1 \bar{x} \overset{(3.4)}{=} (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 x = \bar{y}
$$
which proves that $y$ passes through $(\bar{x}, \bar{y})$


### 7. It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic (3.17) is equal to the square of the correlation between $X$ and $Y$ (3.18). Prove that this is the case. For simplicity, you may assume that $\bar{x} = \bar{y} = 0$
- We know that:
$$
R^2 = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \hat{y}_i)^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \quad \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$
then:
$$
R^2 = \frac{\sum (\hat{\beta}_0 + \hat{\beta}_1 x_i - \bar{y})^2}{\sum (y_i - \hat{y}_i)^2} \\
= \frac{\sum (\bar{y} - \hat{\beta}_1 \bar{x} + \hat{\beta}_1 x_i - \bar{y})^2}{\sum (y_i - \hat{y}_i)^2} \\
= \frac{\hat{\beta}_1^2 \sum (  x_i - \bar{x}  )^2}{\sum (y_i - \hat{y}_i) ^2} \\ 
\overset{subst.  \hat{\beta}_1}{=} \frac{(\sum (x_i - \bar{x}) (y_i - \bar{y}))^2}{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2} \\
= \Bigl(  \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}  \Bigl)^2 \\
= corr(x, y)^2
$$