# Chapter 2: Statistical Learning

## Conceptual



**Q1.** Performance of a flexible statistical learning method (F) vs an inflexible method (I)

- (a) The sample size $n$ is extremely large, the number of predictors $p$ is small: F better than I.
- (b) $n$ small, $p$ large: F worse than I (there may be overfitting).
- (c) Relationship predictors and response is highly non-linear: F better than I.
- (d) $\sigma^2 = \text{Var}(\epsilon)$ is extremely high: F worse than I (there may be overfitting).



**Q2.** Classification or regression? Inference or prediction? Provide $n$ and $p$.
- (a) Regression. Inference. $n=500, p=3$.
- (b) Classification. Prediction. $n=20, p=13$.
- (c) Regression. Prediction. $n=52$ (number of weeks in 2012), $p=4$.
  


**Q3.** Skech of relation between: bias, variance, training error, test error and Bayes (irreducible) error curves

$$
E\left(y_0 - \hat{f}(x_0)\right)^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)
$$

![Relation between bias-variance](./fig/chap2-ex3a.png)

- **Training error**: decreases if the flexibility increases because we try to fit more accurently the training set.
- **Test error**: firstly decreases upto some point and then increases because of the overfitting problem.
- **Bias**: the more complex model, the less bias error we get.
- **Variance**: opposite to bias.
- **Bayes (irreducible) error**: steady. Note that, the point where the irreducible curve is closest to the test error curve is the optimal operating point for this system.

**Q4.** Reak-life examples for SL
    
- (a) Classification

    1. stock market price direction, prediction, response: up, down. input: yesterday's price movement change, two previous day price movement change, etc.
    2. illness classification, inference, response: ill, healthy, input: resting heart rate, resting breath rate, mile run time
    3. car part replacement, prediction, response: needs to be replace, good, input: age of part, mileage used for, current amperage

- (b) Regression 

    1. CEO salary. inference. predictors: age, industry experience, industry,
    years of education. response: salary.

    2. car part replacement. inference. response: life of car part. predictors: age
    of part, mileage used for, current amperage.

    3. illness classification, prediction, response: age of death,
    input: current age, gender, resting heart rate, resting breath rate, mile run
    time.

- (c) Cluster 

    1. cancer type clustering. diagnose cancer types more accurately.

    2. Netflix movie recommendations. recommend movies based on users who have
    watched and rated similar movies.

    3. marketing survey. clustering of demographics for a product(s) to see which
    clusters of consumers buy which products.

**Q5.** Very flexible vs less flexible

- The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias.
- The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance.
- A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.
- A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

**Q6.** Parametric vs non-parametric

|              | Parametric methods                                                                        | Non-parametric methods                                                                                       |
|--------------|-------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Definition   | Give an explicit approximation function $\hat{f}$ of $f$.                                 | Seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. |
| Advantage    | Simplifies the problem of estimating $f$ because we just need to estimate the parameters. | Avoid the assumption of a form of $f$ and have potential change to fit to any shapes of $f$.                 |
| Disadvantage | The model we choose will usually not match the true form of $f$.                          | Need a very large number of observations.                                                                    |

**Q7.** K-nearest neighbors



In [7]:
import pandas as pd
import numpy as np

df_dict = {'x1': [0,2,0,0,-1,1],
           'x2': [3,0,1,1,0,1],
           'x3': [0,0,3,2,1,1],
           'y': ['R','R','R','G','G','R']
          }
df = pd.DataFrame(df_dict)
df

Unnamed: 0,x1,x2,x3,y
0,0,3,0,R
1,2,0,0,R
2,0,1,3,R
3,0,1,2,G
4,-1,0,1,G
5,1,1,1,R


- (a) Euclidean distance

In [11]:
dist = (df.values[:,:3] - np.array([0,0,0])) ** 2
dist = dist.sum(axis=1).astype('float')
dist = np.sqrt(dist)
dist

array([ 3.        ,  2.        ,  3.16227766,  2.23606798,  1.41421356,
        1.73205081])

- (b) $K=1$: From above distance result, the shortest distance is $1.4$ which is corresponding to the point number 5. The prediction is Green.
- (c) $K=3$: three corresponding closet points are 2,5,6 (Red, Green, Red). The prediction is Red.
- (d) $K$ needs to be small for highly non-linear Bayes decision boundary.