# Summary


## What is statistical learning

- General equation: $Y =f(X)+\epsilon$
    - $X$: Input variables, also called predictors, independent variables, features
    - $Y$: Response, dependent variable
    - $\epsilon$: Random error term (independant from $X$ and $mean(\epsilon) = 0$)
    

In essence, statistical learning refers to a set of approaches for estimating $f$.


### Why estimate f?

For prediction and inference

#### Prediction

In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using:

$\hat{Y} = \hat{f}(X)$
- $\hat{f}$ is the estimate for $f$ (treated as *black box*)
- $\hat{Y}$ is the prediction for $Y$

The accuracy of $\hat{Y}$ as a prediction for $Y$ depends on two quantities, which we will call the **reducible** error and the **irreducible** error. 

One error is **reducible** because we can potentially improve the accuracy of $\hat{f}$ by using the most appropriate statistical learning technique to estimate $f$.

The second is **irreducible** because even with a perfect estimate ($\hat{Y} = f(X)$), $Y$ is still depending on $\epsilon$ (which is not predicatble using $X$).

The focus of this book is on techniques for estimating $f$ with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for $Y$ . This bound is almost always unknown in practice.

#### Inference

Our goal with inference is not to make prediction but understanding the relationship between $X$ and $Y$. You might be interesting in answering the following questions: 

- Which predictors are associated with the response? 
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately sum- marized using a linear equation, or is the relationship more complicated?

### How do we estimate f?

Our goal is to apply a statistical learning method to the observations (training data) in order to estimate the unknown function f. We want to find a function $\hat{f}$ such that $Y ≈\hat{f}(X)$ for any observation $(X,Y)$. There are two type of approaches: **parametric** or **non-parametric**.

#### Parametric methods

Parametric methods involve a two-step model-based approach.

1. Make an assumption about the functional form, or shape, of $f$. For example we could assume that $f$ is linear: $f(X) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p (2.4)$

2. Apply a procedure to estimate the paramters (finding $\beta_0,\beta_1,...,\beta_p$)

Assuming a parametric form for $f$ simplifies the problem of estimating $f$ because it is generally much easier to estimate a set of parameters, such as $\beta_0,\beta_1,...,\beta_p$ in the linear model (2.4), than it is to fit an entirely arbitrary function $f$.

The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of $f$. If the chosen model is too far from the true $f$, then our estimate will be poor.

Assuming a **flexible** parametric form improves the estimate of $f$, but it requires estimating a greater number of parameters ($\beta_i$). Also it might lead to **overfitting**, which means that the estimate follows the errors (noise) too closely.

#### Non-parametric methods

Non-parametric methods do not make explicit assumptions about the functional form of $f$, they have the potential to accurately fit a wider range of possible shapes for $f$.


#### The trade-off between prediction accuracy and model interpretability

![Tradeoff](./2.7.png)
*A representation of the tradeoff between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.*


Why would we ever choose to use a more restrictive method instead of a very flexible approach? 

- If we are mainly interested in inference, then restrictive models are much more interpretable.
- If we are only interested in prediction, and the interpretability of the predictive model is simply not of interest.


In the prediction setting, we might expect that it will be best to use the most flexible model available. Surprisingly, this is not always the case! **We will often obtain more accurate predictions using a less flexible method**. This phenomenon, which may seem counterintuitive at first glance, has to do with the **potential for overfitting in highly flexible methods.** 

#### Supervised vs unsupervised learning

*Supervised:* 
- For each observation of the predictor measurement(s) $x_i, i = 1,...,n$ there is an associated response measurement $y_i$. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference). 

*Unsupervised:*
- For every observation $i = 1,...,n$, we observe a vector of measurements $xi$ but no associated response $yi$. In this setting, we are in some sense **working blind**; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis.One statistical learning tool that we may use is cluster analysis, or clustering. 

*Semi-supervised (not in the book)*
- We have a set of $n$ observations. For $m$ of the observations, where $m < n$, we have both predictor measurements and a response measurement. For the remaining $n − m$ observations, we have predictor measurements but no response measurement.

#### Regression vs classification problem

We tend to refer to problems with a **quantitative** response as **regression** problems, while those involving a **qualitative** response are often referred to as **classification** problems.

### Assessing model accuracy

There is no free lunch in statistics: no one method dominates all others over all possible data sets. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

#### Measuring the quality of the fit

In order to evaluate the performance of a statistical learning method on
a given data set, we need some way to measure how well its predictions actually match the observed data. In the regression setting, the
most commonly-used measure is the **mean squared error (MSE)**, given by:

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 $

The MSE
will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.