# Conceptual Foundations of Statistical Learning

*Based on ISL*

## What Is Statistical Learning?

Broadly speaking...

> **Supervised** statistical learning: build a statistical model to predict/estimate, an *output* based on one ore more *inputs*

> **Unsupervised** statistical learning: there are inputs but no supervised output; nevertheless we can learn relationships and structure from such data

Suppose input $X$ and output $Y$ relationship can we written as

$$Y = f(X) + \epsilon.$$

Where

- $X$... input, predictor, feature, independent variable
- $Y$... output, response, dependent variable
- $f$... systematic information that X provides about Y (fixed but unknown)
- $\epsilon$... random error term with mean zero

---

> Statistical learning refers to a set of approaches for estimating $f$.

## Why Estimate $f$?

### Prediction

We have input $X$ and wish to *predict* output $Y$.

Assuming we can approximate $f$ through some $\hat{f}$ then

We obtain a prediction $\hat{Y}$ of $Y$ using

$$\hat{Y} = \hat{f}(X).$$

We can treat $\hat{f}$ as *black box* provided that it yields sufficiently accurate predictions for $Y$.

Accuracy of prediction $\hat{Y}$ depends on:

#### Reducible error
How well $\hat{f}$ approximates $f$?

Can be decreased by improving model.

#### Irreducible error
How large is the error term $\epsilon$?

Represents fluctuations we cannot capture.

### Inference

*Understand* how $Y$ is affected as $X$ changes.

We need to $f$ but we do not necessarily make predictions.

We analyze properties of $\hat{f}$ itself to improve our understanding.

Typical questions:
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between $Y$ and each predictor be described using a simple, interpretable model?

In some fields of applications interpretable/transparent *white box* model are preferred.

## How do we Estimate $f$?

*Very interesting but out of scope.*

## Problem & Model Categories

Recall our very general framing of the problem as

$$Y = f(X) + \epsilon.$$

### Supervised vs. Unsupervised

- *Supervised*: for each observation $x_i$ there is an associated response $y_i$
- *Unsupervised*: such an associated response is (for whatever reasons) not available (typically more challenging). 

Possibles examples for (X, y) in a *supervised* setting:
- $X$: picture, $Y$: object on that picture
- $X$: tweet, $Y$: sentiment of that tweet
- $X$: macro economic time series, $Y$: prediction of GDP next year
- $X$: bank records of a client, $Y$: probability of default

Possible examples for X (and no $Y$) in an *unsupervised* setting:
- $X$: (song, video, ...) play lists of users; understand user taste to make recommendations
- $X$: document; identify similar documents
- $X$: monitoring data on a production site; identify outliers to detect failures early
- $X$: customer data; segment customers into different groups for e.g. marketing purposes

### Regression vs. Classification

- Response variables $Y$ (or $y_i$) can be characterized as either *quantitative* or *qualitative* (or *categorical*).
- A quantitative variable takes on numerical values whereas a qualitative variable assumes a value in one of K different *classes*.

- *Regression*: quantitative response
- *Classification*: qualitative response

Note that quantitative variable can be categorized (quantized) if desirable.

With regards to above examples we observe:
- $Y$: object on that picture → qualitative → classification
- $Y$: sentiment of that tweet → qualitative or quantitative → classification or regression
- $Y$: prediction of GDP next year → quantitative → regression
- $X$: $Y$: probability of default → quantitative → regression

### Model Based vs. Instance based

- *Instance* based: look up historical data
- *Model* based: derive model from data

## Assessing Model Accuracy

> Which model works best?

*There is no free lunch in statistics*: no one method dominates all others over all possible data sets.

Need a way to assess model performance to be able to compare different approaches.

### Measuring the Quality of Fit

*Depending on the problem, other measures may be more appropriate!*

#### Regression

Mean squared error (MSE):

$$\textrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2.$$

#### Classification

Accuracy (error rate):


$$\frac{1}{n}\sum_{i=1}^{n}\textrm{I}(y_i \neq \hat{y}_i).$$


(Proportion of mistakes that are made.)

#### Training vs. Test Error

> Training error: Are we able to learn anything at all?

> Test error: Does the model generalize? How good will it work on unseen data?

<a href="http://faculty.marshall.usc.edu/gareth-james/ISL/">
<img src="../images/isl/isl_2.9_train_vs_test_error.png" alt="Train vs. Test Error" width=800>
</a>

Test error in red, train error in blue.

#### Cross-validation

> Estimate test error by repeatedly fitting model on different subsets of the data
