# An Introduction to Statistical Learning in R

## Introduction to Statistical Learning

What is Statistical Learning?
$\hat{=}$ are tools for understanding data

$\Rightarrow$ **supervised learning**
statistical models for predicting/ estimating an output based on one or more inputs

$\Rightarrow$ **unsupervised learning**
inputs but no supervised outputs $\rightarrow$ instead analysis of the relationships and structure of the data (*cluster analysis*)

Think of a model $Y=f(X)+\epsilon$ where $Y$ is the output/ response, $X$ are input variables and $\epsilon$ is the systematic error. The general idea is to find a function $f(\cdot)$ respresenting systematic information.  

### Reasons to estimate $f(\cdot)$

#### Prediction
using input $X$ and response $Y$ to predict $\hat{Y}$ for $X$ where we cannot observe the response.

$\hat{Y}=\hat{f}(X)$

here $\hat{f}(X)$ is often treated as Black Box and we are just interested in finding the best
prediction.
$E(Y-\hat{Y})^2=E(f(X)+\epsilon-\hat{f}(X))^2=(f(X)-\hat{f}(X))^2+Var(\epsilon)$

Thus, we can distinct between a reduceable error and an irreduceable error.

Note: irreduceable error $\epsilon$ is larger than zero $\rightarrow$ $\epsilon$ may 
contain unmeasured/ unobserved variables  
    
#### Inference
understanding relationships between $X$ and $Y$ $\Rightarrow$ $\hat{f}(\cdot)$ can not be 
treated as Black Box. 
 - Which predictors are associated with the response?
 - What is the relationship between Response and each predictor?
 - Can the relationship be adequately summarized using linear models?

### How to estimate $f(\cdot)$
    
Idea: using training data to fit a model $\hat{f}(X)$ such that
      $Y \approx \hat{f}(X) \qquad \forall (X,Y)$
      
$\Rightarrow$ there are two major approches in statistical learning for estimating $f(\cdot)$
       
* parametric methods
* non-parametric methods

### Parametric Methods

are in general two-step model based approaches

 1. make assumptions about the functional form 
    $f(X)=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p$
 
 2. use training data to fit the model
    $Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p$
    often estimated by least-squares estimation
    Trade-off: flexibility vs. overfitting
    
Note: overfitting $\hat{=}$ complex models which follow the errors/ noise too closely

### Non-parametric Methods

estimating $f(\cdot)$ without any assumptions about the functional form of $f$.

$\Rightarrow$ seek an estimate of $f(\cdot)$ that gets as close to the data points as possible without beeing too rough

methods:

 * Polynomial Regression
 * Step functions
 * Regression Splines
 
Note: 
 - need much more data points for accuracy
 - again problem of overfitting
 
Trade-off: Prediction Accuracy vs. Interpretability

### Why would we ever choose a more restrictive method instead of a very flexible approach?

restrictive models are much more interpretable $\Rightarrow$ better for Inference

Note: more flexible methods for prediction are often NOT more accurate!

### Regression vs Classificaion Problems

Variables can be characterized as either:

 * quantitative $\Rightarrow$ Regression Problem
 
 * qualitative (categorical) $\Rightarrow$ Classification Problem
   

### Assessing Model Accuracy

 * Mean squared error (MSE) 
   (training MSE/ error)
   
   $MSE = \frac{1}{n} \sum_{i=1}^n (y_i-\hat{f}(x_i))^2$
   
 * Average test error
   (test MSE)
   
   $Ave(y_0-\hat{f}(x_0)) \qquad (x_0,y_0) \text{test observations}$
   
Note: model flexibility producing U-shape MSE & test MSE

**Assessing Model Accuracy for Classification Problems**

 * $MSE = \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y_i})$
 
 * $Ave (I(y_i \neq \hat{y_i}))$

## References for this and the following Chapters

*James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.*

*Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.*