**General Machine Learning Theory**

**Machine learning = science of programming computers so they can learn from data.**

# Statistical learning

**Statistical learning = a set of approaches for estimating f.**

$$\large Y = f(X) + \epsilon$$ 

where $\epsilon$ is a random error term which is independent of X and has mean zero, f represents the systematic information that X provides about Y.

**X:** input variables (predictors/ features/ independent variables)

**Y:** output variable (response/ dependent variable)

Main reasons for estimating f?
- **Prediction**: 

    - When a set of inputs X is readily available, but the output Y can not easily be obtained. 
    - Since $\epsilon$ averages to zero, Y is predicted using:

    $ \hat{Y} = \hat{f}(X) $, where $\hat{f}$ is the estimate for $f$ and $\hat{Y}$ is the resulting prediction. 
    - Here, we are not concerned with the exact form of $\hat{f}$ (it's a black box).
    - There are two types of errors: 
        - **the reducible error** ($\hat{f}$ is not a perfect estimate of $\hat{f}$) 
        - **the irreducible error** (the existence of $\epsilon$ > 0), this error provides an upper bound on the accuracy of our predictions for Y.


- **Inference**:
    - When trying to understand the association between $Y$ and $X_{1}, X_{2}, X_{3}, ...$
    
There are two types of statistical learning methods:

- **Parametric**: It reduces the problem of estimating $f$ down to one of estimating a set of
parameters. How? First you make an assumption about the function form (i.e. it is linear), then you fit (or train) the chosen model (i.e. with the ordinary least squares method).


- **Non-parametric**: do not make explicit assumptions about the functional form of $f$. Instead they seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. Disadvantage: since they do not reduce the problem of estimating $f$ to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$.

Difference between Regression and Classification problems:

- **Regressions** require a quantitative response.
- **Classifications** require a qualitative response (logistic regression falls into this category!)

## Model accuracy

### Regression

For regressions, the most commonly used quality of fit method is the mean squared error (MSE). 

$$\large MSE = \frac{1}{n} \sum\limits_{i=1}^n (y_{i} - \hat{f}(x_{i}))^2$$

Here, the MSE is computed using the training data, so it is referred to the training MSE. Since we are interested in the accuracy of the predictions we should instead compute the test MSE (the average squared prediction error for these test observations $(x_{0}, y_{0})$):

$$\large Ave (y_{0} - \hat{f}(x_{0}))^2$$

Obs: there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE!

The expected test MSE for a given value $x_{0}$ can always be decomposed into the sum of three fundamental quantities: the variance of $\hat{f}(x_{0})$, the squared bias of $\hat{f}(x_{0})$, and the variance of the error term $\epsilon$. 

   Since both the $Var(\hat{f}(x_{0}))$ and the $Bias(\hat{f}(x_{0}))^2$ are both non negative terms, the expected test MSE can never lie below $Var(\epsilon)$
    
   **BIAS-VARIANCE TRADE-OFF**
   - **The variance = amount by which $\hat{f}(x_{0})$ would change if we estimated it using a different training data set.**
   - **Bias = error that is introduced by approximating a real life problem by a much simpler model.**
   - In general, the more flexible the model, the higher the variance and the lower the bias. The bias tends to initially decrease more than the variance increases. That is why the test MSE usually decreases at first (until the variance starts to significantly increase).

### Classification

The **bias-variance trade-off** transfers over to the classification setting. However, since $y_{i}$ is no longer quantitative, the accuracy of the estimate $\hat{f}(x_{0})$ is given by the training error rate:

$$\large \frac{1}{n} \sum\limits_{i=1}^n I(y_{i} \neq \hat{y_{i}})$$

where $\hat{y_{i}}$ is the predicted class label for the ith observation and $I(y_{i} \neq \hat{y_{i}})$ is an indicator variable that equals 1 if $y_{i} \neq \hat{y_{i}}$, and 0 if $y_{i} = \hat{y_{i}}$.

The test error rate is:

$$\large Ave(I(y_{0} \neq \hat{y_{0}})$$

where $y_{0}$ is the predicted class label that results from applying the classifier to the test observation with predictor $x_{0}$.

This test error is minimized on average by **the Bayes Classifier** which assigns each observation to the most likely class, given its predictor values. The classifier always chooses the class for which $P(Y=j | x=x_{0})$ is the largest.

The Bayes classifier produces the lowest possible test error rate. The Bayes error rate is:

$$\large 1 - E(max_j P(Y = j | X))$$

In practice, you can't calculate the conditional distribution of Y given X because that distribution is defined over the entire population. The Bayes Classifier is the state of nature that we do not know but we aim to approximate as well as possible.
Another method that attempts to estimate the conditional distribution of Y given X, and then classifies a
given observation to the class with highest estimated probability is **the KNN Classifier.**

### Cross Validation

In the absence of a very large designated test set that can be used to directly estimate the test error rate, a number of techniques can be used to estimate this quantity using the available training data. 

- **Validation set approach:**

Method: randomly dividing the available set of observations into two parts, a **training set** and a **validation set or hold-out set**. The model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set. The resulting validation set error rate provides an estimate of the test error rate.

Drawbacks:
1. Highly variable validation estimate of the test error rate (depending on which observations are included in the training/validation sets).
2. Only a subset of the observations (those that are included in the training set) are used to fit the model. The validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

- **Leave one out approach:**

- **K-fold cross validation:**