# Assessing Model Accuracy 

## Measuring the Quality of the Fit

### Regression Models

To evaluate the performance of a statistical learning model we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.

In **regression**, the most commonly-used measure is the *mean squared error* (MSE)

$$
MSE = \frac{1}{n}\sum_{i = 1}^{n} (y_i - \hat{f}(x_i))^2
$$

where $\hat{f}(x_i)$ is the prediction that $\hat{f}$ gives for the $i$th observation. The MSE will be small if the predicted values are very close to the true responses, and will be larger if for some observations the predicted and true responses differ significantly. 

In most cases we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. In other words, we want to know whether $\hat{f}(x_0)$ is approximately equal to $y_0$, where ($x_0, y_0$) is a previously **unseen test observation not used to train the statistical learning method**. We want to choose the method that gives the **lowest test MSE**

$$
Avg(y_0 - \hat{f}(x_0))^2
$$

Since the training MSE and the test MSE appear to be related one might imagine simple select a statistical learning method that minimizes the training MSE. However there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE. 

As model flexibility increases (i.e. the ability of a model to fit a wide variety of patterns in the data), training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be **overfitting** the data. This happends because our statistical learning method becomes more capable of fitting even the smallest details in the training data, **including noise and random fluctuations** that don't represent the true underlying patterns. When we overfit the training data, the test MSE will be vary large because the supposed patterns that the method found in the training data simply don't exist in the test data. 

**N o t e**\
Regardless of whether or not overfitting has ocurred, we almost always expect the training MSE to be smaller than the test MSE becuase most statistical learning methods either directly or indirectly seek to minimize the training MSE. 

### Bias-Variance Trade-Off

The expected test MSE, for a given value $x_0$, can always be decomposed into the sum of three fundamental quantities:

$$
E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)
$$

where $E(y_0 - \hat{f}(x_0))^2$ defines the expected test MSE at $x_0$, and refers to the average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets and testes each at $x_0$. The overall expected test MSE can be computed by averaging $E(y_0 - \hat{f}(x_0))^2 $ over all possible values of $x_0$ in the test set.

The equation above tell us that in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves **low variance** and **low bias**.

#### What do we Mean by *Variance* and *Bias* of a Statistical Learning Method? 

*Variance* refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different $\hat{f}$. However the estimate $f$ shouldn't vary too much between training sets. If a method has high variance then small changes in the training data can result in large changes in $\hat{f}$. 

*Bias* refers to the error that is introduces by approximating a real-life problem. 

**N o t e**\
More flexible statistical methods have higher variance while having less bias.

### Classification Models

Consider the following training set $\{(x_1, y_1), ..., (x_n, y_n)\}$ where $y_1, ..., y_n$ are qualitative. The approach for quantifying the accuracy of our estimate $\hat{f}$ is the training *error rate*, the proportion of mistakes that are made if we apply our estimate $\hat{f}$ to the training observations

$$
\frac{1}{n} \sum_{i = 1}^{n} I(y_i \neq \hat{y}_i)
$$

where $\hat{y}_i$ is the predicted class label for the $i$th observation using $\hat{f}$. And $I(y_i \neq \hat{y}_i)$ is an **indicator variable** that equals $1$ if $y_i \neq \hat{y}_i$ and $0$ if $y_i = \hat{y}_i$. Thus if $I(y_i \neq \hat{y}_i) = 0$ then the $i$th observation was classified correctly by our classification method, otherwise it was misclassified.

The eq. above is called *training error* rate because it is computed based on the data that was used to train our classifier.

The *test error* rate associated with a set of test observations of the form $(x_0, y_0)$ is given by

$$
Avg(I(y_0 \neq \hat{y}_0))
$$

where $\hat{y}_0$ is the predicted class label that results from applying the classifier to the test observation with predictor $x_0$. A good classifier is one for which the test eror is smallest.

Retrieved from: 

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An introduction to statistical learning with applications in R* (2nd ed.). Springer.

***

MSc Statistical Computing by Mathematics Research Center (CIMAT Monterrey)

October 2024