# Coefficent of Determination $R^2$

## Definition

The *Coefficient of Determination* measures the proportion of variation in the regression target that is explained by the model.

Given inputs $\mathbf{x}=(x_1, \ldots, x_n)$, targets $\mathbf{y}=(y_1,\ldots, y_n)$ and predictions $\mathbf{\hat{y}}=(\hat{y}_1,\ldots, \hat{y}_n) = (f(x_1), \ldots, f(x_n))$, the *coefficient of determination* is

\begin{equation}
    R^2 = 1 - \frac{ \sum_{i=1}^n (y_i - \hat{y}_i)^2 }{ \sum_{i=1}^n (y_i - \overline{y}_i)^2 }.
\end{equation}

That is, one minus the residual sum-of-squares over the total sum-of-squares.

## Properties

- Most models have $0\leq R^2\leq 1$, but $R^2$ can be arbitrarily negative corresponding to arbitrarily bad predictions
- In an ordinary least-squares model, $R^2$ equals Pearson's correlation coefficient between the predictions $\mathbf{\hat{y}}$ and the observations $\mathbf{y}$
- In an ordinary least-squares model, $R^2$ can be rewritten as the quotient of the *explained* sum-of-squares by the total sum-of-squares:

\begin{equation}
    R^2 = \frac{ \sum_{i=1}^n (\hat{y}_i - \overline{y})^2 }{ \sum_{i=1}^n (y_i - \overline{y})^2 }.
\end{equation}

## Interpretations

- Unexplained Variance: $R^2$ is one minus the *Fraction of Variance Unexplained* (FVU); the fraction of the variation in $\mathbf{y}$ that isn't explained (isn't correctly predicted) from the $\mathbf{x}$
- Comparison with Base Error Rate: $R^2$ compares $f$ against predicting each $y_i$ using the sample mean. If $R^2 = 0$ then the model is no better than predicting using the mean. If $R^2=1$ then the model is perfect.

## Drawbacks

- With an ordinary least-squares model, $R^2$ is monotone increasing with the number of features, so relying on $R^2$ alone can lead to overfitting
- The $R^2$ doesn't indicate whether a model is appropriate, e.g. OLS models yield the same $R^2$ on all members of Anscombe's quarter
- The $R^2$ reflects not only the quality of the regression, but also the distribution of the independent variables.
- The $R^2$ is very dependent on the number of independent variables, so can't be used a meaningful comparison of models with significantly different numbers of independent variables

To illustrate the last two points, suppose that $X$ and $Y$ are random variables with $Y=a+bX +\epsilon$, where $\epsilon \sim \mathcal{N}\left(0, \sigma^2\right)$ is random noise (independent of $X$). Then the expected value of $R^2$ is

\begin{equation}
    1 - \frac{\text{E}(\epsilon^2)}{\text{Var}(Y)} = 1 - \frac{\sigma^2}{b^2 \text{Var}(X) + \sigma^2} = \frac{b^2 \text{Var}(X)}{b^2 \text{Var}(X) + \sigma^2}.
\end{equation}

Note that even with a perfect model the value of $R^2$ can be anything in $(0, 1)$ and is entirely dependent on the variation in the independent variable and the inherent noise. Conversely a linear model can achieve $R^2$ close to 1 even with noticeably non-linear data ([example](https://stats.stackexchange.com/a/13317)).

See [here](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf) for a list of issues with $R^2$. 