# $\S$ 2.4. Statistical Decision Theory

> The best prediction of $Y$ at any point $X=x$ is the conditional mean, when best is measured by average squared error.

Let $X\in\mathbb{R}^p$ denote a real valued random input vector, and $Y\in\mathbb{R}$ a real valued random output variable, with joint distribution $\text{Pr}(X,Y)$. We seek a function $f(X)$ for predicting $Y$  given values of the input $X$. This Theory requires a *loss function* $L(Y, f(X))$ for penalizing errors in prediction, and by for the most common and convenient is *squared error loss*:

\begin{equation}
L(Y, f(X)) = (Y-f(X))^2.
\end{equation}

This leads us to a criterion for choosing $f$ the expected (squared) prediction error:

\begin{align}
\text{EPE}(f) &= \text{E}(Y-f(X))^2\\
&= \int\left(y-f(x)\right)^2\text{Pr}(dx,dy),
\end{align}

By conditioning on $X$, we can write EPE as

\begin{equation}
\text{EPE}(f) = \text{E}_X\text{E}_{Y|X}\left(\left[Y-f(X)\right]^2|X\right)
\end{equation}

and we see that it suffices to minimize EPE pointwise:

\begin{equation}
f(x) = \arg\min_c\text{E}_{Y|X}\left(\left[Y-c\right]^2|X=x\right)
\end{equation}

The solution is the conditional expectation a.k.a. the *regression* function:

\begin{equation}
f(x) = \text{E}\left(Y|X=x\right)
\end{equation}

### Conclusions first

Both KNN and least squares will end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:
* Least squares assumes $f(x)$ is well approximated by a globally linear function.
* $k$-nearest neighbors assumes $f(x)$ is well approximated by a locally constant function.
Although the latter seems more palatable, we will see below that we may pay a price for this flexibility.

### The light and shadows of KNN

The KNN attempts to directly implement this recipe using the training data:

\begin{equation}
\hat{f}(x) = \text{Ave}\left(y_i|x_i\in N_k(x)\right),
\end{equation}

where "Ave" denotes average. Two approximations are happening here:
* Expectation is approximated by averaging over sample data;
* Conditioning at a point is relaxed to conditioning on some region "close" to the target point.

Note that under mild regularity conditions on the joint probability distribution $\text{Pr}(X,Y)$, one can show that

\begin{equation}
\hat{f}(x)\rightarrow\text{E}(Y|X=x)\text{ as }N,k\rightarrow\infty\text{ s.t. }k/N\rightarrow0
\end{equation}

In light of this, why look further, since it seems we have a universal approximator?
* We often do not have very large samples. Linear models can usually get a more stable estimate than $k$-nearest neighbors, provided the structured model is appropriate.
* Curse of dimensionality: As the dimension $p$ gets large, so does the metric size of the $k$-nearest neighborhood. The convergence above still holds, but the *rate* of convergence decreases as the dimension increases.

### Model-based approach for linear regression

Model-based approach specifies a model for the regression function. Assume that the regression function $f(x)$ is approximately linear in its arguments:
\begin{equation}
f(x)\approx x^T\beta
\end{equation}

Plugging this linear model for $f(x)$ into EPE and differentiating, we can solve for $\beta$ theoretically:
\begin{equation}
\beta = \left[\text{E}\left(XX^T\right)\right]^{-1}\text{E}(XY)
\end{equation}

### Bayes classifier: for a categorical output with 0-1 loss function

For a catagorical variable $G$ with $\mathcal{G}$, a set of possible classes, we need a different loss function for penalizing prediction errors.

Our loss function can be represented by $K\times K$ matrix $\mathbf{L}$, where $K=\text{card}(\mathcal{G})$. $\mathbf{L}$ will be zero on the diagonal and nonnegative elsewhere, representing the price paid for misclassifying $\mathcal{G}_k$ as $\mathcal{G}_l$.

Most often we use the *zero-one* loss function, where all misclassifications are charged a single unit, i.e.,

\begin{equation}
L(k,l) = \begin{cases}
0\text{ if }k = l,\\
1\text{ otherwise}.
\end{cases}
\end{equation}

The expected prediction error is

\begin{equation}
\text{EPE} = \text{E}\left[L(G, \hat{G}(X)\right],
\end{equation}

where the expectation is taken w.r.t. the joint distribution $\text{Pr}(G, X)$. Again we condition, and can write EPE as

\begin{equation}
\text{EPE} = \text{E}_X\sum^K_{k=1}L\left(\mathcal{G}_k, \hat{G}(X)\right)\text{Pr}\left(\mathcal{G}_k|X\right)
\end{equation}

and again it suffices to minimize EPE pointwise:

\begin{equation}
\hat{G}(x) = \arg\min_{g\in\mathcal{G}}\sum^K_{k=1} L\left(\mathcal{G}_k,g\right)\text{Pr}\left(\mathcal{G}_k|X=x\right)
\end{equation}

With the 0-1 loss function this simplifies to

\begin{align}
\hat{G}(x) &= \arg\min_{g\in\mathcal{G}} \left[1 - \text{Pr}(g|X=x)\right]\\
&= \mathcal{G}_k \text{ if Pr}\left(\mathcal{G}_k|X=x\right) = \max_{g\in\mathcal{G}}\text{Pr}(g|X=x)
\end{align}

This reasonable solution is known as the *Bayes classifier*, and says that we classify to the most probable class, using the conditional (discrete) distribution $\text{Pr}(G|X)$. The error rate of the Bayes classifier is called the *Bayes rate*.

### KNN and Bayes classifier

Again we see that the $k$-nearest neighbor classifier directly approximates this solution -- a majority vote in a nearest neighborhood amounts to exactly this, except that
* conditional probability at a point is relaxed to conditional probability within a neighborhood of a point,
* and probabilities are estimated by training-sample proportions.