# $\S$ 2.6. Statistical Models, Supervised Learning and Function Approximation

### Review

Goal: to find a useful approximation $\hat{f}(x)$ to $f(x)$ that underlies the predictive relationship between the inputs and outputs.

1. For a quantitative response, the squared error loss leads us to the regression function
\begin{equation}
f(x) = \text{E}(Y|X=x)
\end{equation}.
2. The KNNs can be viewed as direct estimates of this conditional expectation.
3. But KNNs can fail at least two ways:
  * If the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors,
  * if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates.
  
> We anticipate using other classes of models for $f(x)$, in many cases specifically designed to overcome the dimensionality problems, and here we discuss a framework for incorporating them into the prediction problem.

## $\S$ 2.6.1. A Statistical Model for the Joint Distribution $\text{Pr}(X,Y)$

### The simplest case
Suppose in fact that our data arose from a statistical model

\begin{equation}
Y = f(X) + \epsilon,
\end{equation}
where the random error $\epsilon$ has $\text{E}(\epsilon)=0$ and is independent of $X$. For this model (with squared loss),
\begin{equation}
f(x)=\text{E}(Y|X=x),
\end{equation}
and in fact the conditional distribution $\text{Pr}(Y|X)$ depends on $X$ only through the conditional mean $f(x)$.

### Without independence assumption
The assumption that the errors are i.i.d. is not strictly necessary; e.g., we can have $\text{Var}(Y|X=x)=\sigma(x)$, and now both the mean and variance depend on $X$. In general $\text{Pr}(Y|X)$ can depend on X in complicated ways, but the additive error model precludes these.

### For qualitative outputs
Additive error models are typically not used for qualitative outputs $G$; in this case the target function $p(X)$ is the conditional density $\text{Pr}(G|X)$, and this is modeled directly.

For example, for two-class data, it is often reasonable to assume that the data arise from independent binary trials, with the probability of one particular outcome being $p(X)$, and the other $1 − p(X)$. Thus if $Y$ is the 0–1 coded version of $G$, then $\text{E}(Y |X = x) = p(x)$, but the variance depends on $x$ as well: $Var(Y |X = x) = p(x)[1 − p(x)]$.

## $\S$ 2.6.2. Supervised Learning

Suppose for simplicity the additive model

\begin{equation}
Y = f(X) + \epsilon,
\end{equation}
is a reasonable assumption. Supervised learning attempts to learn $f$ by example through a *teacher*. One observes the system under study, both the inputs and outputs, and assembles a training set of observations $\mathcal{T} = (x_i, y_i), i = 1, \cdots, N$. The observed input values to the system $x_i$ are also fed into an artificial system, known as a learning algorithm (usually a computer program), which also produces outputs $\hat{f}(x_i)$ in response to the inputs. The learning algorithm has the property that it can modify its input/output relationship $\hat{f}$ in response to differences $y_i − \hat{f}(x_i)$ between the original and generated outputs. This process is known as learning by example. Upon completion of the learning process the hope is that the artificial and real outputs will be close enough to be useful for all sets of inputs likely to be encountered in practice.

This learning paradigm has been the motivation for research into the supervised learning probelm in the fields of machine learning (with analogies to human reasoning) and neural networks (with biological analogies to the brain). The approach taken in applied mathematics and statistics has been from the perspective of function approximation and estimation.

## $\S$ 2.6.3. Function Approximation

Here the data pairs $\{x_i, y_i\}$ are viewed as points in a $(p+1)$-dimensional Euclidean space. The function $f(x)$ has domain $\mathbb{R}^p$, and is related to the data via a model such as $y_i=f(x_i)+\epsilon$.

Goal: to obtain a useful approximation to $f(x)$ for all $x$  in some region of $\mathbb{R}^p$, given the representations in $\mathcal{T}$.

### Basis expansions

Many of the approximations we will encounter have associated a set of parameters $\theta$ that can be modified to suit the data at hand. e.g., the linear model $f(x) = x^T\beta$ has $\theta=\beta$. Another class of useful approximators can be expressed as *linear basis expansions*
\begin{equation}
f_\theta(x) = \sum_{k=1}^K h_k(x)\theta_k,
\end{equation}
where the $h_k$ are a suitable set of functions or transformations of the input vector $x$. Traditional examples are polynomial and trigonometric expansions, where for example $h_k$ might be $x_1^2$, $x_1x_2^2$, $\text{cos}(x_1)$ and so on. We also encounter nonlinear expansions, such as the sigmoid transformation common to neural network models,
\begin{equation}
h_k(x) = \frac{1}{1+\text{exp}(-x^T\beta_k)}.
\end{equation}

### Least squares again
We can use least squares to estimate the parameters $\theta$ in $f_\theta$ as we did for the linear model, by minimizing the residual sum-of-squares
\begin{equation}
\text{RSS}(\theta) = \sum_{i=1}^N\left(y_i-f_\theta(x_i)\right)^2
\end{equation}
as a function of $\theta$.

While least squares is generally very convenient, it is not the only criterion used and in some cases would not make much sense. A more general principle for estimation is *maximum likelihood estimation*.

### Maximum-likelihood

> The principle of maximum likelihood assumes that tthe most reasonable values for $\theta$ are those for which the probability of the observed sample is largest.

Suppose we have a random sample $y_i, i=1,\cdots,N$ from a density $\text{Pr}_\theta(y)$ indexed by some parameters $\theta$. The log-probability of the observed sample is
\begin{equation}
L(\theta) = \sum_{i=1}^N\log\text{Pr}_\theta(y_i),
\end{equation}
and the maximum-likelihood estimate of $\theta$ is
\begin{equation}
\hat{\theta}_{\text{ML}} = \arg\min_\theta L(\theta)
\end{equation}

### ML = LS with Gaussian errors

Assuming the error term $\epsilon\sim N(0,\sigma^2)$, least squares is equivalent to maximum likelihood using the conditional likelihood
\begin{equation}
\text{Pr}(Y|X,\theta) = N(f_\theta(X), \sigma^2),
\end{equation}
Since the log-likelihood of the data is
\begin{equation}
L(\theta) = -\frac{N}{2}\log(2\pi) - N\log\sigma -\frac{1}{2\sigma^2}\sum_{i=1}^N\left(y_i-f_\theta(x_i)\right)^2,
\end{equation}
and the only term involving $\theta$ is the last, which is $\text{RSS}(\theta)$ up to a scalar negative multiplier.

### Multinomial likelihood for a qualitative output $G$

Suppose we have a model

\begin{equation}
\text{Pr}(G=\mathcal{G}_k|X=x) = p_{k,\theta}(x), k=1,\cdots,K
\end{equation}
for the conditional density of each class given $X$, indexed by the parameter vector $\theta$. Then the log-likelihood (also referred to as the corss-entropy) is
\begin{equation}
L(\theta) = \sum_{i=1}^N\log p_{g_i,\theta}(x_i),
\end{equation}
and when maximized it delivers values of $\theta$ that best conform with the data in the likelihood sense.