# $\S$ 2.8. Classes of Restricted Estimators

## $\S$ 2.8.1. Roughness Penalty and Bayesian Methods

> Penalty function, or *regularization* methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.

The class of functions is controlled by explicitly penalizing $\text{RSS}(f)$ with a roughness penalty $J$

\begin{equation}
\text{PRSS}(f;\lambda) = \text{RSS}(f) + \lambda J(f),
\end{equation}
where the user-selected functional $J$ will be large for functions $f$ that vary too rapidly over small regions of input space.

### Cubic smoothing spline

For example, the popular *cubic smoothing spline* for one-dimensional inputs is the solution to the penalized least-squares criterion

\begin{equation}
\text{PRSS}(f;\lambda) = \sum_{i=1}^N \left(y_i-f(x_i)\right)^2 + \lambda\int\left[f''(x)\right]^2dx.
\end{equation}
This roughness penalty controls large values of the second derivative of $f$, and the amount penalty is dictated by $\lambda \le 0$. For $\lambda =0$ no penalty is imposed, and any interpolating will do, while for $\lambda = \infty$ only functions linear in $x$ are permitted.

### Connection with Bayesian framework

The penalty $J$ corresponds to a log-prior, and $\text{PRSS}(f;\lambda)$ the log-posterior distribution, and minimizing $\text{PRSS}$ amounts to finding the posterior mode.

## $\S$ 2.8.2. Kernel Methods and Local Regression

By specifying the nature of the local neighborhood, these methods explicitly provide estimates of regression functions or conditional expectation.

The local neighborhood is specified by a *kernel function* $K_\lambda(x_0, x)$ which assigns weights to points $x$ in a region around $x_0$. For example, the Gaussian kernel has a weight function based on the Gaussian density function

\begin{equation}
K_\lambda(x_0, x) = \frac{1}{\lambda}\exp\left(-\frac{\|x-x_0\|^2}{2\lambda}\right).
\end{equation}
The parameter $\lambda$ corresponds to the variance of the Gaussian density, and controls the width of the neighborhood.

The simplest form of kernel estimate is the Nadaraya-Watson weighted average

\begin{equation}
\hat{f}(x_0) = \frac{\sum_{i=1}^N K_\lambda(x_0, x_i)y_i}{\sum_{i=1}^N K_\lambda(x_0, x_i)}.
\end{equation}

In general, we can define a local regression estimate of $f(x_0)$ as $f_{\hat{\theta}}(x_0)$, where $\hat{\theta}$ minimizes

\begin{equation}
\text{RSS}(f_\theta,x_0) = \sum_{i=1}^N K_\lambda(x_0, x_i)\left(y_i -f_\theta(x_i)\right)^2,
\end{equation}
where $f_\theta$ is some parameterized function, such as a low-order polynomial. Some examples are:
* $f_\theta(x) = \theta_0$, the constant function; this results in the Nadaraya-Watson estimate.
* $f_\theta(x) = \theta_0+\theta_1 x$ gives the popular local linear regression model.

### Association between kernel methods and nearest neighbors

Nearest-neighbor methods can be thought of as kernel methods having a more data-dependent metric. Indeed, the metric for $k$-nearest neighbors is

\begin{equation}
K_k(x,x_0) = I\left(\|x-x_0\| \le \|x_{(k)}-x_0\|\right),
\end{equation}
where $x_{(k)}$ is the training observation ranked $k$th in distance from $x_0$, and $I(S)$ is the indicator of the set $S$.

## $\S$ 2.8.3. Basis functions and Dictionary Methods

### Linear expansion

\begin{equation}
f_\theta(x) = \sum_{m=1}^M\theta_m h_m(x),
\end{equation}
where each of the $h_m$ is a function of the input $x$, and the term linear here refers to the action of the parameter $\theta$.

This class covers a wide variety of methods. In some cases the sequence of basis functions is prescribed, such as a basis for polynomials in $x$ of total degree $M$.

### Radial basis functions

*Radial basis functions* are symmetric $p$-dimensional kernels located at particular centroids,

\begin{equation}
f_\theta(x) = \sum_{m=1}^M K_{\lambda_m}(\mu_m,x)\theta_m,
\end{equation}

e.g., the Gaussian kernel $K_\lambda(\mu,x)=e^{-\|x-\mu\|^2/2\lambda}$ is popular.

Radial basis functions have centroids $\mu_m$ and scales $\lambda_m$ that have to be determined. In general we would like the data to dictate them as well. Including these as parameters changes the regression problem from a straightforward linear problem to a combinatorially hard nonlinear problem. In practice, shortcuts such as greedy algorithms or two stage processes are used.