# $\S$ 2.8. Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. These classes are not distinct, and indeed some methods fall in several classes.

Each of the classes has associated with it one or more parameters, sometimes appropriately called _smoonthing_ parameters, that control the effective size of the local neighborhood.

Here we describe three broad classes.

## $\S$ 2.8.1. Roughness Penalty and Bayesian Methods

Here the class of functions is controlled by explicitly penalizing $\text{RSS}(f)$ with a roughness penalty $J$

\begin{equation}
\text{PRSS}(f;\lambda) = \text{RSS}(f) + \lambda J(f),
\end{equation}

where the user-selected functional $J(f)$ will be large for functions $f$ that vary too rapidly over small regions of input space.

#### Cubic smoothing spline

For example, the popular *cubic smoothing spline* for one-dimensional inputs is the solution to the penalized least-squares criterion

\begin{equation}
\text{PRSS}(f;\lambda) = \sum_{i=1}^N \left(y_i-f(x_i)\right)^2 + \lambda\int\left[f''(x)\right]^2 dx.
\end{equation}

This roughness penalty controls large values of the second derivative of $f$, and the amount penalty is dictated by $\lambda \ge 0$. For $\lambda =0$ no penalty is imposed, and any interpolating will do, while for $\lambda = \infty$ only functions linear in $x$ are permitted.

### Various penalties

Penalty functionals $J$ can be constructed for functions in any dimension, and special versions can be created to impose special structure. For example, additive penalties

\begin{equation}
J(f) = \sum_{j=1}^p J(f_j)
\end{equation}

are used in conjunction with additive functions $f(X) = \sum_{j=1}^p f_j(X_j)$ to create additive models with smooth coordinate functions.

Similarly, _projection pursuit regression_ models have

\begin{equation}
f(X) = \sum_{m=1}^M g_m(\alpha_m^T X)
\end{equation}

for adaptively chosen direction $\alpha_m$, and the functions $g_m$ can each have an associated roughness penalty.

### Connection with Bayesian framework

Penalty function, or _regularization_ methods, express our prior belief that the type of functions we seek exhibits a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.
* The penalty $J$ corresponds to a log-prior, and
* $\text{PRSS}(f;\lambda)$ the log-posterior distribution, and
* minimizing $\text{PRSS}$ amounts to finding the posterior mode.

We discuss roughness-penalty approaches in Chapter 5 and the Bayesian paradigm in Chapter 8.

## $\S$ 2.8.2. Kernel Methods and Local Regression

These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally.

The local neighborhood is specified by a _kernel function_ $K_\lambda(x_0, x)$ which assigns weights to points $x$ in a region around $x_0$ (FIGURE 6.1). For example, the Gaussian kernel has a weight function based on the Gaussian density function

\begin{equation}
K_\lambda(x_0, x) = \frac{1}{\lambda}\exp\left(-\frac{\|x-x_0\|^2}{2\lambda}\right),
\end{equation}

and assigns weights to points that die exponentially with their squared Euclidean distance from $x_0$. The parameter $\lambda$ corresponds to the variance of the Gaussian density, and controls the width of the neighborhood.

The simplest form of kernel estimate is the Nadaraya-Watson weighted average

\begin{equation}
\hat{f}(x_0) = \frac{\sum_{i=1}^N K_\lambda(x_0, x_i)y_i}{\sum_{i=1}^N K_\lambda(x_0, x_i)}.
\end{equation}

### Formulation

In general, we can define a local regression estimate of $f(x_0)$ as $f_{\hat{\theta}}(x_0)$, where $\hat{\theta}$ minimizes

\begin{equation}
\text{RSS}(f_\theta,x_0) = \sum_{i=1}^N K_\lambda(x_0, x_i)\left(y_i -f_\theta(x_i)\right)^2,
\end{equation}

where $f_\theta$ is some parameterized function, such as a low-order polynomial. Some examples are:
* $f_\theta(x) = \theta_0$, the constant function; this results in the Nadaraya-Watson estimate.
* $f_\theta(x) = \theta_0+\theta_1 x$ gives the popular local linear regression model.

### Association between kernel methods and kNN

Nearest-neighbor methods can be thought of as kernel methods having a more data-dependent metric. Indeed, the metric for kNN is

\begin{equation}
K_k(x,x_0) = I\left(\|x-x_0\| \le \|x_{(k)}-x_0\|\right),
\end{equation}

where
* $x_{(k)}$ is the training observation ranked $k$th in distance from $x_0$, and
* $I(S)$ is the indicator of the set $S$.

These methods of course need to be modified in high dimensions, to avoid the curse of dimensionality.

Various adaptations are discussed in Chapter 6.

## $\S$ 2.8.3. Basis functions and Dictionary Methods

This class of methods includes the familiar linear and polynomial expansions, but more importantly a wide variety of more flexible models.

### Linear expansion

The model for $f$ is a linear expansion of basis functions

\begin{equation}
f_\theta(x) = \sum_{m=1}^M\theta_m h_m(x),
\end{equation}

where each of the $h_m$ is a function of the input $x$, and the term linear here refers to the action of the parameter $\theta$.

This class covers a wide variety of methods.

In some cases the sequence of basis functions is prescribed, such as a basis for polynomials in $x$ of total degree $M$.

#### Splines

For 1D $x$, polynomial splines of degree $K$ can be represented by an appropriate sequence of $M$ spline basis functions, determined in turn by $M-K-1$ _knots_. These produce functions that are piecewise polynomials of degree $K$ between the knots, and joined up with continuity of degree $K-1$ at the knots.

#### Linear splines

As an example consider linear splines, or piecewise linear functions. One intuitively satisfying basis consist of the functions

\begin{align}
b_1(x) &= 1, \\
b_2(x) &= x, \\
b_{m+2}(x) &= (x - t_m)_+, m=1,\cdots,M-2,
\end{align}

where
* $t_m$ is the $m$th knot, and
* $z_+$ denotes positive part.

Tensor products of spline bases can be used for inputs with dimensions larger than one ($\S$ 5.2, and the CART/MARS model in Chapter 9).

The parameter $M$ controls the degree of the polynomial or the number of knots in the case of splines.

### Radial basis functions

_Radial basis functions_ are symmetric $p$-dimensional kernels located at particular centroids,

\begin{equation}
f_\theta(x) = \sum_{m=1}^M K_{\lambda_m}(\mu_m,x) \theta_m,
\end{equation}

e.g., the Gaussian kernel $K_\lambda(\mu,x)=e^{-\|x-\mu\|^2/2\lambda}$ is popular.

Radial basis functions have centroids $\mu_m$ and scales $\lambda_m$ that have to be determined. The spline basis functions have knots. In general we would like the data to dictate them as well.

> Including these as parameters changes the regression problem from a straightforward linear problem to a combinatorially hard nonlinear problem.

In practice, shortcuts such as greedy algorithms or two stage processes are used ($\S$ 6.7).

### Neural networks

A single-layer feed-forward neural network model with linear output weights can be thought of as an adaptive basis function method.

The model has the form

\begin{equation}
f_\theta(x) = \sum_{m=1}^M \beta_m \sigma(\alpha_m^T x + b_m),
\end{equation}

where

\begin{equation}
\sigma(x) = \frac{1}{1+e^{-x}}
\end{equation}

is known as the _activation_ function.

Here, as in the projection pursuit model, the directions $\alpha_m$ and the _bias_ term $b_m$ have to be determined, and their estimation is the meat of the computation (Chapter 11).

### Dictionary

These adaptively chosen basis function methods are a.k.a. _dictionary_ methods, where one has available a possibly infinite set or dictionary $\mathcal{D}$ of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.