# $\S$ 5.8. Regularization and Reproducing Kernel Hilbert Spaces
## $\S$ 5.8.1. Spaces of Functions Generated by Kernels

An important subclass of problems of the form

\begin{equation}
\min_{f\in\mathcal{H}} \left[ \sum_{i=1}^N L(y_i, f(x_i)) + \lambda J(f) \right]
\end{equation}

are generated by a positive definite kernel $K(x,y)$, and the corresponding space of functions $\mathcal{H}_K$ is called a _reproducing kernel Hilbert space_ (RKHS). The penalty functional $J$ is defined in terms of the kernel as well. 

We give a brief and simplified introduction to this class of models, and below are the references:
* Wahba (1990)
* Girosi et al. (1995)
* Evgeniou et al. (2000)

### Definitions and assumptions

Let $x, y \in\mathbb{R}^p$ and a kernel $K$ be given.

We consider the space $\mathcal{H}_K$ of functions generated by the linear span of

\begin{equation}
\lbrace K(\cdot, y), y\in\mathbb{R}^p \rbrace;
\end{equation}

i.e., arbitrary linear combinations of the form

\begin{equation}
f(x) = \sum_m \alpha_m K(x, y_m).
\end{equation}

Suppose that $K$ has an eigen-expansion

\begin{equation}
K(x,y) = \sum_{i=1}^\infty \gamma_i \phi_i(x)\phi_i(x),
\end{equation}

with $\gamma_i \ge 0$, $\sum_{i=1}^\infty \gamma_i^2 \lt \infty$.

Elements of $\mathcal{H}_K$ have an expansion in terms of these eigen-functions,

\begin{equation}
f(x) = \sum_{i=1}^\infty c_i \phi_i(x),
\end{equation}

with the constraints that

\begin{equation}
\|f\|_{\mathcal{H}_K}^2 := \sum_{i=1}^\infty c_i^2/\gamma_i \lt \infty,
\end{equation}

where $\|f\|_{\mathcal{H}_K}$ is the norm induced by $K$.

The penalty functional $J$ for the space $\mathcal{H}_K$ is defined to be the squared norm

\begin{equation}
J(f) = \|f\|_{\mathcal{H}_K}^2.
\end{equation}

The quantity $J(f)$ can be interpreted as a generalized ridge penalty, where functions with large eigenvalues in the expansion get penalized less, and vice versa.

### Solution

Rewritting the minimization problem, we have

\begin{equation}
\min_{f\in\mathcal{H}_K} \left[ \sum_{i=1}^N L(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}_K}^2\right],
\end{equation}

or equivalently

\begin{equation}
\min_{\lbrace c_j\rbrace_1^\infty} \left[ \sum_{i=1}^N L\left(y_i, \sum_{j=1}^\infty c_j\phi_j(x)\right) + \lambda \sum_{j=1}^\infty \frac{c_j^2}{\gamma_j} \right].
\end{equation}

It can be shown (Wahba (1990), see also Exercise 5.15) that the solution is finite-dimensional, and has the form

\begin{equation}
f(x) = \sum_{i=1}^N \alpha_i K(x,x_i)
\end{equation}


### Reproducing property of $\mathcal{H}_K$

The basis function

\begin{equation}
h_i(x) = K(x,x_i)
\end{equation}

is known as the _representer of evaluation_ at $x_i$ in $\mathcal{H}_K$, since for $f\in\mathcal{H}_K$, it is easily seen that

\begin{equation}
\langle K(\cdot,x_i), f\rangle_{\mathcal{H}_K} = f(x_i).
\end{equation}

Similarly,

\begin{equation}
\langle K(\cdot,x_i), K(\cdot,x_j)\rangle_{\mathcal{H}_K} = K(x_i,x_j)
\end{equation}

(a.k.a. the _reproducing_ property of $\mathcal{H}_K$), and hence

\begin{equation}
J(f) = \sum_{i=1}^N\sum_{j=1}^N K(x_i,x_j)\alpha_j \alpha_j
\end{equation}

for $f(x) = \sum_{i=1}^N \alpha_i K(x, x_i)$.

### Down to the finite-dimension

Then the minimization problem reduces to a finite-dimensional criterion

\begin{equation}
\min_\alpha L(\mathbf{y}, \mathbf{K\alpha}) + \lambda\mathbf{\alpha}^T\mathbf{K\alpha},
\end{equation}

where $\mathbf{K}$ is the $N \times N$ matrix with $ij$th entry $K(x_i,x_j)$ and so on. Simple numerical algorithm can be used to optimize this problem.

This phenomenon, whereby the infinite-dimensional problem reduces to a finite-dimensional optimization problem, has been dubbed the _kernel property_ in the literature on support-vector machine (Chapter 12).

### Bayesian interpretation

$f$ is interpreted as a realization of a zero-mean stationary Gausssian process, with prior covariance function $K$. Then the eigen-decomposition produces a series of orthogonal eigen-functions $\phi_j(x)$ with associated variances $\gamma_j$.

The typical scenario is that
* "smooth" functions $\phi_j$ have large prior variances,
* while "rough" $\phi_j$ have small prior variances.

The penalty $J$ is the contribution of the prior to the joint likelihood, and penalizes more those components with smaller prior variance.

### More general approach

For simplicity we have dealt with the case here where all members of $\mathcal{H}$ are penalized, as in the above minimization problem.

More generally, there may be some components in $\mathcal{H}$ that we wish to leave alone, such as the linear functions for cubic smoothing splines in $\S$ 5.4. The multidimensional thin-plate splines of $\S$ 5.7 and tensor product splines fall into this category as well.

In these cases there is a more convenient representation

\begin{equation}
\mathcal{H} = \mathcal{H}_0 \oplus \mathcal{H}_1,
\end{equation}

with the _null space_ $\mathcal{H}_0$ consisting of, e.g., low degree polynomials in $x$ that do not get penalized. Then the penalty becomes

\begin{equation}
J(f) = \|P_1 f\|,
\end{equation}

where $P_1$ is the orthogonal projection of $f$ onto $\mathcal{H}_1$.

The solution has the form

\begin{equation}
f(x) = \sum_{j=1}^M \beta_j h_j(x) + \sum_{i=1}^N \alpha_i K(x,x_i),
\end{equation}

where the first term represents an expansion in $\mathcal{H}_0$.

From a Bayesian perspective, the coefficients of components in $\mathcal{H}_0$ have improper priors, with infinite variance.