## Chapter X: Kernel methods

# X.1  Motivation and basic examples

In this Chapter we describe fixed feature kernels, which is a method of representing fixed basis features so that they scale more gracefully when applied to vector valued input.

A serious practical issue presents itself when applying fixed basis features to vector valued input: even with a moderate sized input dimension $N$, the corresponding dimension
$M$ of the transformed features grows rapidly with $N$ and quickly becomes prohibitively
large in terms of storage and computation. For example, the precise number $M$ of non-
bias features/feature weights of a degree $D$ polynomial of an input with dimension
$N$ is $\left(\begin{array}{c}
N+D\\
D
\end{array}\right)-1=\frac{\left(N+D\right)!}{N!D!}-1$. Even if the input dimension is of reasonably small size, for instance $N=100$ or $N=500$, then just the associated degree $D=5$ polynomial feature map of these input dimensions has dimension $M= 96,560,645$ and $M=268,318,178,226$ respectively! In the latter case we cannot even hold the feature vectors in memory on a modern computer.

The corresponding number of transformed features with a Fourier basis/map is even more gargantuan: the degree $D$ Fourier feature map of arbitrary input dimension $N$ has $\left(2D + 1\right)^N$ associated/feature weights. When $D=5$ and $N=80$ this is $11^{80}$, a number larger than current estimates of the number of atoms in the visible universe!

This crucial issue, of not being able to effectively store high dimensional fixed basis feature transformations, motivates the search for more efficient representations of fixed bases. Here we introduce kernels or kernelized representations of fixed feature transformations, which are clever ways of constructing them that do not require explicit construction of the fixed features themselves. Kernels allow us to avoid this combinatorial storage problem and use fixed features with vector input (at the cost, as we will see, of scaling poorly with the size of a dataset). Additionally they provide a way of generating new fixed feature maps defined solely through such a kernelized representation.

## The fundamental theorem of linear algebra

Before discussing the concept of kernelization, it will be helpful to first recall a useful fact, generally referred to as the fundamental theorem of linear algebra. This is a simple statement about how to deconstruct an $M$ length vector $\mathbf{w} \in \mathbb{R}^{M}$ over the columns of a given matrix.

Recall that a set of $M$-dimensional vectors $\left\{\mathbf{f}_p\right\}_{p=1}^P$ spans a subspace of dimension $P$, where $P \leq M$, and that any vector $\mathbf{w}$ in this subspace can be written as some linear combination of the vectors as

\begin{equation}
\mathbf{w} = \sum_{p=1}^{P} \mathbf{f}_p\,z_p
\end{equation}

where $z_p$ are weights associated with $\mathbf{w}$. By stacking the vectors $\mathbf{f}_p$ column-wise into an $M \times P$ matrix $\mathbf{F}$ and the $z_{p}$'s together into a $P \times 1$ vector $\mathbf{z}$ this relationship can be written more compactly as

\begin{equation}
\mathbf{w} = \mathbf{F}\mathbf{z}
\end{equation}

As illustrated in Figure 1, any vector $\mathbf{w} \in \mathbb{R}^{M}$ can then be decomposed into two pieces: the portion of $\mathbf{w}$ belonging to the subspace spanned by the columns of $\mathbf{F}$ and an orthogonal component $\mathbf{r}$. Formally this decomposition is written as

\begin{equation}
\mathbf{w} = \mathbf{F}\mathbf{z}+\mathbf{r}
\end{equation}

Note that $\mathbf{r}$ being orthogonal to the span of $\mathbf{F}$’s columns means formally that $\mathbf{F}^T\mathbf{r}=\mathbf{0}_{P \times 1}$.

<figure>
  <img src= '../../mlrefined_images/kernel_images/Fig_7_1.png' width="40%"/>
  <figcaption> 
      <strong>Figure 1:</strong> 
      <em> 
An illustration of the fundamental theorem of linear algebra which states that any vector $\mathbf{w}$ in an $M$-dimensional space can be decomposed as $\mathbf{w} = \mathbf{F}\mathbf{z}+\mathbf{r}$. Here the vector $\mathbf{F}\mathbf{z}$ belongs in the subspace determined by the columns of the matrix $\mathbf{F}$ and $\mathbf{r}$ is orthogonal to this subspace.
      </em>
  </figcaption>
</figure>

As we will now see this simple statement is the key to representing fixed basis features more effectively (when used to transform vector valued input for use) with every cost function discussed in this book.

## Kernelizing cost functions

Suppose that we have a dataset of $P$ points  $\left\{\left(\mathbf{x}_p,y_p\right)\right\}_{p=1}^P$ where each input $\mathbf{x}_p$ has dimension $N$. Recall, when employing any fixed feature basis we learn proper parameters by minimizing the Least Squares regression cost

\begin{equation}
g\left(b,\mathbf{w}\right) = \sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\mathbf{w}-y_p\right)^2
\end{equation}

where we have used the vector notation

\begin{equation}
\mathbf{f}_{p}=\left[\begin{array}{c}
f_{1}\left(\mathbf{x}_{p}\right)\\
f_{1}\left(\mathbf{x}_{p}\right)\\
\vdots\\
f_{1}\left(\mathbf{x}_{p}\right)
\end{array}\right]
\end{equation}

to denote the $M$ fixed basis feature transformations of the input $\mathbf{x}_p$. 

Denote by $\mathbf{F}$ the $M \times P$ matrix $\mathbf{F}$ formed by stacking the vectors $\mathbf{f}_p$ column-wise. Now, employing the fundamental theorem of linear algebra discussed in the previous section we may write $\mathbf{w}$ here as

\begin{equation}
\mathbf{w} = \mathbf{F}\mathbf{z}+\mathbf{r}
\end{equation}

where $\mathbf{r}$ satisfies $\mathbf{F}^T\mathbf{r}=\mathbf{0}_{P \times 1}$.  Plugging this representation of $\mathbf{w}$ back into the cost function then gives

\begin{equation}
\sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\left(\mathbf{F}\mathbf{z}+\mathbf{r}\right)-y_p\right)^2 = \sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\mathbf{F}\mathbf{z}-y_p\right)^2 
\end{equation}

Finally, denoting the symmetric matrix $\mathbf{H} = \mathbf{F}^T\mathbf{F}$ (and where $\mathbf{h}_p = \mathbf{F}^T\mathbf{f}_p$ is the $p^{th}$ column of this matrix), referred to as a fixed basis *kernel matrix*, our original cost function becomes equivalently

\begin{equation}
g\left(b,\mathbf{z}\right) = \sum_{p=1}^{P} \left(b+\mathbf{h}_p^T\mathbf{z}-y_p\right)^2
\end{equation}

Note that we have changed the arguments of the cost function from $g\left(b,\mathbf{w}\right)$ to $g\left(b,\mathbf{z}\right)$ due to our substitution of $\mathbf{w}$. The original problem of minimizing the Least Squares cost may now be written equivalently in this *kernelized form* as

\begin{equation}
\underset{b,\,\mathbf{z}}{\text{minimize}}\,\, \sum_{p=1}^{P} \left(b+\mathbf{h}_p^T\mathbf{z}-y_p\right)^2
\end{equation}

Using precisely the same argument given here we may kernelize all of the cost functions discussed in this book including: the softmax cost/logistic regression classifier, the squared margin-perceptron/soft-margin SVMs, the multiclass softmax cost function, as well as any $\ell_2$ regularized version of these models. We show both the original and kernelized forms of these formulae in Table 1 for easy reference.

TABLE 7.1

## The value of kernelization

The real value of kernelizing any cost function is that for many fixed feature maps, including polynomials and Fourier features, the kernel matrix $\mathbf{H}$ may be constructed without first building the matrix $\mathbf{F}$, that is we need not construct it explicitly as $\mathbf{H}=\mathbf{F}^T\mathbf{F}$, but this matrix may be constructed entry-wise via simple formulae. In fact, as we will see, thinking about constructing kernel matrices in this way leads to the construction of fixed feature bases by defining the kernel matrix first (that is, not by beginning with an explicit feature transformation). As we see in the next section this can be done for both degree $D$ polynomial and Fourier feature bases, as well as many other fixed maps. This is highly advantageous since, as discussed previously, even with moderate sized input dimension $N$ the dimension of a fixed feature transformation $M$ will likely be gargantuan, so large that we may not even be able to store the matrix $\mathbf{F}$, let alone compute with it.

However, note that the non-bias optimization variable from the original to kernelized form has changed from $\mathbf{w}$, which had dimension $M$ in Equation (4), to $\mathbf{z}$, which has dimension $P$ in the kernelized version shown in Equation (8). This is precisely how the dimension of the non-bias optimization variable changes with kernelized cost functions as well, like those shown in Table 1.

While it is true that for large datasets (that is large values of $P$, e.g., in the thousands or tens of thousands) the minimization of a kernelized cost function becomes more challenging, the main obstacle is storing the $P \times P$ kernel matrix itself, which for large values of $P$ is difficult or even impossible to do completely. For example, with $P=10,000$ the corresponding kernel matrix will be of size $10,000 \times 10,000$, with $10^8$ values to store, far more than a modern computer can store all at once. Moreover, the amount of computation required to perform, e.g. gradient descent, grows dramatically with the size of a kernel matrix due to its explosive size.

Common ways of dealing with these issues for large datasets include: 1) using advanced first order methods such as stochastic gradient descent, discussed in Chapter 13, so that only a small number of the kernelized points are dealt with at a time when optimizing; 2) reducing the dimension of data using techniques like Principal Component Analysis and hence avoiding the need for kernelized versions of fixed bases; 3) using the explicit structure of certain problems (see e.g., [[1,2]](#bib_cell)); and 4) employing the tools from function approximation to avoid explicit construction of the kernel matrix [[3,4,5]](#bib_cell)).

## Examples of kernels

Here we present a list of examples of kernels for popular fixed feature transformations that may be built without first constructing the explicit feature transformation itself. While these are the most commonly used kernels in practice, the reader can see e.g., [[6,7]](#bib_cell)) for a more exhaustive list of kernels and their properties.

#### <span style="color:#a50e3e;">Example 1. </span> The polynomial kernel

Consider the following second degree polynomial mapping from $N=2$ to $M=5$ dimensional space given by

\begin{equation}
\mathbf{f}\left(\left[\begin{array}{c}
x_{1}\\
x_{2}
\end{array}\right]\right)=\left[\begin{array}{ccccc}
\sqrt{2}x_{1}^{\,} & \sqrt{2}x_{2}^{\,} & x_{1}^{2} & \sqrt{2}x_{1}^{\,}x_{2}^{\,} & x_{2}^{2}\end{array}\right]^{T}
\end{equation}

This is entirely equivalent to a standard degree $2$ polynomial, as the $\sqrt{2}$ attached to several of the terms can be absorbed by their associated weights when taking the corresponding weighted sum $\underset{m=1}{\overset{5}{\sum}}f_{m}\left(\mathbf{x}\right)w_{m}$.

Denoting briefly by $\mathbf{u}=\mathbf{x}_{i}$ and $\mathbf{v}=\mathbf{x}_{j}$
the $i^{\textrm{th}}$ and $j^{\textrm{th}}$ input data points respectively,
the $\left(i,j\right)^{\textrm{th}}$ element of the kernel matrix
for a degree 2 polynomial $\mathbf{H}=\mathbf{F}^{T}\mathbf{F}$ may
be written as

\begin{equation}
\begin{array}{c}
\mathbf{H}_{ij}=\left[\begin{array}{ccccc}
\sqrt{2}u_{1} & \sqrt{2}u_{2} & u_{1}^{2} & \sqrt{2}u_{1}u_{2} & u_{2}^{2}\end{array}\right]\left[\begin{array}{c}
\sqrt{2}v_{1}\\
\sqrt{2}v_{2}\\
v_{1}^{2}\\
\sqrt{2}v_{1}v_{2}\\
v_{2}^{2}
\end{array}\right]\end{array}
\\
=\left(1+2u_{1}v_{1}+2u_{2}v_{2}+u_{1}^{2}v_{1}^{2}+2u_{1}u_{2}v_{1}v_{2}+u_{2}^{2}v_{2}^{2}\right)-1
\\
=\left(1+u_{1}v_{1}+u_{2}v_{2}\right)^{2}-1=\left(1+\mathbf{u}^{T}\mathbf{v}\right)^{2}-1
\end{equation}

In short, the *polynomial kernel* matrix $\mathbf{H}$ may be built without first constructing the explicit features in Equation (10), and may be simply defined
entry-wise as

\begin{equation}
\mathbf{H}_{ij}=\left(1+\mathbf{x}_{i}^{T}\mathbf{x}_{j}\right)^{2}-1
\end{equation}

Again note that with the polynomial kernel defined above we only require access to the original input data, not the explicit polynomial features themselves.

Although the kernel construction rule in Equation (12) was derived specifically for $N=2$ and a degree two polynomial, one can show that a polynomial kernel can be defined entry-wise for general $N$ and degree $D$ analogously as

\begin{equation}
\mathbf{H}_{ij}=\left(1+\mathbf{x}_{i}^{T}\mathbf{x}_{j}\right)^{D}-1
\end{equation}

#### <span style="color:#a50e3e;">Example 2. </span>  The Fourier kernel

The degree $D$ Fourier feature transformation from $N=1$ to $M=2D$ dimensional
space is given as

\begin{equation}
\mathbf{f}_{p}=\left[\begin{array}{ccccc}
\sqrt{2}\mbox{cos}\left(2\pi x_{p}\right) & \sqrt{2}\mbox{sin}\left(2\pi x_{p}\right) & \cdots & \sqrt{2}\mbox{cos}\left(2D\pi x_{p}\right) & \sqrt{2}\mbox{sin}\left(2D\pi x_{p}\right)\end{array}\right]^{T}
\end{equation}

For a dataset of $P$ points the corresponding $\left(i,j\right)^{th}$
element of the corresponding kernel matrix $\mathbf{H}$ can be written
as

\begin{equation}
\mathbf{H}_{ij}=\mathbf{f}_{i}^{T}\mathbf{f}_{j}=2 \sum_{m=1}^{D}\mbox{cos}\left(2\pi mx_{i}\right)\mbox{cos}\left(2\pi mx_{j}\right)+\mbox{sin}\left(2\pi mx_{i}\right)\mbox{sin}\left(2\pi mx_{j}\right)
\end{equation}

Using trigonometric identities one can show (see Section \ref{subsec:Fourier-kernel-calculations-scalar})
that this may equivalently be written as

\begin{equation}
\mathbf{H}_{ij}=\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{i}-x_{j}\right)\right)}{\mbox{sin}\left(\pi\left(x_{i}-x_{j}\right)\right)}-1.
\end{equation}

Note that whenever $x_{i}-x_{j}$ is integer valued the term $\frac{\text{sin}\left(\left(2D+1\right)\pi\left(x_{i}-x_{j}\right)\right)}{\text{sin}\left(\pi\left(x_{i}-x_{j}\right)\right)}$
is not technically defined. In these case it is simply replaced by
its associated limit which, regardless of the integer value $x_{i}-x_{j}$,
is always equal to $2D+1$ meaning that $\mathbf{H}_{ij}=2D$. 

Moreover for general $N$ dimensional input the corresponding kernel
can be written similarly entry-wise as

\begin{equation}
\mathbf{H}_{ij}=\underset{n=1}{\overset{N}{\prod}}\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{in}-x_{jn}\right)\right)}{\mbox{sin}\left(\pi\left(x_{in}-x_{jn}\right)\right)}-1.
\end{equation}

As with the one dimensional version whenever $x_{in}-x_{jn}$ is integer
valued the associated term $\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{in}-x_{jn}\right)\right)}{\mbox{sin}\left(\pi\left(x_{in}-x_{jn}\right)\right)}$
in the product is replaced by its limit which, regardless of the value
of $x_{in}-x_{jn}$, is always equal to $2D+1$. See Section \ref{subsec:Fourier-kernel-calculations-vector-input}
for further details.

With this formula we may compute the degree $D$ Fourier features
for arbitrary $N$ dimensional input vectors without calculating the
enormous number (see footnote \ref{fn:high-dim-fourier-basis}) of
basis features explicitly. 

#### <span style="color:#a50e3e;">Example 3. </span>  Kernel representation of radial basis function (RBF) features

Another popular choice of kernel is the\emph{ radial basis function
}(RBF) kernel which is typically defined explicitly as a kernel matrix
over the input data as

\begin{equation}
\mathbf{H}_{ij}=e^{-\beta\left\Vert \mathbf{x}_{i}-\mathbf{x}_{j}\right\Vert _{2}^{2}}
\end{equation}

Here the kernel parameter $\beta$ is tuned to the data in practice
via cross-validation. 

While the RBF kernel is typically defined directly as above, it can be traced back to an explicit fixed feature basis as with the polynomial and Fourier kernels i.e., we have that

\begin{equation}
\mathbf{H}_{ij}=\mathbf{f}_{i}^{T}\mathbf{f}_{j},
\end{equation}

where $\mathbf{f}_{i}$ is the fixed feature transformation of the input $\mathbf{x}_{i}$ based on a fixed basis. While the length of a feature transformation corresponding to a degree $D$ polynomial/Fourier kernel matrix can be extremely large (as discussed in the introduction to this Section), with the RBF kernel the associated feature transformation is always \emph{infinite }dimensional. For example when $N=1$ the feature vector $\mathbf{f}_{i}$ takes the form $\mathbf{f}_{i}=\left[\begin{array}{cccc}
f_{1}\left(x_{i}\right) & f_{2}\left(x_{i}\right) & f_{3}\left(x_{i}\right) & \cdots\end{array}\right]^{T}$, where the $m^{th}$ fixed basis feature is defined as

\begin{equation}
f_{m}\left(x_{i}\right)=e^{-\beta x_{i}^{2}}\sqrt{\frac{\left(2\beta\right)^{m-1}}{\left(m-1\right)!}}x_{i}^{m-1}\quad\textrm{for all }m\geq1
\end{equation}

When $N>1$ the corresponding feature vector takes on an analogous form (and is also infinite in length), but regardless of the input dimension it would impossible to even construct and store a single $\mathbf{f}_{i}$ let alone such transformations of the entire dataset. 

The polynomial, Fourier, and RBF kernel matrices introduced earlier are all similarity matrices, essentially encoding how close or similar a collection of data points are to one another, with points in proximity to one another receiving a high value and those far apart receiving a low value. In this sense all three kernels discussed here, and hence all three corresponding fixed feature bases, define some kind of similarity between data points xi and xj from different geometric perspectives.

In Figure \ref{fig:kernels} we compare these three kernels geometrically
by fixing a point $\mathbf{x}_{p}=\left[\begin{array}{cc}
0.5 & 0.5\end{array}\right]^{T}$ and plotting $\mathbf{H}\left(\mathbf{x},\mathbf{x}_{p}\right)$
over the range $\mathbf{x}\in\left[0,1\right]^{2}$, producing a color-coded
surface showing how each kernel treats points near $\mathbf{x}_{p}$.
Analyzing this Figure we can judge more generally how the three kernels
define 'similarity' between points.

<figure>
  <img src= '../../mlrefined_images/kernel_images/Fig_7_2.png' width="80%"/>
  <figcaption> 
      <strong>Figure 2:</strong> 
      <em> 
Surfaces generated by polynomial, Fourier, and RBF kernels centered at xp =   0.5 0.5  T with the surfaces color-coded based on their similarity to xp. (left panel) A degree 2 polynomial kernel, (middle panel) degree 3 Fourier kernel, and (right panel) RBF kernel with β = 10. See text for further details.
      </em>
  </figcaption>
</figure>

Firstly, we can see that a polynomial kernel treats data points xi and xj similarly if their inner product is high or, in other words, they highly correlate with each other. Likewise the points are treated as dissimilar when they are orthogonal to one another. On the other hand, the Fourier kernel treats points as similar if they lie close together, but their similarity differs like a “sinc” function as their distance from each other grows. Finally an RBF kernel provides a smooth similarity between points. If they are close to each other in a Euclidean sense they are highly similar; however, once the distance between them passes a certain threshold they are deemed rapidly dissimilar.

### Further kernel calculations

#### Kernelizing various cost functions

Here we derive the kernelization of the three core classification models: softmax cost/logistic regression, soft-margin SVMs, and the multiclass softmax classifier. Although we will only describe how to kernelize the $\ell_{2}$ regularizer along with the SVM model, precisely the same argument can be made in combination with either the two or multiclass softmax classifiers. As with the derivation
for Least Squares regression shown in Section \ref{subsec:Kernelizing-cost-functions}
here the main tool for kernelizing these models is again the *Fundamental Theorem of Linear Algebra* described in Section \ref{subsec:The-fundamental-theorem-linear-algebra}. 

Throughout this Section we will suppose that an arbitrary $M$ dimensional fixed feature vector has been taken of the input of $P$ points $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$
giving feature vectors $\mathbf{f}_{p}=\left[\begin{array}{cccc}
f_{1}\left(\mathbf{x}_{p}\right) & f_{2}\left(\mathbf{x}_{p}\right) & \cdots & f_{M}\left(\mathbf{x}_{p}\right)\end{array}\right]^{T}$ for each $\mathbf{x}_{p}$. 

#### <span style="color:#a50e3e;">Example ?. </span> Kernelizing two-class softmax classification/logistic regression 

Recall the softmax perceptron cost function using with fixed feature
mapped input is given as

\begin{equation}
g\left(b,\mathbf{w}\right)=\underset{p=1}{\overset{P}{\sum}}\mbox{log}\left(1+e^{-y_{p}\left(b+\mathbf{f}_{p}^{T}\mathbf{w}\right)}\right)
\end{equation}

Using the fundamental theorem of linear algebra for any $\mathbf{w}$
we can then write $\mathbf{w}=\mathbf{F}\mathbf{z}+\mathbf{r}$ where
$\mathbf{F}^{T}\mathbf{r}=\mathbf{0}_{P\times1}$. Making this substitution
into the above and simplifying gives

\begin{equation}
g\left(b,\mathbf{z}\right)=\begin{aligned}\underset{p=1}{\overset{P}{\sum}}\mbox{log}\left(1+e^{-y_{p}\left(b+\mathbf{f}_{p}^{T}\mathbf{F}\mathbf{z}\right)}\right)\end{aligned}
\end{equation}

and denoting the kernel matrix $\mathbf{H}=\mathbf{F}^{T}\mathbf{F}$
(where $\mathbf{h}_{p}=\mathbf{F}^{T}\mathbf{f}_{p}$ is the $p^{th}$
column of $\mathbf{H}$) we can then write the above in kernelized
form as

\begin{equation}
g\left(b,\mathbf{z}\right)=\begin{aligned}\underset{p=1}{\overset{P}{\sum}}\mbox{log}\left(1+e^{-y_{p}\left(b+\mathbf{h}_{p}^{T}\mathbf{z}\right)}\right)\end{aligned}
\end{equation}

This is the kernelized form of logistic regression shown in Table
\ref{tab:kernelized-versions}.

#### <span style="color:#a50e3e;">Example ?. </span> Kernelizing soft-margin SVM/regularized margin-perceptron

Recall the soft-margin SVM cost/regularized margin-perceptron cost

\begin{equation}
g\left(b,\mathbf{w}\right)=\underset{p=1}{\overset{P}{\sum}}\mbox{max}^{2}\left(0,1-y_{p}^{\,}\left(b+\mathbf{f}_{p}^{T}\mathbf{w}\right)\right)+\lambda\left\Vert \mathbf{w}\right\Vert _{2}^{2}
\end{equation}

Applying the fundamental theorem of linear algebra we may then write
$\mathbf{w}$ as $\mathbf{w}=\mathbf{F}\mathbf{z}+\mathbf{r}$ where
$\mathbf{F}^{T}\mathbf{r}=\mathbf{0}_{P\times1}$. Substituting this
into the cost and noting that then $\mathbf{w}^{T}\mathbf{w}=\left(\mathbf{F}\mathbf{z}+\mathbf{r}\right)^{T}\left(\mathbf{F}\mathbf{z}+\mathbf{r}\right)=\mathbf{z}^{T}\mathbf{F}^{T}\mathbf{F}\mathbf{z}+\mathbf{r}^{T}\mathbf{r}=\mathbf{z}^{T}\mathbf{H}\mathbf{z}+\left\Vert \mathbf{r}\right\Vert _{2}^{2}$
denoting $\mathbf{H}=\mathbf{F}^{T}\mathbf{F}$ as the kernel matrix
we may rewrite the above equivalently as

\begin{equation}
\begin{aligned}g\left(b,\mathbf{z},\mathbf{r}\right)=\underset{p=1}{\overset{P}{\sum}}\mbox{max}^{2}\left(0,1-y_{p}^{\,}\left(b+\mathbf{h}_{p}^{T}\mathbf{z}\right)\right)+\lambda\mathbf{z}^{T}\mathbf{H}\mathbf{z}+\lambda\left\Vert \mathbf{r}\right\Vert _{2}^{2}\end{aligned}
\end{equation}

Notice that since we are aiming to minimize the quantity above over
$\left(b,\mathbf{z},\mathbf{r}\right)$, and since the only term with
$\mathbf{r}$ remaining is $\left\Vert \mathbf{r}\right\Vert _{2}^{2}$,
that the optimal value of $\mathbf{r}$ is zero for otherwise the
value of the cost function would be larger than necessary. Therefore
we can ignore $\mathbf{r}$ and write the cost function above in kernelized
form as

\begin{equation}
g\left(b,\mathbf{z}\right)=\underset{p=1}{\overset{P}{\sum}}\mbox{max}^{2}\left(0,1-y_{p}^{\,}\left(b+\mathbf{h}_{p}^{T}\mathbf{z}\right)\right)+\lambda\mathbf{z}^{T}\mathbf{H}\mathbf{z}
\end{equation}

as originally shown in Table \ref{tab:kernelized-versions}.

#### <span style="color:#a50e3e;">Example ?. </span>  Kernelizing the multiclass softmax loss

Recall that the multiclass softmax cost function is written as

\begin{equation}
g\left(b_{1},...,b_{C},\mathbf{w}_{1},...,\mathbf{w}_{C}\right)=\underset{c=1}{\overset{C}{\sum}}\underset{p\in\Omega_{c}}{\sum}\mbox{log}\left(1+\underset{\underset{j\neq c}{j=1}}{\overset{C}{\sum}}e^{\left(b_{j}^{\,}-b_{c}^{\,}\right)+\mathbf{f}_{p}^{T}\left(\mathbf{w}_{j}^{\,}-\mathbf{w}_{c}^{\,}\right)}\right)
\end{equation}

Rewriting each $\mathbf{w}_{j}$ as $\mathbf{w}_{j}=\mathbf{F}\mathbf{z}_{j}+\mathbf{r}_{j}$
where $\mathbf{F}^{T}\mathbf{r}_{j}=\mathbf{0}_{P\times1}$ for all
$j$ we can rewrite each $\mathbf{f}_{p}^{T}\left(\mathbf{w}_{j}^{\,}-\mathbf{w}_{c}^{\,}\right)$
term as $\mathbf{f}_{p}^{T}\left(\mathbf{w}_{j}^{\,}-\mathbf{w}_{c}^{\,}\right)=\mathbf{f}_{p}^{T}\left(\mathbf{F}\left(\mathbf{z}_{j}-\mathbf{z}_{c}\right)+\left(\mathbf{r}_{j}-\mathbf{r}_{c}\right)\right)=\mathbf{f}_{p}^{T}\mathbf{F}\left(\mathbf{z}_{j}-\mathbf{z}_{c}\right)$.
And denoting $\mathbf{H}=\mathbf{F}^{T}\mathbf{F}$ the kernel matrix
this we have that $\mathbf{f}_{p}^{T}\left(\mathbf{w}_{j}^{\,}-\mathbf{w}_{c}^{\,}\right)=\mathbf{h}_{p}^{T}\left(\mathbf{z}_{j}-\mathbf{z}_{c}\right)$, and so the cost may be written eqvuivalently (kernelized) as

\begin{equation}
g\left(b_{1},...,b_{C},\mathbf{z}_{1},...,\mathbf{z}_{C}\right)=\underset{c=1}{\overset{C}{\sum}}\underset{p\in\Omega_{c}}{\sum}\mbox{log}\left(1+\underset{\underset{j\neq c}{j=1}}{\overset{C}{\sum}}e^{\left(b_{j}^{\,}-b_{c}^{\,}\right)+\mathbf{h}_{p}^{T}\left(\mathbf{z}_{j}-\mathbf{z}_{c}\right)}\right)
\end{equation}

as shown in Table \ref{tab:kernelized-versions}.

#### Fourier kernel calculations - scalar input

From Example \ref{Fourier-kernel} the $\left(i,j\right)^{th}$ element
of the kernel matrix $\mathbf{H}$ is given as

\begin{equation}
\begin{array}{c}
\mathbf{H}_{ij}=2\underset{m=1}{\overset{D}{\sum}}\mbox{cos}\left(2\pi mx_{i}\right)\mbox{cos}\left(2\pi mx_{j}\right)+\mbox{sin}\left(2\pi mx_{i}\right)\mbox{sin}\left(2\pi mx_{j}\right)\end{array}
\end{equation}

Writing this using the complex exponential notation (see Exercise
\ref{exercise-complex-Fourier-representation}) we have equivalently

\begin{equation}
\begin{array}{c}
\mathbf{H}_{ij}=\underset{m=-D}{\overset{D}{\sum}}e^{2\pi im\left(x_{i}-x_{j}\right)}-1\end{array}
\end{equation}

If $x_{i}-x_{j}$ is an integer then $e^{2\pi im\left(x_{i}-x_{j}\right)}=1$
and so clearly the above sums to $2D$. Supposing this is not the
case, examining the summation alone we may write

\begin{equation}
\underset{m=-D}{\overset{D}{\sum}}e^{2\pi im\left(x_{i}-x_{j}\right)}=e^{-2\pi iD\left(x_{i}-x_{j}\right)}\underset{m=0}{\overset{2D}{\sum}}e^{2\pi im\left(x_{i}-x_{j}\right)}
\end{equation}

Now, the sum on the right hand side above is a geometric series, thus
we have the above is equal to

\begin{equation}
e^{-2\pi iD\left(x_{i}-x_{j}\right)}\frac{1-e^{2\pi i\left(x_{i}-x_{j}\right)\left(2D+1\right)}}{1-e^{2\pi i\left(x_{i}-x_{j}\right)}}=\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{i}-x_{j}\right)\right)}{\mbox{sin}\left(\pi\left(x_{i}-x_{j}\right)\right)}
\end{equation}

where final equality follows from the definition of the complex exponential.
Because in the limit as $t$ approaches any integer value $\frac{\mbox{sin}\left(\left(2D+1\right)\pi t\right)}{\mbox{sin}\left(\pi t\right)}=2D+1$,
which one can show using L'Hospital's rule from basic calculus, we
may therefore generally write in conclusion that

\begin{equation}
\mathbf{H}_{ij}=\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{i}-x_{j}\right)\right)}{\mbox{sin}\left(\pi\left(x_{i}-x_{j}\right)\right)}-1
\end{equation}

where at integer values of the input it is defined by the associated limit.

#### Fourier kernel calculations - vector input

Like the multidimensional polynomial basis element (see Footnote \ref{fn:N-dim-input-poly/fourier})
with the complex exponential notation for a general $N$ dimensional
input each Fourier basis element takes the form $f_{\mathbf{m}}\left(\mathbf{x}\right)=e^{2\pi im_{1}x_{1}}e^{2\pi im_{2}x_{2}}\cdots e^{2\pi im_{N}x_{N}}=e^{2\pi i\mathbf{m}^{T}\mathbf{x}}$
where $\mathbf{m}=\left[\begin{array}{cccc}
m_{1} & m_{2} & \cdots & m_{N}\end{array}\right]^{T}$, a product of one dimensional basis elements. Further a 'degree $D$'
sum contains all such basis elements where $-D\leq m_{1},\,m_{2},\,\cdots,\,m_{N}\leq D$,
and one may deduce that there are $M=\left(2D+1\right)^{N}-1$ non
constant basis elements in this sum. 

The the corresponding $\left(i,j\right)$th entry of the kernel matrix
in this instance takes the form

\begin{equation}
\mathbf{H}_{ij}=\mathbf{f}{}_{i}^{T}\overline{\mathbf{f}_{j}}=\left(\underset{-D\leq m_{1},\,m_{2},\,\cdots,\,m_{N}\leq D}{\sum}e^{2\pi i\mathbf{m}^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)}\right)-1
\end{equation}

Since $e^{a+b}=e^{a}e^{b}$ we may write each summand above as $e^{2\pi i\mathbf{m}^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)}=\underset{n=1}{\overset{N}{\prod}}e^{2\pi im_{n}\left(x_{in}-x_{jn}\right)}$,
and the entire summation as

\begin{equation}
\underset{-D\leq m_{1},\,m_{2},\,\cdots,\,m_{N}\leq D}{\sum}\,\,\,\underset{n=1}{\overset{N}{\prod}}e^{2\pi im_{n}\left(x_{in}-x_{jn}\right)}
\end{equation}

Finally one can show that the above can be written simply as

\begin{equation}
\underset{-D\leq m_{1},\,m_{2},\,\cdots,\,m_{N}\leq D}{\sum}\,\,\,\underset{n=1}{\overset{N}{\prod}}e^{2\pi im_{n}\left(x_{in}-x_{jn}\right)}=\underset{n=1}{\overset{N}{\prod}}\left(\underset{m=-D}{\overset{D}{\sum}}e^{2\pi im\left(x_{in}-x_{jn}\right)}\right)
\end{equation}

Since we already have that $\underset{m=-D}{\overset{D}{\sum}}e^{2\pi im\left(x_{in}-x_{jn}\right)}=\frac{\sin\left(\left(2D+1\right)\pi\left(x_{in}-x_{jn}\right)\right)}{\sin\left(\pi\left(x_{in}-x_{jn}\right)\right)}$,
the $\left(i,j\right)$th entry of the kernel matrix can easily be
calculated as

\begin{equation}
\mathbf{H}_{ij}=\underset{n=1}{\overset{N}{\prod}}\frac{\mbox{sin}\left(\left(2D+1\right)\pi\left(x_{in}-x_{jn}\right)\right)}{\mbox{sin}\left(\pi\left(x_{in}-x_{jn}\right)\right)}-1
\end{equation}

In [6]:
def kernel_visualizer():

    beta = 10
    poly_deg = 2
    fourier_deg = 3
    o = np.array([0.5, 0.5])
    
    types = ['poly', 'Fourier', 'Gauss']
    
    for t in types:
        # plot determined surface in 3d space
        s = np.linspace(0,1,50)
        s1, s2 = np.meshgrid(s,s)
        s1 = np.reshape(s1, (len(s1),1))
        s2 = np.reshape(s2, (len(s2),1))
        f = np.zeros((len(s1),1))
        for i in range(0, len(s1)):
            p = [s1[i], s2[i]]
            if t=='poly':
                d = (p'*o + 1).^poly_deg
                f(i) = d                
                case 2; % Fourier
                    dist = p - o;
                    ind = find(dist == 0);
                    dist1 = sin((2*fourier_deg + 1)*pi*dist);
                    dist2 = sin(pi*dist);
                    dist = dist1./dist2;
                    dist(ind) = (2*fourier_deg + 1);
                    d = prod(dist,1);
                    f(i) = d;                
                case 3; % Gauss
                    d = p - o;
                    d = sum(d.*d,1);
                    d = exp(-beta*d)';
                    f(i) = d;
            end
        end
        s1 = reshape(s1,[length(s),length(s)]);
        s2 = reshape(s2,[length(s),length(s)]);
        f = reshape(f,[length(s),length(s)]);  
        subplot(1,3,type)
        mesh(s1, s2, f);
        colormap winter

        xlabel('x_1','Fontsize',18,'FontName','cmmi9')
        ylabel('x_2','Fontsize',18,'FontName','cmmi9')
        set(get(gca,'YLabel'),'Rotation',0)
        set(gcf,'color','w');
        axis square

    end
    end

SyntaxError: invalid syntax (<ipython-input-6-5b4c1f0f50e9>, line 17)

<a id='bib_cell'></a>

## References

[1] Léon Bottou and Chih-Jen Lin. Support vector machine solvers. Large Scale Kernel Machines, pp. 301–320, MIT Press, 2007.

[2] Zhiyun Lu, Avner May, Kuan Liu, et al. How to scale up kernel methods to be as good as
deep neural nets. arXiv preprint arXiv:1411.4000, 2014.

[3] Jeffrey Pennington, Felix Yu, and Sanjiv Kumar. Spherical random features for polynomial kernels. In Advances in Neural Information Processing Systems, pages 1837–1845, NIPS, 2015.

[4] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, NIPS, 2007.

[5] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pp. 555–561. IEEE, 2008.

[6] Christopher M Bishop et al. Pattern Recognition and Machine Learning, volume 4. Springer, 2006.

[7] David J C MacKay. Introduction to gaussian processes. NATO ASI Series F Computer and
Systems Sciences, 168 133–166, 1998.