# Kernelizing cost functions

Suppose that we have a dataset of $P$ points  $\left\{\left(\mathbf{x}_p,y_p\right)\right\}_{p=1}^P$ where each input $\mathbf{x}_p$ has dimension $N$. Recall, when employing any fixed feature basis we learn proper parameters by minimizing the Least Squares regression cost

\begin{equation}
g\left(b,\mathbf{w}\right) = \sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\mathbf{w}-y_p\right)^2
\end{equation}

where we have used the vector notation

\begin{equation}
\mathbf{f}_{p}=\left[\begin{array}{c}
f_{1}\left(\mathbf{x}_{p}\right)\\
f_{1}\left(\mathbf{x}_{p}\right)\\
\vdots\\
f_{1}\left(\mathbf{x}_{p}\right)
\end{array}\right]
\end{equation}

to denote the $M$ fixed basis feature transformations of the input $\mathbf{x}_p$. 

Denote by $\mathbf{F}$ the $M \times P$ matrix $\mathbf{F}$ formed by stacking the vectors $\mathbf{f}_p$ column-wise. Now, employing the fundamental theorem of linear algebra discussed in the previous section we may write $\mathbf{w}$ here as

\begin{equation}
\mathbf{w} = \mathbf{F}\mathbf{z}+\mathbf{r}
\end{equation}

where $\mathbf{r}$ satisfies $\mathbf{F}^T\mathbf{r}=\mathbf{0}_{P \times 1}$.  Plugging this representation of $\mathbf{w}$ back into the cost function then gives

\begin{equation}
\sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\left(\mathbf{F}\mathbf{z}+\mathbf{r}\right)-y_p\right)^2 = \sum_{p=1}^{P} \left(b+\mathbf{f}_p^T\mathbf{F}\mathbf{z}-y_p\right)^2 
\end{equation}

Finally, denoting the symmetric matrix $\mathbf{H} = \mathbf{F}^T\mathbf{F}$ (and where $\mathbf{h}_p = \mathbf{F}^T\mathbf{f}_p$ is the $p^{th}$ column of this matrix), referred to as a fixed basis *kernel matrix*, our original cost function becomes equivalently

\begin{equation}
g\left(b,\mathbf{z}\right) = \sum_{p=1}^{P} \left(b+\mathbf{h}_p^T\mathbf{z}-y_p\right)^2
\end{equation}

Note that we have changed the arguments of the cost function from $g\left(b,\mathbf{w}\right)$ to $g\left(b,\mathbf{z}\right)$ due to our substitution of $\mathbf{w}$. The original problem of minimizing the Least Squares cost may now be written equivalently in this *kernelized form* as

\begin{equation}
\underset{b,\,\mathbf{z}}{\text{minimize}}\,\, \sum_{p=1}^{P} \left(b+\mathbf{h}_p^T\mathbf{z}-y_p\right)^2
\end{equation}

Using precisely the same argument given here we may kernelize all of the cost functions discussed in this book including: the softmax cost/logistic regression classifier, the squared margin-perceptron/soft-margin SVMs, the multiclass softmax cost function, as well as any $\ell_2$ regularized version of these models. We show both the original and kernelized forms of these formulae in Table 1 for easy reference.

<figure>
  <img src= '../../mlrefined_images/kernel_images/kernel_table.png' width="60%"/>
</figure>