## Chapter 12: Kernel methods

# 12.4  Optimization of kernelized models

We have seen that virtually any machine learning model - supervised or unsupervised - can be kernelized.  The real value in kernelization is that - for a large range of kernel features - we can actually construct the kernelized version $\mathbf{H}$ *without* explicitly defining the kernel functions themselves, while allowing for the construction of new kernels directly via their kernel matrix (like e.g., the RBF kernel).  As we have seen this allows us to get around the issue of scaling kernel features with large input dimension.  Moreover, because the final kernelized model remains linear in its parameters (the kernels themselves having no internal parameters tuned duiring optimization), corresponding kernelized cost functions themselves are quite 'nice' in terms of their general shape.  For example, any convex cost function for regression and classification *remains convex when kernelized*.  This allows virtually any optimization method to be used to tune a kernelized supervised learner - from zero to first order and even powerful second order approaches like Newton's method.

However, because kernel matrices $\mathbf{H}$ are sized $P \times P$ - where $P$ is the size of the training set - they inherently scale very poorly in the size of training data.  For example, with $P=10,000$ the corresponding kernel matrix will be of size $10,000 \times 10,000$, with $10^8$ values to store, far more than a modern computer can store all at once.  This obviously makes training kernelized models extremely challenging on large datasets, as even the amount of computation required to perform even less computationally intensive optimization like e.g., gradient descent grows dramatically with the size of a kernel matrix due to its explosive size.  Even predictions using a kernelized model  - which as we saw require the evaluation of *every training datapoint* - become more challenging as the size of training data increases.

Most standard ways of dealing with this crippling scaling issue revolve around avoiding the creation of the entire kernel matrix $\mathbf{H}$, especially during training.  For example one can use first order methods such as stochastic gradient descent to avoid construction of the entire kernel $\mathbf{H}$, i.e., so that only a small number of the training points are dealt with at a time when training - meaning that only a small subset of columns of $\mathbf{H}$ are ever created concurrently when training.  Sometimes the explicit structure of certain problems allows can allow for the avoidance of explicit kernel construction [[3,4,5]](#bib_cell)) as well.