## Chapter 11: Principles of Feature Learning

# 11.0  Introduction

In the previous Chapter we saw how to inject arbitrary nonlinear features into both supervised and unsupervised learning paradigms. However the specific examples we examined there were rather simplistic both in the sense that they were all low-dimensional (allowing us to visualize each dataset) and contained common visual patterns that could identify 'by eye'.  In this Chapter we introduce the fundamental tools and principles of nonlinear *feature learning*, a set of diverse tools for automating the process of engineering nonlinear features for arbitrary datasets.

## 10.0.1  The practical limitations of nonlinear feature engineering

In order to engineer nonlinear feature properly we cannot in general rely on visualizations - as most (particularly modern) datasets have far more than two inputs making visualization impossible.  Morever, even in cases where data visualization is possible, we cannot rely on our own pattern recognition skills either.  Take the two simple examples below - one (on the left) a regression dataset with $N=1$ dimensional input and the other (on the right) a two-class classification dataset with $N=2$ dimensional input.  Note that the true underlying nonlinear model used to generate the regression data is shown in dashed black, and likewise the true nonlinear decision boundary separating the two classes of the classification data is shown in dashed black as well.  We humans are typically taught only how to recognize the simplest of nonlinear patterns 'by eye' - like those created by elementary functions (e.g., low degree polynomials, exponential functions, sine waves) and simple shapes (e.g., circles, squares, etc.,).   Neither of the patterns shown here match such simple nonlinear functionality.  

<figure>
    <img src= '../../mlrefined_images/nonlinear_superlearn_images/nonlinear_combined.png' width="75%" height="75%" alt=""/>
<figcaption>   
<strong>Figure 1:</strong> <em> 
A regression (on the left - with true nonlinear model shown in dashed black) and two-class classification (on the right - with true nonlinear decision boundary shown in dashed black) datasets that clearly exhibit nonlinear behavior.  However in each case this behavior is difficult to assess 'by eye', since neither belongs to the basic set of nonlinear patterns we humans are taught to recognize.  Therefore even in cases like these - where we can visualize data - it can be difficult if not impossible to properly engineer nonlinear features ourselves.
</em>  </figcaption> 
</figure>

So whether or not a dataset can be visualized, as the two examples above illustrate, engineering proper nonlinear features ourselves can be difficult if not impossible to do ourselves.

## 10.0.2  The limitless potential of nonlinear feature learning

It is precisely this challenge which motivates the fundamental *feature learning* tools described in this Chapter.  *In short these technologies *automate* the process of identifying appropriate nonlinear features for arbitrary datasets.*  With these tools in hand we no longer need to 'engineer' proper nonlinearites ourselves - we can their appropriate forms.  This why the phrase *feature learning* is used to collectively describe these tools, since using them we may *learn* proper nonlinear features automatically as opposed to engineering them ourselves.  Compared to our own limited nonlinear pattern recognition abilities, feature learning tools can identify virtually any nonlinear pattern present in a dataset regardless of its input dimension.

## 10.0.2  Chapter summary

The aim to automate nonnlinear learning is an ambitious one and is perhaps at first glance an intimidating one as well, for there are an infinite variety of nonlinearities / nonlinear functions to choose from.  How do we, in general, parse this infinitude automatically to determine the appropriate nonlinearity for a given dataset?  

The first step - as we will see in Section 10.1 - is to organize the pursuit of automation by first placing the fundamental building blocks of this infinitude into *managable collections* of (relatively simple) nonlinear functions.  These collections are often called *universal approximators*, of which three strains are popularly used and which we introduce here: kernels, neural networks, and trees.  After introducing universal approximators we then discuss the fundamental concepts underlying how they are employed automatically - including a description of the *bias-variance tradeoff* in Section 11.2, the necessity for *validation error* as a measurement tool in Section 11.3, the automatic tuning of nonlinear capacity via *boostingn and regularization* in Sections 11.4 and 11.5 respectively, and the notion *ensembling* in Section 11.6, and finally *testing error* in Section 11.7.

&copy; This material is not to be distributed, copied, or reused without written permission from the authors.