Feature extraction is often neglected in ML courses, but
it is also often one of the main bottlenecks in the ML pipeline

A _feature template_ is a group of features all computed in the same way.
Basically a feature description with blanks in it.

e.g.
- "string length is greater than [blank]"
- "string contains the character [blank]"


# Hypothesis Class

Suppose you have a particular form of predictor in mind, 
like linear classifier $ f_w(x) = w^T \phi(x) $.

Now you are trying to pick a feature extraction funciton $\phi$.

One thing to consider is the _hypothesis class_ induced by your choice of $\phi$.

The hypothesis class corresponding to $\phi$ is the set $\{f_w\ |\ w\in\mathbb{R}^d\}$, where $d$ is the number of features.

Note that linear classifiers can produce crazy nonlinear decision boundaries, all depending on how $\phi$ is chosen.


# Neural Networks

They are basically like linear predictors strung together with activation functions. Each layer solves a bunch of sub-problems. The output of each layer is like features for the next layer.

Taking gradient of loss function can be done in a systematic way as follows:
- Create computation graph for expression, where nodes are building blocks like $+$ and $\cdot$ and $\sigma$, etc.
- Label edges with partial derivatives of the building blocks wrt their inputs.
- Observe that the chain rule is now simply multiplication of edges along paths. In other words if you want to know the deriv of one node value wrt one of its descendents, just multiply edge values along the path joining them.
- Define forward values, $f_i$'s, to be the values on the nodes-- i.e. the values of subexpressions. Define backward values, $g_i$'s, to be the derivatives $\frac{\partial \text{output}}{\partial f_i}$ -- i.e. how changing that node influces the root.
- Backpropagation algorithm:
    - Forward pass: Compute each $f_i$, leaves to root. So basically just compute the top level expression and make sure to remember the values of all subexpressions along the way
    - Backward pass: Compute each $g_i$, root to leaves. This is just a matter of traveling down paths to the nodes you're interested in taking a deriv wrt, and along the way multiplying the derivative expressions that you labeled on edges, which you can now compute because you have all the $f_i$'s in place.

# Nearest Neighbors

Store your _entire_ training data set. Given a new input $x$, find the data point in your training set whose input is nearest to $x$. The associated output is then your prediction. That's it! With spatial input and euclidean distance, this creates Voronoi diagrams.

No pre-defined set of parameters-- each new data point is its own parameter. Model complexity scales with the size of the data set. This situation is referred to as a _non-parametric model_.

---

# Summary

- Linear predictors are fast, easy to learn, but have weak use of features.
- Neural nets are fast, hard to learn, and have powerful use of features.
- Nearest neighbor is slow, easy to learn, and has powerful use of features.