# $\S$ 2.7. Structured Regression Models

### Review & motivation

We have seen that although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.

This section introduces classes of such structured approaches. Before we proceed, though, we discuss further the need for such classes.

## $\S$ 2.7.1. Difficulty of the Problem

Consider the RSS criterion for an arbitrary function $f$,

\begin{equation}
\text{RSS}(f) = \sum_{i=1}^N \left( y_i - f(x_i) \right)^2.
\end{equation}

Minimizing the RSS leads to infinitely many solutions: Any function $\hat{f}$ passing through the training points $(x_i,y_i)$ is a solution. Any particular solution chosen might be a poor predictor at test points different from the training points.

If there are multiple observation pairs $(x_i,y_{il})$, $l=1,\cdots,N_i$, at each value of $x_i$, the risk is limited. In this case, the solution pass through the average values of the $y_{il}$ at each $x_i$ (Exercise 2.6). The situation is similar to the one we have already visited in $\S$ 2.4; indeed, the above RSS is the finite sample version of the expected prediction error

\begin{equation}
\text{EPE}(f) = \text{E}\left( Y - f(X) \right)^2 = \int \left( y - f(x) \right)^2 \text{Pr}(dx, dy).
\end{equation}

### Necessity & limit of the restriction

If the sample size $N$ were sufficiently large such that repeats were guaranteed and densely arranged, it would seem that these solutions might all tend to the limiting conditional expectation.

In order to obtain useful results for finite $N$, we must restrict the eligible solution to the RSS to a smaller set of functions.

> How to decide on the nature of the restrictions is based on considerations outside of the data.

These restrictions are somtimes
* encoded via the parametric representation of $f_\theta$, or
* may be built into the learning method itself, either implicitly or explicitly.

> These restricted classes of solutions are the major topic of this book.

One thing should be clear, though.

> Any restrictions imposed on $f$ that lead to a unique solution to RSS do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the abmiguity has simply been transferred to the choice of constraint.

### Complexity

In general the constraints imposed by most learning methods can be described as _complexity_ restrictions of one kind or another.

> This usually means some kind of regular behavior in small neighborhoods of the input space.

That is, for all input points $x$ sufficiently close to each other in some metric, $\hat{f}$ exhibits some special structure such as
* nearly constant,
* linear or
* low-order polynomial behavior.

The estimator is then obtained by averaging or polynomial fitting in that neighborhood.

The strength of the constraint is dictated by the neighborhood size.

> The larger the size, the stronger the constraint, and the more sensitive the solution is to the particular choice of constraint.

For example,
* local constant fits in infinitesimally small neighborhoods is no constraints at all;
* local linear fits in very large neighborhoods is almost a globally llinear model, and is very restrictive.

### Metric

The nature of the constraint depends on the metric used.

Some methods, such as kernel and local regression and tree-based methods, directly specify the metric and size of the neighborhood. The kNN methods discussed so far are based on the assumption that locally the function is constant; close to a target input $x_0$, the function does not change much, and so close outputs can be averagedd to produce $\hat{f}(x_0)$.

Other methods such as splines, neural networks and basis-function methods implicitly define neighborhoods of local behavior. In $\S$ 5.4.1 we discuss the concept of an _equivalent kernel_, which describes this local dependence for any method linear in the outputs. These equivalent kernels in many cases look just like the explicitly defined weighting kernels discussed above -- peaked at the target point and falling smoothly away from it.

### Curse of dimensionality

One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotopic neighborhoods will run into problems in high dimensions -- again the curse of dimensionality.

And conversely, all methods that overcome the dimensionality problems have an associated -- and often implicit or adaptive -- metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.