# $\S$ 6.2. Selecting the Width of the Kernel

In each of the kernels $K_\lambda$, $\lambda$ is a parameter that controls its width:

* For the Epanechnikov or tri-cube kernel with metric width, $\lambda$ is the radius of the support region.
* For the Gaussian kernel, $\lambda$ is the standard deviation.
* $\lambda$ is the number $k$ of nearest neighbors in $k$-nearest neighborhoods, often expressed as a fraction or _span_ $k/N$ of the total training sample.

### Bias-variance tradeoff, again and again

There is a natural bias-variance tradeoff as we change the width of the averaging window, which is most explicit for local averages:

* If the window is narrow, $\hat{f}(x_0)$ is an average of a small number of $y_i$ close to $x_0$, and its variance will be relatively large -- close to that of an individual $y_i$. The bias will tend to be small, again because each of the $\text{E}(y_i) = f(x_i)$ should be close to $f(x_0)$.
* If the window is wide, the variance of $\hat{f}(x_0)$ will be small relative to the variance of any $y_i$, because of the effects of averaging. The bias will be higher, because we are now using observations $x_i$ further from $x_0$, and there is no quarantee that $f(x_i)$ will be close to $f(x_0)$.

Similar arguments apply to local regression estimates, say local linear:
* As the width goes to zero, the estimates approach a piecewise-linear function that interpolates the training data;
* as the width gets infinitely large, the fit approaches the global linear least-squares fit to the data.

The discussion in Chapter 5 on selecting the regularization parameter for smoothing splines applies here, and will not be repeated.

Local regression smoothers are linear estimators; the smoother matrix in

\begin{equation}
\hat{\mathbf{f}} = \mathbf{S}_\lambda\mathbf{y}
\end{equation}

is built up from the equivalent kernels ($\S$ 6.1.1), and has $ij$th entry $\{\mathbf{S}_\lambda\}_{ij} = l_i(x_j)$.

Leave-one-out cross-validation is particularly simple (Exercise 6.7), as is generalized cross-validation, $C_p$ (Exercise 6.10), and $k$-fold cross-validation. The effective degrees of freedom is again defined as $\text{trace}(\mathbf{S}_\lambda)$, and can be used to calibrate the amount of smoothing.

FIGURE 6.7 compares the equivalent kernels for a smoothing spline and local linear regression. The local regression smoother has a span of $40%$, which results in $\text{df} = \text{trace}(\mathbf{S}_\lambda) = 5.86$. The smoothing spline was calibrated to have the same $\text{df}$, and their equivalent kernels are qualitatively quite similar.

In [1]:
"""FIGURE 6.7. Equivalent kernels for a local linear regreession smoother and
a smoothing spline"""
print('Under construction ...')

Under construction ...
