# Chapter 6 - Kernel Smoothing Methods

* Fit a different, but simple model, separately at each point by using observations close to the target point.
* Localization is achieved via a weighting function, or kernel, which assigns a weight to observations near the point based on its distance.

### 6.1 - One-Dimensional Kernel Smoothers

* K-nearest neighbor regression is an example of a type of kernel smoother.
* With nearest neighbor, the size of the kernel adapts to the local density of points.
* Other methods will fix the width of the kernel to determine the # of points to average.
* Let $\lambda$ be the smoothing parameter which determines the width of the local neighborhood, then a large $\lambda$ implies lower variance (averages more observations) but higher bias (assumes the function is constant within the window).
* Observation weights create some problems -- with metric neighborhood methods you can simply multiply them by the kernel weight for the weighted average. But with nearest neighborhood, we might want to insist on a minimum neighborhood weight since each observation no longer should count as 1 neighbor.
* Metric methods can be badly biased on the boundaries of the domain because of the asymmetry of the kernel in that region. We can alleviate some of this bias by fitting local linear regressions in the kernel.
* To even further reduce the bias, we can fit polynomials in the kernel with higher order than just local linear regression. These methods, however, can be victims of the bias-variance tradeoff. Fitting these methods decrease bias at the cost of increased variance.

### 6.2 - Selecting the Width of the Kernel

* In k-nearest neighbors $\lambda = k$.
* For Epanechnikov or tri-cube kernel with metric width, $\lambda$ is the radius of the support region.
* For Gaussian kernel, $\lambda$ is the standard deviation.
* If window is narrow, variance will be large and bias will be small.
* if window is wide, variance will be smaller but bias will be higher.
* Similar arguments apply to the local regression estimates.

### 6.3 - Local Regression in $\mathbb{R}^p$

* Kernel smoothing and local regression generalize naturally to two or more dimensions.
* Boundary effects become a much bigger in higher dimensions, as the fraction of points close to the boundary increases to one as $p \rightarrow \infty$. Local polynomial regression can correct this problem.
* Local regression becomes less useful in dimensions higher than 2 -- it becomes impossible to maintain locality without extraordinary sample sizes.

### 6.4 - Structured Local Regression Models in $\mathbb{R}^p$

* When ratio of dimension to sample size is unfavorable, local regression does not help much. However, we can make some structural assumptions about the model.
* One approach is to modify the kernel -- a default spherical kernel gives equal weight. We can standardize each variable to unit standard deviation, or more generally, weight diffrent coordinates to downgrade or omit their importance.
* Another approach is to consider the ANOVA decomposition and eliminating some of the higher order terms. *Varying coefficient models* belong to this class. 
* Variable coefficient example -- If we are measuring aorta thickness w.r.t age, gender, and depth down aorta -- we may model the diameter of the aorta as a linear function of age (a longstanding known effect), but allow the coefficients to vary with gender and depth down the aorta.


### 6.5 - Local Likelihood and Other Models
* We can fit a broad class of models locally.  Any method can be made local if it accomodates observation weights.
* For example, multi-class logistic regression can be used locally.

### 6.6 - Kernel Density Estimation and Classification

* Kernel density estimation is an unsupervised learning procedure that often precedes kernel regression.
* Natural estimate for classification is # of points in class / # points in neighborhood. Parzen estimate smooths this, which can be bumpy -- it adds a bit of noise at each point.
* Naive Bayes is appropriate when the feature space is high dimension (making density estimation unnattractive). It assumes the function $f$ can be expressed as a product of independent functions $f_k$ for each dimension of features. While generally not true, it simplifies the estimation.
* Naive Bayes allows each the conditional densities (for each dimension) to be estimated using one-dimensional kernel density estimates.

### 6.7 - Radial Basis Functions and Kernels
* Combine the flexibility of local fitting with the flexibility gained by fitting basis expansions.
* A radial basis function (RBF) is a real-valued function whose value depends only on the distance from the origin, or alternatively on the distance from some other point c, called a center.

In [6]:
# From sklearn documentation
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
import numpy as np

X = np.array([[0, 0], [1, 1], [1, 0], [0, 1]])
y = np.array([0, 0, 1, 1])
rbf_feature = RBFSampler(gamma=1, random_state=1)
X_kernel = rbf_feature.fit_transform(X)
clf = SGDClassifier()   
clf.fit(X_kernel, y)
clf.score(X_kernel, y)

1.0

In [7]:
print('Original feature space shape: {}'.format(X.shape))
print('Kernel feature space shape: {}'.format(X_kernel.shape))
print('Now feed it into your favorite linear algorithm!')

Original feature space shape: (4, 2)
Kernel feature space shape: (4, 100)
Now feed it into your favorite linear algorithm!


### 6.8 - Mixture Models for Density Estimation and Classification
* Mixture models are a weighted sum of gaussian density functions. Useful for probability density estimation.
* Mixture models can be viewed as a kind of kernel method.

### 6.9 - Computational Considerations

* With kernel methods, the fitting is done at evaluation or prediction time.  This can provde computationally infeasible with real-world applications.