# $\S$ 6.5. Local Likelihood and Other Models

The concept of local regression and varying coefficient models is extremely broad: Any parametric model can be made local if the fitting method accommodates observation weights. Here are some examples:

1.

Associated with each observation $y_i$ is a parameter

\begin{equation}
\theta_i = \theta(x_i) = x_i^T\beta
\end{equation}

linear in the covariate(s) $x_i$, and inference for $\beta$ is based on the log-likelihood

\begin{equation}
l(\beta) = \sum_{i=1}^N l(y_i,x_i^T\beta).
\end{equation}

We can model $\theta(X)$ more flexibly by using the likelihood local to $x_0$ for inference of $\theta(x_0) = x_0^T\beta(x_0)$:

\begin{equation}
l(\beta(x_0)) = \sum_{i=1}^N K_\lambda(x_0,x_i) l(y_i,x_i^T\beta(x_0)).
\end{equation}

Many likelihood models, in particular the family of  generalized linear models including logistic and log-linear models, involve the covariates in a linear fashion. Local likelihood allows a relaxation from a globally linear model to one that is locally linear.

2.

As above, except different variables are associated with $\theta$ from those used for defining the local likelihood:

\begin{equation}
l(\theta(z_0)) = \sum_{i=1}^N K_\lambda(z_0,z_i) l\left(y_i,\eta(x_i,\theta(z_0))\right).
\end{equation}

For example, $\eta(x,\theta) = x^T\theta$ could be a linear model in $x$. This will fit a varying coefficient model $\theta(z)$ by maximizing the local likelihood.

3.

Autoregressive time series models of order $k$ have the form

\begin{equation}
y_t = \beta_0 + \beta_1 y_{t-1} + \beta_2 y_{t-2} + \cdots + \beta_k y_{t-k} + \epsilon_t.
\end{equation}

Denoting the _lag set_ by

\begin{equation}
z_t = (y_{t-1}, y_{t-2}, \cdots, y_{t-k}),
\end{equation}

the model looks like a standard linear model

\begin{equation}
y_t = z_t^T\beta + \epsilon,
\end{equation}

and is typically fit by least squares. Fitting by local least squares with a kernel $K(z_0,z_t)$ allows the model to vary according to the short-term history of the series.

This is to be distinguished from the more traditional dynamic linear models that vary by windowing time.

As an illustration of local likelihood, we consider the local version of the multiclass linear logistic regression model of Chapter 4. The data consist of features $x_i$ and an associated categorical response $g_i \in \{1,2,\cdots,J\}$, and the linear model has the form

\begin{equation}
\text{Pr}(G=j|X=x) = \frac{\exp\left(\beta_{j0} + \beta_j^Tx\right)}{1 + \sum_{k=1}^{J-1} \exp\left(\beta_{k0} + \beta_k^Tx\right)}.
\end{equation}

The local log-likelihood for this $J$ class model can be written

\begin{equation}
\sum_{i=1}^N K_\lambda(x_0,x_i) \left\{ \beta_{g_i 0}(x_0) + \beta_{g_i}^T(x_i-x_0) - \log\left( 1 + \sum_{k=1}^{J-1} \exp\left( \beta_{k0}(x_0) + \beta_k(x_0)^T(x_i-x_0) \right) \right) \right\}.
\end{equation}

Notice that
* we have used $g_i$ as a subscript in the first line to pick out the appropriate numerator;
* $\beta_{J0} = 0$ and $\beta_J = 0$ by the definition of the model;
* we have centered the local regressions at $x_0$, so that the fitted posterior probabilities at $x_0$ are simply  

  \begin{equation}
  \hat{\text{Pr}}(G=j|X=x) = \frac{\exp\left(\hat\beta_{j0} + \hat\beta_j^Tx\right)}{1 + \sum_{k=1}^{J-1} \exp\left(\hat\beta_{k0} + \hat\beta_k^Tx\right)}.
  \end{equation}
  
This model can be used for flexible multiclass classification in moderately low dimensions, although successes have been reported with the high-dimensional ZIP-code classficiation problem. Generalized additive models (Chapter 9) using kernel smoothing methods are closely related, and avoid dimensionality problems by assuming an additive structure for the regression function.

### Heart disease data, again

As a simple illustration we fit a two-class local linear logistic model to the heart disease data of Chapter 4. FIGURE 6.12 shows the univariate local logistic models fit to two of the risk factors (separately). This is a useful screening device for detecting nonlinearities, when the data themselves have little visual information to offer. In this case an unexpected anomaly is uncovered in the data, which may have gone unnoticed with traditional method.

In [1]:
"""FIGURE 6.12. Local logistic models for the heart disease data"""
print('Under construction ...')

Under construction ...


Since $\textsf{CHD}$ is a binary indicator, we could estimate the conditional prevalence $\text{Pr}(G=j|x_0)$ by simply smoothing this binary response directly without resorting to a likelihood formulation. This amounts to fitting a locally constant logistic regression model (Exercise 6.5). In order to enjoy the bias-correction of local-linear smoothing, it is more natural to operate on the unstricted logit scale.

Typically with logistic regression, we compute parameter estimates as well as their standard errors. This can be done locally as well, and so we can produce, as shown in the plot, estimated pointwise standard-error bands about our fitted prevalence.