In [1]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


This notebook covers Part III of the Kutner (2004) textbook *Applied Linear Statistical Models* on **Nonlinear Regression**.

### Chapter 13  Introduction to Nonlinear Regression and Neural Networks

**Linear and Nonlinear Regression Models**

Regression models can be written in the form:

$$
Y_i = f(\mathbf{X_i\beta}) + \epsilon_i
$$

and a linear regression model is a special case of this more general form, which includes nonlinear regression models. We will use $\gamma$ rather than $\beta$ when discussing nonlinear regression parameters.

A typical nonlinear regression model is the *exponential model*. Some examples follow:

\begin{align}
Y_i &= \gamma_0\exp(\gamma_1X_{i1}) + \epsilon_i \\
Y_i &= \gamma_0 + \gamma_1\exp(\gamma_2X_{i1}) + \epsilon_i \\
\end{align}

**Diversion**: Application to tuning curve analysis in an attention task with two conditions. Here we set the scene for discussion that will continue below as the relevant points come up in the chapter. We are dealing with the following problem. We have recorded a neuron's responses to different orientations and determined a raw tuning curve of responses versus orientation across all instances where an orientation was shown. We fit the von Mises distribution function to this tuning curve to obtain a set of overall parameters which we use to reduce the number of parameters estimated in a subsequent step. That subsequent step involves a comparison of the tuning curves in two different attentional conditions, an attend towards and attend away condition. The von Mises distribution takes a form similar to the exponential model shown above.

$$
Y = \gamma_0 + \exp(\gamma_2 + \gamma_3\cos(X - \gamma_4)),
$$

which can be re-written as

$$
Y = \gamma_0 + \exp(\gamma_2)\theta,
$$

where $\theta = \exp(\gamma_3\cos(X - \gamma_4))$ becomes a constant in the case that the shape and preferred orientation terms, $\gamma_3, \gamma_4$ are known. These parameters are known in our case, as they are determined by the original tuning curve fit performed across all trials, regardless of attentional state. We next want to fit two regression functions of the form shown immediately above and determine whether $\gamma_0$ or $\gamma_1$ are the same or differ across conditions.

**Return from Diversion**

Another important nonlinear model is the *logistic regression model*, which takes the following form:

$$
Y_i = \frac{\gamma_0}{1 + \gamma_1\exp(\gamma_2X_{i1})} + \epsilon_i
$$

One important feature of nonlinear regression models is that the number of parameters is not as straightforwardly related to the number of $X$ variables.

As a point of caution, some nonlinear response functions can be linearized by a certain transformation, in which case they are called *intrinsically linear* response functions. However, linear regression may not be appropriate with such models as assumptions about the error variance constancy may not hold under the transformation. Error terms should be examined in this case to check for appropriateness of linear regression.

**Estimation of Regression Parameters**

Least squares and maximum likelihood are used to estimate parameters of nonlinear regression functions, as was the case for linear regression. A major difference, however, is that analytical solutions are less common, so numerical search methods must be used. We will discuss how to find least squares estimates, focusing first on use of the normal equations and second on use of numerical search procedures.

*Normal Equations*

In this case we minimize

$$
Q = \sum_{i=1}^n[Y_i - f(\mathbf{X_i}, \boldsymbol{\gamma})]^2,
$$

with respect to the $\gamma_k$, which gives the partial derivative of $Q$ wrt $\gamma_k$:

$$
\frac{\partial{Q}}{\partial{\gamma_k}} = \sum_{i=1}^n-2[Y_i - f(\mathbf{X_i},\boldsymbol{\gamma})]\Big[ \frac{\partial{f(\mathbf{X_i},\boldsymbol{\gamma})}}{\partial{\gamma_k}} \Big].
$$

Setting the $p$ partial derivatives equal to $0$ and replacing the parameters $\gamma_k$ by the least squares estimates $g_k$, we get:

$$
0 = \sum_{i=1}^n Y_i \Big[\frac{\partial{f(\mathbf{X_i},\boldsymbol{\gamma})}}{\partial{\gamma_k}}\Big]_{\gamma=g} - \sum_{i=1}^n f(\mathbf{X_i,g}) \Big[\frac{\partial{f(\mathbf{X_i},\boldsymbol{\gamma})}}{\partial{\gamma_k}}\Big]_{\gamma=g}
$$

*Direct Numerical Search: Gauss-Newton Method*

One option is to use the normal equations (solved for above) and perform numerical search methods for solutions to these equations. In many instances, however, it is more practical to perform a direct numerical search to find the least squares solutions. One of these direct methods is the *Gauss-Newton* or *linearization method*, which uses a Taylor series expansion to approximate the nonlinear regression model with linear terms and uses normal least squares on the approximation to estimate the parameters.

The first step is to initialize the $g_k^{(0)}$ for use in the first iteration of the algorithm. We then use this initial estimate to evaluate the linear terms of the Taylor expansion to approximate $f(\mathbf{X_i},\boldsymbol{\gamma})$.

$$
f(\mathbf{X_i},\boldsymbol{\gamma}) \approx f(\mathbf{X_i, g^{(0)}}) + \sum_{k=1}^p \Big[\frac{\partial{f(\mathbf{X_i},\boldsymbol{\gamma})}}{\partial{\gamma_k}}\Big]_{\gamma_k = g_k^{(0)}}(\gamma_k - g_k^{(0)}) = f_i^{(0)} + \sum_{k=0}^{p-1}D_{ik}^{(0)}\beta_k^{(0)},
$$

where $D_{ik}$ is the $k^{th}$ partial derivative of $f_i = f(\mathbf{X_i, g})$. Now taking

$$
Y - f_i^{(0)} = Y_i^{(0)} \approx \sum_{k=0}^{p-1}D_{ik}^{(0)}\beta_k^{(0)} + \epsilon_i,
$$

we see that it takes the form of a linear regression model

$$
Y_i = \sum_{k=0}^{p-1} \beta_kX_{ik} + \epsilon_i
$$

In this approach, then, the $Y_i^{(0)}$ are residuals between the response and the regression function evaluated at the initial estimate for $g$, the $D_{ik}^{(0)}$ are the partial derivatives evaluated for each of $n$ cases with the parameters replaced by the initial estimates, and the $\beta_k^{(0)}$ represent the difference between the true parameters and the initial estimates. We can write this model as:

$$
\mathbf{Y}_i^{(0)} \approx \mathbf{D}^{(0)}\boldsymbol{\beta}^{(0)} + \boldsymbol{\epsilon},
$$

which we solve for $\mathbf{b}^{(0)}$ to obtain our revised parameter estimates according to

$$
\mathbf{g}^{(1)} = \mathbf{g}^{(0)} + \mathbf{b}^{(0)}.
$$

We next want to evaluate

\begin{align}
SSE^{(0)} &= \sum_{n=1}^N\big(Y_i - f_i^{(0)}\big)^2 \\
SSE^{(1)} &= \sum_{n=1}^N\big(Y_i - f_i^{(1)}\big)^2
\end{align}

to ensure that $SSE^{(1)} < SSE^{(0)}$, that is, that we are moving in the right direction.

**Inferences about Parameters for Nonlinear Regression**

Exact inference for regression parameters is available for linear regression with normal error terms. Unfortunately, the same cannot be said for nonlinear regression, even with normal error terms, where LSq and ML estimators are not normally distributed, unbiased or minimum variance. With large samples, however, these assumptions may approximately hold.

Inferences about regression parameters require an estimate of the error term variance $\sigma^2$, which we obtain vai:

$$
MSE = \frac{SSE}{n-p} = \frac{\sum_{i=1}^N\big(Y_i - f(\mathbf{X_i,g})\big)^2}{n-p}
$$

When the sample size is large

\begin{align}
E[\mathbf{g}] &= \boldsymbol{\gamma} \\
\sigma^2\{\mathbf{g}\} &= MSE(\mathbf{D'D})^{-1}
\end{align}

Bootstrap inference may be successful when large-sample theory falls short.

### Remaining Topics in Chapters 13 and 14: Neural Networks, Poisson Regression, Logistic Regression, GLMs

In the following section we focus in a more piecemeal fashion on the final major regression model types as discussed in Kutner (2004). For the time being, we will ignore the topics of Logistic and Polytomous Regression. We will also focus on Poisson Regression and GLMs first.

**Poisson Regression**

The Poisson probability distribution, for discrete outcomes that are counts, is of the form:

$$
p(Y) = \frac{\mu^Ye^{-\mu}}{Y!}.
$$

If data with different time intervals is collected, the form may be modified to:

$$
p(Y) = \frac{t\mu^Ye^{-t\mu}}{Y!}.
$$

For example, $t=1$ if Y is a count over 1 month, whereas $t=\frac{7}{30}$ when Y is a count over a week.

For Poisson regression, the typical form for the regression model is used:

$$
Y_i = E[Y_i] + \epsilon_i,
$$

where the mean response for the $i^{th}$ case, denoted now as $\mu_i$, is some function of the predictors and coefficients. Some examples follow:

\begin{align}
\mu_i &= \mu(\mathbf{X}_i,\boldsymbol{\beta}) = \mathbf{X}_i\boldsymbol{\beta} \\
&= \mu(\mathbf{X}_i,\boldsymbol{\beta}) = \log_e(\mathbf{X}_i\boldsymbol{\beta}) \\
&= \mu(\mathbf{X}_i,\boldsymbol{\beta}) = \exp(\mathbf{X}_i\boldsymbol{\beta})
\end{align}

We use *Maximum Likelihood* to estimate the parameters $\boldsymbol{\beta}$, so we need the likelihood function:

\begin{align}
L(\beta) = \prod_{i=1}^n f_i(Y_i) &= \prod_{i=1}^n \frac{[\mu(\mathbf{X}_i, \boldsymbol{\beta})]^{Y_i}\exp[-\mu(\mathbf{X}_i,\boldsymbol{\beta})]}{Y_i!} \\
\log_eL(\beta) &= \sum_{i=1}^n Y_i\log_e[\mu(\mathbf{X}_i,\boldsymbol{\beta})] - \sum_{i=1}^n[\mu(\mathbf{X}_i,\boldsymbol{\beta})] - \sum_{i=1}^n\log_e(Y_i)
\end{align}

For model development, ie testing necessity of subsets of predictor variables, the likelihood ratio test statistic $G^2$ can be used (as with logistic regression).

**Generalized Linear Models** are any model meeting the following criteria:

1. The $Y_i$ are $n$ independent responses that follow an *exponential family* probability distribution, whose expectation, $E\{Y_i\} = \mu_i$.
2. A *linear predictor*, $\mathbf{X}_i\boldsymbol{\beta}$, is used
3. A *link function* relates the linear predictor to the mean response: $\mathbf{X}_i\boldsymbol{\beta} = g(\mu_i)$.


