In [1]:
from IPython.core.display import HTML
HTML("""
<style>
div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.3em;
line-height:1.4em;
padding-left:1.5em;
padding-right:1.5em;
}
</style>
""")

<h1><center>Moving Beyond Linearity</center></h1>

Lineaer models have its limitations in terms of predictive power. Linear models can be extended simply as:

 - <b>Polynomial regression</b> extends linear regression by adding extra higher order predictors (predictors rasied to higher order powers).
 
 
 - <b>Step functions</b> cut the range of a variable into $K$ distinct regions in order to produce a qualitative variable.
 
 
 - <b>Regression splines</b> is the extension of polynomial regression and step functions. It divides the range of predictor $X$ into $K$ distinct regions and within each region a polynomial function is fit to the data.
 
 
 - <b>Smoothing splines</b>
 
 
 - <b>Local regression</b>
 
  
 - <b>Generalized additive models</b>
  

### 7.1 Polynomial Regression

A standard linear regression model

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$

can be replaced by a more generic polynomial function

$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + ... + \beta_d x_i^d + \epsilon_i$$

This approach is known as <b>polynomial regression</b> and for large enough values of $d$, it can produce a highly non-linear curve. It is highly unusual to use $d$ greater than 3 or 4. The given model parameters can easily be estimated using linear least squares linear regression procedure. Similarly, polynomial functions can be modeled with the <b>logistic regression</b> as well. 

### 7.2 Step Functions

Polynomial regression gives a fit that is more <b>global</b> in nature. In <b>step functions</b>, we divide the range of $X$ into <b>bins</b> and fit a different constant in each bin. We can create $K$ <b>cutpoints</b> $c_1, c_2, ..., c_K$ in the range of $X$, and then can construct $K+1$ new <b>categorical</b> variables as:

$$C_i(X) = I(c_i \leq X < c_{i+1})$$

where $I(.)$ is an <b>indicator function</b> which returns 1 if the condition is true and 0 oterwise. For any value of $X$, $C_0(X) + C_1(X) + ... + C_K(X) = 1$, as only one value will be 1 for each $X$. We can then fit a linear least squares model to fit $C_1(X), C_2(X),...,C_K(X)$ as predictors. We need to omit one predictor as there will be intarcept too. The linear model is given as:

$$y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + ... + \beta_K C_K(x_i) + \epsilon_i$$

$\beta_0$ is a response for $X<c_1$. The response for $c_j \leq X < c_{j+1}$ is $\beta_0 + \beta_j$. Hence, $\beta_j$ represents the average increase in the response for $X$ in $c_j \leq X < c_{j+1}$ relative to $X < c_1$. Logistic regression model can be fitted in the same way.

### 7.3 Basis Functions

Polynomial and piecewise-constant regression models are special cases of a <b>basis function</b> approach for regression. In basis function approach, we use a family of functions to transform $X$ and instead of fitting a linear model in $X$, we fit the transformed predictors as:

$$y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + ... + \beta_K b_K(x_i) + \epsilon_i$$

The basis functions are fixed and known. For polynomial regression, the basis functions are $b_j(x_i) = x_i^j$. For piecewise constant functions, they are $b_j(x_i) = I(c_j \leq x_i < c_{j+1})$. As in basis functions approach linear model is fitted on the transformed variables, all the inference tools for linear models can be used.

### 7.4 Regression Splines

Regression splines are flixible class of basis functions that extend upon polynomial and piecewise constant regression approaches.

#### 7.4.1 Piecewise Polynomials

<b>Piecewise polynomial regression</b> fits separate low-degree polynomials over different regions of $X$. For example, a piecewise squared polynomial fits squared regression model of the form

$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i$$

where the coefficients $\beta_0, \beta_1, \beta_2$ differs in different parts of the range of $X$. The points where the coefficients change are called <b>knots</b>. Each of the polynomial functions can be fit using least square methods. Increasing the number of knots will give a more flexible piecewise polynomial.

#### 7.4.2 Constraints and Splines

By using piecewise polynomial regression, the fitted curve on the data may have a <b>discontinuity at the knots</b> or we can say that the fitted curve is too flexible. Instead, we can fit a piecewise polynomial under the constraint that the fitted curve must be continuous. We can further add more constraints, such as, both the first and second derivatives of the piecewise polynomials must be continuous. <b>Each added constraint frees up one degree of freedom. and hence reducing the complexity of the resulting piecewise polynomial fit</b>. Hence by imposing three constraints of continuity, continuity of the first and second derivative, we reduce the degree of freedom of model by 3.

A piecewise cubic polynomial function with three constraints(continuity, continuity of the first and second derivative) is called as <b>cubic spline</b>. The degree of freedom of cubic spline is $K+4$, where $K$ is the <b>number of knots</b>. It can be explained as: The left(or right) end of the polynomial has a degree of freedom 4(as we have to estimate 4 coefficients or parameters to fit a cubic spline). Each additional knot adds one parameter (as three imposed constraints leave one free parameter) and hence making a total of $K+4$ parameters for $K$ knots. In general, a <b>degree-d spline</b> is a piecewise degree-d polynomial with continuity in derivatives upto degree $d-1$ at each knot.

#### 7.4.3 The Spline Basis Representation

A cubic spline with $K$ knots can be modeled as:

$$y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + ... + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i$$

First of all, the equation can be interpreted as: the degree of freedom of a cubic spline is $K+4$ and hence we have to estimate $K+4$ parameters. After composing the equation, we need to formulate the <b>basis functions</b> $b_1, b_2, ..., b_{K+3}$. As explained above, a cubic spline can be iterpreted as a polynomial function where left(or right) end has a degree of freedom 4 (as we need to fit a cubic polynomial without any constraint) giving the first three basis functions as $x, x^2$ and $x^3$. Then we have to add one degree of freedom (parameter) per knot, with the constraints of continuity and continuity of the first and second derivatives. This behaviour can be captured by adding one <b>truncated power basis function</b> per knot, which is given as:

$$
\begin{equation}
  h(x, \xi) = (x - \xi)^3_+ = \left\{
  \begin{array}{@{}ll@{}}
    y(x - \xi)^3, & \text{if}\ x > \xi \\
    0, & \text{otherwise}
  \end{array}\right.
\end{equation}
$$

where $\xi$ is the knot. Adding $\beta_ih(x, \xi)$ will lead to discontinuity only in the third derivative at $\xi$. Hence to fit a cubic spline to a data set with $K$ knots, we need to perform least squares regression to estimate an intercept and $3+K$ parameters for $X, X^2, h(X, \xi_1), h(X, \xi_2), ..., h(X, \xi_K)$, where $\xi_1, \xi_2, ..., \xi_K$ are the knots.

Cubic splines have higher variance at the ends. A <b>natural spline</b> adds additional <b>boundary constraints</b>(requirement of being linear at boundaries, reducing 2 degree of freedom at each boundary) and hence reduce the variance, producing more stable estimates at boundaries.

#### 7.4.4 Choosing the Number and Locations of the Knots

The regression spline is most flexible in the regions which have highest number of knots. One approach is to place higher number of knots in the regions where we feel that the function might vary the most. In practice, it is common to place knots in a uniform fashion. The number of knots can be decided by analyzing the curve visually or by cross-validation.

#### 7.4.5 Comparison to Polynomial Regression

Regression splines give better results as compared to polynomial regression. Regression splines increase the fliexibility of the model by increasing the number of knots. As we increase the number of knots, we can place more knots in the regions where the function $f$ seems to change rapidly and fewer knots in the regions where it is stable. In polynomial regression, to increase the flexibility, we need to increase the degree of the polynomial. It may result in unstability and overfitting.

### 7.5 Smoothing Splines

#### 7.5.1 An Overview of Smoothing Splines

Regression splines are created by specifying a set of knots, producing a sequence of basis functions and then estimate spline coefficients using least squares.

To fit a smooth curve to a data set, we need to find a function $g(x)$ such that $RSS = \sum_{i=1}^{n}(y_i - g(x_i))^2$ is minimum. If we do not put any constraint on $g(x)$, we can always find a function $g(x)$, which will make RSS 0. This function will be too flexible and will overfit the data. Hence, we need to find a function $g$ which makes RSS small and which is <b>smooth</b> as well.

One way to find such a smooth function is to minimize:

$$\sum_{i=1}^{n}(y_i - g(x_i))^2 + \lambda \int g^{''}(t)^2 dt$$

where $\lambda$ is a nonnegative <b>tuning parameter</b>. The function that minimizes this is called as <b>smoothing spline</b>. The first part is a <b>loss function</b> and the second term is a <b>penalty</b> part that penalizes the variability of $g$. The second derivative of a function measures its smootheness as it corresponds to the amount by which the slope of a curve is changing. Hence, the second term encourages $g$ to be smooth. Larger the value of $\lambda$, smoother the $g$ as well. When $\lambda = 0$, the given model will be very flexible and will interpolate the training data. For $\lambda \to \infty$, the model corresponds to simple <b>least squares linear regression</b>. In a nut-shell, $\lambda$ <b>controls the bias-variance trade-off of the smoothing spline</b>.

The function $g$ that minimizes above quantity is the <b>natural cubic spline</b>. It is a piecewise cubic polynomial with knots having continuous first and second derivative at them. It should also be linear in the region outside the extreme knots. The obtained natural cubic spline is the <b>shrunken</b> version (due to tuning parameter $\lambda$) of the one which is obtaind by basis function approach.

#### 7.5.2 Choosing the Smoothing Parameter λ

The tuning parameter $\lambda$ controls the flexibility of the smoothing spline, and hence the <b>effective degree of freedom</b>. As $\lambda$ increases from 0 to $\infty$, the effective degree of freedom ($df_{\lambda}$) decreases from $n$ to 2.

Generally, degree of freedom refers to the number of free parameters(coefficients) in a model. A smoothing spline has $n$ parameters and hence $n$ nominal degree of freedom, but these $n$ parameters are heavily constrained. This phenomenon is measured by the effective degree of freedom.

In fitting a smoothing spline, we do not need to select the number of knots as there will be a knot at each training observation. Our main concern is the choice of $\lambda$. One possible approach is to choose $\lambda$ by croos-validation. LOOCV can be computed very efficiently for smoothing splines. The way RSS is calculated is slightly different though and is given as:

$$RSS_{cv}(\lambda) = \sum_{i=1}^{n} (y_i - \widehat{g_{\lambda}}^{(-i)}(x_i))^2 = 
\sum_{i=1}^{n} \bigg[ \frac{y_i - \widehat{g_{\lambda}}(x_i)}{1- (S_{\lambda})_{ii}} \bigg] ^2$$ 

Here $\widehat{g_{\lambda}}^{(-i)}(x_i)$ indicates the fitted value of smoothing spline evaluated at $x_i$, where the model uses all the training observation except $x_i$ (according to the definition of LOOCV). $\widehat{g_{\lambda}}(x_i)$ indicates the fit at $x_i$ using all the training observations. The matrix $S_{\lambda}$ can be computed as:

$$\widehat{g_{\lambda}} = S_{\lambda}y$$

where, $\widehat{g_{\lambda}}$ is the fitted values for a particular value of $\lambda$ and $y$ is the response vector. Hence, the <b>RSS of LOOCV can be computed by just using $\widehat{g_{\lambda}}$, which is the original fit using the entire data set</b>, and hence efficiently. The effective degree of freedom for the smoothing spline is given as:

$$df_{\lambda} = \sum_{i=1}^{n} (S_{\lambda})_{ii}$$

### 7.6 Local Regression

<b>Local regression</b> comutes the fit at a target point $x_0$ using only the nearby training observstions. The algorithm for local regression is as follows:

 - Gather the $k$ points closest to $x_0$.
 - Assign a weight $K_{i0} = K(x_i, x_0)$ to all the points in the neighborhood such that the points that are farthest have lower weights. All the points except from these $k$ nearest neighbors have weigth 0.
 - Fit a <b>weighted least squares regression</b> of the aformentioned points using weights, by finding $\beta_0, \beta_1$ that minimize
 
 $$\sum_{i=1}^{n}K_{i0}(y_i - \beta_0 - \beta_1 x_i)^2$$
 
 
 - The fitted value at $x_0$ is given as $\widehat{f}(x_0) = \widehat{\beta_0} + \widehat{\beta_1} x_0$.
 
The <b>span s</b> of a local regression is defined as $s = \frac{k}{n}$, where $n$ is total number of training samples. It plays the role of controling the flexibility of the non-linera fit. Smaller the value of $s$, the more local or wiggly is the fit. For larger values of $s$, we obtain a global fit. An appropriate value of $s$ can be chosen by cross-validation.

### 7.7 Generalized Additive Models

<b>Generalized additive models (GAMs)</b> predict $Y$ on the basis of $p$ predictors $X_1, X_2, ..., X_p$. This can be viewed as an extension of multiple linear regression.

#### 7.7.1 GAMs for Regression Problems

Multiple linear regression model can be given as:

$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i$$

In order to incorporte a non-linear relationship between each feature and the response, each linear component can be replaced with a smooth non-linear function. The model can be expressed as:

$$y_i = \beta_0 + \sum_{j=1}^{p} f_{j}(x_{ij}) + \epsilon_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + ... + f_p(x_{ip}) + \epsilon_i$$

This is an example of GAM. GAM is <b>additive</b> as we fit a separate non-linear model for each predictor and then add together their contributions. We can use any regression method to fit these individual models.

##### Pros and Cons of GAMs

 - As GAM models a non-linear relationship for each individual predictor, it will automatically capture the non-linear behaviour of the response.
 
 
 - As the model is additive, we can analyze the effect of each predictor on response by keeping other predictors constant.
 
 
 - The smoothness of each individual function can be summarized by its degree of freedom.
 
 
 - The main limitation of GAM is its additive nature. Due to this, the interaction between individual parameters is missed. However, we can manually add interaction terms (of the form $X_j \times X_k$) in GAM.
 
#### 7.7.2 GAMs for Classification Problems

For a qualitative variable $Y$, which takes on two values 0 and 1, the logistic regression model can be given as:

$$log\bigg( \frac{p(X)}{1-p(X)} \bigg) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$$

where $p(X) = Pr(Y=1 | X)$ and the left hand side of the equation is called as <b>logit</b> or log of the odds. To accomodate non-linearity, above model can be modified as:

$$log\bigg( \frac{p(X)}{1-p(X)} \bigg) = \beta_0 + f_0(X_1) + f_2(X_2) + ... + f_p(X_p)$$