# Chapter 6: Beyond Linear Model

- **Polynomial regression** extends the linear model by adding extra pre- dictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, $X$, $X^2$, and $X^3$, as predictors. This approach provides a simple way to provide a non-linear fit to data.
- **Step functions** cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.

## Regression Splines
### Constraints and Splines
Constraints are used for continuity or smooth.

**Cubic Spline**: Constraints include continuity, continuity of the first derivative, and continuity of the second derivative.

---
*Model Structure*
$y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \dots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i$

- $b_j(x)$: basis functions (can be $x$, $x^2$, $x^3$, or special truncated power functions)
- $K$: number of knots
- Total: $K+3$ basis functions plus an intercept → $K+4$ coefficients

*Choice of Basis Functions*
- Start with cubic polynomial terms:  
  $1, \quad x, \quad x^2, \quad x^3$
- For each knot $\xi_k$, add a **truncated cubic** term:  
  $h(x, \xi_k) = (x - \xi_k)^3_+ =
  \begin{cases}
  (x - \xi_k)^3, & x > \xi_k \\
  0, & x \leq \xi_k
  \end{cases}$
  
  where “$_+$” means “positive part” — zero below the knot, cubic above it.
*Why This Works*
- Without knots: $1, x, x^2, x^3$ define one global cubic polynomial (fixed shape).
- Adding $(x - \xi_k)^3_+$ allows extra curvature **only after** the knot.
- This construction keeps the function, first derivative, and second derivative continuous; only the third derivative changes at the knot.

*Degrees of Freedom*
- Intercept: 1 parameter  
- $x, x^2, x^3$: 3 parameters  
- $K$ truncated cubic terms: $K$ parameters  
- **Total:** $K+4$ parameters → $K+4$ degrees of freedom

*Intuition*

A cubic spline with $K$ knots can be fit as an ordinary linear regression on  
$1, \quad x, \quad x^2, \quad x^3, \quad (x - \xi_1)^3_+, \dots, (x - \xi_K)^3_+$  
where each knot’s truncated cubic term “activates” after that knot, giving the spline flexibility while keeping smoothness up to the second derivative.

---


**Degree-d Spline**: a piecewise degree-d polynomial, with continuity in derivatives up to degree $d-1$ at each knot

Splines can have high variance at the outer range of the predictors.

**Natural Spline**: a regression spline with additional boundary constraints - the function is required to be linear at the boundary (in the region where X is smaller than the smallest knot, or larger than the largest knot) - generally produce more stable estimates at the boundaries

### Choose the Number and Locations of the Knots
$-$ specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data

$-$ **cross-validation method**: remove a portion of the data (say 10 %), fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion; repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS; repeat the procedure for different numbers of knots K; then the value of K giving the smallest RSS is chosen.

## Smoothing Splines
$$
\sum_{i=1}^n (y_i - g(x_i))^2 \;+\; \lambda \int g''(t)^2 \, dt
\tag{7.11}
$$

where $\lambda$ is a nonnegative *tuning parameter*.  
The function $g$ that minimizes the equation is known as a **smoothing spline**.

### Choose the tuning parameter $\lambda$
We can write  

$$
\hat{g}_\lambda = \mathbf{S}_\lambda \mathbf{y},
\tag{7.12}
$$  

where $\hat{g}_\lambda$ is the solution to (7.11) for a particular choice of $\lambda$—that is, it is an $n$-vector containing the fitted values of the smoothing spline at the training points $x_1, \dots, x_n$. Equation (7.12) indicates that the vector of fitted values when applying a smoothing spline to the data can be written as a $n \times n$ matrix $\mathbf{S}_\lambda$ (for which there is a formula) times the response vector $\mathbf{y}$.  

Then the **effective degrees of freedom** is defined to be  

$$
df_\lambda = \sum_{i=1}^n \{ \mathbf{S}_\lambda \}_{ii},
\tag{7.13}
$$  

the sum of the diagonal elements of the matrix $\mathbf{S}_\lambda$.

**Leave-One-Out Cross-Validation (LOOCV)**

Cross-validated RSS:  
  $$
  RSS_{cv}(\lambda) = \sum_{i=1}^n \big(y_i - \hat{g}_\lambda^{(-i)}(x_i)\big)^2
  = \sum_{i=1}^n \left[\frac{y_i - \hat{g}_\lambda(x_i)}{1 - \{S_\lambda\}_{ii}}\right]^2
  $$
 
  $\hat{g}_\lambda^{(-i)}(x_i)$ = fitted value leaving out $i$；  
  $\hat{g}_\lambda(x_i)$ = fitted value with all data；  
  $\{S_\lambda\}_{ii}$ = diagonal element of smoother matrix.    


## Local Regression
---
Local Regression at $X = x_0$

1. Gather the fraction $s = k/n$ of training points whose $x_i$ are closest to $x_0$.

2. Assign a weight $K_{i0} = K(x_i, x_0)$ to each point in this neighborhood,  so that the point furthest from $x_0$ has weight zero, and the closest has the highest weight.  
   All but these $k$ nearest neighbors get weight zero.

3. Fit a *weighted least squares regression* of the $y_i$ on the $x_i$ using the aforementioned weights,  
   by finding $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize  
   $$
   \sum_{i=1}^n K_{i0} (y_i - \beta_0 - \beta_1 x_i)^2.
   $$

4. The fitted value at $x_0$ is given by  
   $$
   \hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0.
   $$
---

## Generalizaed Additive Models
Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables,while maintaining additivity.

### GAMs for Regression Models
$$
y_i = \beta_0+\sum_{j=1}^p f_j(x_{ij})+\epsilon
$$

Fitting a GAM with a smoothing spline is not quite as simple as fitting a GAM with a natural spline, since in the case of smoothing splines, least squares cannot be used. → **backfitting** method (can be realized in R): each time we update a function, we simply apply the fitting method for that variable to a partial residual.

### GAMs for Classificaiton Problems
$$
log(\frac{p(X)}{1-p(X)})=\beta_0 + f_1(X_1)+f_2(X_2)+...+f_p(X_p)
$$