# Modeling Non-Linearities

linear models are simple to describe, implement and interpret

But: have significant limitations in terms of predictive power

Ridge Regression, Lasso, PCR and others try to reduce variance of estimates, but still model linearities

**Non-Linear Models:**
    - Polynomial Regression
    
    - Step Functions
    
    - Regression Splines

## Polynomial Regression

consider the forllowing linear model

$$ y_i = \beta_0 + \beta_1 x_i +\epsilon_i $$

a corresponding Polynomial models would be

$$ y_i = \beta_0 + \beta_1 x_i + \beta_1 x_i + \beta_2 x_i^2 +\dots + \beta_d x_i^d +\epsilon_i $$

Hence, Polynomial Regression allows for extremly non-linear curves

### Computation of SE and CI for Polynomial Regression

think of a model

$$ \hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_i + \hat{\beta}_2 x_i + \hat{\beta}_3 x_i^2 + \hat{\beta}_4 x_i^3 $$

estimation uses least squares and the variance is

$$ Var(\hat{f}(x_0)) = l_0'\hat{C}l_0 $$

Where $l_0' = (1,x_0,x_0^2,x_0^3)$ and $\hat{C}$ is the $4 \times 4$ covarite matrix.

From least squares models we know that

$$ SE = \sqrt{Var(\hat{f}(x_0))} $$

$$ CI \approx \hat{f}(x_0) \pm 2*SE $$

### Logistic Polynomial Regression

$$ Pr(y_i>250|x_i) = \frac{exp\{\beta_0 + \beta_1 x_i + \beta_1 x_i + \beta_2 x_i^2 +\dots + \beta_d x_i^d\}} {1+exp\{\beta_0 + \beta_1 x_i + \beta_1 x_i + \beta_2 x_i^2 +\dots + \beta_d x_i^d\}} $$

same estimation as in standard Logistic Regression approach

## Step Functions

**Idea:** 
break the range of $X$ intro different bins $\Rightarrow$ fit a different constant in each bin

**Procedure:**
 1. create cutpoints $c_1,c_2,\dots,c_k$ in the range of $X$
 
 2. create $k+1$ Dummy Variables $$C_{k-1}=I(C_{k-1}\leq X < C_k)$$ with $C_i \in \{0,1\}\quad ,\forall i$ and $\sum_{i=1}^k C_i(x)=1 \quad ,\forall x$
 
 3. fit a linear model using least squares $$y_i=\beta_0+\beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \dots + \beta_k C_k(x_i) + \epsilon_i$$ 
 
agian step functions can be performed using Logistic Regressions
 
### What is the problem with Step Functions?

unsless there are natural breakpoints, step functions can miss action (e.g. increasing linear trends)

## Basis Functions

**Idea:** 
use any function/ transformation of $X$ in $b(X)$ such that

$$y_i=\beta_0+\beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \dots + \beta_k b_k(x_i) + \epsilon_i$$ 

Note: Polynomial Regression and Step-functions where special cases of basis functions

 * $b_j(x_i)=x_i^j$
 
 * $b_j(x_i)=I(C_j \leq x_i <C_{j+1})$
 
we can apply all tools available for least squares estimation, Inference, etc.

## Regression Splines

**Idea:**
piecewise polynomial regression with fitting the polynomial on different regions of $X$ (like a combination of polynomial regression and step functions)

$$
y_i= 
\begin{cases}
\beta_{01}+\beta_{11} x_i + \beta_{21} x_i^2 + \dots + \beta_{d1} x_i^d + \epsilon_i \quad , x_i<c \\
\beta_{02}+\beta_{12} x_i + \beta_{22} x_i^2 + \dots + \beta_{d2} x_i^d + \epsilon_i \quad , x_i\geq c 
\end{cases}
$$

Thresholds are calles knotes, more knots $\Rightarrow$ more flexible piecewise polynomial

**natural Spline**
regression spline with additional boundary constraints (function is required to be linear at the boundary)

### Where to place knots?

 - more knots $\Rightarrow$ more flexible function
 
 - at points where the function needs to be flexible 
 
 - *often place knots uniformly*
 
### How many kntos should we use?

**apply CV:**
 1. remove portion of data
 
 2. fit spline with certain number on knots
 
 3. make predictions

 4. repeat until each set was used for training
 
 5. calculate RSS
 
 6. repeat with different number of knots and compare!
 
### Comparing Polynomials and Splines

Regression Splines often give superoir results to polynomial regressions.

 - polynomials need high degree to be flexible
 
 - splines introduce flexibility by increasing knots and keep degree fixed
 
**generally Regression Splines produce more stable estimates!**

## Exercises

Consider the following data generating process in which we have n observations and one covariate X covariates additional to a constant. $X \sim N_p(0, \sigma^2)$. $\beta = (1 1.5 (−1.5) 1.5)$ and the errors are drawn from a normal distribution $\epsilon \sim N (0, 1)$. The model is generated by $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i$.

a) Calculate the analytical standard errors for the polynomial specification above (as presented in the lecture) and use these to calculate the approximate confidence intervals as $2*SE$ for each value of $X$ you consider.

b) Calculate the naive bootstrap confidence intervals for B bootstrap draws from the original data, such that the nominal coverage for the two methods is the same.

c) Calculate the coverage probability at four different values of X, chosen by you.

d) Calculate the interval length at four different values of X.

**Simulation Study**

Evaluate the two types of confidence intervals above along two dimensions: interval length and coverage probability.

a) Calculate both for a small simulation study of 100 repetitions.

b) How could you change the data-generating process to give a competitive advantage to the bootstrap?
Suggest two changes and check each in a simulation study.