<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Module5_Tutorial_NonLinearModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Non-linear Regression Techniques

Daniel Bauer, 2022

In this tutorial, we will get acquainted with non-linear models.  In particular, we introduce a number of non-linear regression techniques -- including polynomial regression, regression splines, smoothing splines, and local regression -- in a simple example setting.  

Some of the techniques we introduce don't exist in the packages that come with Colab. So we will install two additional libraries: scikit-lego for local regression.

In [None]:
!pip install scikit-lego

With that, let's install the relevat libraries.

In [4]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import Pipeline

from sklego.linear_model import LowessRegression

import statsmodels.api as sm

And clone our git repository so as to have access to the data.

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

## Non-linear Regression Models

### Primer on Non-linear Regression Techniques

Non-linear models expand on the more foundational models that we discussed before, particularly linear regression.  We consider three different though related categories of approaches, which are relatively straightforward in the context of a single feature, i.e. if there is only one $x$.

1. Use *transformations* of $x$: A traditional approach that falls in this category is *polynomial regression*, where we simply add powers of the feature $x$ -- i.e. $x^2$, $x^3$, etc. -- to the regression problem.  However, polynomials have some limitations, particularly in extremal areas, because they fit the regression function globally.  Instead, in *spline regression*, piecewise polynomials that are only fit between *knots* are used, but one imposes restrictions such that the fit is still continuous and smooth.  With so-called *natural splines*, a linear function is used in the extremal (corner) areas so as to avoid erratic behavior when extrapolating.  Depending on the number of knots, the function can mimic more or less arbitrary shapes.

2. Using an arbitrary function but *penalizing* said function for variation.  This yields a so-called *smoothing spline* (so the approach is related to 1.), but it also depends on a smoothing parameters that governs how much variation is allowed.  Similar to regularized methods, this parameter can be used to control *model complexity*.

3. *Local regression*: Instead of using all points for predicting at a given point $x_0$, we run a *local* regression where we put more weight on the data points that are close to $x_0$ and less weight on data points that are far away.  The weighting is typically done via a so-called *kernel* function (generalized bell curve) so this is also referred to as *kernel regression*.


### Example in the context of height-weight relationship

We use a straightforward example: Regressing the weight of people on their heights. The relevant dataset is taken from a [well known example dataset from Davis (1990)](https://rdrr.io/cran/carData/man/Davis.html):

In [8]:
hwdata2 = pd.read_csv('ML_656/Davis.csv', index_col=0) 
hwdata = hwdata2.sort_values('height')

Let's take a look:

In [None]:
hwdata2.head()

And let's defined the $x$ and $y$ vectors:

In [10]:
X = hwdata2[['height']]
y = hwdata2['weight']

Let's start with polynomial regression, where we are taking advantage of so-called *model pipelines* within scikit-learn, where we "pipe" the regression model into our linear regression model:

In [None]:
polynomial_regression = Pipeline([('poly', PolynomialFeatures(degree=4)),('linear', LinearRegression(fit_intercept=False))])
polynomial_regression = polynomial_regression.fit(X, y)
polynomial_regression.named_steps['linear'].coef_

Let's visualize via a scatter plot with polynomial regression line:

In [None]:
height_grid = np.arange(hwdata.height.min(), hwdata.height.max()).reshape(-1,1)
pred_poly = polynomial_regression.predict(hwdata[['height']])
plt.figure(figsize=(15,6))
plt.scatter(hwdata.height, hwdata.weight, facecolor='None', edgecolor='k', alpha=0.3)
plt.plot(hwdata.height, pred_poly, color='g', label='polynomial regression df=4')

It seems like we may be overfitting a bit...

Let's look at regression splines, again using a model pipeline (we get splines via 'SplineTransformer'):

In [None]:
spline_regression = Pipeline([('splines', SplineTransformer(degree=2, n_knots=3, extrapolation='linear')),('linear', LinearRegression(fit_intercept=False))])
spline_regression = spline_regression.fit(X, y)
spline_regression.named_steps['linear'].coef_

Again, let's visualize via a scatter plot with a spline regression line:

In [None]:
height_grid = np.arange(hwdata.height.min(), hwdata.height.max()).reshape(-1,1)
pred_splines = spline_regression.predict(hwdata[['height']])
plt.figure(figsize=(15,6))
plt.scatter(hwdata.height, hwdata.weight, facecolor='None', edgecolor='k', alpha=0.3)
plt.plot(hwdata.height, pred_splines, color='g', label='Spline regression')

Looks neat.

Finally, let's run a local regression, where we rely on 'LowessRegression' from scikit-lego:

In [15]:
local_regression = LowessRegression(sigma=50).fit(X,y)

Again, let's visualize via a scatter plot with a regression line:

In [None]:
height_grid = np.arange(hwdata.height.min(), hwdata.height.max()).reshape(-1,1)
pred_local = local_regression.predict(hwdata[['height']])
plt.figure(figsize=(15,6))
plt.scatter(hwdata.height, hwdata.weight, facecolor='None', edgecolor='k', alpha=0.3)
plt.plot(hwdata.height, pred_local, color='g', label='local regression')

Looks neat.

So the moral of the story is that these techniques present very neat tools for modeling non-linear relationships. 

Let's summarize by plotting altogether:

In [None]:
plt.figure(figsize=(15,6))
plt.scatter(hwdata.height, hwdata.weight, facecolor='None', edgecolor='k', alpha=0.3)
plt.plot(hwdata.height, pred_poly, color='r', label='polynomial regression df=4')
plt.plot(hwdata.height, pred_splines, color='b', label='Natural spline df=4')
plt.plot(hwdata.height, pred_local, color='y', label='Local regression')