# Regression: Polynomial 

Introduction to Machine Learning, BCAM & UPV/EHU course, by Carlos Cernuda, Ekhine Irurozki and Aritz Perez.


## References 

* James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). 
An introduction to statistical learning (Vol. 112). New York: springer.
* Data sets: http://www-bcf.usc.edu/~gareth/ISL/data.html
* SCIKIT-LEARN library example http://scikit-learn.org
* References Jupyter notebooks:
    - R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)
    http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html
    - General Assembly's Data Science course in Washington, DC
    https://github.com/justmarkham/DAT4
    - An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) adapted to Python code
    https://github.com/JWarmenhoven/ISLR-python

## Python Libraries

In [None]:
##########################################################
import numpy as np #scientific computing (n-dim arrays, etc)
import pandas as pd #data analysis library
##########################################################
# Plots:
import matplotlib.pyplot as plt 
import matplotlib.pylab as pylab
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns #visualization library based on matplotlib
%matplotlib inline
plt.style.use(['seaborn-white'])   
params = {'legend.fontsize': 'xx-large',
              'figure.figsize': (15, 5),
              'axes.labelsize': 'xx-large',
              'axes.titlesize':'xx-large',
              'xtick.labelsize':'xx-large',
              'ytick.labelsize':'xx-large'}    
pylab.rcParams.update(params)  #fix the parameters for the plots

pd.set_option('display.notebook_repr_html', False)
##########################################################
# SKLEARN: scikit-learn machine learning tools
from sklearn.preprocessing import scale ###Standardize a dataset 
import sklearn.linear_model as skl_lm ###linear model regression
##########################################################
# STATSMODELS: provides classes and functions for the estimation of many different statistical models, 
# as well as for conducting statistical tests, and statistical data exploration.
import statsmodels.api as sm ##
import statsmodels.formula.api as smf ##
##########################################################
np.random.seed(0)

## Multiple Regression: additive without linear assumptions

## Data set: Auto

Auto data set: 
Milles per galleon (mpg) of a car in relation with cylinders, weight, horsepower, etc in 392 different models. 


In [None]:
auto = pd.read_csv('dataset/Auto.csv', na_values='?').dropna()
auto.describe()

In [None]:
auto.head(3) #show 3 first rows
# mpg: milles per galleon (response variable)
# cylinders, displacement, horsepower, weight, acceleration (predictors variables)

## Regression

Predict mpg by considering as predictor variable the horsepower. 

The linear model will be: 

$$
mpg \sim \beta_0 +\beta_1 horsepower
$$

We can try also a non-linear model (polynomial model of degree two):

$$
mpf \sim \beta_0 +\beta_1 horsepower +\beta_2 horsepower^2
$$

We are going to calculate the estimates coefficients as before and to plot the resulting regression line considering the predictor variable horsepower.

In [None]:
# Lineal regression
fig = plt.figure(figsize=(5,5))
sns.regplot(auto.horsepower, auto.mpg, order=1, ci=None, scatter_kws={'color':'b'}, color='k')
plt.title('Linear regression of mpg using horsepower value')
plt.xlim(40,240)
plt.ylim(5,55);

In [None]:
# Polynomial regression of degree (order) 2
fig = plt.figure(figsize=(5,5))
sns.regplot(auto.horsepower, auto.mpg, order=2, ci=None, scatter_kws={'color':'b'}, color='k')
plt.title('Polynomial regression (quadratic) of mpg using horsepower value')
plt.xlim(40,240)
plt.ylim(5,55);

In [None]:
# Polynomial regression of degree 5
fig = plt.figure(figsize=(5,5))
sns.regplot(auto.horsepower, auto.mpg, order=5, ci=None, scatter_kws={'color':'b'}, color='k')
plt.title('Polynomial regression (degree 5) of mpg using horsepower value')
plt.xlim(40,240)
plt.ylim(5,55);

## Identifying non-linearity using residual plots

Residual plots are a useful graphical tool to identifying non-linear relationships between the predictors and the response. Given a linear regression model, 

* The linear model:

$$
\hat{Y} = \hat{\beta}_0 +\hat{\beta}_1 horsepower
$$

* The polynomial model of degree two (quadratic):

$$
\hat{Y} = \hat{\beta}_0 +\hat{\beta}_1 horsepower +\hat{\beta}_2 horsepower^2
$$

* The residuals can be obtained: 
$$
e_i=y_i-\hat{y_i}
$$
    
at each observed point $x_i$.

In [None]:
# Include a new predictor variable
auto['horsepower2'] = auto.horsepower**2
auto.head(3) #show 

In [None]:
# Data
X = auto.horsepower.values.reshape(-1,1)
y = auto.mpg

# Load the model 
regr_model = skl_lm.LinearRegression()

# Linear fit
regr_model.fit(X, y)

auto['pred1'] = regr_model.predict(X)
auto['resid1'] = auto.mpg - auto.pred1

# Quadratic fit
X2 = auto[['horsepower', 'horsepower2']].as_matrix()
regr_model.fit(X2, y)

auto['pred2'] = regr_model.predict(X2)
auto['resid2'] = auto.mpg - auto.pred2

auto.head(3) #show 

In [None]:
# Plot of the residuals
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot (Linear regression)

# Plot x-axis predicted response and y-axis residual values for each data point

# Smooth fitting
# sns.regplot with lowess: use statsmodels to estimate a nonparametric lowess model 
# (locally weighted linear regression)
sns.regplot(auto.pred1, auto.resid1, lowess=True, 
            ax=ax1, line_kws={'color':'r', 'lw':1})
ax1.hlines(0,
           xmin=ax1.xaxis.get_data_interval()[0],
           xmax=ax1.xaxis.get_data_interval()[1], 
           linestyles='dotted')
ax1.set_title('Residual Plot for Linear Fit')

# Right plot
sns.regplot(auto.pred2, auto.resid2, lowess=True,
            line_kws={'color':'r', 'lw':1}, ax=ax2)
ax2.hlines(0,
           xmin=ax2.xaxis.get_data_interval()[0],
           xmax=ax2.xaxis.get_data_interval()[1], 
           linestyles='dotted')
ax2.set_title('Residual Plot for Quadratic Fit')

for ax in fig.axes:
    ax.set_xlabel('$\hat{y}_i$')
    ax.set_ylabel('$e_i$')

The red line is a smooth fit of the residuals, to identify visually the trend. 
* Left: a linear regression of mpg on horsepower. A strong pattern indicates non-linearity in the data. 
* Right: a linear regression of mpg on horsepower and horsepower$^2$. There is no clear pattern on the residual. 