# Tutorial: Simple Polynomial Regression

In this tutorial, we will train a simple polynomial regression model to fit a simple 1-variable data and select appropriate polynomdel degree for fitting.

Portion of content for this tutorial is taken from Python Data Science Handbook by Jake VanderPlas

## Importing the libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set() 
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

## Importing the dataset

In [None]:
df = pd.read_csv('poly_example.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
X = df[["Feature 1"]] 
y = df["Target"]

## Visualization the dataset

In [None]:
plt.scatter(X, y)

The data is definitely not linear, potentially good for polynomial fitting

## Splitting the dataset into the Training set and Test set

We will be using 70-30 split for this case 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

## Polynomial Features

PolynomialFeature() can be used to transform data into multiple features of the specified degree

In [None]:
#Task: Explore and see what happens when we change the number
p5 = PolynomialFeatures(degree=5)
X_5 = p5.fit_transform(X_train)

In [None]:
X_train.head()

In [None]:
#put in data frame to make it easily readable
pd.DataFrame(X_5).head()

## Visualizing polynomial fits of multiple degrees

The key question here is what degree of polynomial is appropriate? Visualization maybe a good tools to see that.

In [None]:
# Create a range of x value for plotting
X_plot = np.linspace(-0.1, 5, 500)[:, None]

#plot the data
plt.scatter(X_train["Feature 1"], y_train, color='black')

#plotting polynomial
axis = plt.axis()
for deg in range(1,6):# Trying degree 1 - 5
    #changing polynomial to the degree we want
    pf = PolynomialFeatures(degree=deg)
    X_p = pf.fit_transform(X_train)
    #fitting it to regression
    lr = LinearRegression()
    lr.fit(X_p, y_train)
    #plotting functions 
    y_plot = lr.predict(pf.fit_transform(X_plot))
    plt.plot(X_plot.ravel(), y_plot, label='degree={0}'.format(deg))
plt.xlim(-0.1, 1.0);
plt.ylim(-18, 18);
plt.legend(loc='best');

## Visualize the validation curve

Scikit-learn provide a tools that compute training + validation score when varying parameters.

We can use this to see what happens when we vary the degree of polynomial

**The code in this section is taken from Python Data Science Handbook**

In [None]:
#making a pipeline to chain the polynomial feature transformation and linear regression
from sklearn.pipeline import make_pipeline

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [None]:
from sklearn.model_selection import validation_curve
degree = np.arange(0, 21)
train_score, val_score = validation_curve(PolynomialRegression(), X_train, y_train,
                                          'polynomialfeatures__degree', degree, cv=8)

plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');

We see that polynomial around the degree of 3 is probably the most appropriate

## Predicting the test data

In [None]:
#fit the model to polynomial of degree 3
pf = PolynomialFeatures(3)
X_p = pf.fit_transform(X_train)
lr = LinearRegression()
lr.fit(X_p, y_train)

In [None]:
#predicint the test result
y_pred = lr.predict(pf.fit_transform(X_test))

## Evaluate the performance

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred) #not bad

In [None]:
#studying the fit of the training data
y_pred_train = lr.predict(X_p)
r2_score(y_train, y_pred_train) 

The model seems to generalize quite ok.

## What if the polynomial degree is too high?

In [None]:
#fit the model to polynomial of degree 12
pf12 = PolynomialFeatures(12)
X_p12 = pf12.fit_transform(X_train)
lr12 = LinearRegression()
lr12.fit(X_p12, y_train)

In [None]:
#predict the test result
y_pred12 = lr12.predict(pf12.fit_transform(X_test))

In [None]:
#check the r2_score
from sklearn.metrics import r2_score

r2_score(y_test.ravel(), y_pred12) #not bad

In [None]:
#studying the fit of the training data
y_pred_train12 = lr12.predict(X_p12)
r2_score(y_train, y_pred_train12) 

In [None]:
# Create a range of x value for plotting
X_plot12 = np.linspace(-0.1, 5, 500)[:, None]

#plot the data
plt.scatter(X_train["Feature 1"], y_train, color='black')

#plotting polynomial
axis = plt.axis()
y_plot12 = lr12.predict(pf12.fit_transform(X_plot))
plt.plot(X_plot12.ravel(), y_plot12, label='degree={0}'.format(12))
plt.xlim(-0.1, 1.0);
plt.ylim(-18, 18);
plt.legend(loc='best');
plt.title("Polynomial of degree 12 (training data)");

We see that this shape of curve will probably not generalize well with test data

In [None]:
#plot the data
plt.scatter(X_test["Feature 1"], y_test, color='black')

#plotting polynomial
axis = plt.axis()
y_plot12 = lr12.predict(pf12.fit_transform(X_plot))
plt.plot(X_plot12.ravel(), y_plot12, label='degree={0}'.format(12))
plt.xlim(-0.1, 1.0);
plt.ylim(-18, 18);
plt.legend(loc='best');
plt.title("Polynomial of degree 12 (test data)");