## Polynomial Regression

First import some libraries to do some data manipulation, math, and modeling.

For polynomial regression, you want to pick the order/degree based on both low training error and low test error. This means it generalizes well represent the test and training sets.

In [None]:
import numpy as np
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model, preprocessing
import numpy.polynomial.polynomial as poly
%matplotlib inline

## Import Data

Use pandas module to read the file and import into a dataframe. Print the data frame to make sure that the data is being imported correctly. Using "dataframe.values" returns a Numpy representation of the DataFrame to be spliced according to however is necessary. Reshape the feature matrix and target vector to both be of the same rank.

In [None]:
df = pd.read_csv('filename.csv')
print(df)
data = df.values
X = data[:,0]
Y = data[:,1]
X = X.reshape(X.shape[0],1)
Y = Y.reshape(Y.shape[0],1)

## Training Set and Test Set

Here the feature matrix and target vector are both partitioned into two sets, one for training and for test.

In [None]:
# probably not necessary if we do k-fold validation
X_tr = X[0:100,0]
Y_tr = Y[0:100,0]
X_test = X[100:200,0]
Y_test = Y[100:200,0]

## Plotting the Data

To show the original shape of the training set.

In [None]:
plt.scatter(X_tr,Y_tr)
plt.xlabel('Training set of X')
plt.ylabel('Training set of Y')
plt.show()

## Model Selection

Obtain polynomial regression models of different orders starting from linear regression i.e. degree = 1 to higher degree models like degree = 2 to 10. Find training and test error for every order and plot these errors v/s degree. Select the order that fits the data best based on low training and test error. You can use poly.polyval method from numpy to find coefficients of the different models.

In [None]:
Errors_test = []
Errors_tr = []
for degree in range(1, 11):
    print("Degree: ", degree)
    # polyfit will return array of polynomial coefficients 
    polyCoefficients = poly.polyfit(X_tr, Y_tr, degree)
    print("Polynomial Coefficients: ", polyCoefficients)
    
    yHat_tr = poly.polyval(X_tr, polyCoefficients)
    TrainingError = np.sum(np.square(Y_tr - yHat_tr)) * 1/(2*Y_tr.shape[0])
    Errors_tr.append(TrainingError)
    print("Training Errors: ", Errors_tr)
    
    yHat_test  = poly.polyval(X_test, polyCoefficients)
    TestError = np.sum(np.square(Y_test - yHat_test)) * 1/(2*Y_test.shape[0])
    Errors_test.append(TestError)
    print("Test Errors: ", Errors_test)
    
    # creates evenly spaced numbers over a specified interval
    xAxisSpacing = np.linspace(-1, 1, 150)
    # evaluates the polynomial at all points across the xAxisSpacing
    yAxisHat = poly.polyval(xAxisSpacing, polyCoefficients)
    plt.xlim(-2,2)
    plt.ylim(-25,25)
    plt.plot(xAxisSpacing,yAxisHat, 'r-')
    plt.scatter(X_tr,Y_tr)
    plt.xlabel("Training Set of X")
    plt.ylabel("Training Set of Y")
    plt.xlim([-2,2])
    plt.show()
    
    
    

Find training and test error for every order and plot these errors v/s degree. Select the order that fits the data best based on low training and test error.

In [None]:

degree = np.array(range(1,11))
# Plot the degree against the Training Error
plt.plot(degree,Errors_tr)
plt.xlabel("Degree Order")
plt.ylabel("Training Error")
plt.title("Training Error vs. Degree Order")
plt.grid()

In [None]:
degree = np.array(range(1,11))
# Plot the degree against the Training Error
plt.plot(degree,Errors_test)
plt.xlabel("Degree Order")
plt.ylabel("Test Error")
plt.title("test Error vs. Degree Order")
plt.grid()