# Building two simple polynomial regression model

We generate polynomial regressions for two data sets and compare the R2 scores to linear regressions.

## Case 1: Profit prediction for an agricultural problem

In the following we would like to predict profits on harvest for certain field sizes.

In [150]:
#importing the necessary packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

In [185]:
#define data frame
df = pd.read_csv("fields.csv")

In [186]:
df.head()

Unnamed: 0,length,profit,width
0,807.0,634630.0,1032.0
1,299.0,124074.0,337.0
2,431.0,1338300.0,1631.0
3,744.0,327720.0,553.0
4,364.0,500244.0,827.0


We start off with a simple linear regression:

In [187]:
#define the variables
X = df[["width", "length"]].values
Y = df[["profit"]].values

#split the data set into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.25)

#train the model
model = LinearRegression()
model.fit(X_train, Y_train)

#report the R2 score
print(model.score(X_test, Y_test))

0.9265510548118597


In a next step we proceed with an attempt at polynomial fitting for the data:

In [190]:
#PolynomialFeatures?

pf = PolynomialFeatures(degree = 2, include_bias = False) #bias term not needed here
#need to fit the training data accordingly (demanded by sklearn) to adapt to polynomial fitting
pf.fit(X_train)

#generate new columns
X_train_transformed = pf.transform(X_train)
X_test_transformed = pf.transform(X_test)

#print all possible arrangements to get to a polynomial of degree 2 (as done by the transform method)
#print(pf.powers_) 

model = LinearRegression()
model.fit(X_train_transformed, Y_train)

print(model.score(X_test_transformed, Y_test))

0.9878259248140406


We may redo the analysis without the random_state option in the train test function:

In [207]:
scores = []
intercepts = []
coefs = []

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    X_train_transformed = pf.transform(X_train)
    X_test_transformed = pf.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_transformed, Y_train)
    
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test_transformed, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)))


Average score: 0.986807601767222


In [208]:
#np.array(coefs).shape

Now we would like to filter out columns from the fitting procedure:

In [210]:
scores = []
intercepts = []
coefs = []

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    
    #we may exclude certain columns from the fitting and check if the score improves
    X_train_transformed = pf.transform(X_train)[:, [0, 1, 2, 3, 4]]
    X_test_transformed = pf.transform(X_test)[:, [0, 1, 2, 3, 4]]

    model = LinearRegression()
    model.fit(X_train_transformed, Y_train)
    
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test_transformed, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)))


Average score: 0.9868476616439141


## Case 2: Diamond price prediction

In the following we would like to model the prices of diamonds via linear and polynomial regressions and compare the quality of the results via the R2 score.

In [223]:
#define data frame
df = pd.read_csv("diamonds.csv")
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


We start off with two simple linear regressions to get a feeling for the system:

In [226]:
# price over carat
#define the variables
X = df[["carat"]].values
Y = df[["price"]].values

#split the data set into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.25)

#train the model
model = LinearRegression()
model.fit(X_train, Y_train)

#report the R2 score
print(model.score(X_test, Y_test))

0.8506009410929625


In [227]:
# price over dimensions x,y,z
#define the variables
X = df[["x","y","z"]].values
Y = df[["price"]].values

#split the data set into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.25)

#train the model
model = LinearRegression()
model.fit(X_train, Y_train)

#report the R2 score
print(model.score(X_test, Y_test))

0.7834062179737853


Comment: Given this carat seems to be a good indicator for gauging the price of a diamond.

Now we try a polynomial regression:

In [251]:
# price over dimensions x,y,z
#define the variables
X = df[["x", "y", "z"]].values
Y = df[["price"]].values

pf = PolynomialFeatures(degree = 2, include_bias = False) #bias term not needed here
#need to fit the training data accordingly (demanded by sklearn) to adapt to polynomial fitting
pf.fit(X_train)

scores = []
intercepts = []
coefs = []

for i in range(0,200):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    
    #we may exclude certain columns from the fitting and check if the score improves
    X_train_transformed = pf.transform(X_train)
    X_test_transformed = pf.transform(X_test)
    model = LinearRegression()
    model.fit(X_train_transformed, Y_train)
    
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test_transformed, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)))

Average score: -23.75882427600454


This result indicates that in the current setting the linear regression via dimensions outperforms the polynomial one while the linear and polynomial regressions via carats outperform those with respect to dimensions.