# Polynomial Regression Models

In this notebook, we will explore polynomial regression models for the same data set.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt 

## Repeat the data import process

In [2]:
data = pd.read_csv('flat-prices.csv')
data.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,9000
1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,04 TO 06,31.0,IMPROVED,1977,6000
2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,8000
3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,07 TO 09,31.0,IMPROVED,1977,6000
4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,04 TO 06,73.0,NEW GENERATION,1976,47200


In [3]:
label = LabelEncoder()

data['town'] = label.fit_transform(data['town'])
data['flat_type'] = label.fit_transform(data['flat_type'])
data['storey_range'] = label.fit_transform(data['storey_range'])
data['flat_model'] = label.fit_transform(data['flat_model'])

selected_features = data[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date']]

X = selected_features.values 

y = data['resale_price']

## Summary of Encoding

Town:
0 - ANG MO KIO
1 - BEDOK
2 - BISHAN
3 - BUKIT BATOK
4 - BUKIT MERAH
5 - BUKIT PANJANG
6 - BUKIT TIMAH
7 - CENTRAL AREA
8 - CHOA CHU KANG
9 - CLEMENTI
10 - GEYLANG
11 - HOUGANG
12 - JURONG EAST
13 - JURONG WEST
14 - KALLANG/WHAMPOA
15 - LIM CHU KANG
16 - MARINE PARADE
17 - PASIR RIS
18 - QUEENSTOWN
19 - SEMBAWANG
20 - SENGKANG
21 - SERANGOON
22 - TAMPINES
23 - TOA PAYOH
24 - WOODLANDS
25 - YISHUN

Flat Type:
0 - 1 ROOM
1 - 2 ROOM
2 - 3 ROOM
3 - 4 ROOM
4 - 5 ROOM
5 - EXECUTIVE 
6 - MULTI GENERATION

Storey Range:
0 - 01 TO 03 
1 - 04 TO 06 
2 - 07 TO 09
3 - 10 TO 12
4 - 13 TO 15
5 - 16 TO 18
6 - 19 TO 21
7 - 22 TO 24
8 - 25 TO 27

Flat Model:
0 - 2-ROOM
1 - APARTMENT
2 - IMPROVED
3 - IMPROVED-MAISONETTE
4 - MAISONETTE
5 - MODEL A
6 - MODEL A-MAISONETTE
7 - MULTI GENERATION
8 - NEW GENERATION
9 - PREMIUM APARTMENT
10 - SIMPLIFIED
11 - STANDARD
12 - TERRACE

## Training the model

Perform Polynomial Regression for orders 2 to 8 and find the best model
We will evaluate the model using train data and test data and then subsequently make a prediction using the example:

A 5-ROOM flat in BEDOK, FLOOR 04 TO 06, IMPROVED, FLOOR AREA = 121, LEASE COMMENCED in 1980


The feature matrix parameters are:
1 - BEDOK
4 - 5-ROOM
1 - 04 TO 06
121 - FLOOR AREA
2 - IMPROVED
1980 - LEASE COMMENCEMENT DATE


Actual Resale Price: 145,000

In [4]:
from sklearn.preprocessing import PolynomialFeatures

In [5]:
# Orders 2 to 4

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state= 42)
X_example = np.array([1, 4, 1, 121, 2, 1980])
data = []

for order in range(2, 5):
    poly = PolynomialFeatures(degree= order)
    P = poly.fit_transform(X_train) # Check if this forms a tall or wide matrix
    if P.shape[0] > P.shape[1]: # More rows than columns == Tall
        P_T = P.T
        invPTP = np.linalg.inv(P_T.dot(P))
        w = (invPTP.dot(P_T)).dot(y_train)

        # Evaluate based on train data 
        y_pred_train = P.dot(w)
        RMSE_train = sqrt(mean_squared_error(y_pred_train, y_train))

        # Evaluate based on test data
        P_test = poly.fit_transform(X_test)
        y_pred_test = P_test.dot(w)
        RMSE_test = sqrt(mean_squared_error(y_pred_test, y_test))

        # Example Predictions
        P_example = poly.fit_transform(X_example.reshape(1, -1))
        y_pred_example = P_example.dot(w)

        data.append([order, RMSE_train, RMSE_test, y_pred_example[0]])

    elif P.shape[0] < P.shape[1]: # More columns than rows == Wide
        P_T = P.T
        invPPT = np.linalg.inv(P.dot(P_T))
        w = (P_T.dot(invPPT)).dot(y_train)

        # Evaluate based on train data 
        y_pred_train = P.dot(w)
        RMSE_train = sqrt(mean_squared_error(y_pred_train, y_train))

        # Evaluate based on test data
        y_pred_test = P_test.dot(w)
        RMSE_test = sqrt(mean_squared_error(y_pred_test, y_test))

        # Example Predictions
        P_example = poly.fit_transform(X_example.reshape(1, -1))
        y_pred_example = P_example.dot(w)

        data.append([order, RMSE_train, RMSE_test, y_pred_example[0]])

col = ['Order', 'RMSE_train', 'RMSE_test', 'Prediction']
df = pd.DataFrame(data, columns= col)

print(df.to_string(columns= col, index= False))

 Order    RMSE_train     RMSE_test    Prediction
     2  7.423838e+04  7.438035e+04  2.946108e+05
     3  2.976056e+05  2.978686e+05  5.893932e+05
     4  2.481218e+06  2.479612e+06  3.387227e+06


In [6]:
# Orders 5 to 8

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state= 42)
X_example = np.array([1, 4, 1, 121, 2, 1980])
data = []

for order in range(5, 8):
    poly = PolynomialFeatures(degree= order)
    P = poly.fit_transform(X_train) # Check if this forms a tall or wide matrix
    if P.shape[0] > P.shape[1]: # More rows than columns == Tall
        P_T = P.T
        invPTP = np.linalg.inv(P_T.dot(P))
        w = (invPTP.dot(P_T)).dot(y_train)

        # Evaluate based on train data 
        y_pred_train = P.dot(w)
        RMSE_train = sqrt(mean_squared_error(y_pred_train, y_train))

        # Evaluate based on test data
        P_test = poly.fit_transform(X_test)
        y_pred_test = P_test.dot(w)
        RMSE_test = sqrt(mean_squared_error(y_pred_test, y_test))

        # Example Predictions
        P_example = poly.fit_transform(X_example.reshape(1, -1))
        y_pred_example = P_example.dot(w)

        data.append([order, RMSE_train, RMSE_test, y_pred_example[0]])

    elif P.shape[0] < P.shape[1]: # More columns than rows == Wide
        P_T = P.T
        invPPT = np.linalg.inv(P.dot(P_T))
        w = (P_T.dot(invPPT)).dot(y_train)

        # Evaluate based on train data 
        y_pred_train = P.dot(w)
        RMSE_train = sqrt(mean_squared_error(y_pred_train, y_train))

        # Evaluate based on test data
        y_pred_test = P_test.dot(w)
        RMSE_test = sqrt(mean_squared_error(y_pred_test, y_test))

        # Example Predictions
        P_example = poly.fit_transform(X_example.reshape(1, -1))
        y_pred_example = P_example.dot(w)

        data.append([order, RMSE_train, RMSE_test, y_pred_example[0]])

col = ['Order', 'RMSE_train', 'RMSE_test', 'Prediction']
df = pd.DataFrame(data, columns= col)

print(df.to_string(columns= col, index= False))

 Order    RMSE_train     RMSE_test    Prediction
     5  2.683441e+06  2.671977e+06  1.716856e+05
     6  2.642312e+07  2.643392e+07  3.642878e+07
     7  1.665590e+08  1.675967e+08  1.438205e+08


## Conclusion

It appears that out of all the polynomial models trained, the one with order = 2 produces the lowest RMSE in both training and testing. 