# Multi-dimensional linear regression

## Case 1: Hotel price prediction

In the following we want to create a model that predicts the prices for hotels based on different variables: revenue and square meters. We also want to check the quality of the model via the R2 score.

Notice that linear regression is mathematically based on solving potentially large linear equation systems. When doing so, it is only necessary to include linearily independent columns from data frames, as dependent ones would only introduce redundant information and thus will not improve the quality of the fitting. If this is not a priori clear from the data frame itself LinearRegression() will identify these redundancies anyways, however, potentially at the cost of computational time.

In [203]:
#importin the necessary packages

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [189]:
#define data frame from csv file and print head

df = pd.read_csv("hotels.csv")

df.head()

Unnamed: 0,revenue,price in m,square meters,city
0,119000.0,21.88,3938.0,Berlin
1,250000.0,27.95,3986.0,Munich
2,250000.0,16.09,2574.0,Cologne
3,145000.0,27.58,4155.0,Munich
4,110000.0,23.76,3795.0,Berlin


In [174]:
#define the variables from the data frame convert to numpy arrays and print them
X = df[["revenue", "square meters"]].values
Y = df[["price in m"]].values
#print(X,Y)

In [175]:
#split X and Y into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.25)

In [176]:
#build model from training data
model = LinearRegression()
model.fit(X_train, Y_train)

#print the coefficients of "z=a * x + b * y+  c"
print("Intercept: " + str(model.intercept_))
print("Coef: " + str(model.coef_))

Intercept: [6.48370247]
Coef: [[6.39855984e-06 3.89642288e-03]]


Comment: The result is sensitive towards the choice of "random_state = ...". We elaborate on this further down.

In [177]:
#generate prediction for remaining test data
Y_test_pred = model.predict(X_test)

#print(Y_test_pred)

### R2 score to compare models

We would like to compare our prediction to the actual test data: In a first step we juxtapose the respective values. In a second step we use the R2 score to gauge the quality of our model:
(For information on this quantity, see: https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

In [178]:
#simple comparison between predicted and actual test data
#for i in range(0, len(Y_test_pred)):
#    print(str(Y_test_pred[i][0]) + " - " + str(Y_test[i][0]))

In [179]:
#define the R2 score:
# r2_score? 
r2 = r2_score(Y_test, Y_test_pred)
print(r2)

0.8783249527580934


Given that this value is close to 1 the model fits the data rather well.

We may also fit the model and get the score in a compact fashion:

In [180]:
model = LinearRegression()
model.fit(X_train, Y_train)

print(model.score(X_test, Y_test))

0.8783249527580934


Given that the R2 score is sensitive towards the random_state option of the train_test_split function, we may average it over 1000 fits. In addition, this allows us to nicely compare different fitting procedures where more or less columns of the data frame are considered which ultimately makes it possible to see which data is of relevance and which not:

In [181]:
#when considering the revenue and square meters column

scores = []

X = df[["revenue","square meters"]].values
Y = df[["price in m"]].values

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
    
print(sum(scores)/ len(scores))

0.8189770791478341


In [182]:
#when considering the square meters column

scores = []

X = df[["square meters"]].values
Y = df[["price in m"]].values

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
    
print(sum(scores)/ len(scores))

0.8211365797393368


In [183]:
#when considering the revenue column

scores = []

X = df[["revenue"]].values
Y = df[["price in m"]].values

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
    
print(sum(scores)/ len(scores))

0.3066324116319227


From these values we deduce that the revenue column is not of significance to the model building/fitting.

### Including nominal data into the model building

To include the information regarding the city we need the procedure of one-hot-encoding which can be implemented by a pandas method, see also http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example and https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/.

In [192]:
#implementation of one-hot-encoding via pandas to generate a new data frame
df = pd.read_csv("hotels.csv")

df = pd.get_dummies(df, columns = ["city"])

df.head()

Unnamed: 0,revenue,price in m,square meters,city_Berlin,city_Cologne,city_Munich
0,119000.0,21.88,3938.0,1,0,0
1,250000.0,27.95,3986.0,0,0,1
2,250000.0,16.09,2574.0,0,1,0
3,145000.0,27.58,4155.0,0,0,1
4,110000.0,23.76,3795.0,1,0,0


In [225]:
#define variables
X = df[["revenue", "square meters", "city_Berlin", "city_Munich"]].values
#alternatively: X = df.drop(labels = ["price in m", "city_Munich"], axis = 1).values
Y = df[["price in m"]].values

Notice that adding a third "city" column would not include any new information as this column would be linearily dependent on the other two.

In [226]:
scores = []

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
    
print(sum(scores)/ len(scores))

0.9691996044271944


In [229]:
#model.coef_

In [230]:
#to make a prediction for a hotel in city the column of which is the left out:
model.predict([[10000, 300, 0, 0]])

array([[4.93943481]])

## Case 2: Car price prediction

In the following we would like to create two models for car price prediction and compare them via their R2 scores.

In [164]:
#define data frame
df = pd.read_csv("cars_prepared.csv")

#print the head of the data frame
df.head()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,model,fuelType,name
0,1450,1997,75,90000,andere,benzin,Toyota_Toyota_Starlet_1._Hand__TÜV_neu
1,13100,2005,280,5000,golf,benzin,R32_tauschen_oder_kaufen
2,4500,2008,87,90000,yaris,benzin,Toyota_Yaris_1.3_VVT_i
3,6000,2009,177,125000,3er,diesel,320_Alpinweiss_Kohlenstoff
4,3990,1999,118,90000,3er,benzin,BMW_318i_E46_+++_1._Hand_+++_Liebhaberfahrzeug


### Model 1: Kilometer - Price fitting

In [165]:
scores = []
intercepts = []
coefs = []

#define X,Y from data frame
X = df[["kilometer"]].values #to generate numpy array, check via type(X)
Y = df[["price"]].values

In [166]:
for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)), "Average intercept: " + str((sum(intercepts)/len(intercepts))[0]), "Average coef: " + str((sum(coefs)/len(coefs))[0][0]))

Average score: 0.14359169088307702 Average intercept: 15968.270188875034 Average coef: -0.08775728962333613


### Model 2: Kilometer, PS and year of registration - Price fitting

In [167]:
scores = []
intercepts = []
coefs = []


#define X,Y from data frame
X = df[["kilometer","powerPS","yearOfRegistration"]].values #to generate numpy array, check via type(X)
Y = df[["price"]].values

In [162]:
for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)), "Average intercept: " + str((sum(intercepts)/len(intercepts))), "Average coefs: " + str((sum(coefs)/len(coefs))[0]))

Average score: 0.5120554989281118 Average intercept: [-451910.62329917] Average coefs: [-6.80736336e-02  8.94824496e+01  2.27091393e+02]


Comment: We see that the R2 score for the enhanced model is much better as compared to the simple fit.

### Include nominal data into the model building:

In the following we would like to train yet another model to predict car prices by including information on the model and fuel types. This nominal data can be incorporated via one-hot-encoding.

In [235]:
#implementation of one-hot-encoding via pandas to generate a new data frame
df = df = pd.read_csv("cars_prepared.csv")

df = pd.get_dummies(df, columns = ["model", "fuelType"])

df.head()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,name,model_147,model_156,model_1_reihe,model_1er,model_2_reihe,...,model_up,model_vectra,model_vivaro,model_yaris,model_yeti,model_zafira,fuelType_benzin,fuelType_diesel,fuelType_hybrid,fuelType_lpg
0,1450,1997,75,90000,Toyota_Toyota_Starlet_1._Hand__TÜV_neu,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,13100,2005,280,5000,R32_tauschen_oder_kaufen,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,4500,2008,87,90000,Toyota_Yaris_1.3_VVT_i,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
3,6000,2009,177,125000,320_Alpinweiss_Kohlenstoff,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,3990,1999,118,90000,BMW_318i_E46_+++_1._Hand_+++_Liebhaberfahrzeug,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [238]:
#define variables and include in X only some of the models but all fuel types
X = df[["kilometer","powerPS","yearOfRegistration", "model_147", "model_156", "model_1_reihe", "model_1er", "model_2_reihe", "model_up", "model_vectra", "model_vivaro", "model_yaris", "model_yeti", "model_zafira", "fuelType_benzin", "fuelType_diesel", "fuelType_hybrid", "fuelType_lpg"]].values
Y = df[["price"]].values

In [237]:
scores = []
intercepts = []
coefs = []

for i in range(0,1000):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    intercepts.append(model.intercept_)
    coefs.append(model.coef_)
    scores.append(model.score(X_test, Y_test))
    
print("Average score: " + str(sum(scores)/ len(scores)))


Average score: 0.5621081255590241


Notice that for the given case the inclusion of nominal data yields a slight improvement over the previous fittings.