In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
sb.set()
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('real_estate_price_dataset.csv')
data

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009
...,...,...,...
95,252460.400,549.80,2009
96,310522.592,1037.44,2009
97,383635.568,1504.75,2006
98,225145.248,648.29,2015


In [4]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the Regression

<h3>Declare the dependent and independent variable

In [5]:
x = data[['size','year']]
y = data['price']

<h3>Scale the inputs

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)

<h2>Regression

In [8]:
reg = LinearRegression()
reg.fit(x_scaled,y)

<h4>Intercept

In [9]:
reg.intercept_

292289.4701599997

<h4>Coefficient

In [10]:
reg.coef_

array([67501.57614152, 13724.39708231])

<h3>Calculate the R-squared

In [11]:
reg.score(x_scaled,y)

0.7764803683276793

<h3>Adjusted R-squared

In [13]:
def adj_R_squared(R):
    n = x.shape[0]
    p = x.shape[1]
    adj_R = 1 - (1-R)*(n-1)/(n-p-1)
    return adj_R

In [15]:
adj_R = adj_R_squared(reg.score(x_scaled,y))
adj_R

0.77187171612825

## Compare the R-Squared and Adj-R-Squared

It seems the the R-squared is only slightly larger than the Adjusted R-squared, implying that we were not penalized a lot for the inclusion of 2 independent variables. 

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Comparing the Adjusted R-squared with the R-squared of the simple linear regression (when only 'size' was used - a couple of lectures ago), we realize that 'Year' is not bringing too much value to the result.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [16]:
new_data = [[750,2009]]
new_data_scaled = scaler.transform(new_data)



In [17]:
reg.predict(new_data_scaled)

array([258330.34465995])

# Calculate the univariate p-values of the variables

In [18]:
from sklearn.feature_selection import f_regression

In [19]:
f_regression(x_scaled,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [22]:
p_values = f_regression(x,y)[1]
p_values.round(3)

array([0.   , 0.357])

# creating the summary table

In [23]:
reg_summary = pd.DataFrame(data = x.columns.values, columns = ['Features'])
reg_summary

Unnamed: 0,Features
0,size
1,year


In [24]:
reg_summary['Coeffiencients'] = reg.coef_
reg_summary['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coeffiencients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


It seems that 'Year' is not event significant, therefore we should remove it from the model.

Note that this dataset is extremely clean and probably artificially created, therefore standardization does not really bring any value to it.