# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [2]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

## Load the data

In [3]:
data1 = pd.read_csv('real_estate_price_size_year.csv')

In [4]:
data1.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

### Declare the dependent and the independent variables

In [5]:
x_var = data1[['size','year']]
y_var = data1['price']

### Regression

In [10]:
reg = LinearRegression()
reg.fit(x_var,y_var)


(100, 2)

### Find the intercept

In [12]:
reg.intercept_

-5772267.017463278

### Find the coefficients

In [13]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [15]:
reg.score(x_var, y_var)

0.7764803683276793

### Calculate the Adjusted R-squared

In [16]:
r2 = reg.score(x_var, y_var)

In [17]:
n = x_var.shape[0]
p = x_var.shape[1]
adj_r = 1-(1-r2)*(n-1)/(n-p-1)
adj_r

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

R-Squared is greater than the adjusted R-squared

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

In [39]:
x = x_var['size']
x_matrix = x.values.reshape(-1,1)
reg.fit(x_matrix,y_var)
reg.score(x_matrix,y_var)
reg.fit(x_var, y_var)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [62]:
new_data = [[750, 2009]]
reg.predict(new_data)
df = pd.DataFrame(data=[[750,2009]], columns = ['size', 'year'])
#df['year'] = [2009]
df['est. price'] = reg.predict(new_data).round(2)
df

Unnamed: 0,size,year,est. price
0,750,2009,258330.34


### Calculate the univariate p-values of the variables

In [50]:
f_regression(x_var,y_var)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [51]:
p_values = f_regression(x_var,y_var)[1].round(3)

In [57]:
x_var.columns.values

array(['size', 'year'], dtype=object)

### Create a summary table with your findings

In [59]:
reg_summary = pd.DataFrame(data = x_var.columns.values, columns = ['Features'])
reg_summary['Coefficients'] = reg.coef_


In [61]:
reg_summary['P-values'] = p_values
reg_summary['Siginificance'] = np.where(reg_summary['P-values']<=0.05, 'Yes','No')
reg_summary

Unnamed: 0,Features,Coefficients,P-values,Siginificance
0,size,227.700854,0.0,Yes
1,year,2916.785327,0.357,No


Answer...