# Multiple Linear Regression with Dummies - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year_view.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

In this exercise, the dependent variable is 'price', while the independent variables are 'size', 'year', and 'view'.

#### Regarding the 'view' variable:
There are two options: 'Sea view' and 'No sea view'. You are expected to create a dummy variable for view and include it in the regression

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Load the data

In [2]:
raw_data = pd.read_csv('real_estate_price_size_year_view.csv')

## Create a dummy variable for 'view'

In [None]:
data = raw_data.copy()

data['view'] = data['view'].map({'No sea view':0,'Sea view':1})

In [7]:
data.describe()

Unnamed: 0,price,size,year,view
count,100.0,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6,0.49
std,77051.727525,297.941951,4.729021,0.502418
min,154282.128,479.75,2006.0,0.0
25%,234280.148,643.33,2009.0,0.0
50%,280590.716,696.405,2015.0,0.0
75%,335723.696,1029.3225,2018.0,1.0
max,500681.128,1842.51,2018.0,1.0


## Create the regression

### Declare the dependent and the independent variables

In [9]:
y = data['price']
x1 = data[['size','year','view']]

### Regression

In [12]:
x = sm.add_constant(x1)
result = sm.OLS(y,x).fit()
result.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.91
Dependent Variable:,price,AIC:,2297.2007
Date:,2024-02-04 21:53,BIC:,2307.6214
No. Observations:,100,Log-Likelihood:,-1144.6
Df Model:,3,F-statistic:,335.2
Df Residuals:,96,Prob (F-statistic):,1.02e-50
R-squared:,0.913,Scale:,533490000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,-5397914.1816,993836.9550,-5.4314,0.0000,-7370664.9453,-3425163.4178
size,223.0316,7.8381,28.4549,0.0000,207.4732,238.5901
year,2718.9489,493.5018,5.5095,0.0000,1739.3556,3698.5422
view,56726.0198,4627.6954,12.2579,0.0000,47540.1171,65911.9225

0,1,2,3
Omnibus:,29.224,Durbin-Watson:,1.965
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.957
Skew:,1.088,Prob(JB):,0.0
Kurtosis:,6.295,Condition No.:,941885.0


In [21]:
sample = data.sample(10)
sample = sample[['size','year','view']]
sample = sm.add_constant(sample)

result.predict(sample)

sample

65    265235.051167
64    290377.563336
11    532118.144648
21    203366.079933
2     254331.773344
24    181726.391529
83    289080.099653
44    325418.483700
81    387811.156605
57    423832.993462
dtype: float64