# Multiple Linear Regression
<h3>At certain times there will be many dependencies for a data. At that time we use multiple linear regression model.<br>
<h3>Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

In [5]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sb
sb.set()


<h4> The dataset consists of 3 columns among them the dependent variable is 'price'. The other 2 variables are size and year. We know that a random year does not depend on the price of the house. Let'see what happens if we use this variable to find the best fit model for the data.

In [3]:
data = pd.read_csv('/real_estate_price_size_year.csv')
data

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009
...,...,...,...
95,252460.400,549.80,2009
96,310522.592,1037.44,2009
97,383635.568,1504.75,2006
98,225145.248,648.29,2015


In [4]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


<h2>Y<sub>i</sub> = b<sub>0</sub> + b<sub>1</sub>X<sub>1</sub> + b<sub>2</sub>X<sub>2</sub><br>
here b<sub>1</sub> is the parameter for variable 'size' and b<sub>2</sub> is the parameter for variable 'year'

In [14]:
Y = data['price'] # Y => the dependent variable
x = data[['size','year']] #x => independent variables 'size' and 'year'

# fitting the model using OLS regression method

In [16]:
x1 = sm.add_constant(x) #adding constant to the first parameter
result = sm.OLS(Y,x1).fit() #using the OLS regression method and fitting the model
result.summary() # shows the summary

0,1,2,3
Dep. Variable:,price,R-squared:,0.776
Model:,OLS,Adj. R-squared:,0.772
Method:,Least Squares,F-statistic:,168.5
Date:,"Fri, 17 Jun 2022",Prob (F-statistic):,2.7700000000000004e-32
Time:,01:43:56,Log-Likelihood:,-1191.7
No. Observations:,100,AIC:,2389.0
Df Residuals:,97,BIC:,2397.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.772e+06,1.58e+06,-3.647,0.000,-8.91e+06,-2.63e+06
size,227.7009,12.474,18.254,0.000,202.943,252.458
year,2916.7853,785.896,3.711,0.000,1357.000,4476.571

0,1,2,3
Omnibus:,10.083,Durbin-Watson:,2.25
Prob(Omnibus):,0.006,Jarque-Bera (JB):,3.678
Skew:,0.095,Prob(JB):,0.159
Kurtosis:,2.08,Cond. No.,941000.0


<h3>From the summary it is clear that the Adj-R sqaured gets significantly reduced while compareing the linear regression we did in regression analysis 2. Hence we can say that we have used unwanted or insignificant values to find the best fit line. Also we can see that the constant gets reduced to -5772000 which is not at all a good fit.<br><br> So we can say that, the variable 'year' is insignificant in fitting the data. It can increase the error in fitting.