# Multiple Linear Regression 

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.


## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd

## Load the data

In [2]:
dataset = pd.read_csv("real_estate.csv")

In [3]:
dataset.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


**Assigning independant and dependant variable**

In [4]:
y = dataset.iloc[:,0].values
x = dataset.iloc[:,1:3].values

In [5]:
x.shape

(100, 2)

In [6]:
y

array([234314.144, 228581.528, 281626.336, 401255.608, 458674.256,
       245050.28 , 265129.064, 175716.48 , 331101.344, 218630.608,
       279555.096, 494778.992, 215472.104, 418753.008, 444192.008,
       440201.616, 248337.6  , 234178.16 , 225451.984, 299416.976,
       268125.08 , 171795.24 , 412569.472, 183459.488, 168047.264,
       362519.72 , 271793.312, 406852.304, 297760.44 , 368988.432,
       301635.728, 225452.32 , 207742.248, 191486.896, 285223.176,
       302000.92 , 269225.92 , 233493.208, 292965.216, 245747.2  ,
       310045.712, 217468.224, 287350.   , 414682.648, 293044.496,
       300061.48 , 204302.976, 201778.048, 257828.416, 262423.504,
       225656.12 , 393069.76 , 258637.008, 269523.056, 255629.16 ,
       500681.128, 320345.52 , 395242.096, 330677.128, 251332.592,
       251188.824, 263311.696, 359674.44 , 334938.872, 302393.384,
       304587.272, 355251.2  , 271726.752, 294582.944, 454512.76 ,
       276875.632, 181587.576, 298926.496, 211724.096, 228313.

**Splitting the dataset into trainning set and test set**

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size = .2, random_state = 0)

In [9]:
Y_test.shape

(20,)

**Fitting the simple Linear regression to the training set**

In [10]:
from sklearn.linear_model import LinearRegression

In [11]:
LR = LinearRegression()

In [12]:
LR.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [13]:
Y_pred = LR.predict(X_test)

**now we will check how our regressor has learned from the trainning set so that it can predict from the Test set observations**

In [14]:
Y_test

array([271793.312, 154282.128, 281626.336, 500681.128, 286161.6  ,
       266684.248, 248337.6  , 211724.096, 255629.16 , 252460.4  ,
       269523.056, 298170.88 , 251560.04 , 418753.008, 175716.48 ,
       301635.728, 412569.472, 168047.264, 191486.896, 331101.344])

In [15]:
Y_pred

array([237099.32312463, 195463.56906443, 218690.8058951 , 443633.63814185,
       262331.76318841, 265152.49411796, 263445.96291546, 227063.61949839,
       248576.92252859, 216280.14593898, 250101.19322907, 258930.61248233,
       245447.51665936, 333625.13188944, 226526.8409014 , 312295.9981511 ,
       371876.55437879, 210997.60612634, 221781.07968575, 261622.72699847])

**Creating regressor to a object of a new class OLS (Ordinary Least Squares) to statsmodel Liabrary**

In [16]:
import statsmodels.formula.api as sm

In [17]:
x = np.append(arr = np.ones((100,1)).astype(int), values =x , axis = 1 )

In [18]:
x_red = x[:, [0,1,2]]

In [19]:
help(sm.OLS)

Help on class OLS in module statsmodels.regression.linear_model:

class OLS(WLS)
 |  A simple ordinary least squares model.
 |  
 |  
 |  Parameters
 |  ----------
 |  endog : array-like
 |      1-d endogenous response variable. The dependent variable.
 |  exog : array-like
 |      A nobs x k array where `nobs` is the number of observations and `k`
 |      is the number of regressors. An intercept is not included by default
 |      and should be added by the user. See
 |      :func:`statsmodels.tools.add_constant`.
 |  missing : str
 |      Available options are 'none', 'drop', and 'raise'. If 'none', no nan
 |      checking is done. If 'drop', any observations with nans are dropped.
 |      If 'raise', an error is raised. Default is 'none.'
 |  hasconst : None or bool
 |      Indicates whether the RHS includes a user-supplied constant. If True,
 |      a constant is not checked for and k_constant is set to 1 and all
 |      result statistics are calculated as if a constant is present.

**Creating Regressor of the statsmodel**

In [20]:
LR_OLS = sm.OLS(endog = y, exog = x_red).fit()

In [21]:
LR_OLS.summary()# information about the Multiple Linear Regression

0,1,2,3
Dep. Variable:,y,R-squared:,0.776
Model:,OLS,Adj. R-squared:,0.772
Method:,Least Squares,F-statistic:,168.5
Date:,"Sun, 15 Sep 2019",Prob (F-statistic):,2.7700000000000004e-32
Time:,12:30:38,Log-Likelihood:,-1191.7
No. Observations:,100,AIC:,2389.0
Df Residuals:,97,BIC:,2397.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.772e+06,1.58e+06,-3.647,0.000,-8.91e+06,-2.63e+06
x1,227.7009,12.474,18.254,0.000,202.943,252.458
x2,2916.7853,785.896,3.711,0.000,1357.000,4476.571

0,1,2,3
Omnibus:,10.083,Durbin-Watson:,2.25
Prob(Omnibus):,0.006,Jarque-Bera (JB):,3.678
Skew:,0.095,Prob(JB):,0.159
Kurtosis:,2.08,Cond. No.,941000.0


**Null Hypothesis**  
The main null hypothesis of a multiple regression is that there is no relationship between the X variables and the Y variable; in other words, the Y values you predict from your multiple regression equation are no closer to the actual Y values than you would expect by chance.