# Boston Housing 

### load data

In [37]:
# load data
dataset = read.csv('Boston_Housing.csv')
dataset <- dataset[ -c(1) ] # drop the first column (index)


In [38]:
# import library 
#install.packages('MLmetrics')
library(caTools)
library(MLmetrics) # to calculate cost function for models


### Split dataset

In [39]:
set.seed(2)

# split data with ratio 75/25
split = sample.split(dataset$target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE) # Set True for train set
test_set = subset(dataset, split == FALSE) # Set false for Test set

head(training_set)

CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


### Create linear model using all features

In [40]:
# create regressor using all features 
regressor = lm(formula = target ~ .,
               data = training_set)

# show the summary and coefficients for regressor
cat("--------------------------------------------------\n")
summary(regressor)
cat("--------------------------------------------------\n")
regressor$coefficients

# use model to predict the y_test
y_pred <- predict(regressor, newdata = test_set)
y_actual <- test_set$target 
cat("--------------------------------------------------\n")
cat("Linear Regression model using all features \nThe results for MAE: ",MAE(y_actual, y_pred),
    "\nThe results for MSE: ",MSE(y_actual, y_pred),"\n")



--------------------------------------------------



Call:
lm(formula = target ~ ., data = training_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.0924  -2.8010  -0.6469   1.8781  25.8189 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.266776   5.619595   7.165 3.58e-12 ***
CRIM         -0.104269   0.033893  -3.076 0.002234 ** 
ZN            0.051647   0.015199   3.398 0.000744 ***
INDUS        -0.014955   0.068736  -0.218 0.827872    
CHAS          1.964905   0.955527   2.056 0.040374 *  
NOX         -18.429207   4.200852  -4.387 1.46e-05 ***
RM            3.523838   0.450074   7.829 4.15e-14 ***
AGE          -0.003865   0.014514  -0.266 0.790165    
DIS          -1.622514   0.220702  -7.352 1.06e-12 ***
RAD           0.317138   0.072294   4.387 1.46e-05 ***
TAX          -0.012796   0.004073  -3.141 0.001802 ** 
PTRATIO      -0.962455   0.147129  -6.542 1.80e-10 ***
B             0.009137   0.002820   3.241 0.001289 ** 
LSTAT        -0.527369   0.054671  -9.646  < 2e-16 ***
---

--------------------------------------------------


--------------------------------------------------
Linear Regression model using all features 
The results for MAE:  3.141123 
The results for MSE:  19.52406 


### Create linear model using the most affected features

In [41]:
regressor2 = lm(formula = target ~ ZN + RM +DIS ,data = training_set)

# show the summary and coefficients for regressor
cat("--------------------------------------------------\n")
summary(regressor2)
cat("--------------------------------------------------\n")
regressor2$coefficients

# use model to predict the y_test
y_pred_2 <- predict(regressor2, newdata = test_set)
cat("--------------------------------------------------\n")
cat("Linear Regression model using the most affected features\nThe results for MAE: "
    ,MAE(y_actual, y_pred_2),
    "\nThe results for MSE: ",MSE(y_actual, y_pred_2),"\n")


--------------------------------------------------



Call:
lm(formula = target ~ ZN + RM + DIS, data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.987  -3.337   0.257   2.771  39.150 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -29.72023    3.04296  -9.767  < 2e-16 ***
ZN            0.06617    0.01893   3.495 0.000523 ***
RM            8.12511    0.48137  16.879  < 2e-16 ***
DIS           0.14113    0.20644   0.684 0.494579    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.661 on 424 degrees of freedom
Multiple R-squared:  0.5026,	Adjusted R-squared:  0.499 
F-statistic: 142.8 on 3 and 424 DF,  p-value: < 2.2e-16


--------------------------------------------------


--------------------------------------------------
Linear Regression model using the most affected features
The results for MAE:  3.658271 
The results for MSE:  29.71482 


## Finally,

Based on the results from the first model the following features are important:
**ZN, NOX, RM, DIS, RAD, PTRATIO, LSTAT**


The first model achieve the low amount of error. So, we can said the first model better than second.