## Boston Housing Price Assessment

##### crim = per capita crime rate by town
##### zn = proportion of residential land zoned for lots over 25,000 sq.ft.
##### indus = proportion of non-retail business acres per town.
##### chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
##### nox = nitrogen oxides concentration (parts per 10 million).
##### rm = average number of rooms per dwelling.
##### age = proportion of owner-occupied units built prior to 1940.
##### dis = weighted mean of distances to five Boston employment centres.
##### rad = index of accessibility to radial highways.
##### tax = full-value property-tax rate per USD 10,000
##### ptratio = pupil-teacher ratio by town.
##### black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
##### lstat = lower status of the population (percent).
##### medv = median value of owner-occupied homes in USD 1000s

In [1]:
#Pkg.add("DecisionTree")
Pkg.add("DataFrames")
Pkg.add("RDatasets")
Pkg.add("GLM")
Pkg.add("PyPlot")
Pkg.add("StatPlots")


[1m[36mINFO: [39m[22m[36mPackage DataFrames is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of DataFrames
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage RDatasets is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of RDatasets
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage GLM is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of GLM
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage PyPlot is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version o

In [2]:
#using DecisionTree
using RDatasets, DataFrames
using GLM
using PyPlot, StatPlots

In [3]:
bos = dataset("MASS", "Boston")
features = convert(Array, bos[:, 1:14]);
#tag = convert(Array, bos[:, 1]);
labels = names(bos)

14-element Array{Symbol,1}:
 :Crim   
 :Zn     
 :Indus  
 :Chas   
 :NOx    
 :Rm     
 :Age    
 :Dis    
 :Rad    
 :Tax    
 :PTRatio
 :Black  
 :LStat  
 :MedV   

In [4]:
show(labels)

Symbol[:Crim, :Zn, :Indus, :Chas, :NOx, :Rm, :Age, :Dis, :Rad, :Tax, :PTRatio, :Black, :LStat, :MedV]

In [5]:
describe(bos)

Crim
Summary Stats:
Mean:           3.613524
Minimum:        0.006320
1st Quartile:   0.082045
Median:         0.256510
3rd Quartile:   3.677083
Maximum:        88.976200
Length:         506
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Zn
Summary Stats:
Mean:           11.363636
Minimum:        0.000000
1st Quartile:   0.000000
Median:         0.000000
3rd Quartile:   12.500000
Maximum:        100.000000
Length:         506
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Indus
Summary Stats:
Mean:           11.136779
Minimum:        0.460000
1st Quartile:   5.190000
Median:         9.690000
3rd Quartile:   18.100000
Maximum:        27.740000
Length:         506
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Chas
Summary Stats:
Mean:           0.069170
Minimum:        0.000000
1st Quartile:   0.000000
Median:         0.000000
3rd Quartile:   0.000000
Maximum:        1.000000
Length:         506
Type:           Int64
Number 

In [6]:
head(bos,5)

Unnamed: 0,Crim,Zn,Indus,Chas,NOx,Rm,Age,Dis,Rad,Tax,PTRatio,Black,LStat,MedV
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [7]:
fm1 = fit(LinearModel, @formula(MedV ~ LStat + Rm + Dis + Chas + Crim), bos)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: MedV ~ 1 + LStat + Rm + Dis + Chas + Crim

Coefficients:
              Estimate Std.Error  t value Pr(>|t|)
(Intercept)    2.16889   3.28148 0.660949   0.5089
LStat         -0.65435 0.0504212 -12.9777   <1e-32
Rm             4.88422  0.434206  11.2486   <1e-25
Dis          -0.490096  0.135231 -3.62413   0.0003
Chas           3.49103  0.950007  3.67474   0.0003
Crim         -0.119994 0.0317739  -3.7765   0.0002


In [8]:
colwise(median,bos)

14-element Array{Any,1}:
 [0.25651]
 [0.0]    
 [9.69]   
 [0.0]    
 [0.538]  
 [6.2085] 
 [77.5]   
 [3.20745]
 [5.0]    
 [330.0]  
 [19.05]  
 [391.44] 
 [11.36]  
 [21.2]   

In [9]:
stderr(fm1)

6-element Array{Float64,1}:
 3.28148  
 0.0504212
 0.434206 
 0.135231 
 0.950007 
 0.0317739

In [10]:
confint(fm1)

6×2 Array{Float64,2}:
 -4.2783     8.61608  
 -0.753413  -0.555286 
  4.03112    5.73731  
 -0.755787  -0.224404 
  1.62453    5.35753  
 -0.182421  -0.0575674

In [11]:
coef(fm1)

6-element Array{Float64,1}:
  2.16889 
 -0.65435 
  4.88422 
 -0.490096
  3.49103 
 -0.119994