# Boston Housing Value Regression

## Problem: Predict the median value of owner occupied homes.

1. CRIM - per capita crime rate by town 
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per \$10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of black people(African Americans) by town
13. LSTAT - percent lower economic status of the population
14. MEDV - Median value of owner-occupied homes in \$1000's

In [1]:
import pandas as pd, statsmodels.api as sm

In [2]:
df = pd.read_csv('data/BostonHousing.csv', header=0)

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.01965,80.0,1.76,0,0.385,6.23,31.5,9.0892,1,241.0,18.2,0.01965,80.0,1.76
1,0.01096,55.0,2.25,0,0.389,6.453,31.9,7.3073,1,300.0,15.3,0.01096,55.0,2.25
2,0.04819,80.0,3.64,0,0.392,6.108,32.0,9.2203,1,315.0,16.4,0.04819,80.0,3.64
3,0.03548,80.0,3.64,0,0.392,5.876,19.1,9.2203,1,315.0,16.4,0.03548,80.0,3.64
4,0.01538,90.0,3.75,0,0.394,7.454,34.2,6.3361,3,244.0,15.9,0.01538,90.0,3.75


In [4]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,3.613524,11.363636,11.136779
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,8.601545,23.322453,6.860353
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.00632,0.0,0.46
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,0.082045,0.0,5.19
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,0.25651,0.0,9.69
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,3.677082,12.5,18.1
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,88.9762,100.0,27.74


In [5]:
target = df[['MEDV']]
features = df.loc[:, df.columns != 'MEDV']

#### StatsModels Approach

In [6]:
y = target['MEDV']
x = features[['CRIM','ZN','INDUS','CHAS','NOX','RM','RAD','TAX','PTRATIO','B','LSTAT']]
#x = features[['RM']]
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
predictions = model.predict(x) # make the predictions by the model
model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,MEDV,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,4.78e+30
Date:,"Sun, 17 May 2020",Prob (F-statistic):,0.0
Time:,18:30:14,Log-Likelihood:,15166.0
No. Observations:,506,AIC:,-30310.0
Df Residuals:,496,BIC:,-30270.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.329e-15,2.13e-14,0.250,0.803,-3.65e-14,4.72e-14
CRIM,-1.006e-16,7.88e-17,-1.278,0.202,-2.55e-16,5.41e-17
ZN,1.018e-16,2.99e-17,3.400,0.001,4.3e-17,1.61e-16
INDUS,1.0000,2.95e-16,3.39e+15,0.000,1.000,1.000
CHAS,-5.551e-16,4.25e-15,-0.131,0.896,-8.9e-15,7.79e-15
NOX,-3.775e-15,1.63e-14,-0.231,0.817,-3.59e-14,2.83e-14
RM,-1.166e-15,1.71e-15,-0.681,0.496,-4.53e-15,2.2e-15
RAD,-5.69e-16,3.27e-16,-1.742,0.082,-1.21e-15,7.27e-17
TAX,9.021e-17,1.86e-17,4.847,0.000,5.36e-17,1.27e-16

0,1,2,3
Omnibus:,27.535,Durbin-Watson:,0.049
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15.615
Skew:,-0.269,Prob(JB):,0.000407
Kurtosis:,2.328,Cond. No.,1.86e+17


### The above represents a multiple linear regression using all of the available variables. From these results, we can make a few determinations about our data set.

1. With an R squared of 1.0, we can say that the variables in the constructed model accounts for nearly all of the variability in the median value of homes in this sample set. I say 'nearly', because for a real-world case such as this, even if the model perfectly explains all of the variability in this sample set, it is highly unlikely that this model can be considered totally complete for the population at large.

2. With a Kurtosis of 2.328 we can say that the model predicts that the dependent variable skews narrower than a normal distribution. In other words, there is more data in the center of the distribution and less in the tails compared to a normal bell curve. A kurtosis of 3 would correspond to a model that was perfectly normally distributed.

3. Perhaps most importantly, if we take the standard P-value of .05, the variables that are statistically significant in our dataset are ZN, INDUS, TAX, and LSTAT. In casual terms, we can say with at least 95% confidence that the value of a home tends to increase if the proportion of residential land zoned for lots over 25,000 sq.ft is higher, if the proportion of non-retail business acres per town is higher, if the full-value property-tax rate per /$10,000 is higher, and if the percent of the population of lower economic status is smaller.

#### SK Learn approach

In [7]:
from sklearn import linear_model

In [8]:
x = features
y = target['MEDV']

In [9]:
lm = linear_model.LinearRegression()
model = lm.fit(x,y)

In [10]:
predictions = lm.predict(x)
print(predictions[0:5])

[1.76 2.25 3.64 3.64 3.75]


In [11]:
lm.score(x,y)

1.0

In [12]:
lm.coef_

array([ 4.60699727e-17,  1.56036045e-16,  1.00000000e+00, -2.18657100e-15,
       -4.59987494e-15,  3.13750969e-16,  1.19314068e-16,  1.05284705e-16,
        5.72244261e-16, -3.31528873e-16,  1.37519559e-17,  3.54328888e-16,
       -8.07660030e-17])

In [13]:
lm.intercept_

1.1368683772161603e-13

### The SKlearn approach is a little more basic than the Stats Models approach. It can provide an R squared value, coefficients, and intercepts out-of-the-box. However it is lacking when it comes to providing information such as the shape of the data (Kurtosis) as well as confidence intervals, which are crucial to evaluating the model. One redeeming feature is the ability to observe individual predictions for each data point.