# Boston Housing Value Regression

## Problem: Predict the median value of owner occupied homes.

1. CRIM - per capita crime rate by town 
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per \$10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - percent lower status of the population
14. MEDV - Median value of owner-occupied homes in \$1000's

In [1]:
import pandas as pd, statsmodels.api as sm

In [2]:
df = pd.read_csv('data/BostonHousing.csv', header=0)

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.01709,90.0,2.02,0,0.41,6.728,36.10 1,2.1265,5,187.0,17.0,384.46,4.5,30.1
1,0.0795,60.0,1.69,0,0.411,6.579,35.90 1,0.7103,4,411.0,18.3,370.78,5.49,24.1
2,0.04301,80.0,1.91,0,0.413,5.663,21.90 1,0.5857,4,334.0,22.0,382.8,8.05,18.2
3,0.10659,80.0,1.91,0,0.413,5.936,19.50 1,0.5857,4,334.0,22.0,376.04,5.57,20.6
4,0.07244,60.0,1.69,0,0.411,5.884,18.50 1,0.7103,4,411.0,18.3,392.33,7.79,18.6


In [4]:
df.describe()

Unnamed: 0,ZN,INDUS,CHAS,NOX,RM,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,11.166008,11.136779,0.06917,0.554695,6.284634,4.332016,408.237154,18.455534,356.674032,12.653063,22.532806
std,22.991219,6.860353,0.253994,0.115878,0.702617,1.417166,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.0,0.46,0.0,0.385,3.561,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.0,5.19,0.0,0.449,5.8855,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.0,9.69,0.0,0.538,6.2085,4.0,330.0,19.05,391.44,11.36,21.2
75%,12.5,18.1,0.0,0.624,6.6235,5.0,666.0,20.2,396.225,16.955,25.0
max,95.0,27.74,1.0,0.871,8.78,8.0,711.0,22.0,396.9,37.97,50.0


In [5]:
target = df[['MEDV']]
features = df.loc[:, df.columns != 'MEDV']

#### Stats Models Approach

In [23]:
y = target['MEDV']
x = features[['ZN','INDUS','CHAS','NOX','RM','RAD','TAX','PTRATIO','B','LSTAT']]
#x = features[['RM']]
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
predictions = model.predict(x) # make the predictions by the model
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.698
Model:,OLS,Adj. R-squared:,0.692
Method:,Least Squares,F-statistic:,114.2
Date:,"Sun, 26 Apr 2020",Prob (F-statistic):,1.06e-121
Time:,14:54:01,Log-Likelihood:,-1537.6
No. Observations:,506,AIC:,3097.0
Df Residuals:,495,BIC:,3144.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,13.7448,4.856,2.831,0.005,4.204,23.285
ZN,-0.0053,0.013,-0.398,0.691,-0.032,0.021
INDUS,0.0619,0.062,0.994,0.321,-0.060,0.184
CHAS,3.2867,0.922,3.564,0.000,1.475,5.099
NOX,-4.5811,3.614,-1.268,0.205,-11.681,2.519
RM,4.7100,0.426,11.060,0.000,3.873,5.547
RAD,0.1358,0.168,0.810,0.418,-0.194,0.465
TAX,-0.0002,0.002,-0.096,0.923,-0.005,0.004
PTRATIO,-0.9025,0.138,-6.529,0.000,-1.174,-0.631

0,1,2,3
Omnibus:,213.356,Durbin-Watson:,1.584
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1189.923
Skew:,1.773,Prob(JB):,4.09e-259
Kurtosis:,9.623,Cond. No.,12800.0


#### SK Learn approach

In [36]:
from sklearn import linear_model

In [37]:
x = features
y = target['MEDV']

Clean up values that make processing difficult

In [None]:
for column in x:
    

In [38]:
lm = linear_model.LinearRegression()
model = lm.fit(x,y)

ValueError: could not convert string to float: ' 0.01432 1'