# Boston Housing

## Predicting Housing Values in Suburbs of Boston

## Background
In this project, I will evaluate the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. This model would prove to be invaluable for someone like a real estate agent who could make use of such information on a daily basis.

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Housing). The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. 

The **medv** variable is the target variable.

## Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

***crim:*** per capita crime rate by town.

***zn:*** proportion of residential land zoned for lots over 25,000 sq.ft.

***indus:*** proportion of non-retail business acres per town.

***chas:*** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

***nox:*** nitrogen oxides concentration (parts per 10 million).

***rm:*** average number of rooms per dwelling.

***age:*** proportion of owner-occupied units built prior to 1940.

***dis:*** weighted mean of distances to five Boston employment centres.

***rad:*** index of accessibility to radial highways.

***tax:*** full-value property-tax rate per $10,000.

***ptratio:*** pupil-teacher ratio by town.

***black:*** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

***lstat:*** lower status of the population (percent).

***medv:*** median value of owner-occupied homes in $1000s.

In [4]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit

# Pretty display for notebooks
%matplotlib inline

# Load the Boston housing dataset
data = pd.read_csv('train.csv')
    
# Success
print "Boston housing dataset has {} data points with {} variables each.".format(*data.shape)

Boston housing dataset has 333 data points with 15 variables each.


## Explore data

In [7]:
data.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [11]:
data.dtypes # data types of columns

ID           int64
crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [12]:
# Check missing value
data.isnull().sum()

ID         0
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64

In [20]:
# Drop ID column
data = data.drop('ID', axis = 1)

In [21]:
data.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [32]:
y = data['medv']
X = data.drop('medv', axis = 1)

# Define a Performance Metric

In [75]:
# Import 'mean_squared_error'
from sklearn.metrics import mean_squared_error

def performance_metric(y_true, y_predict):
    score = (-mean_squared_error(y_true, y_predict))**0.5
    return score

# Build linear regression model

In [68]:
from sklearn.linear_model import LinearRegression

# Initialize linear regression model
lm1 = LinearRegression()
lm1.fit(X, y)

lm2 = LinearRegression(fit_intercept=False)
lm2.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

In [69]:
from sklearn.cross_validation import cross_val_score
scores1 = (-cross_val_score(lm1, X, y, scoring='mean_squared_error', cv=10))**0.5
scores2 = (-cross_val_score(lm2, X, y, scoring='mean_squared_error', cv=10))**0.5
print (scores1.mean())
print (scores2.mean())

5.74373078465
5.44708109301


In [85]:
# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 1)

    # Create a decision tree regressor object
    regressor = RandomForestRegressor(random_state=1)

    # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'n_estimators':(15,16,17,18,19,20,21,22,23,24,25)}

    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric, greater_is_better = False)

    # TODO: Create the grid search object
    grid = GridSearchCV(regressor, params)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

In [80]:
# TODO: Import 'train_test_split'
from sklearn.cross_validation import train_test_split

# TODO: Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Success
print "Training and testing split was successful."

Training and testing split was successful.


In [86]:
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)

# Produce the value for 'max_depth'
print "Parameter 'n_estimators' is {} for the optimal model.".format(reg.get_params()['n_estimators'])

Parameter 'n_estimators' is 20 for the optimal model.


In [92]:
# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 1)

    # Create a decision tree regressor object
    regressor = RandomForestRegressor(n_estimators=20,random_state=1)

    # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'max_features':(1,2,3,4,5,6,7,8,9,10,11,12,13)}

    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric, greater_is_better = False)

    # TODO: Create the grid search object
    grid = GridSearchCV(regressor, params)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

In [93]:
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)

# Produce the value for 'max_depth'
print "Parameter 'max_features' is {} for the optimal model.".format(reg.get_params()['max_features'])

Parameter 'max_features' is 13 for the optimal model.


In [94]:
# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 1)

    # Create a decision tree regressor object
    regressor = RandomForestRegressor(n_estimators=20,
                                      max_features=13,
                                      random_state=1)

    # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'max_depth':(1,2,3,4,5,6,7,8,9,10,11,12,13)}

    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric, greater_is_better = False)

    # TODO: Create the grid search object
    grid = GridSearchCV(regressor, params)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

In [95]:
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)

# Produce the value for 'max_depth'
print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth'])

Parameter 'max_depth' is 9 for the optimal model.


## Initialize linear regression model
tree = RandomForestRegressor(n_estimators=20,
                             max_features=13,
                             max_depth=9,
                             random_state=1)
tree.fit(X, y)

In [99]:
from sklearn.cross_validation import cross_val_score

treescores = (-cross_val_score(tree, X, y, scoring='mean_squared_error', cv=10))**0.5

print (treescores.mean())

4.20784280934


## Processing The Test Set

In [5]:
test = pd.read_csv("test.csv")
test.isnull().sum()

ID         0
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
dtype: int64

In [6]:
test.shape

(173, 14)

In [102]:
# Make predictions using the test set.
predictors = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age','dis','rad','tax','ptratio','black','lstat']
predictions = tree.predict(test[predictors])

In [110]:
# Create a new dataframe with only the columns Kaggle wants from the dataset.
#submission = 
pd.DataFrame({
    "ID": test["ID"],
    "medv": predictions
    }).set_index('ID').to_csv('submission.csv')