# Predicting House Prices - Boston

### Paulo C. Rios Jr. | Oct 23, 2017

### To do

1. Rename the columns
2. Split in train and test data randomly, 20% in test, random_state = 1
3. Train a simple linear model on the train data
4. Perform Cross-validation (with 5 folds)
6. Get the MSE and r2 in the validation
7. Test the model on the test data
8. Get the MSE and r2 on the test data 
9. Compare with MSE and r2 in the validation with the test 
10. Do the above with a RandomFlorestRegressor and identify the features that are most important to the prediction
11. Compare, using MSE, the results of using a simple linear model with the results of using a random forest regressor

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [37]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score,precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict

In [3]:
from sklearn.ensemble import RandomForestRegressor

In [4]:
# Load the diabetes dataset
boston = datasets.load_boston()

In [5]:
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [6]:
type(boston.DESCR)

str

In [7]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [8]:
boston_X = boston.data

In [9]:
boston_X.shape

(506, 13)

### 1. Rename the columns

In [10]:
boston_X_df = pd.DataFrame(boston_X)
boston_X_df.columns = list(boston.feature_names)

In [11]:
boston_X_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### 2. Split in train and test data randomly, 20% in test, random_state = 1

In [15]:
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y, 
                                        test_size=0.2, 
                                        random_state=1)

### 3. Train a simple linear model on the train data

In [17]:
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### 4. Perform Cross-validation (with 5 folds)

In [19]:
cv_score = cross_val_score(lm, X_train, y_train, cv=5 )
cv_score

array([ 0.7535665 ,  0.69205634,  0.68197127,  0.6687108 ,  0.71574185])

In [20]:
cv_score_train_mean =  np.mean(cv_score)
cv_score_train_mean

0.70240935282099237

### 5. Get the MSE and r2 in the validation

In [21]:
y_train_pred_cv = cross_val_predict(lm, X_train, y_train, cv=5)

In [22]:
# The mean squared error
validation_mse = mean_squared_error(y_train, y_train_pred_cv)
print("Mean squared error: %.2f"
      % mean_squared_error(y_train, y_train_pred_cv))

Mean squared error: 23.91


In [23]:
# Explained variance score: 1 is perfect prediction
validation_r2_score = r2_score(y_train, y_train_pred_cv)
print('Variance score: %.2f' % r2_score(y_train, 
                                        y_train_pred_cv))

Variance score: 0.70


### 6. Test the model on the test data

In [24]:
y_test_pred = lm.predict(X_test)

### 7. Get the MSE and r2 on the test data 

In [25]:
test_mse = mean_squared_error(y_test, y_test_pred)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_test_pred))

Mean squared error: 23.37


In [26]:
test_r2_score = r2_score(y_test,y_test_pred)
print('Variance score: %.2f' % r2_score(y_test, 
                                        y_test_pred))

Variance score: 0.76


### 8. Compare with MSE and r2 in the validation with the test 

In [27]:
comparison = {"Validation" : [validation_mse,validation_r2_score],
             "Test": [test_mse, test_r2_score]}
pd.DataFrame(comparison, index = ["MSE Score", "R2 Score"])

Unnamed: 0,Test,Validation
MSE Score,23.374563,23.906838
R2 Score,0.763481,0.704055


### 9. Do the above with a RandomFlorestRegressor and identify the features that are most important to the prediction

In [28]:
X_rf = boston.data
y_rf = boston.target
rf_model = RandomForestRegressor()

In [29]:
X_rf_train, X_rf_test, y_rf_train, y_rf_test = train_test_split(
                                                                X_rf,
                                                                y_rf, 
                                                                test_size=0.2, 
                                                                random_state=1)

In [31]:
rf_model.fit(X_rf_train, y_rf_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [33]:
cv_scores_rf = cross_val_score(rf_model, X_rf_train, y_rf_train, 
                               cv=5)
cv_scores_rf

array([ 0.82084008,  0.85727007,  0.80984658,  0.88036616,  0.85404168])

In [34]:
cv_scores_rf_mean =  np.mean(cv_scores_rf)
cv_scores_rf_mean

0.84447291314574513

In [35]:
y_rf_test_pred = rf_model.predict(X_rf_test)

In [39]:
rf_model.feature_importances_

array([ 0.05038429,  0.00102363,  0.01100077,  0.00232228,  0.01907472,
        0.39356669,  0.00996119,  0.07135472,  0.00414375,  0.01609765,
        0.02192768,  0.01106756,  0.38807507])

In [40]:
imp_list = list(zip(boston.feature_names, 
                    rf_model.feature_importances_))
imp_df = pd.DataFrame(imp_list, columns = ["Features", "Importance"])
imp_df.sort_values(by = "Importance", ascending = False)

Unnamed: 0,Features,Importance
5,RM,0.393567
12,LSTAT,0.388075
7,DIS,0.071355
0,CRIM,0.050384
10,PTRATIO,0.021928
4,NOX,0.019075
9,TAX,0.016098
11,B,0.011068
2,INDUS,0.011001
6,AGE,0.009961


### 10. Compare, using MSE, the results of using a simple linear model with the results of using a random forest regressor

In [42]:
# The mean squared error
test_rf_mse = mean_squared_error(y_rf_test, y_rf_test_pred)
print("Mean squared error: %.2f"
      % mean_squared_error(y_rf_test, y_rf_test_pred))

Mean squared error: 12.86


In [43]:
# Explained variance score: 1 is perfect prediction
test_rf_r2_score = r2_score(y_rf_test, y_rf_test_pred)
print('Variance score: %.2f' % r2_score(y_rf_test, y_rf_test_pred))

Variance score: 0.87


In [47]:
comparison_final = {"Linear Model Test": [test_mse, test_r2_score],
                    "Linear Model Validation" : [validation_mse,validation_r2_score],
                    "Random Forest Test": [test_rf_mse, test_rf_r2_score]}
pd.DataFrame(comparison_final, index = ["MSE Score", "R2 Score"])

Unnamed: 0,Linear Model Test,Linear Model Validation,Random Forest Test
MSE Score,23.374563,23.906838,12.860173
R2 Score,0.763481,0.704055,0.869872
