# Predictive Analysis

The following analysis was performed on an **Amazon EC2** instance running the latest anaconda packages (as of Nov. 2018) and Python 3.7.

All data was imported from **Amazon S3 Buckets**

To stay within the free tier provided by Amazon, a limited test set of 10,000 randomly chosen samples was utilized. The memory limitation could be overcome by utilizing a scaling option within the EC2 instance.

The following analysis shows that additional feature engineering is required. Most noteably:
- All available data should be utilized in future analysis to aquire the most predictive model possible
- MLS Real Estate data should be made available for training the model
    - Features that included number of rooms and housing size (area) would drastically improve the model
    - We can see from datasets such as the [Boston Housing](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) dataset, that these are powerful features for value prediction.
- Training the data on a previous year's data should be tested on the current year's data to see the relevance of predicting annual data from the model 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, make_scorer

In [2]:
ml = pd.read_csv('https://s3-us-west-2.amazonaws.com/schellenbergers3bucket/property_assess_1')
ml.head()

Unnamed: 0,value,nb_id,garage,zoning,lot_size,year_built,crime_per_capita
0,312500,6350.0,True,RF1,557.0,1964.0,0.012926
1,388500,6350.0,True,RF1,566.0,1962.0,0.012926
2,554500,6350.0,True,RF4,886.0,1968.0,0.012926
3,421500,6350.0,True,RF1,580.0,1962.0,0.012926
4,413000,6350.0,True,RF1,554.0,1965.0,0.012926


In [3]:
inds = np.random.choice(len(ml), 10000)
inds

array([100838, 153053, 287741, ..., 272948, 114375,  38744])

In [4]:
dummy_cols = ['nb_id', 'garage', 'zoning']
df = pd.get_dummies(ml, columns=dummy_cols, drop_first=True)
df = df.iloc[list(inds)]

In [5]:
len(df)

10000

In [6]:
X = df.drop('value', axis=1).values
y = df['value'].values

In [7]:
y.reshape(-1,1)

array([[318000],
       [388000],
       [192500],
       ...,
       [363000],
       [331000],
       [109000]])

In [8]:
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

  from numpy.core.umath_tests import inner1d


In [9]:
scoring = make_scorer(mean_squared_error)

# RandomForest

In [11]:
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler(copy=True, with_mean=False, with_std=True)),
         ('rfr', RandomForestRegressor(max_depth=10, random_state=123, max_features=None))]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'rfr__n_estimators':(121, 150)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=123)

# Create the GridSearchCV object: yeg_cv
yeg_cv = GridSearchCV(pipeline, param_grid=parameters, scoring= scoring, cv=5)

# Fit to the training set
yeg_cv.fit(X_train, y_train)

# Compute and print the metrics
mse = yeg_cv.score(X_test, y_test)
print("Tuned RandomForest estimators: {}".format(yeg_cv.best_params_))
print("Tuned RandomForest Mean Squared Error: {}".format(mse))
print('Mean Error:', np.sqrt(mse))

Tuned RandomForest estimators: {'rfr__n_estimators': 150}
Tuned RandomForest Mean Squared Error: 14547303036.63804
Mean Error: 120612.20102725114


In [16]:
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[1])), 'actual value:', np.atleast_2d(y_test[1]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[10])), 'actual value:', np.atleast_2d(y_test[10]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[100])), 'actual value:', np.atleast_2d(y_test[100]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[1000])), 'actual value:', np.atleast_2d(y_test[1000]))

predicted value: [451975.32584509] actual value: [[376500]]
predicted value: [359591.91606518] actual value: [[415500]]
predicted value: [331644.21804453] actual value: [[267000]]
predicted value: [210460.69161245] actual value: [[211500]]


# ExtraTreesRegressor

In [17]:
from sklearn.ensemble import ExtraTreesRegressor
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler(copy=True, with_mean=False, with_std=True)),
         ('rfr', ExtraTreesRegressor(bootstrap=True, criterion='mse', max_depth=None, 
                                     max_features='sqrt', max_leaf_nodes=None, 
                                     min_impurity_decrease=0.0, min_impurity_split=None, 
                                     min_samples_leaf=1, min_samples_split=2, 
                                     min_weight_fraction_leaf=0.0, n_jobs=1, 
                                     oob_score=False, random_state=0, verbose=False, warm_start=False))]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'rfr__n_estimators':(121, 150)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=123)

# Create the GridSearchCV object: yeg_cv
yeg_cv = GridSearchCV(pipeline, param_grid=parameters, scoring= scoring, cv=5)

# Fit to the training set
yeg_cv.fit(X_train, y_train)

# Compute and print the metrics
mse = yeg_cv.score(X_test, y_test)
print("Tuned RandomForest estimators: {}".format(yeg_cv.best_params_))
print("Tuned RandomForest Mean Squared Error: {}".format(mse))
print('Mean Error:', np.sqrt(mse))

Tuned RandomForest estimators: {'rfr__n_estimators': 121}
Tuned RandomForest Mean Squared Error: 13560600712.697529
Mean Error: 116449.99232588007


In [18]:
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[1])), 'actual value:', np.atleast_2d(y_test[1]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[10])), 'actual value:', np.atleast_2d(y_test[10]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[100])), 'actual value:', np.atleast_2d(y_test[100]))
print('predicted value:', yeg_cv.predict(np.atleast_2d(X_test[1000])), 'actual value:', np.atleast_2d(y_test[1000]))

predicted value: [562116.73553719] actual value: [[376500]]
predicted value: [384214.87603306] actual value: [[415500]]
predicted value: [274422.67867578] actual value: [[267000]]
predicted value: [194538.31168831] actual value: [[211500]]
