# Random Forest: Predicting which properties get sold at foreclosure sale using Texas Harris County 2017 Foreclosure Sales data.

<h3> Each month thousands of real estate properties are listed for foreclosure auction by lenders in Harris County in Texas. However, more than 80% get cancelled or delayed because of different reasons. Below I try to predict such cancellation using actual data from 2017.</h3>

- Step: 1) Data loading and cleaning
- Step: 2) Formatting data for machine learning 
- Step: 3) Cross validation to assess the performance of three predictive models
- Step: 4) Select the best model and tune its hyper parameters using pipeline and GridSearchCV modules
- Step: 5) Use the tuned model to assess its predicitive power on test data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline

In [2]:
# Load the data
path1 = ("C:/Users/aath/Dropbox/MAEN/Thankful/Data/fls/FLS_Hist2017_clean.csv")
df = pd.read_csv(path1)

In [3]:
# Check null status of the columns
df.isnull().sum()

rec_num                  0
keymap                1022
sold3rd                  0
tax_id                 439
org_loan_amt           361
mon_org_loan_date      277
year_org_loan_date     279
sale_date                0
est_loan_bal           857
mortgagee               28
bedr_num              1286
prop_val               506
Term                  1005
Trustee                  1
sq_ft                  614
time_sold             3591
trustee_ref_num       4762
open_bid              4359
final_bid             3605
loan_type               43
dtype: int64

In [4]:
# Eliminate few rows with nulls 
df = df[pd.notnull(df['Trustee'])]
df = df[pd.notnull(df['mortgagee'])]
df = df[pd.notnull(df['loan_type'])]
df = df.drop('rec_num', 1)   # This is an arbitrary index and should be removed

# This is the actual sales dates in 2017. This is time-series which we will ignore here
df = df.drop('sale_date', 1) 

In [5]:
# Drop all other columns with many nulls
df=df.dropna(axis='columns')

In [6]:
features_df = df.drop('sold3rd', 1)  # Create feature matrix
labels_df = df['sold3rd']            # create target vector

In [7]:
# Check to see how many unique categories we may need to create
categorical = features_df.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

mortgagee
854
Trustee
300
loan_type
10


In [8]:
# Set up feature transforming functions

def transform_feature( features_df, column_name ):
    unique_values = set( features_df[column_name].tolist() )
    transformer_dict = {}
    for ii, value in enumerate(unique_values):
        transformer_dict[value] = ii

    def label_map(y):
        return transformer_dict[y]
    features_df[column_name] = features_df[column_name].apply( label_map )
    return features_df


# transformation

names_of_columns_to_transform = ["mortgagee", "Trustee","loan_type"]
for column in names_of_columns_to_transform:
    features_features_df = transform_feature(features_df, column )

print(features_df.columns.values)

['mortgagee' 'Trustee' 'loan_type']


In [9]:
X = features_df.as_matrix()
y = labels_df.tolist()

In [10]:
# Perform cross-validation for three different models

import sklearn.linear_model
import sklearn.cross_validation
import sklearn.tree
import sklearn.ensemble

clf = sklearn.linear_model.LogisticRegression()
score = sklearn.cross_validation.cross_val_score( clf, X, y, cv=5 )
print("Logistic Regression Scores {}".format(score))

clf = sklearn.tree.DecisionTreeClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y, cv=5  )
print("Decision Tree Scores {}".format(score))

clf = sklearn.ensemble.RandomForestClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y, cv=5  )
print("Random Forest Scores {}".format(score))



Logistic Regression Scores [ 0.86223984  0.86223984  0.86223984  0.86223984  0.86295929]
Decision Tree Scores [ 0.82457879  0.82061447  0.81863231  0.82160555  0.78152929]
Random Forest Scores [ 0.84241824  0.81367691  0.82854311  0.82953419  0.80536246]


Above we tested three different classifiers and all of them preform very similar to each other’s. However, these algorithms have adjustable variables and we just used the default ones. 

Let's focus on the Random Forest model and see if we can fine tune its parameters to get better results. We can either manually adjust them or use GridSearchCV tool. This tool will exhaustively search over specified parameter value and report the best ones. However, we should do this search using our best features. And for that we can use SelectKBest tool helping to rank features based on lowest p-values.

You may realize that there are many moving parts in this workflow and yet another tool that can help to combine these together is pipeline package. Pipeline chains the transformation step of SelectKBest with the estimation step of RandomForestClassifier into a coherent workflow.

In [11]:

import sklearn.pipeline

# Set up modules of pipeline
select = sklearn.feature_selection.SelectKBest(k='all')
clf = sklearn.ensemble.RandomForestClassifier()

# Chain modules into a stepwise list 

steps = [('feature_selection', select),
         ('random_forest', clf)]

pipeline = sklearn.pipeline.Pipeline(steps)

In [12]:
# Create k-Fold cross-validation
from sklearn.model_selection import StratifiedKFold, KFold, cross_val_score

# Since the ratio of sold to 3rd party = 1 is much less than class 0 we should use stratified k-fold validation

kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

# Conduct k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
                             X, # Feature matrix
                             y, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=1) # Use all CPU scores
# Calculate mean
cv_results.mean()

0.83621079120124442

In [13]:
import sklearn.grid_search

k_range = [i+1 for i in range(3)]         # Number of features selected
n_range = [i+1 for i in range(50)]        # The number of trees in the forest
split_range = [i+2 for i in range(3)]     # The minimum number of samples required to split a node

parameters = dict(feature_selection__k=k_range,  
              random_forest__n_estimators=n_range,
              random_forest__min_samples_split= split_range)

cv = sklearn.grid_search.GridSearchCV(pipeline, param_grid=parameters)

print(pipeline.named_steps)


X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

# fit pipeline on X_train and y_train
pipeline.fit( X_train, y_train )

# call pipeline.predict() on X_test data to make a set of test predictions
y_prediction = pipeline.predict( X_test )

# test predictions using sklearn.classification_report()
report = sklearn.metrics.classification_report( y_test, y_prediction )

# and print the report
print(report)


{'feature_selection': SelectKBest(k='all', score_func=<function f_classif at 0x000000000C1F8E18>), 'random_forest': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)}
             precision    recall  f1-score   support

          0       0.87      0.95      0.91       878
          1       0.17      0.06      0.09       131

avg / total       0.78      0.84      0.80      1009





In [14]:
pipeline.score(X_test, y_test)

0.83845391476709619

In [15]:
# Test to confirm the selected hyper parameters give the same result
sklearn.ensemble.RandomForestClassifier(n_estimators = 10).fit(X_train, y_train).score(X_test, y_test)

0.84539147670961345