# Ames House Price Data - Extreme Random Forest

> Juptyer notebook, running a Julia 0.5.2 kernel, with the help of Machine Learning modules written by the author

*We build an extreme random forest using the 12 most important features.*

Estimated 95% confidence interval for log-RMS error of Sale Price predictions for the tuned model: 

    0.145 ± 0.018

## Reading in and transforming  the data
Our extreme random forest regressors are constructed in the same way as regular random forests, but with different parameters settings. Once again the input data should be converted to a `TreeCollections.DataTable` object.

In [15]:
push!(LOAD_PATH, pwd()) # Allow loading of modules from current directory 
addprocs(3) # for parallel processing
using Regressors, Preprocess, Validation
import DataFrames: DataFrame, readtable, writetable
import TreeCollections.DataTable

df = readtable("2.cleaned/train_randomized.csv")
important_features = convert(Array{Symbol}, readtable("3.important_features/important_features.csv")[1])
important_features = important_features[1:12]
const X = DataTable(df[important_features])
const y = collect(df[:target]);

In [2]:
xtreme = BaggedRegressor(atom=TreeRegressor(extreme=true), n = 200, bagging = 0.0)

unfitted BaggedRegressor{Regressors.TreeRegressor}@...7796

## Determining number of iterations for convergence

In [3]:
train, valid = split_bag(1:length(y), 70)

([1,2,3,4,5,6,7,8,9,10  …  1010,1011,1012,1013,1014,1015,1016,1017,1018,1019],[1020,1021,1022,1023,1024,1025,1026,1027,1028,1029  …  1447,1448,1449,1450,1451,1452,1453,1454,1455,1456])

In [4]:
u,v=learning_curve(xtreme, X, y, train, valid, 1:6:300, parallel=true, verbose=false)

([1.0,7.0,13.0,19.0,25.0,31.0,37.0,43.0,49.0,55.0  …  241.0,247.0,253.0,259.0,265.0,271.0,277.0,283.0,289.0,295.0],[0.215906,0.166803,0.159735,0.156064,0.156488,0.155814,0.154921,0.154207,0.153699,0.15353  …  0.152586,0.152493,0.152381,0.152342,0.152497,0.152452,0.152411,0.152408,0.152316,0.152356])

In [5]:
using Plots; pyplot(size=(600,300))
plot(u,v, ylim=(0.1,0.19), xlab="number of trees", ylab="LRMS validation error")

## Tuning `min_patterns_split`
Sometimes increasing the value of `min_patterns_split` above the default value of 2 improves the accuracy of extreme random forests.

In [6]:
full = vcat(train, valid)
u,v = @getfor σ [2,3,4,5,6,7,8,9,10] rms_error(BaggedRegressor(X, y, train;
        atom=TreeRegressor(extreme=true, min_patterns_split=σ), 
        n=200, bagging=0.0, parallel=true, verbose=false), X, y, valid)

σ=10

([2,3,4,5,6,7,8,9,10],[0.152266,0.15314,0.150861,0.150776,0.150831,0.150376,0.151431,0.152024,0.151598])

In [7]:
plt=plot(u,v, xlab="minimum patterns for a split", ylab="LRMS validation error")

We'll make several more runs to get an idea of variability:

In [8]:
for _ in 1:4
    u,v = @getfor σ [2,3,4,5,6,7,8,9,10] rms_error(BaggedRegressor(X, y, train;
        atom=TreeRegressor(extreme=true, min_patterns_split=σ), 
        n=200, bagging=0.0, parallel=true, verbose=false), X, y, valid)
    plot!(u,v)
end
plt

σ=10

Probably overkill, given small changes in error (<2%) but we will go ahead and fine tune with cross-validation:

In [9]:
u,v = @getfor σ [6,7,8,9, 10] cv_error(
    BaggedRegressor(atom=TreeRegressor(extreme=true, min_patterns_split=σ), 
        n=400, bagging=0), X, y, full; n_folds=6, verbose=false, parallel=true)

σ=10

([6,7,8,9,10],[0.147432,0.147473,0.14742,0.147155,0.14749])

In [10]:
plot(u,v)

## Cross-validation of tuned model

In [11]:
xtreme = BaggedRegressor(atom=TreeRegressor(extreme=true, min_patterns_split=9), 
    n = 200, bagging = 0)
errors_xtreme = cv_errors(xtreme, X, y, full;
    n_folds=12, verbose=false, parallel=true)


12-element Array{Float64,1}:
 0.156125
 0.153891
 0.148563
 0.117961
 0.167092
 0.134842
 0.13731 
 0.147561
 0.117183
 0.165965
 0.126086
 0.167543

Approximate 95% confindence interval for RMSL error of model's prediction of Sale Prices:

In [12]:
string(mean(errors_xtreme), " ± ", std(errors_xtreme))


"0.14501009771648404 ± 0.018297596150772935"

## Appending cross-validation errors to file

In [13]:
using JLD
file = jldopen("cv_errors.jld", "r+") # read/append mode
write(file, "errors_xtreme", errors_xtreme)
close(file)

In [14]:
d=load("cv_errors.jld")

Dict{String,Any} with 3 entries:
  "errors_reg"    => [0.175715,0.196048,0.175332,0.140217,0.189921,0.167402,0.1…
  "errors_rf"     => [0.158927,0.171014,0.150358,0.117973,0.165848,0.143109,0.1…
  "errors_xtreme" => [0.156125,0.153891,0.148563,0.117961,0.167092,0.134842,0.1…