# Ames House Price Data - Gradient Boosted Trees (XGBoost)

> Juptyer notebook, running a Julia 0.5.2 kernel, with the help of Machine Learning modules written by the author

*We build an XGBoost model using all available features*.

Estimated 95% confidence interval for log-RMS error of Sale Price predictions for the tuned model: 

    0.113 ± 0.020

## Reading in and transforming  the data
Our gradient boosting algorithm is just a wrapping of the popular XGBoost algorithm (sitting atop [XGBoost.jl](https://github.com/dmlc/xgboost) of the Distributed Machine Learning Community). As this algorithm does not handle categorical features, we one-hot encode after loading and standardizing the data. Standardization is recommended because the effect of the regularization parameters is scale-dependent. 

In [5]:
push!(LOAD_PATH, pwd()) # Allow loading of modules from current directory 
addprocs(3) # for parallel processing
using Preprocess
import DataFrames: head, readtable, writetable
using Regressors, Validation
import TreeCollections: DataTable

df = readtable("2.cleaned/train_randomized.csv")

const y = collect(df[:target])

x = df[2:end-1]
s = StandardizationScheme(x)
xx = transform(s, x)
t = HotEncodingScheme(xx)
const X = Array(transform(t, xx));

Features standarized: 
  :LotFrontage    mu=71.23589949492492  sigma=24.720946372947626
  :LotArea    mu=10448.78434065934  sigma=9860.763448771824
  :OverallQual    mu=6.0885989010989015  sigma=1.3696691706201316
  :OverallCond    mu=5.576236263736264  sigma=1.1139656335816104
  :YearBuilt    mu=1971.1854395604396  sigma=30.201589946070243
  :YearRemodAdd    mu=1984.819368131868  sigma=20.652142559919664
  :MasVnrArea    mu=101.52678571428571  sigma=177.0117726788596
  :BsmtFinSF1    mu=436.99107142857144  sigma=430.25505173352497
  :BsmtFinSF2    mu=46.6771978021978  sigma=161.52237571837978
  :BsmtUnfSF    mu=566.9903846153846  sigma=442.19718189265285
  :TotalBsmtSF    mu=1050.6586538461538  sigma=412.15571520305724
  :x1stFlrSF    mu=1157.1085164835165  sigma=369.3073305084563
  :x2ndFlrSF    mu=343.532967032967  sigma=431.5289149184655
  :LowQualFinSF    mu=5.860576923076923  sigma=48.68890442210665
  :GrLivArea    mu=1506.5020604395604  sigma=496.8153784562889
  :BsmtFullBath   

## Parameter tuning and cross-validation

XGBoost has quite a few parameters to tune, and we carried out tuning (not published in this notebook) according to [this post](https://www.analyticsvidhya.com/wp-content/uploads/2016/02/5.-gamma.png) of Aarshay Jain. 


In [None]:
rgs=XGBoostRegressor(alpha=0.006, lambda=0.0, subsample=0.66,
                     colsample_bytree=0.55, eta=0.01, n=2000, min_child_weight=6.0, max_depth=3)


# determine cross-validation error:
errors_xgboost=cv_errors(rgs, X, y, n_folds=12, parallel=true, verbose=false); 

In [7]:
string(mean(errors_xgboost), " ± ", std(errors_xgboost))

"0.11299541209957703 ± 0.020329535514609094"

In [8]:
import JLD: jldopen, write, close
file = jldopen("cv_errors.jld", "r+") # open in append mode
write(file, "errors_xgboost", errors_xgboost)
close(file)