# Regression trees with 2003 dataset

In [1]:
using DecisionTree, ScikitLearn, DataFrames, CSV, MLDataUtils



In [2]:
# define error functions

function mean_abs_err(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.(y_output[i] - y_true[i])
    end
    return (sum+0.0)/n
end

function mean_abs_percent(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.((y_output[i] - y_true[i])/y_true[i])
    end
    return 100*(sum+0.0)/n
end

mean_abs_percent (generic function with 1 method)

In [3]:
tap_train_2003 = readtable("TAP_train_2003.csv");
#tap_test_2003 = readtable("TAP_test_2003.csv");

In [4]:
# convert training data into array and shuffle observations by row
TAP_2003 = convert(Array, tap_train_2003);
TAP_2003 = shuffleobs(TAP_2003, obsdim = 1);

In [5]:
# split into X and y
X_data = TAP_2003[:,2:42];
y_data = TAP_2003[:,43];

In [7]:
# split training sets (with 80%) of data into training and validation sets
X_train_2003, X_val_2003 = splitobs(X_data, at=0.75, obsdim=1);
y_train_2003, y_val_2003 = splitobs(y_data, at=0.75, obsdim=1);

In [11]:
# convert arrays from real to Float64
X_train_2003 = convert(Array{Float64}, X_train_2003);
X_val_2003 = convert(Array{Float64}, X_val_2003);
y_train_2003 = convert(Array{Float64}, y_train_2003);
y_val_2003 = convert(Array{Float64}, y_val_2003);

In [12]:
# fit regr1: single regression tree with average of 10 nodes per leaf using build_tree
regr1 = build_tree(y_train_2003, X_train_2003, 10);

# test fit of model on validation set
output_regr1 = apply_tree(regr1, X_val_2003);

# compute errors on validation set
MAE_regr1 = mean_abs_err(output_regr1, y_val_2003);
MAPD_regr1 = mean_abs_percent(output_regr1, y_val_2003);

In [13]:
MAE_regr1

58.878993479009864

In [15]:
MAPD_regr1

3.840747418696398

In [16]:
# fit three more models using ScikitLearn package
regr_2 = DecisionTreeRegressor()
regr_3 = DecisionTreeRegressor(pruning_purity_threshold=0.05)
regr_4 = RandomForestRegressor(ntrees=30)
ScikitLearn.fit!(regr_2, X_train_2003, y_train_2003)
ScikitLearn.fit!(regr_3, X_train_2003, y_train_2003)
ScikitLearn.fit!(regr_4, X_train_2003, y_train_2003)

DecisionTree.RandomForestRegressor(0, 5, 30, 0.7, -1, MersenneTwister(UInt32[0x11a7058f, 0x1cb622ba, 0x29d386e2, 0x3c7ee860], Base.dSFMT.DSFMT_state(Int32[2103179619, 1073161036, -2118814837, 1072943612, -2068753212, 1072791350, -815251381, 1073114271, -1567038479, 1073308193  …  -516135598, 1073413935, 1670056433, 1072929290, 1638151636, -996999553, -468722575, -1891921472, 382, 0]), [1.44612, 1.23877, 1.09356, 1.40152, 1.58646, 1.88811, 1.44411, 1.32428, 1.94316, 1.34846  …  1.27963, 1.14275, 1.86028, 1.51351, 1.48083, 1.74592, 1.9411, 1.70605, 1.6873, 1.22511], 270), Ensemble of Decision Trees
Trees:      30
Avg Leaves: 1977.5333333333333
Avg Depth:  28.9)

In [17]:
# compute errors on validation set
output_regr2 = ScikitLearn.predict(regr_2, X_val_2003);
output_regr3 = ScikitLearn.predict(regr_3, X_val_2003);
output_regr4 = ScikitLearn.predict(regr_4, X_val_2003);

# errors for regr2: DecisionTreeRegressor
MAE_regr2 = mean_abs_err(output_regr2, y_val_2003);
MAPD_regr2 = mean_abs_percent(output_regr2, y_val_2003);

# errors for regr3: DecisionTreeRegressor with pruning purity threshold of 0.025
MAE_regr3 = mean_abs_err(output_regr3, y_val_2003);
MAPD_regr3 = mean_abs_percent(output_regr3, y_val_2003);

# errors for regr4: Random Forest with n=30 trees
MAE_regr4 = mean_abs_err(output_regr4, y_val_2003);
MAPD_regr4 = mean_abs_percent(output_regr4, y_val_2003);

Overview of models for 2003 data:
* model 1: regression tree trained on training set using DecisionTrees.jl package, with average of 10 nodes/leaf
* model 2: full regression tree trained on training set using ScikitLearn package
* model 3: regression tree trained on training set using ScikitLearn and pruning purity threshold of 0.05
* model 4: random forest trained on training set using ScikitLearn with n=30 trees

In [25]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr1 MAPD_regr1
@printf "For model 2 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr2 MAPD_regr2
@printf "For model 3 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr3 MAPD_regr3
@printf "For model 4 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr4 MAPD_regr4

For model 1 on the validation set, the MAE is 58.878993 and the MAPD is 3.840747 
For model 2 on the validation set, the MAE is 61.010784 and the MAPD is 3.909482 
For model 3 on the validation set, the MAE is 84.209590 and the MAPD is 5.387782 
For model 4 on the validation set, the MAE is 53.441672 and the MAPD is 3.370886 
