# Regression trees with 2015 dataset

In [90]:
using DecisionTree, ScikitLearn, DataFrames, CSV, MLDataUtils

In [131]:
# define error functions

function mean_abs_err(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.(y_output[i] - y_true[i])
    end
    return (sum+0.0)/n
end

function mean_abs_percent(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.((y_output[i] - y_true[i])/y_true[i])
    end
    return 100*(sum+0.0)/n
end

mean_abs_percent (generic function with 1 method)

In [91]:
tap_train_2015 = readtable("TAP_train_2015.csv");
#tap_test_2015 = readtable("TAP_test_2015.csv");

In [93]:
# convert training data into array and shuffle observations by row
TAP_2015 = convert(Array, tap_train_2015);
TAP_2015 = shuffleobs(TAP_2015, obsdim = 1);

In [101]:
# split into X and y
X_data = TAP_2015[:,2:42];
y_data = TAP_2015[:,43];

In [102]:
# split training sets (with 80%) of data into training and validation sets
X_train_2015, X_val_2015 = splitobs(X_data, at=0.75, obsdim=1);
y_train_2015, y_val_2015 = splitobs(y_data, at=0.75, obsdim=1);

In [103]:
X_train_2015

5957×41 SubArray{Real,2,Array{Real,2},Tuple{UnitRange{Int64},Base.Slice{Base.OneTo{Int64}}},false}:
 1  0  1  1  0  1  0  0  0  0  0  0  1  …  0  0  0  0  0  0  0  0  22500.5  1
 1  0  1  1  1  0  0  0  0  0  0  1  0     0  0  0  0  0  0  0  0  65500.5  1
 1  0  0  1  1  0  0  0  0  1  0  0  0     0  0  0  0  1  0  0  0   5500.5  1
 1  1  0  1  1  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  39500.5  1
 1  0  0  1  0  1  0  0  0  1  0  0  0     0  0  0  0  0  0  0  0  31500.5  1
 1  1  0  1  1  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0   3500.5  1
 1  1  1  1  1  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  70500.5  1
 1  0  1  1  0  1  0  0  0  0  0  1  0     0  0  0  0  0  0  0  0  78500.5  1
 1  0  0  1  1  0  0  0  0  1  0  0  0     0  0  0  0  1  0  0  0   7500.5  1
 1  1  1  1  1  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  59500.5  1
 1  0  1  1  0  0  1  0  0  0  0  1  0  …  0  0  0  0  0  0  0  0  24500.5  1
 1  1  1  1  0  1  0  0  0  0  0  0  0    

In [105]:
# convert arrays from real to Float64
X_train_2015 = convert(Array{Float64}, X_train_2015);
X_val_2015 = convert(Array{Float64}, X_val_2015);
y_train_2015 = convert(Array{Float64}, y_train_2015);
y_val_2015 = convert(Array{Float64}, y_val_2015);

In [135]:
# fit single regression tree with average of 10 nodes per leaf using build_tree
regr1 = build_tree(y_train_2015, X_train_2015, 10);

# test fit of model on validation set
output_regr1 = apply_tree(regr1, X_val_2015);

# compute errors on validation set
MAE_regr1 = mean_abs_err(output_regr1, y_val_2015);
MAPD_regr1 = mean_abs_percent(output_regr1, y_val_2015);

In [153]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f" MAE_regr1 MAPD_regr1

For model 1 on the validation set, the MAE is 69.342213 and the MAPD is 3.497738

In [149]:
# fit three more models using ScikitLearn package
regr_2 = DecisionTreeRegressor()
regr_3 = DecisionTreeRegressor(pruning_purity_threshold=0.05)
regr_4 = RandomForestRegressor(ntrees=30)
ScikitLearn.fit!(regr_2, X_train_2015, y_train_2015)
ScikitLearn.fit!(regr_3, X_train_2015, y_train_2015)
ScikitLearn.fit!(regr_4, X_train_2015, y_train_2015)

DecisionTree.RandomForestRegressor(0, 5, 30, 0.7, -1, MersenneTwister(UInt32[0xa5acfa3e, 0xf52ad4e6, 0x72c9cc4d, 0xe31e7ecd], Base.dSFMT.DSFMT_state(Int32[751335842, 1073488004, 557103945, 1073425639, 222132640, 1072939513, -1492378665, 1073723113, 499479115, 1073436289  …  1094131347, 1073085658, 661725898, 1073459706, 1374341142, 1647895786, 1330651376, -677878687, 382, 0]), [1.75794, 1.69846, 1.23486, 1.98216, 1.70862, 1.10772, 1.01477, 1.6654, 1.62104, 1.51122  …  1.17767, 1.43478, 1.69752, 1.78802, 1.08077, 1.55432, 1.38774, 1.22249, 1.37423, 1.73095], 306), Ensemble of Decision Trees
Trees:      30
Avg Leaves: 1366.4333333333334
Avg Depth:  25.1)

In [150]:
# compute errors on validation set
output_regr2 = ScikitLearn.predict(regr_2, X_val_2015);
output_regr3 = ScikitLearn.predict(regr_3, X_val_2015);
output_regr4 = ScikitLearn.predict(regr_4, X_val_2015);

# errors for regr2: DecisionTreeRegressor
MAE_regr2 = mean_abs_err(output_regr2, y_val_2015);
MAPD_regr2 = mean_abs_percent(output_regr2, y_val_2015);

# errors for regr3: DecisionTreeRegressor with pruning purity threshold of 0.05
MAE_regr3 = mean_abs_err(output_regr3, y_val_2015);
MAPD_regr3 = mean_abs_percent(output_regr3, y_val_2015);

# errors for regr4: Random Forest with n=30 trees
MAE_regr4 = mean_abs_err(output_regr4, y_val_2015);
MAPD_regr4 = mean_abs_percent(output_regr4, y_val_2015);

Overview of models for 2015 data:
* model 1: regression tree trained on training set using DecisionTrees.jl package, with average of 10 nodes/leaf
* model 2: full regression tree trained on training set using ScikitLearn package
* model 3: regression tree trained on training set using ScikitLearn and pruning purity threshold of 0.05
* model 4: random forest trained on training set using ScikitLearn with n=30 trees

In [156]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr1 MAPD_regr1
@printf "For model 2 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr2 MAPD_regr2
@printf "For model 3 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr3 MAPD_regr3
@printf "For model 4 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr4 MAPD_regr4

For model 1 on the validation set, the MAE is 69.342213 and the MAPD is 3.497738 
For model 2 on the validation set, the MAE is 70.966871 and the MAPD is 3.546807 
For model 3 on the validation set, the MAE is 94.943151 and the MAPD is 4.488609 
For model 4 on the validation set, the MAE is 64.523201 and the MAPD is 3.417253 
