# Regression Trees with entire dataset

In [1]:
using DecisionTree, ScikitLearn, DataFrames, CSV, MLDataUtils



In [2]:
# define error functions

function mean_abs_err(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.(y_output[i] - y_true[i])
    end
    return (sum+0.0)/n
end

function mean_abs_percent(y_output, y_true)
    n = size(y_output,1)
    sum = 0
    for i=1:n
        sum += abs.((y_output[i] - y_true[i])/y_true[i])
    end
    return 100*(sum+0.0)/n
end

mean_abs_percent (generic function with 1 method)

In [3]:
tap_train = readtable("TAP_train.csv");
#tap_test = readtable("TAP_test.csv");

In [10]:
# convert training data into array and shuffle observations by row
TAP = convert(Array, tap_train);
TAP = shuffleobs(TAP, obsdim = 1);

In [11]:
# split into X and y
X_data = TAP[:,:];
y_data = TAP[:,43];

In [12]:
# split training sets (with 80%) of data into training and validation sets
X_train, X_val = splitobs(X_data, at=0.75, obsdim=1);
y_train, y_val = splitobs(y_data, at=0.75, obsdim=1);

In [13]:
X_train

114578×43 SubArray{Real,2,Array{Real,2},Tuple{UnitRange{Int64},Base.Slice{Base.OneTo{Int64}}},false}:
 2013  1  0  0  1  48500.5  0  1  0  …  0  0  0  0  0  0  0  0  1   500.0  
 2005  1  0  0  1  13500.5  1  0  0     0  0  0  0  0  0  0  0  1  2263.46 
 2006  1  1  0  1   6500.5  1  0  0     0  0  0  0  0  0  0  0  1  5014.75 
 2004  1  0  0  1  73500.5  0  1  0     0  0  0  0  0  0  0  0  1   500.0  
 2005  1  0  1  1   2500.5  0  1  0     0  0  0  0  0  0  0  0  1  3920.45 
 2000  0  1  1  1   5500.5  0  0  0  …  0  0  0  0  0  0  1  0  1   277.812
 2003  1  1  1  1  16500.5  1  0  0     0  0  0  0  0  0  0  0  1  3234.61 
 2008  0  0  0  1   4500.5  0  0  0     0  0  0  0  0  0  0  1  1    75.0  
 2013  1  0  1  1  61500.5  0  1  0     0  0  0  0  0  0  0  0  1   500.0  
 2003  1  0  0  1  18500.5  1  0  0     0  0  0  0  0  0  0  0  1  1716.79 
 2007  1  0  1  1  50500.5  1  0  0  …  0  0  0  0  0  0  0  0  1   500.0  
 2002  1  1  1  1  18500.5  1  0  0     0  0  0  0  0  0  0  0

In [14]:
# convert arrays from real to Float64
X_train = convert(Array{Float64}, X_train);
X_val = convert(Array{Float64}, X_val);
y_train = convert(Array{Float64}, y_train);
y_val = convert(Array{Float64}, y_val);

## model 1: full regression tree, with average nodes per leaf

In [15]:
# fit single regression tree with average of avg nodes per leaf using build_tree
avg = 10
regr1 = build_tree(y_train, X_train, avg);

# test fit of model on validation set
output_regr1 = apply_tree(regr1, X_val);

# compute errors on validation set
MAE_regr1 = mean_abs_err(output_regr1, y_val);
MAPD_regr1 = mean_abs_percent(output_regr1, y_val);

In [17]:
regr1

Decision Tree
Leaves: 21852
Depth:  80

In [16]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f" MAE_regr1 MAPD_regr1

For model 1 on the validation set, the MAE is 0.091499 and the MAPD is 0.009618

## model 2: pruned tree

In [18]:
# prune tree with specified pruning threshold
pruning_threshold = 0.05
regr2 = prune_tree(regr1, 1-pruning_threshold)

# test fit of model on validation set
output_regr2 = apply_tree(regr2, X_val);

# compute errors on validation set
MAE_regr2 = mean_abs_err(output_regr2, y_val);
MAPD_regr2 = mean_abs_percent(output_regr2, y_val);

In [24]:
regr2

Decision Tree
Leaves: 12319
Depth:  25

In [19]:
@printf "For model 2 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr2 MAPD_regr2

For model 2 on the validation set, the MAE is 0.091934 and the MAPD is 0.009645 


# model 3: random forest

In [20]:
# build random forest with specified number of trees
numfeatures = 20
numtrees = 30
avg = 50
portion_samples = 0.7
regr3 = build_forest(y_train, X_train, numfeatures, numtrees, avg, portion_samples)

# test fit of model on validation set
output_regr3 = apply_forest(regr3, X_val);

# compute errors on validation set
MAE_regr3 = mean_abs_err(output_regr3, y_val);
MAPD_regr3 = mean_abs_percent(output_regr3, y_val);

In [25]:
regr3

Ensemble of Decision Trees
Trees:      30
Avg Leaves: 3088.233333333333
Avg Depth:  38.1

In [21]:
@printf "For model 3 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr3 MAPD_regr3

For model 3 on the validation set, the MAE is 4.636779 and the MAPD is 0.438844 


## overview of model errors

In [23]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr1 MAPD_regr1
@printf "For model 2 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr2 MAPD_regr2
@printf "For model 3 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr3 MAPD_regr3

For model 1 on the validation set, the MAE is 0.091499 and the MAPD is 0.009618 
For model 2 on the validation set, the MAE is 0.091934 and the MAPD is 0.009645 
For model 3 on the validation set, the MAE is 4.636779 and the MAPD is 0.438844 


## Test error

In [26]:
# import test set
tap_test = readtable("TAP_test.csv");
tap_test = convert(Array, tap_test);

In [27]:
# split data and convert to correct type
X_test = tap_test[:,:];
y_test = tap_test[:,43];

X_test = convert(Array{Float64}, X_test);
y_test = convert(Array{Float64}, y_test);

### model 1 on test set

In [28]:
# test fit of model on validation set
output_test1 = apply_tree(regr1, X_test);

# compute errors on validation set
MAE_test1 = mean_abs_err(output_test1, y_test);
MAPD_test1 = mean_abs_percent(output_test1, y_test);

In [29]:
@printf "For model 1 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test1 MAPD_test1

For model 1 on the test set, the MAE is 0.088761 and the MAPD is 0.017228


### model 2 on test set

In [30]:
# test fit of model on validation set
output_test2 = apply_tree(regr2, X_test);

# compute errors on validation set
MAE_test2 = mean_abs_err(output_test2, y_test);
MAPD_test2 = mean_abs_percent(output_test2, y_test);

In [31]:
@printf "For model 2 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test2 MAPD_test2

For model 2 on the test set, the MAE is 0.089073 and the MAPD is 0.017260


### model 3 on test set

In [34]:
# test fit of model on validation set
output_test3 = apply_forest(regr3, X_test);

# compute errors on validation set
MAE_test3 = mean_abs_err(output_test3, y_test);
MAPD_test3 = mean_abs_percent(output_test3, y_test);

In [35]:
@printf "For model 3 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test3 MAPD_test3

For model 3 on the test set, the MAE is 4.578450 and the MAPD is 0.472857


### overview of test errors

In [36]:
@printf "For model 1 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test1 MAPD_test1
@printf "For model 2 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test2 MAPD_test2
@printf "For model 3 on the test set, the MAE is %f and the MAPD is %f\n" MAE_test3 MAPD_test3

For model 1 on the test set, the MAE is 0.088761 and the MAPD is 0.017228
For model 2 on the test set, the MAE is 0.089073 and the MAPD is 0.017260
For model 3 on the test set, the MAE is 4.578450 and the MAPD is 0.472857


# IGNORE BELOW FOR NOW

## models 2, 3, and 4: using scikitlearn

In [None]:
# fit three more models using ScikitLearn package

# initial parameters
numtrees = 30
pruning_threshold = 0.05

regr_2 = DecisionTreeRegressor()
#regr_3 = DecisionTreeRegressor(pruning_purity_threshold=pruning_threshold)
regr_4 = RandomForestRegressor(ntrees=numtrees)
ScikitLearn.fit!(regr_2, X_train, y_train)
#ScikitLearn.fit!(regr_3, X_train, y_train)
#ScikitLearn.fit!(regr_4, X_train, y_train)

In [None]:
# compute errors on validation set
output_regr2 = ScikitLearn.predict(regr_2, X_val);
#output_regr3 = ScikitLearn.predict(regr_3, X_val);
#output_regr4 = ScikitLearn.predict(regr_4, X_val);

# errors for regr2: DecisionTreeRegressor
MAE_regr2 = mean_abs_err(output_regr2, y_val);
MAPD_regr2 = mean_abs_percent(output_regr2, y_val);

# errors for regr3: DecisionTreeRegressor with pruning purity threshold of 0.05
#MAE_regr3 = mean_abs_err(output_regr3, y_val);
#MAPD_regr3 = mean_abs_percent(output_regr3, y_val);

# errors for regr4: Random Forest with n=30 trees
#MAE_regr4 = mean_abs_err(output_regr4, y_val);
#MAPD_regr4 = mean_abs_percent(output_regr4, y_val);

Overview of models all data:
* model 1: regression tree trained on training set using DecisionTrees.jl package, with average of 10 nodes/leaf
* model 2: full regression tree trained on training set using ScikitLearn package
* model 3: regression tree trained on training set using ScikitLearn and pruning purity threshold of 0.05
* model 4: random forest trained on training set using ScikitLearn with n=30 trees

In [None]:
@printf "For model 1 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr1 MAPD_regr1
@printf "For model 2 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr2 MAPD_regr2
@printf "For model 3 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr3 MAPD_regr3
@printf "For model 4 on the validation set, the MAE is %f and the MAPD is %f \n" MAE_regr4 MAPD_regr4