# DECISION TREE REGRESSION
---

```julia
versioninfo() # -> v"1.11.1"
```

In [1]:
cd(@__DIR__)

In [2]:
using Pkg; pkg"activate .."

[32m[1m  Activating[22m[39m project at `~/Work/git-repos/AI-ML-DL/jlai/Codes/Julia/Part-2`


Import librairies

In [3]:
using CSV, DataFrames
using MLJ

Load data from CSV file

In [4]:
df = CSV.read("../../Datasets/50_Startups.csv", DataFrame)
schema(df)

┌─────────────────┬────────────┬──────────┐
│[22m names           [0m│[22m scitypes   [0m│[22m types    [0m│
├─────────────────┼────────────┼──────────┤
│ R&D Spend       │ Continuous │ Float64  │
│ Administration  │ Continuous │ Float64  │
│ Marketing Spend │ Continuous │ Float64  │
│ State           │ Textual    │ String15 │
│ Profit          │ Continuous │ Float64  │
└─────────────────┴────────────┴──────────┘


Design the features

In [5]:
X = df[!, 1:4]
colnames = ["rd", "admin", "spend", "state"]
rename!(X, Symbol.(colnames))
coerce!(X, :state => Multiclass)

Row,rd,admin,spend,state
Unnamed: 0_level_1,Float64,Float64,Float64,Cat…
1,1.65349e5,1.36898e5,4.71784e5,New York
2,1.62598e5,1.51378e5,4.43899e5,California
3,1.53442e5,1.01146e5,4.07935e5,Florida
4,1.44372e5,1.18672e5,3.832e5,New York
5,1.42107e5,91391.8,3.66168e5,Florida
6,1.31877e5,99814.7,3.62861e5,New York
7,1.34615e5,1.47199e5,1.27717e5,California
8,1.30298e5,1.4553e5,3.23877e5,Florida
9,1.20543e5,148719.0,3.11613e5,New York
10,1.23335e5,1.08679e5,3.04982e5,California


Encoding the state column

In [6]:
ce = ContinuousEncoder()
X = machine(ce, X) |> fit! |> MLJ.transform

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(ContinuousEncoder(drop_last = false, …), …).


Row,rd,admin,spend,state__California,state__Florida,state__New York
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,1.65349e5,1.36898e5,4.71784e5,0.0,0.0,1.0
2,1.62598e5,1.51378e5,4.43899e5,1.0,0.0,0.0
3,1.53442e5,1.01146e5,4.07935e5,0.0,1.0,0.0
4,1.44372e5,1.18672e5,3.832e5,0.0,0.0,1.0
5,1.42107e5,91391.8,3.66168e5,0.0,1.0,0.0
6,1.31877e5,99814.7,3.62861e5,0.0,0.0,1.0
7,1.34615e5,1.47199e5,1.27717e5,1.0,0.0,0.0
8,1.30298e5,1.4553e5,3.23877e5,0.0,1.0,0.0
9,1.20543e5,148719.0,3.11613e5,0.0,0.0,1.0
10,1.23335e5,1.08679e5,3.04982e5,1.0,0.0,0.0


Extract target vector

In [7]:
y = df.Profit

50-element Vector{Float64}:
 192261.83
 191792.06
 191050.39
 182901.99
 166187.94
 156991.12
 156122.51
 155752.6
 152211.77
 149759.96
 146121.95
 144259.4
 141585.52
      ⋮
  81229.06
  81005.76
  78239.91
  77798.83
  71498.49
  69758.98
  65200.33
  64926.08
  49490.75
  42559.73
  35673.41
  14681.4

Preparing for the split

In [8]:
train, test = partition(eachindex(y), 0.8, shuffle=true, rng=123)
Xtrain, Xtest = X[train, :], X[test, :]
ytrain, ytest = y[train], y[test]

([141585.52, 192261.83, 81005.76, 156991.12, 96778.92, 69758.98, 78239.91, 96712.8, 14681.4, 125370.37  …  134307.35, 182901.99, 129917.04, 71498.49, 77798.83, 191050.39, 99937.59, 108552.04, 42559.73, 132602.65], [166187.94, 35673.41, 105008.31, 107404.34, 126992.93, 118474.03, 105733.54, 124266.9, 146121.95, 96479.51])

Load & instantiate the decision tree regression model

In [9]:
DTR = @load DecisionTreeRegressor pkg=DecisionTree

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFor silent loading, specify `verbosity=0`. 


import MLJDecisionTreeInterface ✔


MLJDecisionTreeInterface.DecisionTreeRegressor

In [10]:
dtr_ = DTR(max_depth=5, min_samples_split=3)

DecisionTreeRegressor(
  max_depth = 5, 
  min_samples_leaf = 5, 
  min_samples_split = 3, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  feature_importance = :impurity, 
  rng = Random.TaskLocalRNG())

You may want to see [DecisionTree.jl](https://github.com/bensadeghi/DecisionTree.jl) and the unwrapped model type [`MLJDecisionTreeInterface.DecisionTree.DecisionTreeRegressor`](@ref).

Train & fit

In [11]:
dtr = machine(dtr_, Xtrain, ytrain) |> fit!

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(DecisionTreeRegressor(max_depth = 5, …), …).


trained Machine; caches model-specific representations of data
  model: DecisionTreeRegressor(max_depth = 5, …)
  args: 
    1:	Source @628 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @829 ⏎ AbstractVector{Continuous}


In [12]:
println("Params of fitted model are $(fitted_params(dtr).tree)")

Params of fitted model are DecisionTree.InfoNode{Float64, Float64}(Decision Tree
Leaves: 7
Depth:  4, nchildren=2)


Prediction

In [13]:
yhat_dtr = predict(dtr, Xtest)

10-element Vector{Float64}:
 182999.478
  59682.62777777778
 108446.73599999999
  98526.50999999998
 108446.73599999999
 108446.73599999999
 108446.73599999999
 133036.09
 133036.09
  87323.16

Results & metrics

In [14]:
println("Error is $(sum((yhat_dtr .- ytest).^2) ./ length(ytest))")

Error is 1.733555526993269e8
