# Lightning tour of MLJ

*For a more elementary introduction to MLJ, see [Getting
Started](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/).*

Install required packages (beginners needn't worry about
understanding this cell):

In [1]:
using Pkg
DIR = joinpath(@__DIR__, "src")
Pkg.activate(DIR)
include(joinpath(DIR, "instantiate.jl"))

  Activating environment at `~/Google Drive/Julia/HelloJulia/tutorials/lightning_tour/src/Project.toml`


In MLJ a *model* is just a container for hyper-parameters, and that's
all. Here we will apply several kinds of model composition before
binding the resulting "meta-model" to data in a *machine* for
evaluation, using cross-validation.

Loading and instantiating a gradient tree-boosting model:

In [2]:
using MLJ
MLJ.color_off()

Booster = @load EvoTreeRegressor # loads code defining a model type
booster = Booster(max_depth=2)   # specify hyper-parameter at construction

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main.##405 /Users/anthony/.julia/packages/MLJModels/5itei/src/loading.jl:168
import EvoTrees ✔


EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 2,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(444),
    device = "cpu") @344

In [3]:
booster.nrounds=50               # or mutate post facto
booster

EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 50,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 2,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(444),
    device = "cpu") @344

This model is an example of an iterative model. As is stands, the
number of iterations `nrounds` is fixed.

### Composition 1: Wrapping the model to make it "self-iterating"

Let's create a new model that automatically learns the number of iterations,
using the `NumberSinceBest(3)` criterion, as applied to an
out-of-sample `l1` loss:

In [4]:
using MLJIteration
iterated_booster = IteratedModel(model=booster,
                                 resampling=Holdout(fraction_train=0.8),
                                 controls=[Step(2), NumberSinceBest(3), NumberLimit(300)],
                                 measure=l1,
                                 retrain=true)

DeterministicIteratedModel(
    model = EvoTreeRegressor(
            loss = EvoTrees.Linear(),
            nrounds = 50,
            λ = 0.0,
            γ = 0.0,
            η = 0.1,
            max_depth = 2,
            min_weight = 1.0,
            rowsample = 1.0,
            colsample = 1.0,
            nbins = 64,
            α = 0.5,
            metric = :mse,
            rng = MersenneTwister(444),
            device = "cpu"),
    controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)],
    resampling = Holdout(
            fraction_train = 0.8,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    measure = LPLoss(
            p = 1),
    weights = nothing,
    class_weights = nothing,
    operation = MLJModelInterface.predict,
    retrain = true,
    check_measure = true,
    iteration_parameter = nothing,
    cache = true) @399

### Composition 2: Preprocess the input features

Combining the model with categorical feature encoding:

In [5]:
pipe = @pipeline ContinuousEncoder iterated_booster

Pipeline408(
    continuous_encoder = ContinuousEncoder(
            drop_last = false,
            one_hot_ordered_factors = false),
    deterministic_iterated_model = DeterministicIteratedModel(
            model = EvoTreeRegressor{Float64,…} @344,
            controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)],
            resampling = Holdout @414,
            measure = LPLoss{Int64} @489,
            weights = nothing,
            class_weights = nothing,
            operation = MLJModelInterface.predict,
            retrain = true,
            check_measure = true,
            iteration_parameter = nothing,
            cache = true)) @230

### Composition 3: Wrapping the model to make it "self-tuning"

First, we define a hyper-parameter range for optimization of a
(nested) hyper-parameter:

In [6]:
max_depth_range = range(pipe,
                        :(deterministic_iterated_model.model.max_depth),
                        lower = 1,
                        upper = 10)

typename(MLJBase.NumericRange)(Int64, :(deterministic_iterated_model.model.max_depth), ... )

Now we can wrap the pipeline model in an optimization strategy to make
it "self-tuning":

In [7]:
self_tuning_pipe = TunedModel(model=pipe,
                              tuning=RandomSearch(),
                              ranges = max_depth_range,
                              resampling=CV(nfolds=3, rng=456),
                              measure=l1,
                              acceleration=CPUThreads(),
                              n=50)

DeterministicTunedModel(
    model = Pipeline408(
            continuous_encoder = ContinuousEncoder @733,
            deterministic_iterated_model = DeterministicIteratedModel{EvoTreeRegressor{Float64,…}} @399),
    tuning = RandomSearch(
            bounded = Distributions.Uniform,
            positive_unbounded = Distributions.Gamma,
            other = Distributions.Normal,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 3,
            shuffle = true,
            rng = MersenneTwister(456)),
    measure = LPLoss(
            p = 1),
    weights = nothing,
    operation = MLJModelInterface.predict,
    range = NumericRange(
            field = :(deterministic_iterated_model.model.max_depth),
            lower = 1,
            upper = 10,
            origin = 5.5,
            unit = 4.5,
            scale = :linear),
    selection_heuristic = MLJTuning.NaiveSelection(nothing),
    train_best = true,
    repeats = 1,
    n = 50,
    acceleration = CP

### Binding to data and evaluating performance

Loading a selection of features and labels from the Ames
House Price dataset:

In [8]:
X, y = make_regression();

Binding the "self-tuning" pipeline model to data in a *machine* (which
will additionally store *learned* parameters):

In [9]:
mach = machine(self_tuning_pipe, X, y)

Machine{DeterministicTunedModel{RandomSearch,…},…} @614 trained 0 times; caches data
  args: 
    1:	Source @070 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @412 ⏎ `AbstractVector{Continuous}`


Evaluating the "self-tuning" pipeline model's performance using 5-fold
cross-validation (implies multiple layers of nested resampling):

In [10]:
evaluate!(mach,
          measures=[l1, l2],
          resampling=CV(nfolds=5, rng=123),
          acceleration=CPUThreads())

┌ Info: Performing evaluations using 5 threads.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/rN59G/src/resampling.jl:998


┌────────────────────┬───────────────┬──────────────────────────────────────┐
│[22m _.measure          [0m│[22m _.measurement [0m│[22m _.per_fold                           [0m│
├────────────────────┼───────────────┼──────────────────────────────────────┤
│ LPLoss{Int64} @489 │ 0.198         │ [0.251, 0.187, 0.179, 0.198, 0.176]  │
│ LPLoss{Int64} @595 │ 0.0725        │ [0.1, 0.116, 0.0452, 0.0544, 0.0471] │
└────────────────────┴───────────────┴──────────────────────────────────────┘
_.per_observation = [[[0.313, 0.102, ..., 0.606], [0.0406, 0.373, ..., 0.0288], [0.152, 0.131, ..., 0.41], [0.303, 0.378, ..., 0.069], [0.34, 0.158, ..., 0.0683]], [[0.098, 0.0105, ..., 0.368], [0.00165, 0.139, ..., 0.000829], [0.023, 0.0171, ..., 0.168], [0.0916, 0.143, ..., 0.00476], [0.116, 0.0249, ..., 0.00467]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*