# Lightning tour of MLJ

*For a more elementary introduction to MLJ, see [Getting
Started](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/).*

Inspect Julia version:

In [1]:
VERSION

v"1.6.3"

The following instantiates a package environment.

The package environment has been created using **Julia 1.6** and may not
instantiate properly for other Julia versions.

In [2]:
using Pkg
Pkg.activate("env")
Pkg.instantiate()

  Activating environment at `~/GoogleDrive/Julia/HelloJulia/demos/machine_learning_in_julia/env/Project.toml`
Precompiling project...
[33m  ✓ [39m[90mMissings[39m
[33m  ✓ [39m[90mLogExpFunctions[39m
[33m  ✓ [39mLiterate
[33m  ✓ [39m[90mChainRulesCore[39m
[33m  ✓ [39m[90mTables[39m
[33m  ✓ [39m[90mQuadGK[39m
[32m  ✓ [39m[90mPersistenceDiagramsBase[39m
[32m  ✓ [39m[90mCategoricalArrays[39m
[32m  ✓ [39m[90mTimeZones[39m
[33m  ✓ [39m[90mStatsBase[39m
[33m  ✓ [39m[90mStructArrays[39m
[32m  ✓ [39m[90mLatinHypercubeSampling[39m
[32m  ✓ [39m[90mLossFunctions[39m
[32m  ✓ [39m[90mMemento[39m
[33m  ✓ [39m[90mSpecialFunctions[39m
[32m  ✓ [39m[90mPrettyTables[39m
[32m  ✓ [39m[90mJLSO[39m
[33m  ✓ [39m[90mStatsFuns[39m
[32m  ✓ [39m[90mScientificTypes[39m
[33m  ✓ [39m[90mGeometryBasics[39m
[33m  ✓ [39m[90mDistributions[39m
[32m  ✓ [39m[90mNetworkLayout[39m
[32m  ✓ [39m[90mMLJBase[39m
[32m  ✓ [39mMLJIteration
[

In MLJ a *model* is just a container for hyper-parameters, and that's
all. Here we will apply several kinds of model composition before
binding the resulting "meta-model" to data in a *machine* for
evaluation, using cross-validation.

Loading and instantiating a gradient tree-boosting model:

In [3]:
using MLJ
MLJ.color_off()

Booster = @load EvoTreeRegressor # loads code defining a model type
booster = Booster(max_depth=2)   # specify hyper-parameter at construction

[ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]
[ Info: For silent loading, specify `verbosity=0`. 
import EvoTrees[ Info: Precompiling EvoTrees [f6006082-12f8-11e9-0c9c-0d5d367ab1e5]
 ✔


EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 2,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(444),
    device = "cpu") @569

In [4]:
booster.nrounds=50               # or mutate post facto
booster

EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 50,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 2,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(444),
    device = "cpu") @569

This model is an example of an iterative model. As is stands, the
number of iterations `nrounds` is fixed.

### Composition 1: Wrapping the model to make it "self-iterating"

Let's create a new model that automatically learns the number of iterations,
using the `NumberSinceBest(3)` criterion, as applied to an
out-of-sample `l1` loss:

In [5]:
using MLJIteration
iterated_booster = IteratedModel(model=booster,
                                 resampling=Holdout(fraction_train=0.8),
                                 controls=[Step(2), NumberSinceBest(3), NumberLimit(300)],
                                 measure=l1,
                                 retrain=true)

DeterministicIteratedModel(
    model = EvoTreeRegressor(
            loss = EvoTrees.Linear(),
            nrounds = 50,
            λ = 0.0,
            γ = 0.0,
            η = 0.1,
            max_depth = 2,
            min_weight = 1.0,
            rowsample = 1.0,
            colsample = 1.0,
            nbins = 64,
            α = 0.5,
            metric = :mse,
            rng = MersenneTwister(444),
            device = "cpu"),
    controls = Any[IterationControl.Step(2), EarlyStopping.NumberSinceBest(3), EarlyStopping.NumberLimit(300)],
    resampling = Holdout(
            fraction_train = 0.8,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    measure = LPLoss(
            p = 1),
    weights = nothing,
    class_weights = nothing,
    operation = MLJModelInterface.predict,
    retrain = true,
    check_measure = true,
    iteration_parameter = nothing,
    cache = true) @339

### Composition 2: Preprocess the input features

Combining the model with categorical feature encoding:

In [6]:
pipe = @pipeline ContinuousEncoder iterated_booster

Pipeline270(
    continuous_encoder = ContinuousEncoder(
            drop_last = false,
            one_hot_ordered_factors = false),
    deterministic_iterated_model = DeterministicIteratedModel(
            model = EvoTreeRegressor{Float64,…} @569,
            controls = Any[IterationControl.Step(2), EarlyStopping.NumberSinceBest(3), EarlyStopping.NumberLimit(300)],
            resampling = Holdout @890,
            measure = LPLoss{Int64} @015,
            weights = nothing,
            class_weights = nothing,
            operation = MLJModelInterface.predict,
            retrain = true,
            check_measure = true,
            iteration_parameter = nothing,
            cache = true)) @212

### Composition 3: Wrapping the model to make it "self-tuning"

First, we define a hyper-parameter range for optimization of a
(nested) hyper-parameter:

In [7]:
max_depth_range = range(pipe,
                        :(deterministic_iterated_model.model.max_depth),
                        lower = 1,
                        upper = 10)

typename(MLJBase.NumericRange)(Int64, :(deterministic_iterated_model.model.max_depth), ... )

Now we can wrap the pipeline model in an optimization strategy to make
it "self-tuning":

In [8]:
self_tuning_pipe = TunedModel(model=pipe,
                              tuning=RandomSearch(),
                              ranges = max_depth_range,
                              resampling=CV(nfolds=3, rng=456),
                              measure=l1,
                              acceleration=CPUThreads(),
                              n=50)

DeterministicTunedModel(
    model = Pipeline270(
            continuous_encoder = ContinuousEncoder @436,
            deterministic_iterated_model = DeterministicIteratedModel{EvoTreeRegressor{Float64,…}} @339),
    tuning = RandomSearch(
            bounded = Distributions.Uniform,
            positive_unbounded = Distributions.Gamma,
            other = Distributions.Normal,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 3,
            shuffle = true,
            rng = MersenneTwister(456)),
    measure = LPLoss(
            p = 1),
    weights = nothing,
    operation = MLJModelInterface.predict,
    range = NumericRange(
            field = :(deterministic_iterated_model.model.max_depth),
            lower = 1,
            upper = 10,
            origin = 5.5,
            unit = 4.5,
            scale = :linear),
    selection_heuristic = MLJTuning.NaiveSelection(nothing),
    train_best = true,
    repeats = 1,
    n = 50,
    acceleration = Co

### Binding to data and evaluating performance

Loading a selection of features and labels from the Ames
House Price dataset:

In [9]:
X, y = make_regression();

Binding the "self-tuning" pipeline model to data in a *machine* (which
will additionally store *learned* parameters):

In [10]:
mach = machine(self_tuning_pipe, X, y)

Machine{DeterministicTunedModel{RandomSearch,…},…} @001 trained 0 times; caches data
  args: 
    1:	Source @339 ⏎ `ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}`
    2:	Source @492 ⏎ `AbstractVector{ScientificTypesBase.Continuous}`


Evaluating the "self-tuning" pipeline model's performance using 5-fold
cross-validation (implies multiple layers of nested resampling):

In [11]:
evaluate!(mach,
          measures=[l1, l2],
          resampling=CV(nfolds=5, rng=123),
          acceleration=CPUThreads())

[ Info: Performing evaluations using 5 threads.


┌────────────────────┬───────────────┬────────────────────────────────────────┐
│[22m _.measure          [0m│[22m _.measurement [0m│[22m _.per_fold                             [0m│
├────────────────────┼───────────────┼────────────────────────────────────────┤
│ LPLoss{Int64} @015 │ 0.271         │ [0.29, 0.153, 0.174, 0.262, 0.477]     │
│ LPLoss{Int64} @069 │ 0.204         │ [0.134, 0.0437, 0.0456, 0.0962, 0.701] │
└────────────────────┴───────────────┴────────────────────────────────────────┘
_.per_observation = [[[0.024, 0.0954, ..., 0.104], [0.141, 0.0524, ..., 0.0415], [0.147, 0.302, ..., 0.0775], [0.0997, 0.194, ..., 0.276], [1.46, 0.158, ..., 3.11]], [[0.000577, 0.00911, ..., 0.0108], [0.0199, 0.00275, ..., 0.00172], [0.0217, 0.0913, ..., 0.006], [0.00995, 0.0376, ..., 0.0761], [2.13, 0.025, ..., 9.67]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*