# Machine Learning in Julia, JuliaCon2020

A workshop introducing the machine learning toolbox
[MLJ](https://alan-turing-institute.github.io/MLJ.jl/stable/)

### Environment instantiation

This line loads a Julia environment (defined in Project.toml and Manifest.toml files which must be in the same directory as this file):

In [1]:
include(joinpath(@__DIR__, "setup.jl"))

 Activating environment at `~/Dropbox/Julia7/MLJ/MachineLearningInJulia2020/Project.toml`


## Part 1: Data Representation

> **Goals:**
> 1. Learn how MLJ specifies it's data requirements using "scientific" types
> 2. Understand the options for representing tabular data
> 3. Learn how to inspect and fix the representation of data to meet MLJ requirements

### Scientific types

To help you focus on the intended *purpose* or *interpretation* of
data, MLJ models specify data requirements using *scientific types*,
instead of machine types. An example of a scientific type is
`OrderedFactor`. The other basic "scalar" scientific types are
illustrated below:

![](assets/scitypes.png)

A scientific type is an ordinary Julia type (so it can be used for
method dispatch, for example) but it usually has no instances. The
`scitype` function is used to articulate MLJ's convention about how
different machine types will be interpreted by MLJ models:

In [2]:
using MLJ
scitype(3.141)

Continuous

In [3]:
time = [2.3, 4.5, 4.2, 1.8, 7.1]
scitype(time)

AbstractArray{Continuous,1}

To fix data which MLJ is interpreting incorrectly, we use the
`coerce` method:

In [4]:
height = [185, 153, 163, 114, 180]
scitype(height)

AbstractArray{Count,1}

In [5]:
height = coerce(height, Continuous)

5-element Array{Float64,1}:
 185.0
 153.0
 163.0
 114.0
 180.0

Here's an example of data we would want interpreted as
`OrderedFactor` but isn't:

In [6]:
exam_mark = ["rotten", "great", "bla",  missing, "great"]
scitype(exam_mark)

AbstractArray{Union{Missing, Textual},1}

In [7]:
exam_mark = coerce(exam_mark, OrderedFactor)

┌ Info: Trying to coerce from `Union{Missing, String}` to `OrderedFactor`.
│ Coerced to `Union{Missing,OrderedFactor}` instead.
└ @ MLJScientificTypes /Users/anthony/.julia/packages/MLJScientificTypes/wqfgN/src/convention/coerce.jl:126


5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "rotten"
 "great"
 "bla"
 missing
 "great"

In [8]:
levels(exam_mark)

3-element Array{String,1}:
 "bla"
 "great"
 "rotten"

Use `levels!` to put the classes in the right order:

In [9]:
levels!(exam_mark, ["rotten", "bla", "great"])
exam_mark[1] < exam_mark[2]

true

When subsampling, no levels are not lost:

In [10]:
levels(exam_mark[1:2])

3-element Array{String,1}:
 "rotten"
 "bla"
 "great"

**Note on binary data.** There is no separate scientific type for binary
data. Binary data is `OrderedFactor{2}` if it has an intrinsic
"true" class (eg, "pass"/"fail") and `Multiclass{2}` otherwise (eg,
"male"/"female").

### Two-dimensional data

Whenever it makes sense, MLJ Models generally expect two-dimensional
data to be *tabular*. All the tabular formats implementing the
[Tables.jl API](https://juliadata.github.io/Tables.jl/stable/) (see
this
[list](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md))
have a scientific type of `Table` and can be used with such models.

The simplest example of a table is a the julia native *column
table*, which is just a named tuple of equal-length vectors:

In [11]:
column_table = (h=height, e=exam_mark, t=time)

(h = [185.0, 153.0, 163.0, 114.0, 180.0],
 e = Union{Missing, CategoricalValue{String,UInt32}}["rotten", "great", "bla", missing, "great"],
 t = [2.3, 4.5, 4.2, 1.8, 7.1],)

In [12]:
scitype(column_table)

Table{Union{AbstractArray{Union{Missing, OrderedFactor{3}},1}, AbstractArray{Continuous,1}}}

Notice the `Table{K}` type parameter `K` encodes the scientific
types of the columns. (This is useful when comparing table scitypes
with `<:`). To inspect the individual column scitypes, we use the
`schema` method instead:

In [13]:
schema(column_table)

┌[0m─────────[0m┬[0m─────────────────────────────────────────────────[0m┬[0m──────────────────────────────────[0m┐[0m
│[0m[22m _.names [0m│[0m[22m _.types                                         [0m│[0m[22m _.scitypes                       [0m│[0m
├[0m─────────[0m┼[0m─────────────────────────────────────────────────[0m┼[0m──────────────────────────────────[0m┤[0m
│[0m h       [0m│[0m Float64                                         [0m│[0m Continuous                       [0m│[0m
│[0m e       [0m│[0m Union{Missing, CategoricalValue{String,UInt32}} [0m│[0m Union{Missing, OrderedFactor{3}} [0m│[0m
│[0m t       [0m│[0m Float64                                         [0m│[0m Continuous                       [0m│[0m
└[0m─────────[0m┴[0m─────────────────────────────────────────────────[0m┴[0m──────────────────────────────────[0m┘[0m
_.nrows = 5


Here are four other examples of tables:

In [14]:
row_table = [(a=1, b=3.4),
             (a=2, b=4.5),
             (a=3, b=5.6)]
schema(row_table)

┌[0m─────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m a       [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m b       [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
└[0m─────────[0m┴[0m─────────[0m┴[0m────────────[0m┘[0m
_.nrows = 3


In [15]:
import DataFrames
df = DataFrames.DataFrame(column_table)

Unnamed: 0_level_0,h,e,t
Unnamed: 0_level_1,Float64,Cat…?,Float64
1,185.0,rotten,2.3
2,153.0,great,4.5
3,163.0,bla,4.2
4,114.0,missing,1.8
5,180.0,great,7.1


In [16]:
schema(df)

┌[0m─────────[0m┬[0m─────────────────────────────────────────────────[0m┬[0m──────────────────────────────────[0m┐[0m
│[0m[22m _.names [0m│[0m[22m _.types                                         [0m│[0m[22m _.scitypes                       [0m│[0m
├[0m─────────[0m┼[0m─────────────────────────────────────────────────[0m┼[0m──────────────────────────────────[0m┤[0m
│[0m h       [0m│[0m Float64                                         [0m│[0m Continuous                       [0m│[0m
│[0m e       [0m│[0m Union{Missing, CategoricalValue{String,UInt32}} [0m│[0m Union{Missing, OrderedFactor{3}} [0m│[0m
│[0m t       [0m│[0m Float64                                         [0m│[0m Continuous                       [0m│[0m
└[0m─────────[0m┴[0m─────────────────────────────────────────────────[0m┴[0m──────────────────────────────────[0m┘[0m
_.nrows = 5


In [17]:
using CSV
file = CSV.File(joinpath(DIR, "data", "horse.csv"));
schema(file) # (triggers a file read)

┌[0m─────────────────────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names                 [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────────────────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m surgery                 [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m age                     [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m rectal_temperature      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m pulse                   [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m respiratory_rate        [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m temperature_extremities [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m mucous_membranes        [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m capillary_refill_time   [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m pain                    [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m peristalsis             [0m│[

Most MLJ models do not accept matrix in lieu of a table, but you can
wrap a matrix as a table:

In [18]:
matrix_table = MLJ.table(rand(2,3))
schema(matrix_table)

┌[0m─────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m x1      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m x2      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m x3      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
└[0m─────────[0m┴[0m─────────[0m┴[0m────────────[0m┘[0m
_.nrows = 2


Under the hood many algorithms convert tabular data to matrices. If
your table is a wrapped matrix like the above, then the compiler
will generally collapse the conversions to a no-op.

**Manipulating tabular data.** In this workshop we assume
familiarity with some kind of tabular data container (although it is
possible, in principle, to carry out the exercises without this.)
For a quick start introduction to `DataFrames`, see [this
tutorial](https://alan-turing-institute.github.io/DataScienceTutorials.jl/data/dataframe/)

### Fixing scientific types in tabular data

To show how we can correct the scientific types of data in tables,
we introduce a cleaned up version of the UCI Horse Colic Data Set
(the cleaning workflow is described
[here](https://alan-turing-institute.github.io/DataScienceTutorials.jl/end-to-end/horse/#dealing_with_missing_values))

In [19]:
using CSV
file = CSV.File(joinpath(DIR, "data", "horse.csv"));
horse = DataFrames.DataFrame(file); # convert to data frame without copying columns
first(horse, 4)

Unnamed: 0_level_0,surgery,age,rectal_temperature,pulse,respiratory_rate,temperature_extremities,mucous_membranes
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,Int64,Int64,Int64
1,2,1,38.5,66,66,3,1
2,1,1,39.2,88,88,3,4
3,2,1,38.3,40,40,1,3
4,1,9,39.1,164,164,4,6


From [the UCI
docs](http://archive.ics.uci.edu/ml/datasets/Horse+Colic) we can
surmise how each variable ought to be interpreted (a step in our
workflow that cannot reliably be left to the computer):

variable                    | scientific type (interpretation)
----------------------------|-----------------------------------
`:surgery`                  | Multiclass
`:age`                      | Multiclass
`:rectal_temperature`       | Continuous
`:pulse`                    | Continuous
`:respiratory_rate`         | Continuous
`:temperature_extremities`  | OrderedFactor
`:mucous_membranes`         | Multiclass
`:capillary_refill_time`    | Multiclass
`:pain`                     | OrderedFactor
`:peristalsis`              | OrderedFactor
`:abdominal_distension`     | OrderedFactor
`:packed_cell_volume`       | Continuous
`:total_protein`            | Continuous
`:outcome`                  | Multiclass
`:surgical_lesion`          | OrderedFactor
`:cp_data`                  | Multiclass

Let's see how MLJ will actually interpret the data, as it is
currently encoded:

In [20]:
schema(horse)

┌[0m─────────────────────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names                 [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────────────────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m surgery                 [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m age                     [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m rectal_temperature      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m pulse                   [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m respiratory_rate        [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m temperature_extremities [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m mucous_membranes        [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m capillary_refill_time   [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m pain                    [0m│[0m Int64   [0m│[0m Count      [0m│[0m
│[0m peristalsis             [0m│[

As a first correction step, we can get MLJ to "guess" the
appropriate fix, using the `autotype` method:

In [21]:
autotype(horse)

Dict{Symbol,Type} with 11 entries:
  :abdominal_distension    => OrderedFactor
  :pain                    => OrderedFactor
  :surgery                 => OrderedFactor
  :mucous_membranes        => OrderedFactor
  :surgical_lesion         => OrderedFactor
  :outcome                 => OrderedFactor
  :capillary_refill_time   => OrderedFactor
  :age                     => OrderedFactor
  :temperature_extremities => OrderedFactor
  :peristalsis             => OrderedFactor
  :cp_data                 => OrderedFactor

Okay, this is not perfect, but a step in the right direction, which
we implement like this:

In [22]:
coerce!(horse, autotype(horse));
schema(horse)

┌[0m─────────────────────────[0m┬[0m────────────────────────────────[0m┬[0m──────────────────[0m┐[0m
│[0m[22m _.names                 [0m│[0m[22m _.types                        [0m│[0m[22m _.scitypes       [0m│[0m
├[0m─────────────────────────[0m┼[0m────────────────────────────────[0m┼[0m──────────────────[0m┤[0m
│[0m surgery                 [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{2} [0m│[0m
│[0m age                     [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{2} [0m│[0m
│[0m rectal_temperature      [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m pulse                   [0m│[0m Int64                          [0m│[0m Count            [0m│[0m
│[0m respiratory_rate        [0m│[0m Int64                          [0m│[0m Count            [0m│[0m
│[0m temperature_extremities [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{4} [0m│[0m
│[0m mucous_

All remaining `Count` data should be `Continuous`:

In [23]:
coerce!(horse, Count => Continuous);
schema(horse)

┌[0m─────────────────────────[0m┬[0m────────────────────────────────[0m┬[0m──────────────────[0m┐[0m
│[0m[22m _.names                 [0m│[0m[22m _.types                        [0m│[0m[22m _.scitypes       [0m│[0m
├[0m─────────────────────────[0m┼[0m────────────────────────────────[0m┼[0m──────────────────[0m┤[0m
│[0m surgery                 [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{2} [0m│[0m
│[0m age                     [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{2} [0m│[0m
│[0m rectal_temperature      [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m pulse                   [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m respiratory_rate        [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m temperature_extremities [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{4} [0m│[0m
│[0m mucous_

We'll correct the remaining truant entries manually:

In [24]:
coerce!(horse,
        :surgery               => Multiclass,
        :age                   => Multiclass,
        :mucous_membranes      => Multiclass,
        :capillary_refill_time => Multiclass,
        :outcome               => Multiclass,
        :cp_data               => Multiclass);
schema(horse)

┌[0m─────────────────────────[0m┬[0m────────────────────────────────[0m┬[0m──────────────────[0m┐[0m
│[0m[22m _.names                 [0m│[0m[22m _.types                        [0m│[0m[22m _.scitypes       [0m│[0m
├[0m─────────────────────────[0m┼[0m────────────────────────────────[0m┼[0m──────────────────[0m┤[0m
│[0m surgery                 [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m Multiclass{2}    [0m│[0m
│[0m age                     [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m Multiclass{2}    [0m│[0m
│[0m rectal_temperature      [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m pulse                   [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m respiratory_rate        [0m│[0m Float64                        [0m│[0m Continuous       [0m│[0m
│[0m temperature_extremities [0m│[0m CategoricalValue{Int64,UInt32} [0m│[0m OrderedFactor{4} [0m│[0m
│[0m mucous_

### Resources for Part 1

- From the MLJ manual:
   - [A preview of data type specification in
  MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#A-preview-of-data-type-specification-in-MLJ-1)
   - [Data containers and scientific types](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Data-containers-and-scientific-types-1)
   - [Working with Categorical Data](https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/)
- [Summary](https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/#Summary-of-the-MLJ-convention-1) of the MLJ convention for representing scientific types
- [MLJScientificTypes.jl](https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/)
- From Data Science Tutorials:
    - [Data interpretation: Scientific Types](https://alan-turing-institute.github.io/DataScienceTutorials.jl/data/scitype/)
    - [Horse colic data](https://alan-turing-institute.github.io/DataScienceTutorials.jl/end-to-end/horse/)
- [UCI Horse Colic Data Set](http://archive.ics.uci.edu/ml/datasets/Horse+Colic)

### Exercises for Part 1

#### Ex 1

Try to guess how each code snippet below will evaluate:

In [25]:
scitype(42)

Count

In [26]:
questions = ["who", "why", "what", "when"]
scitype(questions)

AbstractArray{Textual,1}

In [27]:
elscitype(questions)

Textual

In [28]:
t = (3.141, 42, "how")
scitype(t)

Tuple{Continuous,Count,Textual}

In [29]:
A = rand(2, 3)

2×3 Array{Float64,2}:
 0.949666  0.648992  0.968531
 0.657601  0.964264  0.621786

-

In [30]:
scitype(A)

AbstractArray{Continuous,2}

In [31]:
elscitype(A)

Continuous

In [32]:
using SparseArrays
Asparse = sparse(A)

2×3 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  0.949666
  [2, 1]  =  0.657601
  [1, 2]  =  0.648992
  [2, 2]  =  0.964264
  [1, 3]  =  0.968531
  [2, 3]  =  0.621786

In [33]:
scitype(Asparse)

AbstractArray{Continuous,2}

In [34]:
using CategoricalArrays
C1 = categorical(A)

2×3 CategoricalArray{Float64,2,UInt32}:
 0.9496659122609439  0.648991635558477   0.9685312128025074
 0.657600770969814   0.9642637988837086  0.6217855033748962

In [35]:
scitype(C1)

AbstractArray{Multiclass{6},2}

In [36]:
elscitype(C1)

Multiclass{6}

In [37]:
C2 = categorical(A, ordered=true)
scitype(C2)

AbstractArray{OrderedFactor{6},2}

In [38]:
v = [1, 2, missing, 4]
scitype(v)

AbstractArray{Union{Missing, Count},1}

In [39]:
elscitype(v)

Union{Missing, Count}

In [40]:
scitype(v[1:2])

AbstractArray{Union{Missing, Count},1}

Can you guess at the general behaviour of
`scitype` with respect to tuples, abstract arrays and missing
values? The answers are
[here](https://github.com/alan-turing-institute/ScientificTypes.jl#2-the-scitype-and-scitype-methods)
(ignore "Property 1").

#### Ex 2

Coerce the following vector to make MLJ recognize it as a vector of
ordered factors (with an appropriate ordering):

In [41]:
quality = ["good", "poor", "poor", "excellent", missing, "good", "excellent"]

7-element Array{Union{Missing, String},1}:
 "good"
 "poor"
 "poor"
 "excellent"
 missing
 "good"
 "excellent"

#### Ex 3 (fixing scitypes in a table)

Fix the scitypes for the [House Prices in King
County](https://mlr3gallery.mlr-org.com/posts/2020-01-30-house-prices-in-king-county/)
dataset:

In [42]:
file = CSV.File(joinpath(DIR, "data", "house.csv"));
house = DataFrames.DataFrame(file); # convert to data frame without copying columns
first(house, 4)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition
Unnamed: 0_level_1,Float64,Int64,Float64,Int64,Int64,Float64,Int64,Int64,Int64
1,221900.0,3,1.0,1180,5650,1.0,0,0,3
2,538000.0,3,2.25,2570,7242,2.0,0,0,3
3,180000.0,2,1.0,770,10000,1.0,0,0,3
4,604000.0,4,3.0,1960,5000,1.0,0,0,5


(Two features in the original data set have been deemed uninformative
and dropped, namely `:id` and `:date`. The original feature
`:yr_renovated` has been replaced by the `Bool` feature `is_renovated`.)

## Part 2: Selecting, Training and Evaluating Models

> **Goals:**
> 1. Search MLJ's database of model metadata to identify model candidates for a supervised learning task.
> 2. Evaluate the performance of a model on a holdout set using basic `fit!`/`predict` workflow.
> 3. Evaluate performance using other resampling strategies, such as cross-validation, in one line, using `evaluate!`
> 4. Plot a "learning curve", to inspect performance as a function of some model hyper-parameter, such as an iteration parameter

The "Hello World!" of machine learning is to classify Fisher's
famous iris data set. This time, we'll grab the data from
[OpenML](https://www.openml.org):

In [43]:
iris = OpenML.load(61); # a row table
iris = DataFrames.DataFrame(iris);
first(iris, 4)

Unnamed: 0_level_0,sepallength,sepalwidth,petallength,petalwidth,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,SubStri…
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa


**Goal.** To build and evaluate models for predicting the
`:class` variable, given the four remaining measurement variables.

### Step 1. Inspect and fix scientific types

In [44]:
schema(iris)

┌[0m─────────────[0m┬[0m───────────────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names     [0m│[0m[22m _.types           [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────────[0m┼[0m───────────────────[0m┼[0m────────────[0m┤[0m
│[0m sepallength [0m│[0m Float64           [0m│[0m Continuous [0m│[0m
│[0m sepalwidth  [0m│[0m Float64           [0m│[0m Continuous [0m│[0m
│[0m petallength [0m│[0m Float64           [0m│[0m Continuous [0m│[0m
│[0m petalwidth  [0m│[0m Float64           [0m│[0m Continuous [0m│[0m
│[0m class       [0m│[0m SubString{String} [0m│[0m Textual    [0m│[0m
└[0m─────────────[0m┴[0m───────────────────[0m┴[0m────────────[0m┘[0m
_.nrows = 150


In [45]:
coerce!(iris, :class => Multiclass);
schema(iris)

┌[0m─────────────[0m┬[0m─────────────────────────────────[0m┬[0m───────────────[0m┐[0m
│[0m[22m _.names     [0m│[0m[22m _.types                         [0m│[0m[22m _.scitypes    [0m│[0m
├[0m─────────────[0m┼[0m─────────────────────────────────[0m┼[0m───────────────[0m┤[0m
│[0m sepallength [0m│[0m Float64                         [0m│[0m Continuous    [0m│[0m
│[0m sepalwidth  [0m│[0m Float64                         [0m│[0m Continuous    [0m│[0m
│[0m petallength [0m│[0m Float64                         [0m│[0m Continuous    [0m│[0m
│[0m petalwidth  [0m│[0m Float64                         [0m│[0m Continuous    [0m│[0m
│[0m class       [0m│[0m CategoricalValue{String,UInt32} [0m│[0m Multiclass{3} [0m│[0m
└[0m─────────────[0m┴[0m─────────────────────────────────[0m┴[0m───────────────[0m┘[0m
_.nrows = 150


### Step 2. Split data into input and target parts

Here's how we split the data into target and input features, which
is needed for MLJ supervised models. We randomize the data at the
same time:

In [46]:
y, X = unpack(iris, ==(:class), name->true; rng=123);
scitype(y)

AbstractArray{Multiclass{3},1}

Do `?unpack` to learn more:

In [47]:
@doc unpack

```
t1, t2, ...., tk = unnpack(table, f1, f2, ... fk; wrap_singles=false)
```

Split any Tables.jl compatible `table` into smaller tables (or vectors) `t1, t2, ..., tk` by making selections *without replacement* from the column names defined by the filters `f1`, `f2`, ..., `fk`. A *filter* is any object `f` such that `f(name)` is `true` or `false` for each column `name::Symbol` of `table`.

Whenever a returned table contains a single column, it is converted to a vector unless `wrap_singles=true`.

Scientific type conversions can be optionally specified (note semicolon):

```
unpack(table, t...; wrap_singles=false, col1=>scitype1, col2=>scitype2, ... )
```

### Example

```
julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=[:A, :B])
julia> Z, XY = unpack(table, ==(:z), !=(:w);
               :x=>Continuous, :y=>Multiclass)
julia> XY
2×2 DataFrame
│ Row │ x       │ y            │
│     │ Float64 │ Categorical… │
├─────┼─────────┼──────────────┤
│ 1   │ 1.0     │ 'a'          │
│ 2   │ 2.0     │ 'b'          │

julia> Z
2-element Array{Float64,1}:
 10.0
 20.0
```


### On searching for a model

Here's how to see *all* models (not immediately useful):

In [48]:
kitchen_sink = models()

142-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = AffinityPropagation, package_name = ScikitLearn, ... )
 (name = AgglomerativeClustering, package_name = ScikitLearn, ... )
 (name = BaggingClassifier, package_name = ScikitLearn, ... )
 (name = BaggingRegressor, package_name = ScikitLearn, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianLDA, package_name = ScikitLearn, 

Each entry contains metadata for a model whose defining code is not yet loaded:

In [49]:
meta = kitchen_sink[3]

[35mAdaBoost ensemble regression.[39m
[35m→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).[39m
[35m→ do `@load AdaBoostRegressor pkg="ScikitLearn"` to use the model.[39m
[35m→ do `?AdaBoostRegressor` for documentation.[39m
(name = "AdaBoostRegressor",
 package_name = "ScikitLearn",
 is_supervised = true,
 docstring = "AdaBoost ensemble regression.\n→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).\n→ do `@load AdaBoostRegressor pkg=\"ScikitLearn\"` to use the model.\n→ do `?AdaBoostRegressor` for documentation.",
 hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing),
 hyperparameter_types = ("Any", "Int64", "Float64", "String", "Any"),
 hyperparameters = (:base_estimator, :n_estimators, :learning_rate, :loss, :random_state),
 implemented_methods = [:clean!, :fit, :fitted_params, :predict],
 is_pure_julia = false,
 is_wrapper = true,
 load_path = "MLJScikitLearnInterface.AdaBoostRegressor",
 package_license = "BSD",
 packag

In [50]:
targetscitype = meta.target_scitype

AbstractArray{Continuous,1}

In [51]:
scitype(y) <: targetscitype

false

So this model won't do. Let's  find all pure julia classifiers:

In [52]:
filt(meta) = AbstractVector{Finite} <: meta.target_scitype &&
        meta.is_pure_julia
models(filt)

16-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = DecisionTree, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = EvoTreeClassifier, package_name = EvoTrees, ... )
 (name = GaussianNBClassifier, package_name = NaiveBayes, ... )
 (name = KNNClassifier, package_name = NearestNeighbors, ... )
 (name = LDA, package_name = 

Find all models with "Classifier" in `name` (or `docstring`):

In [53]:
models("Classifier")

39-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BaggingClassifier, package_name = ScikitLearn, ... )
 (name = BernoulliNBClassifier, package_name = ScikitLearn, ... )
 (name = ComplementNBClassifier, package_name = ScikitLearn, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = DecisionTree, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = DummyClassifier, package_name = ScikitLearn, ... )
 (name = EvoTreeClassifier, p

Find all (supervised) models that match my data!

In [54]:
models(matching(X, y))

42-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AdaBoostClassifier, package_name = ScikitLearn, ... )
 (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
 (name = BaggingClassifier, package_name = ScikitLearn, ... )
 (name = BayesianLDA, package_name = MultivariateStats, ... )
 (name = BayesianLDA, package_name = ScikitLearn, ... )
 (name = BayesianQDA, package_name = ScikitLearn, ... )
 (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = DecisionTree, ... )
 (name = DeterministicConstantClassifier, package_na

### Step 3. Select and instantiate a model

In [55]:
model = @load NeuralNetworkClassifier

NeuralNetworkClassifier(
    builder = Short(
            n_hidden = 0,
            dropout = 0.5,
            σ = NNlib.σ),
    finaliser = NNlib.softmax,
    optimiser = Flux.Optimise.ADAM(0.001, (0.9, 0.999), IdDict{Any,Any}()),
    loss = Flux.crossentropy,
    epochs = 10,
    batch_size = 1,
    lambda = 0.0,
    alpha = 0.0,
    optimiser_changes_trigger_retraining = false)[34m @631[39m

In [56]:
info(model)

[35mA neural network model for making probabilistic predictions of a `Mutliclass` or `OrderedFactor` target, given a table of `Continuous` features. [39m
[35m→ based on [MLJFlux](https://github.com/alan-turing-institute/MLJFlux.jl).[39m
[35m→ do `@load NeuralNetworkClassifier pkg="MLJFlux"` to use the model.[39m
[35m→ do `?NeuralNetworkClassifier` for documentation.[39m
(name = "NeuralNetworkClassifier",
 package_name = "MLJFlux",
 is_supervised = true,
 docstring = "A neural network model for making probabilistic predictions of a `Mutliclass` or `OrderedFactor` target, given a table of `Continuous` features. \n→ based on [MLJFlux](https://github.com/alan-turing-institute/MLJFlux.jl).\n→ do `@load NeuralNetworkClassifier pkg=\"MLJFlux\"` to use the model.\n→ do `?NeuralNetworkClassifier` for documentation.",
 hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing),
 hyperparameter_types = ("MLJFlux.Short", "typeof(NNlib.softmax)

In MLJ a *model* is just a struct containing hyper-parameters, and
that's all. A model does not store *learned* parameters. Models are
mutable:

In [57]:
model.epochs = 12

12

And all models have a key-word constructor that works once `@load`
has been performed:

In [58]:
NeuralNetworkClassifier(epochs=12) == model

true

### On fitting, predicting, and inspecting models

In MLJ a model and training/validation data are typically bound
together in a machine:

In [59]:
mach = machine(model, X, y)

[34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m trained 0 times.
  args: 
    1:	[34mSource @273[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @347[39m ⏎ `AbstractArray{Multiclass{3},1}`


A machine stores *learned* parameters, among other things. We'll
train this machine on 70% of the data and evaluate on a 30% holdout
set. Let's start by dividing all row indices into `train` and `test`
subsets:

In [60]:
train, test = partition(eachindex(y), 0.7)

([1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  96, 97, 98, 99, 100, 101, 102, 103, 104, 105], [106, 107, 108, 109, 110, 111, 112, 113, 114, 115  …  141, 142, 143, 144, 145, 146, 147, 148, 149, 150])

In [61]:
fit!(mach, rows=train, verbosity=2)

┌ Info: Training [34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Loss is 1.132
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.091
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.062
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.042
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.024
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.013
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 1.003
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.9917
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.9854
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95


[34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m trained 1 time.
  args: 
    1:	[34mSource @273[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @347[39m ⏎ `AbstractArray{Multiclass{3},1}`


After training, one can inspect the learned parameters:

In [62]:
fitted_params(mach)

(chain = Chain(Chain(Dense(4, 3, σ), Dropout(0.5), Dense(3, 3)), softmax),)

Everything else the user might be interested in is accessed from the
training *report*:

In [63]:
report(mach)

(training_losses = Any[1.1318434f0, 1.0909344f0, 1.0619767f0, 1.0417976f0, 1.024498f0, 1.0133653f0, 1.0034266f0, 0.99168724f0, 0.98537177f0, 0.97130626f0, 0.96130794f0, 0.9529781f0],)

Machines remember the last set of hyper-parameters used during fit,
which, in the case of iterative models, allows for a warm restart of
computations in the case that only the iteration parameter is
increased:

In [64]:
model.epochs = model.epochs + 4
fit!(mach, rows=train, verbosity=2)

┌ Info: Updating [34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Loss is 0.938
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.9288
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.9171
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.9071
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95


[34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m trained 2 times.
  args: 
    1:	[34mSource @273[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @347[39m ⏎ `AbstractArray{Multiclass{3},1}`


By default (for this particular model) we can also increase
`:learning_rate` without triggering a cold restart:

In [65]:
model.epochs = model.epochs + 4
model.optimiser.eta = 10*model.optimiser.eta
fit!(mach, rows=train, verbosity=2)

┌ Info: Updating [34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Loss is 0.8051
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.7163
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.6627
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.5915
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95


[34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m trained 3 times.
  args: 
    1:	[34mSource @273[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @347[39m ⏎ `AbstractArray{Multiclass{3},1}`


However, change any other parameter and training will restart from
scratch:

In [66]:
model.lambda = 0.001
fit!(mach, rows=train, verbosity=2)

┌ Info: Updating [34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Loss is 1.016
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.8518
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.766
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.6894
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.6453
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.6232
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.5906
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.5693
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.jl:95
┌ Info: Loss is 0.5476
└ @ MLJFlux /Users/anthony/.julia/packages/MLJFlux/HxJNU/src/core.j

[34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m trained 4 times.
  args: 
    1:	[34mSource @273[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @347[39m ⏎ `AbstractArray{Multiclass{3},1}`


Let's train silently for a total of 50 epochs, and look at a prediction:

In [67]:
model.epochs = 50
fit!(mach, rows=train)
yhat = predict(mach, X[test,:]); # or predict(mach, rows=test)
yhat[1]

┌ Info: Updating [34mMachine{NeuralNetworkClassifier{Short,…}} @005[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318


UnivariateFinite{Multiclass{3}}(Iris-setosa=>0.0607, Iris-versicolor=>0.559, Iris-virginica=>0.38)

What's going on here?

In [68]:
info(model).prediction_type

:probabilistic

**Important**:
- In MLJ, a model that can predict probabilities (and not just point values) will do so by default. (These models have supertype `Proababilistic`, while point-estimate predictors have supertype `Deterministic`.)
- For most probabilistic predictors, the predicted object is a `Distributions.Distribution` object, supporting the `Distributions.jl` [API](https://juliastats.org/Distributions.jl/latest/extends/#Create-a-Distribution-1) for such objects. In particular, the methods `rand`,  `pdf`, `mode`, `median` and `mean` will apply, where appropriate.

So, to obtain the probability of "Iris-virginica" in the first test
prediction, we do

In [69]:
pdf(yhat[1], "Iris-virginica")

0.38038898f0

To get the most likely observation, we do

In [70]:
mode(yhat[1])

CategoricalValue{String,UInt32} "Iris-versicolor"

These can be broadcast over multiple predictions in the usual way:

In [71]:
broadcast(pdf, yhat[1:4], "Iris-versicolor")

4-element Array{Float32,1}:
 0.55894643
 0.3722597
 0.015246802
 0.3320612

In [72]:
mode.(yhat[1:4])

4-element CategoricalArray{String,1,UInt32}:
 "Iris-versicolor"
 "Iris-virginica"
 "Iris-setosa"
 "Iris-virginica"

Or, alternatively, you can use the `predict_mode` operation instead
of `predict`:

In [73]:
predict_mode(mach, X[test,:])[1:4] # or predict_mode(mach, rows=test)[1:4]

4-element CategoricalArray{String,1,UInt32}:
 "Iris-versicolor"
 "Iris-virginica"
 "Iris-setosa"
 "Iris-virginica"

For a more conventional matrix of probabilities you can do this:

In [74]:
L = levels(y)
pdf(yhat, L)[1:4, :]

4×3 Array{Float32,2}:
 0.0606645  0.558946   0.380389
 0.0330928  0.37226    0.594648
 0.984753   0.0152468  5.52088f-14
 0.0282612  0.332061   0.639678

However, in a typical MLJ workflow, this is not as useful as you
might imagine. In particular, all probablistic performance measures
in MLJ expect distribution objects in their first slot:

In [75]:
cross_entropy(yhat, y[test]) |> mean

0.37720057f0

To apply a deterministic measure, we first need to obtain point-estimates:

In [76]:
misclassification_rate(mode.(yhat), y[test])

0.13333333333333333

We note in passing that there is also a search tool for measures
analogous to `models`:

In [77]:
measures(matching(y))

6-element Array{NamedTuple{(:name, :target_scitype, :supports_weights, :prediction_type, :orientation, :reports_each_observation, :aggregation, :is_feature_dependent, :docstring, :distribution_type),T} where T<:Tuple,1}:
 (name = accuracy, ...)
 (name = balanced_accuracy, ...)
 (name = cross_entropy, ...)
 (name = misclassification_rate, ...)
 (name = BrierScore{UnivariateFinite}, ...)
 (name = confusion_matrix, ...)

### Step 4. Evaluate the model performance

Naturally, MLJ provides boilerplate code for carrying out a model
evaluation with a lot less fuss. Let's repeat the performance
evaluation above and add an extra measure, `brier_score`:

In [78]:
evaluate!(mach, resampling=Holdout(fraction_train=0.7),
          measures=[cross_entropy, brier_score])

┌[0m──────────────────────────────[0m┬[0m───────────────[0m┬[0m─────────────────[0m┐[0m
│[0m[22m _.measure                    [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold      [0m│[0m
├[0m──────────────────────────────[0m┼[0m───────────────[0m┼[0m─────────────────[0m┤[0m
│[0m cross_entropy                [0m│[0m 0.377         [0m│[0m Float32[0.377]  [0m│[0m
│[0m BrierScore{UnivariateFinite} [0m│[0m -0.227        [0m│[0m Float32[-0.227] [0m│[0m
└[0m──────────────────────────────[0m┴[0m───────────────[0m┴[0m─────────────────[0m┘[0m
_.per_observation = [[[0.582, 0.988, ..., 0.0132]], [[-0.343, -0.749, ..., -0.000345]]]


Or applying cross-validation instead:

In [79]:
evaluate!(mach, resampling=CV(nfolds=6),
          measures=[cross_entropy, brier_score])



┌[0m──────────────────────────────[0m┬[0m───────────────[0m┬[0m────────────────────────────────────────────────────────[0m┐[0m
│[0m[22m _.measure                    [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold                                             [0m│[0m
├[0m──────────────────────────────[0m┼[0m───────────────[0m┼[0m────────────────────────────────────────────────────────[0m┤[0m
│[0m cross_entropy                [0m│[0m 0.277         [0m│[0m Float32[0.28, 0.212, 0.287, 0.305, 0.349, 0.23]        [0m│[0m
│[0m BrierScore{UnivariateFinite} [0m│[0m -0.139        [0m│[0m Float32[-0.128, -0.101, -0.13, -0.168, -0.195, -0.112] [0m│[0m
└[0m──────────────────────────────[0m┴[0m───────────────[0m┴[0m────────────────────────────────────────────────────────[0m┘[0m
_.per_observation = [[[0.262, 0.281, ..., 0.146], [0.0218, 0.474, ..., 0.0206], [0.136, 0.369, ..., 0.154], [0.00183, 0.48, ..., 0.00381], [0.35, 0.392, ..., 0.658], [0.00752, 0.352

Or, Monte-Carlo cross-validation (cross-validation repeated
randomizied folds)

In [80]:
e = evaluate!(mach, resampling=CV(nfolds=6, rng=123),
                repeats=3,
              measures=[cross_entropy, brier_score])



┌[0m──────────────────────────────[0m┬[0m───────────────[0m┬[0m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[0m┐[0m
│[0m[22m _.measure                    [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold                                                                                                                                              [0m│[0m
├[0m──────────────────────────────[0m┼[0m───────────────[0m┼[0m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[0m┤[0m
│[0m cross_entropy                [0m│[0m 0.306         [0m│[0m Float32[0.202, 0.259, 0.279, 0.387, 0.361, 0.384, 0.372, 0.283, 0.33, 0.323, 0.309, 0.24, 0.367, 0.252, 0.214, 0.266, 0.339, 0.339]                     [0m│[0m
│[0m BrierScore{UnivariateFinite} [0m│[0m -0.161        [

One can access the following properties of the output `e` of an
evaluation: `measure`, `measurement`, `per_fold` (measurement for
each fold) and `per_observation` (measurement per observation, if
reported).

We finally note that you can restrict the rows of observations from
which train and test folds are drawn, by specifying `rows=...`. For
example, imagining the last 30% of target observations are `missing`
you might have a workflow like this:

In [81]:
train, test = partition(eachindex(y), 0.7)
mach = machine(model, X, y)
evaluate!(mach, resampling=CV(nfolds=6),
          measures=[cross_entropy, brier_score],
          rows=train)     # cv estimate, resampling from `train`
fit!(mach, rows=train)    # re-train using all of `train` observations
predict(mach, rows=test); # and predict missing targets

┌ Info: Creating subsamples from a subset of all rows. 
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/resampling.jl:336
┌ Info: Training [34mMachine{NeuralNetworkClassifier{Short,…}} @839[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


### On learning curves

Since our model is an iterative one, we might want to inspect the
out-of-sample performance as a function of the iteration
parameter. For this we can use the `learning_curve` function (which,
incidentally can be applied to any model hyper-parameter). This
starts by defining a one-dimensional range object for the parameter
(more on this when we discuss tuning in Part 4):

In [82]:
r = range(model, :epochs, lower=1, upper=60, scale=:log)
curve = learning_curve(mach,
                       range=r,
                       resampling=Holdout(fraction_train=0.7), # (default)
                       measure=cross_entropy)

┌ Info: Training [34mMachine{ProbabilisticTunedModel{Grid,…}} @244[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Attempting to evaluate 22 models.
└ @ MLJTuning /Users/anthony/.julia/packages/MLJTuning/oLVRR/src/tuned_models.jl:474


(parameter_name = "epochs",
 parameter_scale = :log,
 parameter_values = [1, 2, 3, 4, 5, 6, 7, 8, 10, 11  …  17, 19, 22, 26, 30, 34, 39, 45, 52, 60],
 measurements = [1.0338083505630493, 0.9733019471168518, 0.8096140623092651, 0.6991646885871887, 0.6084268093109131, 0.5542607307434082, 0.5336317420005798, 0.5063855648040771, 0.47317537665367126, 0.44297146797180176  …  0.3798142671585083, 0.39341121912002563, 0.29644110798835754, 0.3708436191082001, 0.3389438986778259, 0.2866306006908417, 0.2980813682079315, 0.2841547727584839, 0.28716742992401123, 0.2718779444694519],)

using Plots
pyplot()
plt=plot(curve.parameter_values, curve.measurements)
xlabel!(plt, "epochs")
ylabel!(plt, "cross entropy on holdout set")
savefig("iris_learning_curve.png")

We will return to learning curves when we look at tuning in Part 4.

### Resources for Part 2

- From the MLJ manual:
    - [Getting Started](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/)
    - [Model Search](https://alan-turing-institute.github.io/MLJ.jl/dev/model_search/)
    - [Evaluating Performance](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/) (using `evaluate!`)
    - [Learning Curves](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_curves/)
    - [Performance Measures](https://alan-turing-institute.github.io/MLJ.jl/dev/performance_measures/) (loss functions, scores, etc)
- From Data Science Tutorials:
    - [Choosing and evaluating a model](https://alan-turing-institute.github.io/DataScienceTutorials.jl/getting-started/choosing-a-model/)
    - [Fit, predict, transform](https://alan-turing-institute.github.io/DataScienceTutorials.jl/getting-started/fit-and-predict/)

### Exercises for Part 2

#### Ex 4

(a) Identify all supervised MLJ models that can be applied (without
type coercion or one-hot encoding) to a supervised learning problem
with input features `X4` and target `y4` defined below:

In [83]:
import Distributions
poisson = Distributions.Poisson

age = 18 .+ 60*rand(10);
salary = coerce(rand([:small, :big, :huge], 10), OrderedFactor);
levels!(salary, [:small, :big, :huge]);
X4 = DataFrames.DataFrame(age=age, salary=salary)

n_devices(salary) = salary > :small ? rand(poisson(1.3)) : rand(poisson(2.9))
y4 = [n_devices(row.salary) for row in eachrow(X4)]

10-element Array{Int64,1}:
 1
 0
 0
 2
 0
 0
 1
 4
 3
 0

(b) What models can be applied if you coerce the salary to a
`Continuous` scitype?

#### Ex 5 (unpack)

After evaluating the following ...

In [84]:
data = (a = [1, 2, 3, 4],
     b = rand(4),
     c = rand(4),
     d = coerce(["male", "female", "female", "male"], OrderedFactor));
pretty(data)

using Tables
y, X, w = unpack(data, ==(:a),
                 name -> elscitype(Tables.getcolumn(data, name)) == Continuous,
                 name -> true);

┌───────┬──────────────────────┬─────────────────────┬─────────────────────────────────┐
│ a     │ b                    │ c                   │ d                               │
│ Int64 │ Float64              │ Float64             │ CategoricalValue{String,UInt32} │
│ Count │ Continuous           │ Continuous          │ OrderedFactor{2}                │
├───────┼──────────────────────┼─────────────────────┼─────────────────────────────────┤
│ 1     │ 0.6441721216226863   │ 0.6296110226768703  │ male                            │
│ 2     │ 0.21229097362580585  │ 0.9030090781929387  │ female                          │
│ 3     │ 0.012358733007368672 │ 0.10483128996037228 │ female                          │
│ 4     │ 0.2648144644130588   │ 0.7981670696077316  │ male                            │
└───────┴──────────────────────┴─────────────────────┴─────────────────────────────────┘


...attempt to guess the evaluations of the following:

In [85]:
y

4-element Array{Int64,1}:
 1
 2
 3
 4

In [86]:
pretty(X)

┌──────────────────────┬─────────────────────┐
│ b                    │ c                   │
│ Float64              │ Float64             │
│ Continuous           │ Continuous          │
├──────────────────────┼─────────────────────┤
│ 0.6441721216226863   │ 0.6296110226768703  │
│ 0.21229097362580585  │ 0.9030090781929387  │
│ 0.012358733007368672 │ 0.10483128996037228 │
│ 0.2648144644130588   │ 0.7981670696077316  │
└──────────────────────┴─────────────────────┘


In [87]:
w

4-element CategoricalArray{String,1,UInt32}:
 "male"
 "female"
 "female"
 "male"

#### Ex 6 (first steps in modelling Horse Colic)

(a) Suppose we want to use predict the `:outcome` variable in the
Horse Colic study introduced in Part 1, based on the remaining
variables that are `Continuous` (one-hot encoding categorical
variables is discussed later in Part 3) *while ignoring the others*.
Extract from the `horse` data set (defined in Part 1) appropriate
input features `X` and target variable `y`. (Do not, however,
randomize the observations.)

(b) Create a 70:30 `train`/`test` split of the data and train a
`LogisticClassifier` model, from the `MLJLinearModels` package, on
the `train` rows. Use `lambda=100` and default values for the
other hyper-parameters. (Although one would normally standardize
(whiten) the continuous features for this model, do not do so here.)
After training:

- (i) Recalling that a logistic classifier (aka logistic regressor) is
  a linear-based model learning a *vector* of coefficients for each
  feature (one coefficient for each target class), use the
  `fitted_params` method to find this vector of coefficients in the
  case of the `:pulse` feature. (To convert a vector of pairs `v =
  [x1 => y1, x2 => y2, ...]` into a dictionary, do `Dict(v)`.)

- (ii) Evaluate the `cross_entropy` performance on the `test`
  observations.

- &star;(iii) In how many `test` observations does the predicted
  probablility of the observed class exceed 50%?

- (iv) Find the `misclassification_rate` in the `test`
  set. (*Hint.* As this measure is deterministic, you will either
  need to broadcast `mode` or use `predict_mode` instead of
  `predict`.)

(c) Instead use a `RandomForestClassifier` model from the
    `DecisionTree` package and:

- (i) Generate an appropriate learning curve to convince yourself
  that out-of-sample estimates of the `cross_entropy` loss do not
  substatially improve for `n_trees > 50`. Use default values for
  all other hyper-parameters, and feel free to use all available
  data to generate the curve.

- (ii) Fix `n_trees=90` and use `evaluate!` to obtain a 9-fold
  cross-validation estimate of the `cross_entropy`, restricting
  sub-sampling to the `train` observations.

- (iii) Now use *all* available data but set
  `resampling=Holdout(fraction_train=0.7)` to obtain a score you can
  compare with the `KNNClassifier` in part (b)(iii). Which model is
  better?

## Part 3 - Transformers and Pipelines

### Transformers

Unsupervised models, which receive no target `y` during training,
always have a `transform` operation. They sometimes also support an
`inverse_transform` operation, with obvious meaning, and sometimes
support a `predict` operation operation (eg, some clustering
algorithms). Otherwise, they are handled much like supervised
models.

For an illustration, let's re-encode *all* of the King County House
input features (see [Ex 3](#ex-3-fixing-scitypes-in-a-table)) into a
set of `Continuous` features. We do this with the `ContinousEncoder`
model, which, by default, will:

- one-hot encode all `Multiclass` features
- coerce all `OrderedFactor` features to `Continuous` ones
- coerce all `Count` features to `Continuous` ones (there aren't any)
- drop any remaining non-Continuous features (there won't be any of these)

First, we load a version of the data with scitypes already fixed:

In [88]:
file = CSV.File(joinpath(DIR, "data", "house.csv"));
house = DataFrames.DataFrame(file)
coerce!(house, autotype(file))
coerce!(house, Count => Continuous, :zipcode => Multiclass);
schema(house)

┌[0m───────────────[0m┬[0m──────────────────────────────────[0m┬[0m───────────────────[0m┐[0m
│[0m[22m _.names       [0m│[0m[22m _.types                          [0m│[0m[22m _.scitypes        [0m│[0m
├[0m───────────────[0m┼[0m──────────────────────────────────[0m┼[0m───────────────────[0m┤[0m
│[0m price         [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m bedrooms      [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m OrderedFactor{13} [0m│[0m
│[0m bathrooms     [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{30} [0m│[0m
│[0m sqft_living   [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m sqft_lot      [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m floors        [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{6}  [0m│[0m
│[0m waterfront    [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m Ord

In [89]:
y, X = unpack(house, ==(:price), name -> true, rng=123);

Instantiate the unsupervised model (transformer):

In [90]:
encoder = ContinuousEncoder() # a built-in model; no need to @load it

ContinuousEncoder(
    drop_last = false,
    one_hot_ordered_factors = false)[34m @347[39m

Bind the model to the data and fit!

In [91]:
mach = machine(encoder, X) |> fit!;

┌ Info: Training [34mMachine{ContinuousEncoder} @616[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


Transform and inspect the result:

In [92]:
Xcont = transform(mach, X);
schema(Xcont)

┌[0m────────────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names        [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m────────────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m bedrooms       [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m bathrooms      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m sqft_living    [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m sqft_lot       [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m floors         [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m waterfront     [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m view           [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m condition      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m grade          [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m sqft_above     [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m sqft_basement  [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m

Here's a list of MLJ's built-in transformers:

In [93]:
models(m->!m.is_supervised)

28-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = AffinityPropagation, package_name = ScikitLearn, ... )
 (name = AgglomerativeClustering, package_name = ScikitLearn, ... )
 (name = Birch, package_name = ScikitLearn, ... )
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = DBSCAN, package_name = ScikitLearn, ... )
 (name = FeatureAgglomeration, package_name = ScikitLearn, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = ICA, package_name = MultivariateStats, ... )
 (name = KMeans, package_name = Clustering, ... )
 (name = KMeans, package_name = ParallelKMean

Some commonly used ones are built-in (do not require `@load`ing):

model type                  | does what?
----------------------------|----------------------------------------------
ContinuousEncoder | transform input to a table of `Continuous` features (see above)
FeatureSelector | retain or dump selected features
FillImputer | impute missing values
OneHotEncoder | one-hot encoder `Multiclass` (and optionally `OrderedFactor`) features
Standardizer | standardize (whiten) the `Continuous` features in a table
UnivariateBoxCoxTransformer | apply a learned Box-Cox transformation to a vector
UnivariateDiscretizer | discretize a `Continuous` vector, and hence render its elscityp `OrderedFactor`
UnivariateStandardizer| standardize (whiten) a `Continuous` vector

### Pipelines

In [94]:
length(schema(Xcont).names)

87

Let's suppose that additionally we'd like to reduce the dimension of
our data.  A model that will do this is `PCA` from
`MultivariateStats`:

In [95]:
reducer = @load PCA

PCA(
    maxoutdim = nothing,
    method = :auto,
    pratio = 0.99,
    mean = nothing)[34m @424[39m

Now, rather simply repeating the workflow above, applying the new
transformation to `Xcont`, we can combine both the encoding and the
dimension-reducing models into a single model, known as a
*pipeline*. While MLJ offers a powerful interface for composing
models in a variety of ways, we'll stick to these simplest class of
composite models for now. The easiest way to construct them is using
the `@pipeline` macro:

In [96]:
pipe = @pipeline encoder reducer

Pipeline772(
    continuous_encoder = ContinuousEncoder(
            drop_last = false,
            one_hot_ordered_factors = false),
    pca = PCA(
            maxoutdim = nothing,
            method = :auto,
            pratio = 0.99,
            mean = nothing))[34m @533[39m

Notice that `pipe` is an *instance* of an automatically generated
type called `Pipeline???`.

The new model behaves like any other transformer:

In [97]:
mach = machine(pipe, X) |> fit!;
Xsmall = transform(mach, X)
schema(Xsmall)

┌ Info: Training [34mMachine{Pipeline772} @731[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{ContinuousEncoder} @525[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{PCA} @851[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


┌[0m─────────[0m┬[0m─────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.names [0m│[0m[22m _.types [0m│[0m[22m _.scitypes [0m│[0m
├[0m─────────[0m┼[0m─────────[0m┼[0m────────────[0m┤[0m
│[0m x1      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
│[0m x2      [0m│[0m Float64 [0m│[0m Continuous [0m│[0m
└[0m─────────[0m┴[0m─────────[0m┴[0m────────────[0m┘[0m
_.nrows = 21613


Want to combine this pre-processing with a logistic classifier?

In [98]:
rgs = @load RidgeRegressor pkg=MLJLinearModels
pipe2 = @pipeline encoder reducer rgs

Pipeline780(
    continuous_encoder = ContinuousEncoder(
            drop_last = false,
            one_hot_ordered_factors = false),
    pca = PCA(
            maxoutdim = nothing,
            method = :auto,
            pratio = 0.99,
            mean = nothing),
    ridge_regressor = RidgeRegressor(
            lambda = 1.0,
            fit_intercept = true,
            penalize_intercept = false,
            solver = nothing))[34m @438[39m

Now our pipeline is a supervised model, instead of a transformer:
whose performance we can evaluate:

In [99]:
mach = machine(pipe2, X, y) |> fit!
evaluate!(mach, measure=mae, resampling=Holdout()) # CV(nfolds=6) is default

┌ Info: Training [34mMachine{Pipeline780} @223[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{ContinuousEncoder} @238[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{PCA} @092[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{RidgeRegressor} @013[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


┌[0m───────────[0m┬[0m───────────────[0m┬[0m────────────[0m┐[0m
│[0m[22m _.measure [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold [0m│[0m
├[0m───────────[0m┼[0m───────────────[0m┼[0m────────────[0m┤[0m
│[0m mae       [0m│[0m 234000.0      [0m│[0m [234000.0] [0m│[0m
└[0m───────────[0m┴[0m───────────────[0m┴[0m────────────[0m┘[0m
_.per_observation = [missing]


### Training of composite models is "smart"

Now notice what happens if we train on all the data, then change a
regressor hyper-parameter and retrain:

In [100]:
fit!(mach)

┌ Info: Training [34mMachine{Pipeline780} @223[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{ContinuousEncoder} @666[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{PCA} @559[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317
┌ Info: Training [34mMachine{RidgeRegressor} @544[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


[34mMachine{Pipeline780} @223[39m trained 3 times.
  args: 
    1:	[34mSource @020[39m ⏎ `Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{70},1}, AbstractArray{OrderedFactor{6},1}, AbstractArray{OrderedFactor{13},1}, AbstractArray{OrderedFactor{30},1}, AbstractArray{OrderedFactor{5},1}, AbstractArray{OrderedFactor{12},1}, AbstractArray{OrderedFactor{2},1}}}`
    2:	[34mSource @121[39m ⏎ `AbstractArray{Continuous,1}`


In [101]:
pipe2.ridge_regressor.lambda = 0.1
fit!(mach)

┌ Info: Updating [34mMachine{Pipeline780} @223[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Not retraining [34mMachine{ContinuousEncoder} @666[39m. Use `force=true` to force.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:320
┌ Info: Not retraining [34mMachine{PCA} @559[39m. Use `force=true` to force.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:320
┌ Info: Updating [34mMachine{RidgeRegressor} @544[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318


[34mMachine{Pipeline780} @223[39m trained 4 times.
  args: 
    1:	[34mSource @020[39m ⏎ `Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{70},1}, AbstractArray{OrderedFactor{6},1}, AbstractArray{OrderedFactor{13},1}, AbstractArray{OrderedFactor{30},1}, AbstractArray{OrderedFactor{5},1}, AbstractArray{OrderedFactor{12},1}, AbstractArray{OrderedFactor{2},1}}}`
    2:	[34mSource @121[39m ⏎ `AbstractArray{Continuous,1}`


Second time only the ridge regressor is retrained!

Mutate a hyper-parameter of the `PCA` model and every model except
the `ContinuousEncoder` (which comes before it will be retrained):

In [102]:
pipe2.pca.pratio = 0.9999
fit!(mach)

┌ Info: Updating [34mMachine{Pipeline780} @223[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Not retraining [34mMachine{ContinuousEncoder} @666[39m. Use `force=true` to force.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:320
┌ Info: Updating [34mMachine{PCA} @559[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:318
┌ Info: Training [34mMachine{RidgeRegressor} @544[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


[34mMachine{Pipeline780} @223[39m trained 5 times.
  args: 
    1:	[34mSource @020[39m ⏎ `Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{70},1}, AbstractArray{OrderedFactor{6},1}, AbstractArray{OrderedFactor{13},1}, AbstractArray{OrderedFactor{30},1}, AbstractArray{OrderedFactor{5},1}, AbstractArray{OrderedFactor{12},1}, AbstractArray{OrderedFactor{2},1}}}`
    2:	[34mSource @121[39m ⏎ `AbstractArray{Continuous,1}`


### Inspecting composite models

The dot syntax used above to change the values of *nested*
hyper-parameters is also useful when inspecting the learned
parameters and report generated when training a composite model:

In [103]:
fitted_params(mach).ridge_regressor

(coefs = [:x1 => -0.7328956348956883, :x2 => -0.16590563202915296, :x3 => 194.59515890822158, :x4 => 102.7130175613619],
 intercept = 540085.6428739978,)

In [104]:
report(mach).pca

(indim = 87,
 outdim = 4,
 mean = [4.369869985656781, 8.459121824827651, 2079.899736269838, 15106.96756581687, 1.988617961412113, 1.0075417572757137, 1.2343034284921113, 3.4094295100171195, 6.6569194466293435, 1788.3906907879518  …  0.011798454633785222, 0.012122333780595013, 0.006292509138018786, 0.012955165872391617, 0.01466709850552908, 47.56005251931713, -122.21389640494186, 1986.5524915560081, 12768.455651691113, 1.9577106371165505],
 principalvars = [2.177071551045085e9, 2.841813972643024e8, 1.6850160830643424e6, 277281.83841321553],
 tprincipalvar = 2.463215246230865e9,
 tresidualvar = 157533.26199674606,
 tvar = 2.4633727794928617e9,)

### Incorporating target transformations

Next, suppose that instead of using the raw `:price` as the
training target, we want to use the log-price (a common practice in
dealing with house price data). However, suppose that we still want
to report final *predictions* on the original linear scale (and use
these for evaluation purposes). Then we supply appropriate functions
to key-word arguments `target` and `inverse`.

First we'll overload `log` and `exp` for broadcasting:

In [105]:
Base.log(v::AbstractArray) = log.(v)
Base.exp(v::AbstractArray) = exp.(v)

Now for the new pipeline:

In [106]:
pipe3 = @pipeline encoder reducer rgs target=log inverse=exp
mach = machine(pipe3, X, y)
evaluate!(mach, measure=mae)

┌ Info: Treating pipeline as a `Deterministic` predictor.
│ To override, specify `prediction_type=...` (options: :deterministic, :probabilistic, :interval). 
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/composition/models/pipelines.jl:394


┌[0m───────────[0m┬[0m───────────────[0m┬[0m──────────────────────────────────────────────────────────────[0m┐[0m
│[0m[22m _.measure [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold                                                   [0m│[0m
├[0m───────────[0m┼[0m───────────────[0m┼[0m──────────────────────────────────────────────────────────────[0m┤[0m
│[0m mae       [0m│[0m 162000.0      [0m│[0m [160000.0, 161000.0, 164000.0, 159000.0, 173000.0, 157000.0] [0m│[0m
└[0m───────────[0m┴[0m───────────────[0m┴[0m──────────────────────────────────────────────────────────────[0m┘[0m
_.per_observation = [missing]


MLJ will even allow you to insert *learned* target
transformations. For example, we might want to apply
`UnivariateStandardizer()` to the target, to standarize it, or
`UnivariateBoxCoxTransformer()` to make it look Gaussian. Then
instead of specifying a *function* for `target`, we specify a model
(or model type). One does not specify `inverse` because these are
models that implement `inverse_transform` in addition to
`transform`:

## Part 4 - Tuning hyper-parameters

In [107]:
r = range(pipe3, :(ridge_regressor.lambda), lower = 1e-6, upper=10, scale=:log)

MLJBase.NumericRange(Float64, :(ridge_regressor.lambda), ... )

If you're curious, you can see what `lambda` values this range will
generate for a given resolution:

In [108]:
iterator(r, 10)

10-element Array{Float64,1}:
  1.0000000000000004e-6
  5.994842503189412e-6
  3.593813663804628e-5
  0.0002154434690031884
  0.0012915496650148838
  0.007742636826811276
  0.046415888336127795
  0.27825594022071254
  1.668100537200059
 10.000000000000002

## Solutions to exercises

#### Ex 2 solution

In [109]:
quality = coerce(quality, OrderedFactor);
levels!(quality, ["poor", "good", "excellent"]);
elscitype(quality)

┌ Info: Trying to coerce from `Union{Missing, String}` to `OrderedFactor`.
│ Coerced to `Union{Missing,OrderedFactor}` instead.
└ @ MLJScientificTypes /Users/anthony/.julia/packages/MLJScientificTypes/wqfgN/src/convention/coerce.jl:126


Union{Missing, OrderedFactor{3}}

#### Ex 3 solution

First pass:

In [110]:
coerce!(house, autotype(house));
schema(house)

┌[0m───────────────[0m┬[0m──────────────────────────────────[0m┬[0m───────────────────[0m┐[0m
│[0m[22m _.names       [0m│[0m[22m _.types                          [0m│[0m[22m _.scitypes        [0m│[0m
├[0m───────────────[0m┼[0m──────────────────────────────────[0m┼[0m───────────────────[0m┤[0m
│[0m price         [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m bedrooms      [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m OrderedFactor{13} [0m│[0m
│[0m bathrooms     [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{30} [0m│[0m
│[0m sqft_living   [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m sqft_lot      [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m floors        [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{6}  [0m│[0m
│[0m waterfront    [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m Ord

All the "sqft" fields refer to "square feet" so are
really `Continuous`. We'll regard `:yr_built` (the other `Count`
variable above) as `Continuous` as well. So:

In [111]:
coerce!(house, Count => Continuous);

And `:zipcode` should not be ordered:

In [112]:
coerce!(house, :zipcode => Multiclass);
schema(house)

┌[0m───────────────[0m┬[0m──────────────────────────────────[0m┬[0m───────────────────[0m┐[0m
│[0m[22m _.names       [0m│[0m[22m _.types                          [0m│[0m[22m _.scitypes        [0m│[0m
├[0m───────────────[0m┼[0m──────────────────────────────────[0m┼[0m───────────────────[0m┤[0m
│[0m price         [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m bedrooms      [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m OrderedFactor{13} [0m│[0m
│[0m bathrooms     [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{30} [0m│[0m
│[0m sqft_living   [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m sqft_lot      [0m│[0m Float64                          [0m│[0m Continuous        [0m│[0m
│[0m floors        [0m│[0m CategoricalValue{Float64,UInt32} [0m│[0m OrderedFactor{6}  [0m│[0m
│[0m waterfront    [0m│[0m CategoricalValue{Int64,UInt32}   [0m│[0m Ord

`:bathrooms` looks like it has a lot of levels, but on further
inspection we see why, and `OrderedFactor` remains appropriate:

In [113]:
import StatsBase.countmap
countmap(house.bathrooms)

Dict{CategoricalValue{Float64,UInt32},Int64} with 30 entries:
  CategoricalValue{Float64,UInt32} 5.5 (22/30)  => 10
  CategoricalValue{Float64,UInt32} 6.5 (26/30)  => 2
  CategoricalValue{Float64,UInt32} 2.0 (8/30)   => 1930
  CategoricalValue{Float64,UInt32} 1.5 (6/30)   => 1446
  CategoricalValue{Float64,UInt32} 3.25 (13/30) => 589
  CategoricalValue{Float64,UInt32} 4.75 (19/30) => 23
  CategoricalValue{Float64,UInt32} 4.5 (18/30)  => 100
  CategoricalValue{Float64,UInt32} 6.75 (27/30) => 2
  CategoricalValue{Float64,UInt32} 0.0 (1/30)   => 10
  CategoricalValue{Float64,UInt32} 2.75 (11/30) => 1185
  CategoricalValue{Float64,UInt32} 3.5 (14/30)  => 731
  CategoricalValue{Float64,UInt32} 1.25 (5/30)  => 9
  CategoricalValue{Float64,UInt32} 6.25 (25/30) => 2
  CategoricalValue{Float64,UInt32} 8.0 (30/30)  => 2
  CategoricalValue{Float64,UInt32} 6.0 (24/30)  => 6
  CategoricalValue{Float64,UInt32} 5.25 (21/30) => 13
  CategoricalValue{Float64,UInt32} 4.0 (16/30)  => 136
  CategoricalVal

#### Ex 4 solution

4(a)

There are *no* models that apply immediately:

In [114]:
models(matching(X4, y4))

0-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}

4(b)

In [115]:
y4 = coerce(y4, Continuous);
models(matching(X4, y4))

4-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = DecisionTreeRegressor, package_name = DecisionTree, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = RandomForestRegressor, package_name = DecisionTree, ... )

#### Ex 6 solution

6(a)

In [116]:
y, X = unpack(horse,
              ==(:outcome),
              name -> elscitype(Tables.getcolumn(horse, name)) == Continuous);

6(b)(i)

In [117]:
model = @load LogisticClassifier pkg=MLJLinearModels;
model.lambda = 100
mach = machine(model, X, y)
fit!(mach, rows=train)
fitted_params(mach)

┌ Info: Training [34mMachine{LogisticClassifier} @869[39m.
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/machines.jl:317


(classes = CategoricalValue{Int64,UInt32}[1, 2, 3],
 coefs = Pair{Symbol,SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true}}[:rectal_temperature => [0.061700165020208884, -0.06507181615992094, 0.003371651139712025], :pulse => [-0.009584825599058816, 0.004022558646948241, 0.005562266952111324], :respiratory_rate => [-0.009584825599058816, 0.004022558646948241, 0.005562266952111324], :packed_cell_volume => [-0.0430937217634404, 0.020859863954344793, 0.02223385780909599], :total_protein => [0.02750875236570991, -0.06317268044006659, 0.03566392807435661]],
 intercept = [0.0008917387282688827, -0.0008917385123632456, -4.972412452088422],)

In [118]:
coefs_given_feature = Dict(fitted_params(mach).coefs)
coefs_given_feature[:pulse]

#6(b)(ii)

yhat = predict(mach, rows=test); # or predict(mach, X[test,:])
err = cross_entropy(yhat, y[test]) |> mean

0.7187276476280999

6(b)(iii)

The predicted probabilities of the actual observations in the test
are given by

In [119]:
p = broadcast(pdf, yhat, y[test]);

The number of times this probability exceeds 50% is:

In [120]:
n50 = filter(x -> x > 0.5, p) |> length

30

Or, as a proportion:

In [121]:
n50/length(test)

0.6666666666666666

6(b)(iv)

In [122]:
misclassification_rate(mode.(yhat), y[test])

0.28888888888888886

6(c)(i)

In [123]:
model = @load RandomForestClassifier pkg=DecisionTree
mach = machine(model, X, y)
evaluate!(mach, resampling=CV(nfolds=6), measure=cross_entropy)

r = range(model, :n_trees, lower=10, upper=70, scale=:log)



MLJBase.NumericRange(Int64, :n_trees, ... )

Since random forests are inherently randomized, we generate multiple
curves:

plt = plot()
for i in 1:4
    curve = learning_curve(mach,
                           range=r,
                           resampling=Holdout(),
                           measure=cross_entropy)
    plt=plot!(curve.parameter_values, curve.measurements)
end
xlabel!(plt, "n_trees")
ylabel!(plt, "cross entropy")

6(c)(ii)

In [124]:
evaluate!(mach, resampling=CV(nfolds=9),
                measure=cross_entropy,
                rows=train).measurement[1]

model.n_trees = 90

┌ Info: Creating subsamples from a subset of all rows. 
└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/r3heT/src/resampling.jl:336


90

6(c)(iii)

In [125]:
err_forest = evaluate!(mach, resampling=Holdout(),
                       measure=cross_entropy).measurement[1]

1.2903445418223982

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*