# Machine Learning in Julia, JuliaCon2020

A workshop introducing the machine learning toolbox
[MLJ](https://alan-turing-institute.github.io/MLJ.jl/stable/).

### Set-up

Inspect Julia version:

In [3]:
VERSION

v"1.6.3"

The following instantiates a package environment.

The package environment has been created using **Julia 1.6** and may not
instantiate properly for other Julia versions.

In [4]:
using Pkg
Pkg.activate("env")
Pkg.instantiate()

[32m[1m  Activating[22m[39m environment at `~/.julia/dev/MLJTutorial/notebooks/01_data_representation/env/Project.toml`


## General resources

- [MLJ Cheatsheet](https://alan-turing-institute.github.io/MLJ.jl/dev/mlj_cheatsheet/)
- [Common MLJ Workflows](https://alan-turing-institute.github.io/MLJ.jl/dev/common_mlj_workflows/)
- [MLJ manual](https://alan-turing-institute.github.io/MLJ.jl/dev/)
- [Data Science Tutorials in Julia](https://juliaai.github.io/DataScienceTutorials.jl/)

## Part 1 - Data Representation

> **Goals:**
> 1. Learn how MLJ specifies it's data requirements using "scientific" types
> 2. Understand the options for representing tabular data
> 3. Learn how to inspect and fix the representation of data to meet MLJ requirements

### Scientific types

To help you focus on the intended *purpose* or *interpretation* of
data, MLJ models specify data requirements using *scientific types*,
instead of machine types. An example of a scientific type is
`OrderedFactor`. The other basic "scalar" scientific types are
illustrated below:

![](https://github.com/ablaom/MachineLearningInJulia2020/blob/for-MLJ-version-0.16/assets/scitypes.png)

A scientific type is an ordinary Julia type (so it can be used for
method dispatch, for example) but it usually has no instances. The
`scitype` function is used to articulate MLJ's convention about how
different machine types will be interpreted by MLJ models:

In [5]:
using ScientificTypes
scitype(3.141)

Continuous

In [6]:
time = [2.3, 4.5, 4.2, 1.8, 7.1]
scitype(time)

AbstractVector{Continuous} (alias for AbstractArray{Continuous, 1})

To fix data which MLJ is interpreting incorrectly, we use the
`coerce` method:

In [7]:
height = [185, 153, 163, 114, 180]
scitype(height)

AbstractVector{Count} (alias for AbstractArray{Count, 1})

In [8]:
height = coerce(height, Continuous)

5-element Vector{Float64}:
 185.0
 153.0
 163.0
 114.0
 180.0

Here's an example of data we would want interpreted as
`OrderedFactor` but isn't:

In [9]:
exam_mark = ["rotten", "great", "bla",  missing, "great"]
scitype(exam_mark)

AbstractVector{Union{Missing, Textual}} (alias for AbstractArray{Union{Missing, Textual}, 1})

In [10]:
exam_mark = coerce(exam_mark, OrderedFactor)

┌ Info: Trying to coerce from `Union{Missing, String}` to `OrderedFactor`.
│ Coerced to `Union{Missing,OrderedFactor}` instead.
└ @ ScientificTypes /Users/anthony/.julia/packages/ScientificTypes/Vswzn/src/convention/coerce.jl:174


5-element CategoricalArrays.CategoricalArray{Union{Missing, String},1,UInt32}:
 "rotten"
 "great"
 "bla"
 missing
 "great"

In [11]:
levels(exam_mark)

3-element Vector{String}:
 "bla"
 "great"
 "rotten"

Use `levels!` to put the classes in the right order:

In [12]:
levels!(exam_mark, ["rotten", "bla", "great"])
exam_mark[1] < exam_mark[2]

true

When sub-sampling, no levels are lost:

In [13]:
levels(exam_mark[1:2])

3-element Vector{String}:
 "rotten"
 "bla"
 "great"

**Note on binary data.** There is no separate scientific type for
binary data. Binary data is `OrderedFactor{2}` or
`Multiclass{2}`. If a binary measure like `truepositive` is a
applied to `OrderedFactor{2}` then the "positive" class is assumed
to appear *second* in the ordering. If such a measure is applied to
`Multiclass{2}` data, a warning is issued. A single `OrderedFactor`
can be coerced to a single `Continuous` variable, for models that
require this, while a `Multiclass` variable can only be one-hot
encoded.

### Two-dimensional data

Whenever it makes sense, MLJ Models generally expect two-dimensional
data to be *tabular*. All the tabular formats implementing the
[Tables.jl API](https://juliadata.github.io/Tables.jl/stable/) (see
this
[list](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md))
have a scientific type of `Table` and can be used with such models.

Probably the simplest example of a table is the julia native *column
table*, which is just a named tuple of equal-length vectors:

In [14]:
column_table = (h=height, e=exam_mark, t=time)

(h = [185.0, 153.0, 163.0, 114.0, 180.0], e = Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}["rotten", "great", "bla", missing, "great"], t = [2.3, 4.5, 4.2, 1.8, 7.1])

In [15]:
scitype(column_table)

Table{Union{AbstractVector{Union{Missing, OrderedFactor{3}}}, AbstractVector{Continuous}}}

Notice the `Table{K}` type parameter `K` encodes the scientific
types of the columns. (This is useful when comparing table scitypes
with `<:`). To inspect the individual column scitypes, we use the
`schema` method instead:

In [16]:
schema(column_table)

┌─────────┬──────────────────────────────────────────────────┬──────────────────
│[22m _.names [0m│[22m _.types                                          [0m│[22m _.scitypes     [0m ⋯
├─────────┼──────────────────────────────────────────────────┼──────────────────
│ h       │ Float64                                          │ Continuous      ⋯
│ e       │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing,  ⋯
│ t       │ Float64                                          │ Continuous      ⋯
└─────────┴──────────────────────────────────────────────────┴──────────────────
[36m                                                                1 column omitted[0m
_.nrows = 5


Here are five other examples of tables:

In [17]:
dict_table = Dict(:h => height, :e => exam_mark, :t => time)
schema(dict_table)

┌─────────┬──────────────────────────────────────────────────┬──────────────────
│[22m _.names [0m│[22m _.types                                          [0m│[22m _.scitypes     [0m ⋯
├─────────┼──────────────────────────────────────────────────┼──────────────────
│ e       │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing,  ⋯
│ h       │ Float64                                          │ Continuous      ⋯
│ t       │ Float64                                          │ Continuous      ⋯
└─────────┴──────────────────────────────────────────────────┴──────────────────
[36m                                                                1 column omitted[0m
_.nrows = 5


(To control column order here, instead use `LittleDict` from
OrderedCollections.jl.)

In [18]:
row_table = [(a=1, b=3.4),
             (a=2, b=4.5),
             (a=3, b=5.6)]
schema(row_table)

┌─────────┬─────────┬────────────┐
│[22m _.names [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────┼─────────┼────────────┤
│ a       │ Int64   │ Count      │
│ b       │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 3


In [19]:
import DataFrames
df = DataFrames.DataFrame(column_table)

Unnamed: 0_level_0,h,e,t
Unnamed: 0_level_1,Float64,Cat…?,Float64
1,185.0,rotten,2.3
2,153.0,great,4.5
3,163.0,bla,4.2
4,114.0,missing,1.8
5,180.0,great,7.1


In [20]:
schema(df) == schema(column_table)

true

In [21]:
using UrlDownload, CSV
csv_file = urldownload("https://raw.githubusercontent.com/ablaom/"*
                   "MachineLearningInJulia2020/"*
                   "for-MLJ-version-0.16/data/horse.csv");
schema(csv_file)

┌─────────────────────────┬─────────┬────────────┐
│[22m _.names                 [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────┼─────────┼────────────┤
│ surgery                 │ Int64   │ Count      │
│ age                     │ Int64   │ Count      │
│ rectal_temperature      │ Float64 │ Continuous │
│ pulse                   │ Int64   │ Count      │
│ respiratory_rate        │ Int64   │ Count      │
│ temperature_extremities │ Int64   │ Count      │
│ mucous_membranes        │ Int64   │ Count      │
│ capillary_refill_time   │ Int64   │ Count      │
│ pain                    │ Int64   │ Count      │
│ peristalsis             │ Int64   │ Count      │
│ abdominal_distension    │ Int64   │ Count      │
│ packed_cell_volume      │ Float64 │ Continuous │
│ total_protein           │ Float64 │ Continuous │
│ outcome                 │ Int64   │ Count      │
│ surgical_lesion         │ Int64   │ Count      │
│ cp_data                 │ Int64   │ Count      │
└───

Most MLJ models do not accept matrix in lieu of a table, but you can
wrap a matrix as a table:

In [22]:
using Tables
matrix_table = Tables.table(rand(2,3))
schema(matrix_table)

┌─────────┬─────────┬────────────┐
│[22m _.names [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────┼─────────┼────────────┤
│ Column1 │ Float64 │ Continuous │
│ Column2 │ Float64 │ Continuous │
│ Column3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 2


The matrix is *not* copied, only wrapped. Some models may perform
better if one wraps the adjoint of the transpose - see
[here](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Observations-correspond-to-rows,-not-columns).

**Manipulating tabular data.** In this workshop we assume
familiarity with some kind of tabular data container (although it is
possible, in principle, to carry out the exercises without this.)
For a quick start introduction to `DataFrames`, see [this
tutorial](https://juliaai.github.io/DataScienceTutorials.jl/data/dataframe/).

### Fixing scientific types in tabular data

To show how we can correct the scientific types of data in tables,
we introduce a cleaned up version of the UCI Horse Colic Data Set
(the cleaning work-flow is described
[here](https://juliaai.github.io/DataScienceTutorials.jl/end-to-end/horse/#dealing_with_missing_values)).
We already downloaded this data set immediately above.q

In [23]:
horse = DataFrames.DataFrame(csv_file); # convert to data frame
first(horse, 4)

Unnamed: 0_level_0,surgery,age,rectal_temperature,pulse,respiratory_rate,temperature_extremities
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,Int64,Int64
1,2,1,38.5,66,66,3
2,1,1,39.2,88,88,3
3,2,1,38.3,40,40,1
4,1,9,39.1,164,164,4


From [the UCI
docs](http://archive.ics.uci.edu/ml/datasets/Horse+Colic) we can
surmise how each variable ought to be interpreted (a step in our
work-flow that cannot reliably be left to the computer):

variable                    | scientific type (interpretation)
----------------------------|-----------------------------------
`:surgery`                  | Multiclass
`:age`                      | Multiclass
`:rectal_temperature`       | Continuous
`:pulse`                    | Continuous
`:respiratory_rate`         | Continuous
`:temperature_extremities`  | OrderedFactor
`:mucous_membranes`         | Multiclass
`:capillary_refill_time`    | Multiclass
`:pain`                     | OrderedFactor
`:peristalsis`              | OrderedFactor
`:abdominal_distension`     | OrderedFactor
`:packed_cell_volume`       | Continuous
`:total_protein`            | Continuous
`:outcome`                  | Multiclass
`:surgical_lesion`          | OrderedFactor
`:cp_data`                  | Multiclass

Let's see how MLJ will actually interpret the data, as it is
currently encoded:

In [24]:
schema(horse)

┌─────────────────────────┬─────────┬────────────┐
│[22m _.names                 [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────┼─────────┼────────────┤
│ surgery                 │ Int64   │ Count      │
│ age                     │ Int64   │ Count      │
│ rectal_temperature      │ Float64 │ Continuous │
│ pulse                   │ Int64   │ Count      │
│ respiratory_rate        │ Int64   │ Count      │
│ temperature_extremities │ Int64   │ Count      │
│ mucous_membranes        │ Int64   │ Count      │
│ capillary_refill_time   │ Int64   │ Count      │
│ pain                    │ Int64   │ Count      │
│ peristalsis             │ Int64   │ Count      │
│ abdominal_distension    │ Int64   │ Count      │
│ packed_cell_volume      │ Float64 │ Continuous │
│ total_protein           │ Float64 │ Continuous │
│ outcome                 │ Int64   │ Count      │
│ surgical_lesion         │ Int64   │ Count      │
│ cp_data                 │ Int64   │ Count      │
└───

As a first correction step, we can get MLJ to "guess" the
appropriate fix, using the `autotype` method:

In [25]:
autotype(horse)

Dict{Symbol, Type} with 11 entries:
  :abdominal_distension    => OrderedFactor
  :pain                    => OrderedFactor
  :surgery                 => OrderedFactor
  :mucous_membranes        => OrderedFactor
  :surgical_lesion         => OrderedFactor
  :outcome                 => OrderedFactor
  :capillary_refill_time   => OrderedFactor
  :age                     => OrderedFactor
  :temperature_extremities => OrderedFactor
  :peristalsis             => OrderedFactor
  :cp_data                 => OrderedFactor

Okay, this is not perfect, but a step in the right direction, which
we implement like this:

In [26]:
coerce!(horse, autotype(horse));
schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Int64                           │ Count            ⋯
│ respiratory_rate        │ Int64                           │ Count            ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ OrderedFactor{6} ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ OrderedFactor{3} ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│

All remaining `Count` data should be `Continuous`:

In [27]:
coerce!(horse, Count => Continuous);
schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ OrderedFactor{2} ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Float64                         │ Continuous       ⋯
│ respiratory_rate        │ Float64                         │ Continuous       ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ OrderedFactor{6} ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ OrderedFactor{3} ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│

We'll correct the remaining truant entries manually:

In [28]:
coerce!(horse,
        :surgery               => Multiclass,
        :age                   => Multiclass,
        :mucous_membranes      => Multiclass,
        :capillary_refill_time => Multiclass,
        :outcome               => Multiclass,
        :cp_data               => Multiclass);
schema(horse)

┌─────────────────────────┬─────────────────────────────────┬───────────────────
│[22m _.names                 [0m│[22m _.types                         [0m│[22m _.scitypes      [0m ⋯
├─────────────────────────┼─────────────────────────────────┼───────────────────
│ surgery                 │ CategoricalValue{Int64, UInt32} │ Multiclass{2}    ⋯
│ age                     │ CategoricalValue{Int64, UInt32} │ Multiclass{2}    ⋯
│ rectal_temperature      │ Float64                         │ Continuous       ⋯
│ pulse                   │ Float64                         │ Continuous       ⋯
│ respiratory_rate        │ Float64                         │ Continuous       ⋯
│ temperature_extremities │ CategoricalValue{Int64, UInt32} │ OrderedFactor{4} ⋯
│ mucous_membranes        │ CategoricalValue{Int64, UInt32} │ Multiclass{6}    ⋯
│ capillary_refill_time   │ CategoricalValue{Int64, UInt32} │ Multiclass{3}    ⋯
│ pain                    │ CategoricalValue{Int64, UInt32} │ OrderedFactor{5} ⋯
│

### Resources for Part 1

- From the MLJ manual:
   - [A preview of data type specification in
  MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#A-preview-of-data-type-specification-in-MLJ-1)
   - [Data containers and scientific types](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Data-containers-and-scientific-types-1)
   - [Working with Categorical Data](https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/)
- [Summary](https://juliaai.github.io/ScientificTypes.jl/dev/#Summary-of-the-default-convention) of the MLJ convention for representing scientific types
- [ScientificTypes.jl](https://juliaai.github.io/ScientificTypes.jl/dev/)
- From Data Science Tutorials:
    - [Data interpretation: Scientific Types](https://juliaai.github.io/DataScienceTutorials.jl/data/scitype/)
    - [Horse colic data](https://juliaai.github.io/DataScienceTutorials.jl/end-to-end/horse/)
- [UCI Horse Colic Data Set](http://archive.ics.uci.edu/ml/datasets/Horse+Colic)

### Exercises for Part 1

#### Exercise 1

Try to guess how each code snippet below will evaluate:

In [29]:
scitype(42)

Count

In [30]:
questions = ["who", "why", "what", "when"]
scitype(questions)

AbstractVector{Textual} (alias for AbstractArray{Textual, 1})

In [31]:
elscitype(questions)

Textual

In [32]:
t = (3.141, 42, "how")
scitype(t)

Tuple{Continuous, Count, Textual}

In [33]:
A = rand(2, 3)

2×3 Matrix{Float64}:
 0.0802    0.517074  0.933195
 0.231843  0.496183  0.24546

-

In [34]:
scitype(A)

AbstractMatrix{Continuous} (alias for AbstractArray{Continuous, 2})

In [35]:
elscitype(A)

Continuous

In [36]:
using SparseArrays
Asparse = sparse(A)

2×3 SparseMatrixCSC{Float64, Int64} with 6 stored entries:
 0.0802    0.517074  0.933195
 0.231843  0.496183  0.24546

In [37]:
scitype(Asparse)

AbstractMatrix{Continuous} (alias for AbstractArray{Continuous, 2})

In [38]:
C = coerce(A, Multiclass)

2×3 CategoricalArrays.CategoricalArray{Float64,2,UInt32}:
 0.0802    0.517074  0.933195
 0.231843  0.496183  0.24546

In [39]:
scitype(C)

AbstractMatrix{Multiclass{6}} (alias for AbstractArray{Multiclass{6}, 2})

In [40]:
elscitype(C)

Multiclass{6}

In [41]:
v = [1, 2, missing, 4]
scitype(v)

AbstractVector{Union{Missing, Count}} (alias for AbstractArray{Union{Missing, Count}, 1})

In [42]:
elscitype(v)

Union{Missing, Count}

In [43]:
scitype(v[1:2])

AbstractVector{Union{Missing, Count}} (alias for AbstractArray{Union{Missing, Count}, 1})

Can you guess at the general behavior of
`scitype` with respect to tuples, abstract arrays and missing
values? The answers are
[here](https://github.com/juliaai/ScientificTypesBase.jl#2-the-scitype-and-scitype-methods)
(ignore "Property 1").

#### Exercise 2

Coerce the following vector to make MLJ recognize it as a vector of
ordered factors (with an appropriate ordering):

In [44]:
quality = ["good", "poor", "poor", "excellent", missing, "good", "excellent"]

7-element Vector{Union{Missing, String}}:
 "good"
 "poor"
 "poor"
 "excellent"
 missing
 "good"
 "excellent"

#### Exercise 3 (fixing scitypes in a table)

Fix the scitypes for the [House Prices in King
County](https://mlr3gallery.mlr-org.com/posts/2020-01-30-house-prices-in-king-county/)
dataset:

In [45]:
house_csv = urldownload("https://raw.githubusercontent.com/ablaom/"*
                        "MachineLearningInJulia2020/for-MLJ-version-0.16/"*
                        "data/house.csv");
house = DataFrames.DataFrame(house_csv)
first(house, 4)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view
Unnamed: 0_level_1,Float64,Int64,Float64,Int64,Int64,Float64,Int64,Int64
1,221900.0,3,1.0,1180,5650,1.0,0,0
2,538000.0,3,2.25,2570,7242,2.0,0,0
3,180000.0,2,1.0,770,10000,1.0,0,0
4,604000.0,4,3.0,1960,5000,1.0,0,0


(Two features in the original data set have been deemed uninformative
and dropped, namely `:id` and `:date`. The original feature
`:yr_renovated` has been replaced by the `Bool` feature `is_renovated`.)

<a id='part-2-selecting-training-and-evaluating-models'></a>

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*