# Tutorial 3

An introduction to machine learning using MLJ and the Titanic
dataset. Explains how to train a simple decision tree model and
evaluate it's performance on a holdout set.

MLJ is a *multi-paradigm* machine learning toolbox (i.e., not just
deep-learning).

For other MLJ learning resources see the [Learning
MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/learning_mlj/)
section of the
[manual](https://alan-turing-institute.github.io/MLJ.jl/dev/).

## Activate package environment

In [1]:
using Pkg
Pkg.activate(joinpath(@__DIR__, "..", ".."))
Pkg.instantiate()

  Activating project at `~/GoogleDrive/Julia/HelloJulia`


## Establishing correct data representation

In [2]:
using MLJ
import DataFrames

[ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]


A ["scientific
type"](https://juliaai.github.io/ScientificTypes.jl/dev/) or
*scitype* indicates how MLJ will *interpret* data (as opposed to how
it is represented on your machine). For example, while we have

In [3]:
typeof(3.14)

Float64

we have

In [4]:
scitype(3.14)

ScientificTypesBase.Continuous

and also

In [5]:
scitype(3.143f0)

ScientificTypesBase.Continuous

In MLJ, model data requirements are articulated using scitypes.

Here are common "scalar" scitypes:

In [6]:
html"""
<div style="text-align: left";>
        <img src="https://github.com/ablaom/MLJTutorial.jl/blob/dev/notebooks/01_data_representation/scitypes.png?raw=true">
</div>
"""

There are also container scitypes. For example, the scitype of any
vector is `AbstractVector{S}`, where `S` is the scitype of its
elements:

In [7]:
scitype(["cat", "mouse", "dog"])

AbstractVector{Textual}[90m (alias for [39m[90mAbstractArray{ScientificTypesBase.Textual, 1}[39m[90m)[39m

We'll be using [OpenML](https://www.openml.org/home) to grab the
Titanic dataset:

In [8]:
table = OpenML.load(42638)
df0 = DataFrames.DataFrame(table)
DataFrames.describe(df0)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,pclass,,1,,3,0,"CategoricalValue{String, UInt32}"
2,sex,,female,,male,0,"CategoricalValue{String, UInt32}"
3,age,29.7589,0.42,30.0,80.0,0,Float64
4,sibsp,0.523008,0.0,0.0,8.0,0,Float64
5,fare,32.2042,0.0,14.4542,512.329,0,Float64
6,cabin,,E31,,C148,687,"Union{Missing, CategoricalValue{String, UInt32}}"
7,embarked,,C,,S,2,"Union{Missing, CategoricalValue{String, UInt32}}"
8,survived,,0,,1,0,"CategoricalValue{String, UInt32}"


The `schema` operator summarizes the column scitypes of a table:

In [9]:
schema(df0)

┌──────────┬─────────────────────────────────┬──────────────────────────────────
│[22m names    [0m│[22m scitypes                        [0m│[22m types                          [0m ⋯
├──────────┼─────────────────────────────────┼──────────────────────────────────
│ pclass   │ Multiclass{3}                   │ CategoricalValue{String, UInt32 ⋯
│ sex      │ Multiclass{2}                   │ CategoricalValue{String, UInt32 ⋯
│ age      │ Continuous                      │ Float64                         ⋯
│ sibsp    │ Continuous                      │ Float64                         ⋯
│ fare     │ Continuous                      │ Float64                         ⋯
│ cabin    │ Union{Missing, Multiclass{186}} │ Union{Missing, CategoricalValue ⋯
│ embarked │ Union{Missing, Multiclass{3}}   │ Union{Missing, CategoricalValue ⋯
│ survived │ Multiclass{2}                   │ CategoricalValue{String, UInt32 ⋯
└──────────┴─────────────────────────────────┴──────────────────────────────────


Looks like we need to fix `:sibsp`, the number of siblings/spouses:

In [10]:
df1 = coerce(df0, :sibsp => Count)
schema(df1)

┌──────────┬─────────────────────────────────┬──────────────────────────────────
│[22m names    [0m│[22m scitypes                        [0m│[22m types                          [0m ⋯
├──────────┼─────────────────────────────────┼──────────────────────────────────
│ pclass   │ Multiclass{3}                   │ CategoricalValue{String, UInt32 ⋯
│ sex      │ Multiclass{2}                   │ CategoricalValue{String, UInt32 ⋯
│ age      │ Continuous                      │ Float64                         ⋯
│ sibsp    │ Count                           │ Int64                           ⋯
│ fare     │ Continuous                      │ Float64                         ⋯
│ cabin    │ Union{Missing, Multiclass{186}} │ Union{Missing, CategoricalValue ⋯
│ embarked │ Union{Missing, Multiclass{3}}   │ Union{Missing, CategoricalValue ⋯
│ survived │ Multiclass{2}                   │ CategoricalValue{String, UInt32 ⋯
└──────────┴─────────────────────────────────┴──────────────────────────────────


Lets take a closer look at our target column :survived. Here a value
`0` means that the individual didn't survive while a value of `1` indicates
an individual survived.

In [11]:
levels(df1.survived)

2-element Vector{String}:
 "0"
 "1"

The `:cabin` feature has a lot of missing values, and low frequency
for other classes:

In [12]:
import StatsBase
StatsBase.countmap(df0.cabin)

Dict{Union{Missing, CategoricalArrays.CategoricalValue{String, UInt32}}, Int64} with 148 entries:
  "C104"    => 1
  "E50"     => 1
  "D20"     => 2
  "E58"     => 1
  "C46"     => 1
  "D37"     => 1
  "B96 B98" => 4
  "C86"     => 1
  "C106"    => 1
  "A5"      => 1
  "C52"     => 2
  "B19"     => 1
  "C65"     => 2
  "C30"     => 1
  "D48"     => 1
  missing   => 687
  "B42"     => 1
  "C128"    => 1
  "E38"     => 1
  ⋮         => ⋮

We'll make `missing` into a bona fide class and group all the other
classes into one:

In [13]:
function class(c)
    if ismissing(c)
        return "without cabin"
    else
        return "has cabin"
    end
end

class (generic function with 1 method)

Shorthand syntax would be `class(c) = ismissing(c) ? "without cabin" :
"has cabin"`. Now to transform the whole column:

In [14]:
df2 = DataFrames.transform(
    df1,
    :cabin => DataFrames.ByRow(class) => :cabin
) # :cabin now has `Textual` scitype
coerce!(df2, :class => Multiclass)
schema(df2)

┌──────────┬───────────────────────────────┬────────────────────────────────────
│[22m names    [0m│[22m scitypes                      [0m│[22m types                            [0m ⋯
├──────────┼───────────────────────────────┼────────────────────────────────────
│ pclass   │ Multiclass{3}                 │ CategoricalValue{String, UInt32}  ⋯
│ sex      │ Multiclass{2}                 │ CategoricalValue{String, UInt32}  ⋯
│ age      │ Continuous                    │ Float64                           ⋯
│ sibsp    │ Count                         │ Int64                             ⋯
│ fare     │ Continuous                    │ Float64                           ⋯
│ cabin    │ Textual                       │ String                            ⋯
│ embarked │ Union{Missing, Multiclass{3}} │ Union{Missing, CategoricalValue{S ⋯
│ survived │ Multiclass{2}                 │ CategoricalValue{String, UInt32}  ⋯
└──────────┴───────────────────────────────┴────────────────────────────────────


## Splitting into train and test sets
Here we split off 30% of our observations into a
lock-and-throw-away-the-key holdout set, called `df_test`:

In [15]:
df, df_test = partition(df2, 0.7, rng=123)
DataFrames.nrow(df)

624

In [16]:
DataFrames.nrow(df_test)

267

## Cleaning the data

Let's construct an MLJ model to impute missing data:

In [17]:
cleaner = FillImputer()

FillImputer(
  features = Symbol[], 
  continuous_fill = MLJModels._median, 
  count_fill = MLJModels._round_median, 
  finite_fill = MLJModels._mode)

In MLJ a *model* is just a container for hyper-parameters associated with some ML
algorithm. It does not store learned parameters (unlike scikit-learn "estimators"). In
this case the hyper-parameters `features`, `continuous_fill`, `count_fill`, and
`finite_fill` specify which features should be imputed and how imputation should be
carried out, depending on the scitype. Since we didn't specify any features in our
constructor, we are using default values.

We now bind the model with training data in a *machine*:

In [18]:
machc = machine(cleaner, df)

untrained Machine; caches model-specific representations of data
  model: FillImputer(features = Symbol[], …)
  args: 
    1:	Source @375 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Textual}, AbstractVector{Union{Missing, ScientificTypesBase.Multiclass{3}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}}


And train the machine to store learned parameters there (the column
modes and medians to be used to impute missings):

In [19]:
fit!(machc);

[ Info: Training machine(FillImputer(features = Symbol[], …), …).


We can inspect the learned parameters if we want:

In [20]:
fitted_params(machc).filler_given_feature

Dict{Symbol, Any} with 7 entries:
  :sibsp    => 0
  :pclass   => "3"
  :survived => "0"
  :sex      => "male"
  :age      => 30.0
  :fare     => 14.5
  :embarked => "S"

Next, we apply the learned transformation on our data:

In [21]:
dfc     =  transform(machc, df)
dfc_test = transform(machc, df_test)
schema(dfc)

┌──────────┬───────────────┬──────────────────────────────────┐
│[22m names    [0m│[22m scitypes      [0m│[22m types                            [0m│
├──────────┼───────────────┼──────────────────────────────────┤
│ pclass   │ Multiclass{3} │ CategoricalValue{String, UInt32} │
│ sex      │ Multiclass{2} │ CategoricalValue{String, UInt32} │
│ age      │ Continuous    │ Float64                          │
│ sibsp    │ Count         │ Int64                            │
│ fare     │ Continuous    │ Float64                          │
│ cabin    │ Textual       │ String                           │
│ embarked │ Multiclass{3} │ CategoricalValue{String, UInt32} │
│ survived │ Multiclass{2} │ CategoricalValue{String, UInt32} │
└──────────┴───────────────┴──────────────────────────────────┘


## Split the data into input features and target

The following method puts the column with name equal to `:survived`
into the vector `y`, and everything else into a table (`DataFrame`)
called `X`.

In [22]:
y, X = unpack(dfc, ==(:survived));
scitype(y)

AbstractVector{Multiclass{2}}[90m (alias for [39m[90mAbstractArray{ScientificTypesBase.Multiclass{2}, 1}[39m[90m)[39m

While we're here, we'll do the same for the holdout test set:

In [23]:
y_test, X_test = unpack(dfc_test, ==(:survived));

## Choosing a supervised model:

There are not many models that can directly handle a mixture of
scitypes, as we have here:

In [24]:
models(matching(X, y))

4-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )

This can be mitigated with further pre-processing (such as one-hot
encoding) but we'll settle for one the above models here:

In [25]:
doc("DecisionTreeClassifier", pkg="BetaML")

```julia
mutable struct DecisionTreeClassifier <: MLJModelInterface.Probabilistic
```

A simple Decision Tree model for classification with support for Missing data, from the Beta Machine Learning Toolkit (BetaML).

# Hyperparameters:

  * `max_depth::Int64`: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: `0`, i.e. no limits]
  * `min_gain::Float64`: The minimum information gain to allow for a node's partition [def: `0`]
  * `min_records::Int64`: The minimum number of records a node must holds to consider for a partition of it [def: `2`]
  * `max_features::Int64`: The maximum number of (random) features to consider at each partitioning [def: `0`, i.e. look at all features]
  * `splitting_criterion::Function`: This is the name of the function to be used to compute the information gain of a specific partition. This is done by measuring the difference betwwen the "impurity" of the labels of the parent node with those of the two child nodes, weighted by the respective number of items. [def: `gini`]. Either `gini`, `entropy` or a custom function. It can also be an anonymous function.
  * `rng::Random.AbstractRNG`: A Random Number Generator to be used in stochastic parts of the code [deafult: `Random.GLOBAL_RNG`]

# Example:

```julia

julia> using MLJ

julia> X, y                        = @load_iris;

julia> modelType                   = @load DecisionTreeClassifier pkg = "BetaML" verbosity=0
BetaML.Trees.DecisionTreeClassifier

julia> model                       = modelType()
DecisionTreeClassifier(
  max_depth = 0, 
  min_gain = 0.0, 
  min_records = 2, 
  max_features = 0, 
  splitting_criterion = BetaML.Utils.gini, 
  rng = Random._GLOBAL_RNG())

julia> (fitResults, cache, report) = MLJ.fit(model, 0, X, y);

julia> class_est                   = predict(model, fitResults, X)
150-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 ⋮
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
```


In [26]:
Tree = @load DecisionTreeClassifier pkg=BetaML  # model type
tree = Tree()                                   # default instance

[ Info: For silent loading, specify `verbosity=0`. 
import BetaML[ Info: Precompiling BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
[ Info: Precompiling ZygoteColorsExt [e68c091a-8ea5-5ca7-be4f-380657d4ad79]
 ✔


DecisionTreeClassifier(
  max_depth = 0, 
  min_gain = 0.0, 
  min_records = 2, 
  max_features = 0, 
  splitting_criterion = BetaML.Utils.gini, 
  rng = Random._GLOBAL_RNG())

Notice that by calling `Tree` with no arguments we get default
values for the various hyperparameters that control how the tree is
trained. We specify keyword arguments to overide these defaults. For example:

In [27]:
small_tree = Tree(max_depth=3)

DecisionTreeClassifier(
  max_depth = 3, 
  min_gain = 0.0, 
  min_records = 2, 
  max_features = 0, 
  splitting_criterion = BetaML.Utils.gini, 
  rng = Random._GLOBAL_RNG())

A decision tree is frequently not the best performing model, but it
is easy to interpret (and the algorithm is relatively easy to
explain). For example, here's an diagramatic representation of a
tree trained on (some part of) the Titanic data set, which suggests
how prediction works:

In [28]:
html"""
<div style="text-align: left";>
        <img src="https://upload.wikimedia.org/wikipedia/commons/5/58/Decision_Tree_-_survival_of_passengers_on_the_Titanic.jpg">
</div>
"""

## The fit/predict worflow

We now the bind data to be used for training and evaluation to the model (ie, choice of
hyperparameters) in a machine, just like we did for missing value imputation. In this
case, however, we also need to specify the training target `y`:

In [29]:
macht = machine(tree, X, y)

untrained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 0, …)
  args: 
    1:	Source @807 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}, AbstractVector{ScientificTypesBase.Multiclass{2}}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Textual}}}
    2:	Source @909 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}


To train using *all* the bound data:

In [30]:
fit!(macht)

[ Info: Training machine(DecisionTreeClassifier(max_depth = 0, …), …).


trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 0, …)
  args: 
    1:	Source @807 ⏎ ScientificTypesBase.Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{ScientificTypesBase.Count}, AbstractVector{ScientificTypesBase.Multiclass{2}}, AbstractVector{ScientificTypesBase.Multiclass{3}}, AbstractVector{ScientificTypesBase.Textual}}}
    2:	Source @909 ⏎ AbstractVector{ScientificTypesBase.Multiclass{2}}


And get predictions on the holdout set:

In [31]:
p = predict(macht, X_test);

These are *probabilistic* predictions:

In [32]:
p[3]

UnivariateFinite{ScientificTypesBase.Multiclass{2}}(0=>0.2, 1=>0.8)

In [33]:
pdf(p[3], "0")

0.2

We can also get "point" predictions:

In [34]:
yhat = mode.(p)

267-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "1"
 "1"
 "1"
 "0"
 "0"
 "0"
 "1"
 "0"
 "0"
 "0"
 ⋮
 "0"
 "0"
 "0"
 "0"
 "1"
 "0"
 "1"
 "0"
 "0"

We can evaluate performance using a probabilistic measure, as in

In [35]:
log_loss(p, y_test) |> mean

8.307615579043436

Or using a deterministic measure:

In [36]:
accuracy(yhat, y_test)

0.7265917602996255

List all performance measures with `measures()`. Naturally, MLJ
includes functions to automate this kind of performance evaluation,
but this is beyond the scope of this tutorial. See, eg,
[here](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Getting-Started).

## Learning more

Some suggestions for next steps are
[here](https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Getting-Started).

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*