## MLJ Basics

This notebook shows the basics tools needed to deal with a classification task in MLJ.

In particular it shows:

- How to load a model
- How to fit a model
- How to predict using a fitted model
- How to evaluate a model


In [1]:
using RDatasets, MLJ, NearestNeighbors, MLJModels
iris = dataset("datasets", "iris")
first(iris, 3) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┬────────────────────────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│[1m Species                        [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│[90m CategoricalValue{String,UInt8} [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│[90m Multiclass{3}                  [0m│
├─────────────┼────────────┼─────────────┼────────────┼────────────────────────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa                         │
│ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa                         │
│ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa                         │
└─────────────┴────────────┴─────────────┴────────────┴────────────────────────────────┘


In [2]:
y, X = unpack(iris, ==(:Species), colname -> true)
first(X, 1) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│
├─────────────┼────────────┼─────────────┼────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │
└─────────────┴────────────┴─────────────┴────────────┘


In [3]:
println(typeof(X))
println(typeof(y))

DataFrame
CategoricalArrays.CategoricalArray{String,1,UInt8,String,CategoricalArrays.CategoricalValue{String,UInt8},Union{}}


In [4]:
for m in models(matching(X, y))
    if m.prediction_type == :probabilistic
        println(rpad(m.name, 30), "($(m.package_name))")
    end
end

AdaBoostClassifier            (ScikitLearn)
AdaBoostStumpClassifier       (DecisionTree)
BaggingClassifier             (ScikitLearn)
BayesianLDA                   (MultivariateStats)
BayesianLDA                   (ScikitLearn)
BayesianQDA                   (ScikitLearn)
BayesianSubspaceLDA           (MultivariateStats)
ConstantClassifier            (MLJModels)
DecisionTreeClassifier        (DecisionTree)
DummyClassifier               (ScikitLearn)
EvoTreeClassifier             (EvoTrees)
ExtraTreesClassifier          (ScikitLearn)
GaussianNBClassifier          (NaiveBayes)
GaussianNBClassifier          (ScikitLearn)
GaussianProcessClassifier     (ScikitLearn)
GradientBoostingClassifier    (ScikitLearn)
KNNClassifier                 (NearestNeighbors)
KNeighborsClassifier          (ScikitLearn)
LDA                           (MultivariateStats)
LGBMClassifier                (LightGBM)
LogisticCVClassifier          (ScikitLearn)
LogisticClassifier            (MLJLinearModels)
LogisticClas

## Training a KNN

### Choosing a model

In [5]:
knn = @load KNNClassifier verbosity = 0

KNNClassifier(
    K = 5,
    algorithm = :kdtree,
    metric = Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform)[34m @176[39m

We can define specific atributes of the model setting them direcly to the loaded object.

For example, we can set the `leafsize` to be 5:

In [6]:
knn.leafsize = 10

10

In [7]:
knn

KNNClassifier(
    K = 5,
    algorithm = :kdtree,
    metric = Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform)[34m @176[39m

### Defining a machine with a model and the data

Now we will define a `MLJ.Machine` object that will contain 3 objects:

- The model `knn`
- The input data `X`
- The output data `y`

In [8]:
println(typeof(X))
println(typeof(y))

DataFrame
CategoricalArrays.CategoricalArray{String,1,UInt8,String,CategoricalArrays.CategoricalValue{String,UInt8},Union{}}


Let us define the machine with the function `machine`

In [9]:
m_knn = machine(knn, X, y)

[34mMachine{KNNClassifier} @416[39m trained 0 times.
  args: 
    1:	[34mSource @516[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @235[39m ⏎ `AbstractArray{Multiclass{3},1}`


Now we have a `Machine` object containing a `KNNClassifier`

In [10]:
typeof(m_knn)

Machine{MLJModels.NearestNeighbors_.KNNClassifier}

Since the machine already has all the training data we can specify a set of indices for training and one set for test.

A very handy function to generate train and test splits is **`partition`** which can take as input 

- `UnitRange`  (ex: `1:10`) 
- `Array` (ex:`[1,2,3,4,5,6,7,8,9,10]`) 
- `AbstractUnitRange` (ex: `eachindex(y)`)

To generate a train/test split. It returns two arrays with the expected partitions.

If you want a reproducible partition you can use Random.seed!(some_integer) so that the partition will be allways the same
 
 
#### Julia example
```
train_test =    partition(1:10, 0.8, shuffle=true)
train_test 

([4, 3, 9, 7, 2, 1, 6, 8], [10, 5])
```

#### Sklearn equivalent


```
train_test =  sklearn.model_selection.train_test_split(range(10), 
                                                       train_size=0.8, 
                                                       shuffle=True, 
                                                       random_state=123)
train_test
[[7, 5, 8, 3, 1, 6, 9, 2], [4, 0]]
``` 


The following version to generate a train, test split are equivalent

In [106]:
using Random
Random.seed!(123)
train_ind, test_ind = partition(Array(1:length(y)), 0.7, shuffle=true)

([125, 100, 130, 9, 70, 148, 39, 64, 6, 107  …  134, 114, 52, 74, 44, 61, 83, 18, 122, 26], [97, 78, 30, 108, 101, 24, 85, 91, 135, 96  …  112, 144, 140, 72, 109, 41, 106, 147, 47, 5])

In [104]:
using Random
Random.seed!(123)
train_ind, test_ind = partition(eachindex(y), 0.7, shuffle=true)

([125, 100, 130, 9, 70, 148, 39, 64, 6, 107  …  134, 114, 52, 74, 44, 61, 83, 18, 122, 26], [97, 78, 30, 108, 101, 24, 85, 91, 135, 96  …  112, 144, 140, 72, 109, 41, 106, 147, 47, 5])

In [105]:
using Random
Random.seed!(123)
train_ind, test_ind = partition(1:length(y), 0.7, shuffle=true)

([125, 100, 130, 9, 70, 148, 39, 64, 6, 107  …  134, 114, 52, 74, 44, 61, 83, 18, 122, 26], [97, 78, 30, 108, 101, 24, 85, 91, 135, 96  …  112, 144, 140, 72, 109, 41, 106, 147, 47, 5])

## Training and predicting with a machine
We can train a machine using **`fit!`**  and we can specify the rows used with **`rows=training_ind`**

In [119]:
fit!(m_knn, rows=train_ind)

┌ Info: Training [34mMachine{KNNClassifier} @416[39m.
└ @ MLJBase /Users/davidbuchaca1/.julia/packages/MLJBase/5TNcr/src/machines.jl:319


[34mMachine{KNNClassifier} @416[39m trained 2 times.
  args: 
    1:	[34mSource @516[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @235[39m ⏎ `AbstractArray{Multiclass{3},1}`


**`fitted_params(machine)`** allows us to view the fitted parameters of a `machine`.

In [232]:
fitted_params(m_knn) |> print

(tree = KDTree{StaticArrays.SArray{Tuple{4},Float64,1,4},Euclidean,Float64}
  Number of points: 105
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true,)

**`predict`** allows us to get  predictions from a machine. 

There are two different approaches:

- **`predict(machine, rows=test_ind)`**: Use the internal dataset that provided to the machine specifying the rows from which to make predictions.


- **`predict(machine, X_df)`**: Use a dataframe `X_df` from which to make predictions.



In [120]:
ŷ_test  = predict(m_knn, rows=test_ind);
ŷ_train = predict(m_knn, rows=train_ind);

In [121]:
ŷ_train = predict(m_knn, X[train_ind,:])
ŷ_test  = predict(m_knn, X[test_ind,:]);

In [123]:
typeof(ŷ_train)

MLJBase.UnivariateFiniteArray{Multiclass{3},String,UInt8,Float64,1}

In [125]:
ŷ_train[3]

UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)

It is important to note that `UnivariateFinite` cannot be directly compared with a `CategoricalArray`. 

In [224]:
ŷ_train[1]

UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)

In [225]:
y_train[1]

CategoricalValue{String,UInt8} "virginica"

In [226]:
ŷ_train[1] == y_train[1]

false

### Predictions for classification algorithms in  MLJ

Notice that in MLJ machine predictions for classification problems return an Array of the following form: `MLJBase.UnivariateFiniteArray{Multiclass{3},String,UInt8,Float64,1}`.
This type might be a bit surprising, let's see the details.

You can interpret `ŷ_test[k]` as an array of length `n_classes`. But users can't do `ŷ_test[k][c]` directly.

A component `c` in this "array" contains the probability of example `k` beeing from the class indexed by integer `c`. 

For example, the following line tells us that the probability of `versicolor` is 0.6 and the probability of `virginica` is 0.4.

```julia
ŷ_test[2]
UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.6, virginica=>0.4)
```

Note that the output of predict is a **`MLJBase.UnivariateFiniteArray`** and each element is of type **`UnivariateFinite`** which contains the predicted probabilities for each class given the input of the predict.


There are several advantadges for using `UnivariateFinite`  instead of a vector of floats **TODO:EXPAND THIS**.


##### Sklearn equivalent

In sklearn, `model.predict(X)` returns a `np.array` containing the predicted class labels (allawys coded as integers from 0 to `n_classes-1` ).
 
Notice that `predict(tree_machine, X[test_ind,:])` in MLJ would be pretty much the same as `tree_machine.predict_proba(X[test,:])` in Sklearn. The main difference is that Sklearn would return a numpy array of float values whereas MLJ returns a "weird" array of `UnivariateFinite{Multiclass{3}}` values.




In [242]:
ŷ_test[2]

UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.6, virginica=>0.4)

In [239]:
ŷ_test[2].prob_given_ref

OrderedCollections.LittleDict{UInt8,Float64,Array{UInt8,1},Array{Float64,1}} with 3 entries:
  0x01 => 0.0
  0x02 => 0.6
  0x03 => 0.4

**`prob_given_ref.vals`** returns a vector with the probabilities.

In [241]:
ŷ_test[2].prob_given_ref.vals

3-element Array{Float64,1}:
 0.0
 0.6000000000000001
 0.4

**`mode(ŷ_test[k])`**  can be used to get the most likely class for example `k`

In [260]:
mode(ŷ_test[2])

CategoricalValue{String,UInt8} "versicolor"

**`predict_mode(machine, X)`** returns a categorical array with the predicted classes from `X`

In [262]:
 predict_mode(m_knn, X[test_ind[1:3],:])

3-element CategoricalArray{String,1,UInt8}:
 "versicolor"
 "versicolor"
 "setosa"

we can make comparissons to check if the predicted class equals the true class

In [204]:
mode(ŷ_test[2]) == y_test[2]

true

Users need to be carefull not to confuse the previous line with the following one:

In [206]:
ŷ_test[2] == y_test[2]

false


In order to compute the accuracy between `ŷ_test` and `y_test` we could do

In [193]:
function accuracy(y,ŷ)
    accuracy = 0.
    for m in 1:length(y)
        accuracy += y[m] == mode(ŷ[m])
    end
    return accuracy/length(y)
end

accuracy(y_test,ŷ_test)

0.9777777777777777

In [247]:
mce = mean(cross_entropy(ŷ_test, y_test))
round(mce, digits=4)

0.0806

In [290]:
#evaluate!(m_knn, rows=test_ind);

## Tunning Hyperparameters

In [425]:
knn = @load KNNClassifier verbosity = 0

KNNClassifier(
    K = 5,
    algorithm = :kdtree,
    metric = Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform)[34m @586[39m

In [426]:
knn

KNNClassifier(
    K = 5,
    algorithm = :kdtree,
    metric = Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform)[34m @586[39m

In [427]:
K_range = range(knn, :K, lower=5, upper=20);

In [428]:
K_range

MLJBase.NumericRange(Int64, :K, ... )

Incidentally, a grid is generated internally "over the range" by calling the iterator method with an appropriate resolution:

In [429]:
iterator(K_range, 3)

3-element Array{Int64,1}:
  5
 12
 20

Now let us define a tunned model

In [507]:
self_tuning_knn = TunedModel(model=knn,
                             resampling = CV(nfolds=5),
                             tuning = Grid(resolution=5),
                             range = K_range);

┌ Info: No measure specified. Setting measure=[34mLogLoss{Float64} @278[39m. 
└ @ MLJTuning /Users/davidbuchaca1/.julia/packages/MLJTuning/6MZ7C/src/tuned_models.jl:222


In [508]:
m_self_tuning_knn = machine(self_tuning_knn, X, y)

[34mMachine{ProbabilisticTunedModel{Grid,…}} @142[39m trained 0 times.
  args: 
    1:	[34mSource @723[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @170[39m ⏎ `AbstractArray{Multiclass{3},1}`


In [509]:
fit!(m_self_tuning_knn, rows=train_ind, verbosity=0)

[34mMachine{ProbabilisticTunedModel{Grid,…}} @142[39m trained 1 time.
  args: 
    1:	[34mSource @723[39m ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	[34mSource @170[39m ⏎ `AbstractArray{Multiclass{3},1}`


In [510]:
fitted_params(m_self_tuning_knn)

(best_model = [34mKNNClassifier @504[39m,
 best_fitted_params = (tree = KDTree{StaticArrays.SArray{Tuple{4},Float64,1,4},Euclidean,Float64}
  Number of points: 105
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true,),)

In [511]:
fitted_params(m_self_tuning_knn).best_model

KNNClassifier(
    K = 20,
    algorithm = :kdtree,
    metric = Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform)[34m @504[39m

**`report`** allows us to access a thorough report of the tuning process

In [514]:
report(m_self_tuning_knn)

(best_model = [34mKNNClassifier @504[39m,
 best_history_entry = (model = [34mKNNClassifier @504[39m,
                       measure = LogLoss{Float64}[[34mLogLoss{Float64} @278[39m],
                       measurement = [0.1670047290499682],
                       per_fold = [[0.18149729477405077, 0.10005618653945388, 0.21909874840716415, 0.14570504613580174, 0.18866636939337042]],),
 history = NamedTuple{(:model, :measure, :measurement, :per_fold),Tuple{MLJModels.NearestNeighbors_.KNNClassifier,Array{LogLoss{Float64},1},Array{Float64,1},Array{Array{Float64,1},1}}}[(model = [34mKNNClassifier @711[39m, measure = [[34mLogLoss{Float64} @278[39m], measurement = [0.13665915288523797], per_fold = [[0.15450290845334771, 0.08915669132218772, 0.18930221668821798, 0.09256179562950509, 0.1577721523329314]]), (model = [34mKNNClassifier @634[39m, measure = [[34mLogLoss{Float64} @278[39m], measurement = [0.11642670571866609], per_fold = [[0.13751290362838622, 0.06696030644151882, 0.160

**`report(machine).history`** contains the per fold meatures stored during training

In [550]:
report(m_self_tuning_knn).history

5-element Array{NamedTuple{(:model, :measure, :measurement, :per_fold),Tuple{MLJModels.NearestNeighbors_.KNNClassifier,Array{LogLoss{Float64},1},Array{Float64,1},Array{Array{Float64,1},1}}},1}:
 (model = [34mKNNClassifier @711[39m, measure = [[34mLogLoss{Float64} @278[39m], measurement = [0.13665915288523797], per_fold = [[0.15450290845334771, 0.08915669132218772, 0.18930221668821798, 0.09256179562950509, 0.1577721523329314]])
 (model = [34mKNNClassifier @634[39m, measure = [[34mLogLoss{Float64} @278[39m], measurement = [0.11642670571866609], per_fold = [[0.13751290362838622, 0.06696030644151882, 0.16084031807589744, 0.06367493160880684, 0.1531450688387212]])
 (model = [34mKNNClassifier @199[39m, measure = [[34mLogLoss{Float64} @278[39m], measurement = [0.09329713262922215], per_fold = [[0.15522370568516472, 0.06795792169714998, 0.06682856328680156, 0.06488465878583707, 0.11159081369115739]])
 (model = [34mKNNClassifier @504[39m, measure = [[34mLogLoss{Float64} @278[39m

Note that the measure used to evaluate a particular hyperparameter is the mean over the different folds

In [551]:
report(m_self_tuning_knn).history[1].measurement

1-element Array{Float64,1}:
 0.13665915288523797

This value can be computed diretcly from the `.per_fold` array

In [553]:
mean(report(m_self_tuning_knn).history[1].per_fold[1])

0.13665915288523797

In [547]:
accuracy(y_test, predict(m_self_tuning_knn, X[test_ind,:]))

0.9333333333333333

### Defining a custom metric for the hyperparameter selection process

In [487]:
#MLJ.accuracy(y_train,ŷ_test)