# Getting started

## Fit, predict, transform

In [1]:
using Pkg; Pkg.activate("D:/JULIA/6_ML_with_Julia/A-fit-predict"); Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\A-fit-predict`


> Preliminary steps
> * Data
> * MLJ Machine

> Training and testing a supervised model
> * Splitting the data
> * Fitting and testing the machine

> Unsupervised models

### Preliminary steps
---

### Data

As in "choosing a model", let's load the Iris dataset and unpack it:

In [2]:
using MLJ
import Statistics
using PrettyPrinting
using StableRNGs

X, y = @load_iris;

let's also load the ```DecisionTreeClassifier```:

In [3]:
DecisionTreeClassifier = @load DecisionTreeClassifier pkg = DecisionTree
tree_model = DecisionTreeClassifier()

import MLJDecisionTreeInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\loading.jl:168


DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

### MLJ Machine

In MLJ, remember that a _model_ is an object that only serves as a container for the hyperparameters of the model. A machine is an object wrapping both a model and data and can contain information on the trained model; it does not fit the model by itself. However, it does check that the model is compatible with the scientific type of the data and will warn you otherwise.

In [4]:
tree = machine(tree_model, X, y)

Machine{DecisionTreeClassifier,…} trained 0 times; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
  args: 
    1:	Source @676 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @159 ⏎ `AbstractVector{Multiclass{3}}`


A machine is used both for supervised and unsupervised model. In this tutorial we give an example for the supervised model first and then go on with the unsupervised case.

### Training and testing a supervised model
---

Now that you've declared the model you'd like to consider and the data, we are left with the standard training and testing step for a supervised learning algorithm.

### Splitting the data

To split the data into a _training_ and testing set, you can use the function ```partition``` to obtain indices for data points that should be considered either as training or testing data:

In [5]:
rng = StableRNG(566)

StableRNGs.LehmerRNG(state=0x0000000000000000000000000000046d)

In [6]:
# eachindex : 인덱스 추출 내장 함수
# partition : R의 sample / python의 sklearn.model_selection.train_test_split 과 유사 역할

train, test = partition(eachindex(y), 0.7, shuffle = true, rng = rng)

([131, 145, 67, 55, 49, 18, 87, 2, 108, 109  …  41, 8, 58, 147, 120, 50, 92, 95, 105, 118], [39, 54, 9, 107, 97, 135, 68, 22, 1, 88  …  96, 80, 12, 33, 99, 16, 10, 114, 70, 113])

In [7]:
test[1:3]

3-element Vector{Int64}:
 39
 54
  9

### Fitting and testing the machine

To fit the machine, you can use the function ```fit!``` specifying the rows to be used for the training:

In [8]:
fit!(tree, rows = train)

┌ Info: Training Machine{DecisionTreeClassifier,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464


Machine{DecisionTreeClassifier,…} trained 1 time; caches data
  model: MLJDecisionTreeInterface.DecisionTreeClassifier
  args: 
    1:	Source @676 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @159 ⏎ `AbstractVector{Multiclass{3}}`


Note that this **modifies** the machine which now contains the trained parameters of the decision tree. You can inspect the result of the fitting with the ```fitted_params``` method:

In [9]:
fitted_params(tree) |> pprint

(tree = Decision Tree
Leaves: 5
Depth:  4,
 encoding =
     Dict(CategoricalArrays.CategoricalValue{String, UInt32} "virginica" =>
              0x00000003,
          CategoricalArrays.CategoricalValue{String, UInt32} "setosa" =>
              0x00000001,
          CategoricalArrays.CategoricalValue{String, UInt32} "versicolor" =>
              0x00000002))

In [10]:
tree.args

(Source @676, Source @159)

In [11]:
tree.data

((sepal_length = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9  …  6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], sepal_width = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1  …  3.1, 3.1, 2.7, 3.2, 3.3, 3.0, 2.5, 3.0, 3.4, 3.0], petal_length = [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5  …  5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1], petal_width = [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1  …  2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8]), CategoricalArrays.CategoricalValue{String, UInt32}["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])

In [12]:
tree.fit_okay # true == 1

Channel{Bool}(1) (empty)

In [13]:
tree.fitresult

(Decision Tree
Leaves: 5
Depth:  4, CategoricalArrays.CategoricalValue{String, UInt32}["setosa", "versicolor", "virginica"], UInt32[0x00000001, 0x00000002, 0x00000003])

In [14]:
tree.frozen

false

In [15]:
tree.model

DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

In [16]:
tree.old_model

DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

In [17]:
tree.old_rows

105-element Vector{Int64}:
 131
 145
  67
  55
  49
  18
  87
   2
 108
 109
 133
  34
  60
   ⋮
  40
  98
  41
   8
  58
 147
 120
  50
  92
  95
 105
 118

In [18]:
tree.old_upstream_state

()

In [19]:
tree.report

(classes_seen = CategoricalArrays.CategoricalValue{String, UInt32}["setosa", "versicolor", "virginica"],
 print_tree = TreePrinter object (call with display depth),)

In [20]:
tree.resampled_data

((sepal_length = [7.4, 6.7, 5.6, 6.5, 5.3, 5.1, 6.7, 4.9, 7.3, 6.7  …  5.0, 5.0, 4.9, 6.3, 6.0, 5.0, 6.1, 5.6, 6.5, 7.7], sepal_width = [2.8, 3.3, 3.0, 2.8, 3.7, 3.5, 3.1, 3.0, 2.9, 2.5  …  3.5, 3.4, 2.4, 2.5, 2.2, 3.3, 3.0, 2.7, 3.0, 3.8], petal_length = [6.1, 5.7, 4.5, 4.6, 1.5, 1.4, 4.7, 1.4, 6.3, 5.8  …  1.3, 1.5, 3.3, 5.0, 5.0, 1.4, 4.6, 4.2, 5.8, 6.7], petal_width = [1.9, 2.5, 1.5, 1.5, 0.2, 0.3, 1.5, 0.2, 1.8, 1.8  …  0.3, 0.2, 1.0, 1.9, 1.5, 0.2, 1.4, 1.3, 2.2, 2.2]), CategoricalArrays.CategoricalValue{String, UInt32}["virginica", "virginica", "versicolor", "versicolor", "setosa", "setosa", "versicolor", "setosa", "virginica", "virginica"  …  "setosa", "setosa", "versicolor", "virginica", "virginica", "setosa", "versicolor", "versicolor", "virginica", "virginica"])

In [21]:
tree.state

1

This ```fitresult``` will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).

You can now use the machine to make predictions with the ```predict``` function specifying rows to be used for the prediction:

In [22]:
ŷ = predict(tree, rows = test) # y\hat + tab

45-element CategoricalDistributions.UnivariateFiniteArray{Multiclass{3}, String, UInt32, Float64, 1}:
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{

In [23]:
@show ŷ[1]

ŷ[1] = UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)


                    [1mUnivariateFinite{Multiclass{3}}[22m      
              [90m┌                                        ┐[39m 
       [0msetosa [90m┤[39m[38;5;2m■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■[39m[0m 1.0 [90m [39m 
   [0mversicolor [90m┤[39m[0m 0.0                                    [90m [39m 
    [0mvirginica [90m┤[39m[0m 0.0                                    [90m [39m 
              [90m└                                        ┘[39m 

Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the ```mode``` function on ```ŷ``` or using ```predict_mode```:

In [24]:
ȳ = predict_mode(tree, rows = test)

45-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "setosa"
 "versicolor"
 "setosa"
 "versicolor"
 "versicolor"
 "virginica"
 "versicolor"
 "setosa"
 "setosa"
 "versicolor"
 "versicolor"
 "versicolor"
 "virginica"
 ⋮
 "versicolor"
 "versicolor"
 "versicolor"
 "versicolor"
 "setosa"
 "setosa"
 "versicolor"
 "setosa"
 "setosa"
 "virginica"
 "versicolor"
 "virginica"

```predict_mode``` 함수는 확률값을 class로 바꿔줌 

In [25]:
@show ȳ[1]

ȳ[1] = CategoricalArrays.CategoricalValue{String, UInt32} "setosa"


CategoricalArrays.CategoricalValue{String, UInt32} "setosa"

In [26]:
@show mode(ŷ[1])

mode(ŷ[1]) = 

CategoricalArrays.CategoricalValue{String, UInt32} "setosa"

CategoricalArrays.CategoricalValue{String, UInt32} "setosa"


To measure the discrepancy between ```ŷ``` and ```y``` you could use the average cross entropy:

In [27]:
mce = cross_entropy(ŷ, y[test]) |> mean

2.4029102259411435

In [28]:
round(mce, digits = 4)

2.4029

### Unsupervised models

---

Unsupervised models define a ```transform``` method, and may optionally implement an ```inverse_transform``` method. As in the supervised case, we use a machine to wrap the unsupervised model and the data:

In [29]:
v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)

Machine{UnivariateStandardizer,…} trained 0 times; caches data
  model: UnivariateStandardizer
  args: 
    1:	Source @909 ⏎ `AbstractVector{Count}`


We can then fit the machine and use it to apply the corresponding _data transformation_:

In [30]:
fit!(stand)

┌ Info: Training Machine{UnivariateStandardizer,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464


Machine{UnivariateStandardizer,…} trained 1 time; caches data
  model: UnivariateStandardizer
  args: 
    1:	Source @909 ⏎ `AbstractVector{Count}`


In [31]:
w = transform(stand, v)

4-element Vector{Float64}:
 -1.161895003862225
 -0.3872983346207417
  0.3872983346207417
  1.161895003862225

In [32]:
@show round.(w, digits = 2)
@show mean(w)
@show std(w)

round.(w, digits = 2) = [-1.16, -0.39, 0.39, 1.16]
mean(w) = 0.0
std(w) = 1.0


1.0

In this case, the model also has an inverse transform:

In [33]:
vv = inverse_transform(stand, w)
# 

4-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0

In [34]:
sum(abs.(vv .- v))

0.0