# SoleXplorer

### The Swiss Army knife for machine learning

In [1]:
using SoleXplorer

Load NATOPS dataset, composed of time series

In [2]:
X, y = load_arff_dataset("NATOPS")

([1m360×24 DataFrame[0m
[1m Row [0m│[1m X[Hand tip l]                     [0m[1m Y[Hand tip l]                     [0m[1m Z[0m ⋯
     │[90m Array…                            [0m[90m Array…                            [0m[90m A[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ [-0.519771, -0.52758, -0.531415,…  [-2.14011, -2.18043, -2.18425, -…  [ ⋯
   2 │ [-0.489753, -0.48607, -0.484529,…  [-1.55293, -1.54966, -1.55206, -…  [
   3 │ [-0.521346, -0.518394, -0.522321…  [-1.72326, -1.72407, -1.72326, -…  [
   4 │ [-0.57022, -0.562064, -0.565967,…  [-1.91196, -1.90369, -1.90527, -…  [
   5 │ [-0.624417, -0.626031, -0.625388…  [-1.84287, -1.84026, -1.84688, -…  [ ⋯
   6 │ [-0.502501, -0.502525, -0.499415…  [-2.17556, -2.15613, -2.18516, -…  [
   7 │ [-0.488461, -0.489463, -0.487539…  [-2.17242, -2.18203, -2.18057, -…  [
   8 │ [-0.468105, -0.410602, -0.473909…  [-1.86535, -1.89011, -1.87105, -…  [
  ⋮  │                 ⋮            

### downsize dataset
it is important to downsize the dataset to avoid long running times and to avoid memory issues

In [None]:
using StatsBase: sample
num_cols_to_sample, num_rows_to_sample, rng = 10, 50, Xoshiro(11)
chosen_cols = sample(rng, 1:size(X, 2), num_cols_to_sample; replace=false)
chosen_rows = sample(rng, 1:size(X, 1), num_rows_to_sample; replace=false)
X = X[chosen_rows, chosen_cols]
y = y[chosen_rows]

Xoshiro(0x0991231718e930cb, 0x28e1460087a5d0ff, 0x4d62c780da1946f0, 0x764f51fefd621192, 0x434e1895e0078176)

### Let's start diving into available models
## Decision Tree

In [5]:
model = symbolic_analysis(X, y; models=(type=:decisiontree,), preprocess=(;rng))

MethodError: MethodError: no method matching keys(::SoleXplorer.RulesParams)
The function `keys` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  keys(!Matched::Core.SimpleVector)
   @ Base essentials.jl:944
  keys(!Matched::Pkg.Registry.RegistryInstance)
   @ Pkg /snap/julia/136/share/julia/stdlib/v1.11/Pkg/src/Registry/registry_instance.jl:449
  keys(!Matched::Pkg.Types.Manifest)
   @ Pkg /snap/julia/136/share/julia/stdlib/v1.11/Pkg/src/Types.jl:313
  ...


For reproducible experiments, always include 'preprocess=(;rng)' in your 'preprocess' configuration.

In [53]:
model.mach

trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = -1, …)
  args: 
    1:	Source @606 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @453 ⏎ AbstractVector{Multiclass{6}}


In [54]:
model.model

[34m▣[0m ([std(Z[Elbow l])w1] < 0.08596251206097288)
├✔ ([std(Y[Hand tip r])w1] < 1.1566050775773289)
│ ├✔ ([maximum(Y[Elbow r])w1] < 0.212721)
│ │ ├✔ ([maximum(Y[Hand tip r])w1] < 0.3105835)
│ │ │ ├✔ ([maximum(Y[Elbow r])w1] < 0.0018995000000000001)
│ │ │ │ ├✔ ([mean(Y[Hand tip l])w1] < -2.12836987254902)
│ │ │ │ │ ├✔ Not clear
│ │ │ │ │ └✘ All clear
│ │ │ │ └✘ ([std(X[Hand tip r])w1] < 0.7329422471876528)
│ │ │ │   ├✔ Not clear
│ │ │ │   └✘ All clear
│ │ │ └✘ ([mean(Y[Thumb r])w1] < -1.0434569705882355)
│ │ │   ├✔ ([maximum(Y[Elbow r])w1] < 0.0970635)
│ │ │   │ ├✔ All clear
│ │ │   │ └✘ Not clear
│ │ │   └✘ All clear
│ │ └✘ ([maximum(Y[Hand tip r])w1] < 0.7871275)
│ │   ├✔ Not clear
│ │   └✘ ([std(Y[Wrist r])w1] < 0.8482700353263318)
│ │     ├✔ All clear
│ │     └✘ Not clear
│ └✘ ([std(Y[Elbow l])w1] < 0.05585516323564049)
│   ├✔ I have command
│   └✘ ([minimum(Y[Thumb l])w1] < -1.9731075)
│     ├✔ I have command
│     └✘ All clear
└✘ ([std(X[Thumb r])w1] < 0.5878229455602122)
  ├✔

La prediction è fatta sul modello originale, non sul modello Sole. Chiedere se è possibile fare la predizione sul modello Sole.

In [70]:
preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.7361111111111112

Non c'è bisongo di definire se l'esperimento è di classificazione o regressione.

## Random Forest

In [73]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:randomforest, rng=rng))

ModelConfig:
    setup      =SymbolicModelSet(type=MLJDecisionTreeInterface.RandomForestClassifier, features=4)
    classifier =RandomForestClassifier(max_depth = -1, …)
    rules      =nothing
    accuracy   =nothing


In [74]:
preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.75

## XGBoost

In [4]:
model = traintest(X, y; models=(; type=:xgboost))

└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93


ModelConfig:
    setup      =SymbolicModelSet(type=MLJXGBoostInterface.XGBoostClassifier, features=4)
    classifier =XGBoostClassifier(test = 1, …)
    rules      =nothing
    accuracy   =nothing


In [5]:
model.model

[34m▣[0m Ensemble{CategoricalArrays.CategoricalValue{String, UInt32}} of 454 models of type Branch{CategoricalArrays.CategoricalValue{String, UInt32}}
├[1/454]┐ ([mean(X[Thumb r])w1] < 1.07237208)
│       ├✔ ([std(X[Wrist l])w1] < 0.0152500486)
│       │ ├✔ ([maximum(Y[Hand tip r])w1] < 0.743623018)
│       │ │ ├✔ All clear : (ninstances = 4, ncovered = 4, confidence = 1.0, lift = 1.0)
│       │ │ └✘ I have command : (ninstances = 6, ncovered = 6, confidence = 1.0, lift = 1.0)
│       │ └✘ ([mean(Y[Wrist r])w1] < -1.02672195)
│       │   ├✔ Not clear : (ninstances = 7, ncovered = 7, confidence = 0.57, lift = 1.0)
│       │   └✘ ([minimum(Z[Elbow l])w1] < -0.0650589988)
│       │     ├✔ Fold wings : (ninstances = 176, ncovered = 176, confidence = 0.27, lift = 1.0)
│       │     └✘ Not clear : (ninstances = 4, ncovered = 4, confidence = 0.5, lift = 1.0)
│       └✘ ([maximum(Y[Elbow r])w1] < 0.119737998)
│         ├✔ ([maximum(Y[Hand tip r])w1] < 0.715116024)
│         │ ├✔ ([maximum(Y[

In [25]:
preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.7777777777777778

## Modal DecisionTree

In [None]:
Random.seed!(train_seed)
model = traintest(X, y; models=(; type=:modaldecisiontree, rng=rng))

model isa SymbolicModel = true


┌ Info: Precomputing logiset...
└ @ SoleData /home/paso/Documents/Aclai/Sole/SoleData.jl/src/utils/autologiset-tools.jl:277


ModelConfig:
    setup      =SymbolicModelSet(type=ModalDecisionTrees.MLJInterface.ModalDecisionTree, features=4)
    classifier =ModalDecisionTree(max_depth = nothing, …)
    rules      =nothing
    accuracy   =nothing


In [27]:
preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.8194444444444444

## Modal RandomForest

In [None]:
Random.seed!(train_seed)
model = traintest(X, y; models=(; type=:modalrandomforest, rng=rng))

┌ Info: Precomputing logiset...
└ @ SoleData /home/paso/Documents/Aclai/Sole/SoleData.jl/src/utils/autologiset-tools.jl:277
[32mApplying trees... 100%|██████████████████████████████████| Time: 0:00:06[39m
└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93
└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93
└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93
└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93
└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93


ModelConfig:
    setup      =SymbolicModelSet(type=ModalDecisionTrees.MLJInterface.ModalRandomForest, features=4)
    classifier =ModalRandomForest(sampling_fraction = 0.7, …)
    rules      =nothing
    accuracy   =nothing


In [19]:
preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.875

## Scegliamo DecisionTree da ottimizzare, perchè è il più scarso

In [None]:
Random.seed!(train_seed)
model = traintest(X, y; models=(; type=:decisiontree, rng=rng))

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.7361111111111112

## Posso modificare i parametri di preparazione del dataset:
tipo, anzichè usare una sola "finestra" e ridurre il dato ad una singola dimensione,
potrei usare più finestre e ridurre il dato ad una dimensione per ogni finestra.

In [81]:
Random.seed!(train_seed)
model = traintest(X, y; 
    models=(
        type=:decisiontree,
        winparams=(; type=adaptivewindow, nwindows=5),)
    )

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.875

Potrei anche decidere che feature utilizzare

In [86]:
Random.seed!(train_seed)
model = traintest(X, y; 
    models=(
        type=:decisiontree,
        features=[minimum, maximum, mean, cov, std],)
    )

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.7638888888888888

Proviamo a unire?

Promemoria: const DEFAULT_FEATS = [maximum, minimum, mean, std]

In [92]:
Random.seed!(train_seed)
model = traintest(X, y; 
    models=(
        type=:decisiontree,
        features=[minimum, maximum, mean, cov, std],
        winparams=(; type=adaptivewindow, nwindows=5),)
    )

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

0.8472222222222222

Possiamo fare cross validation tramite la funzione MLJ stratified cv.

In [1]:
Random.seed!(train_seed)
model = traintest(X, y;
    models=(
        type=:decisiontree,
        features=[minimum, maximum, mean, cov, std],
        winparams=(; type=adaptivewindow, nwindows=5),
    ),
    preprocess=(
        stratified=true,
        nfolds=10,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

UndefVarError: UndefVarError: `Random` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
Hint: Random is loaded but not imported in the active module Main.

Possiamo utilizzare le tecniche di tuning fornite da MLJ

In [None]:
Random.seed!(train_seed)
model = traintest(X, y;
    models=(
        type=:decisiontree,
        features=[minimum, maximum, mean, cov, std],
        winparams=(; type=adaptivewindow, nwindows=5),
        tuning=true
    ),
    preprocess=(
        stratified=true,
        nfolds=10,
    )
)

# verifica che rng vada ovunque

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

0.8361111111111112

Anche il tuning è parametrizzabile

In [34]:
Random.seed!(train_seed)
model = traintest(X, y;
    models=(
        type=:decisiontree,
        features=[minimum, maximum, mean, cov, std],
        winparams=(; type=adaptivewindow, nwindows=5),
        tuning=(
            method=(type=latinhypercube, rng=rng), 
            params=(repeats=20, n=10),
            ranges=[
                SoleXplorer.range(:merge_purity_threshold, lower=0.1, upper=2.0),
                SoleXplorer.range(:feature_importance, values=[:impurity, :split])
            ]
        ), 
    ),
    preprocess=(
        stratified=true,
        nfolds=10,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

0.836111111111111

### Modal RandomForest parametrizzato

In [None]:
Random.seed!(train_seed)
model = traintest(X, y;
    models=(
        type=:modalrandomforest,
        features=[minimum, maximum, mean, cov, std],
        tuning=(
            method=(type=latinhypercube, rng=rng), 
            params=(repeats=20, n=10),
            ranges=[
                SoleXplorer.range(:sampling_fraction, lower=0.1, upper=0.9),
                SoleXplorer.range(:feature_importance, values=[:impurity, :split])
            ]
        ), 
    ),
    preprocess=(
        stratified=true,
        nfolds=6,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

┌ Info: Precomputing logiset...
└ @ SoleData /home/paso/Documents/Aclai/Sole/SoleData.jl/src/utils/autologiset-tools.jl:277
┌ Info: Precomputing logiset...
└ @ SoleData /home/paso/Documents/Aclai/Sole/SoleData.jl/src/utils/autologiset-tools.jl:277
┌ Info: Precomputing logiset...
└ @ SoleData /home/paso/Documents/Aclai/Sole/SoleData.jl/src/utils/autologiset-tools.jl:277


### XGBoost

In [9]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost, seed=11))

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

└ @ SoleBase /home/paso/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:93


0.7916666666666666

    params = (;
        test                        = 1, 
        num_round                   = 100, 
        booster                     = "gbtree", 
        disable_default_eval_metric = 0, 
        eta                         = 0.3,      # alias: learning_rate
        num_parallel_tree           = 1, 
        gamma                       = 0.0, 
        max_depth                   = 6, 
        min_child_weight            = 1.0, 
        max_delta_step              = 0.0, 
        subsample                   = 1.0, 
        colsample_bytree            = 1.0, 
        colsample_bylevel           = 1.0, 
        colsample_bynode            = 1.0, 
        lambda                      = 1.0, 
        alpha                       = 0.0, 
        tree_method                 = "auto", 
        sketch_eps                  = 0.03, 
        scale_pos_weight            = 1.0, 
        updater                     = nothing, 
        refresh_leaf                = 1, 
        process_type                = "default", 
        grow_policy                 = "depthwise", 
        max_leaves                  = 0, 
        max_bin                     = 256, 
        predictor                   = "cpu_predictor", 
        sample_type                 = "uniform", 
        normalize_type              = "tree", 
        rate_drop                   = 0.0, 
        one_drop                    = 0, 
        skip_drop                   = 0.0, 
        feature_selector            = "cyclic", 
        top_k                       = 0, 
        tweedie_variance_power      = 1.5, 
        objective                   = "automatic", 
        base_score                  = 0.5, 
        early_stopping_rounds       = 0, 
        watchlist                   = nothing, 
        nthread                     = 1, 
        importance_type             = "gain", 
        seed                        = nothing, 
        validate_parameters         = false, 
        eval_metric                 = String[]
    )

Quasi fondamentale, in XGBoost, è l'utilizzo di early stopping.
Utilizza 2 dataset: uno di train e uno di validation, gli alberi vengono man mano ottimizzati,
ma si fermerà automaticamente quando i parametri di controllo (ad esempio, logloss) sono stabili.

In [14]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11)
    ),
    # with early stopping a validation set is required
    preprocess=(; valid_ratio = 0.8)
)

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.18508738538493286	eval-mlogloss:1.43033578889123314
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.84708688310954883	eval-mlogloss:1.25396987590296516
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.62557121048802911	eval-mlogloss:1.10083969553996774
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.47018248322217360	eval-mlogloss:0.98764142085765971
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.35891678631305696	eval-mlogloss:0.90764036003885595
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

0.7777777777777778

Possiamo provare a modificare qualche parametro come per DecisionTree...

In [None]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11),
    features=[minimum, maximum, mean, cov, std],
    ),
    preprocess=(; valid_ratio = 0.8)
)

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.18513766216195138	eval-mlogloss:1.43191179735907204
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.84520895662515061	eval-mlogloss:1.24948379089092376
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.62513551362182784	eval-mlogloss:1.09960836685937036
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.47193204993787019	eval-mlogloss:0.99195546836688597
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.36234309414158694	eval-mlogloss:0.91941915606630265
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

0.7777777777777778

In [None]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11),
        winparams=(; type=adaptivewindow, nwindows=5),
        features=[minimum, maximum, mean, cov, std],
    ),
    preprocess=(; valid_ratio = 0.8)
)

preds = MLJ.predict(model.mach, model.ds.Xtest)
yhat = MLJ.mode.(preds)
acc = MLJ.accuracy(yhat, model.ds.ytest)

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.12219827175140385	eval-mlogloss:1.27251948364849743
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.79069778297258464	eval-mlogloss:1.02420010854457977
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.58083628960277722	eval-mlogloss:0.84634144450056137
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.43454184169354648	eval-mlogloss:0.73508124567311384
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.33020068977190098	eval-mlogloss:0.65528494530710679
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

0.8055555555555556

Cross validation

In [21]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11),
        winparams=(; type=adaptivewindow, nwindows=5),
        features=[minimum, maximum, mean, cov, std],
    ),
    preprocess=(
        valid_ratio = 0.8,
        stratified=true,
        nfolds=10,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.13404390296420532	eval-mlogloss:1.23521783351898184
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.80589340690480238	eval-mlogloss:0.96364070727274964
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.59449707863413692	eval-mlogloss:0.77508205633897043
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.44630928976195200	eval-mlogloss:0.65414448609718912
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.33868955323134609	eval-mlogloss:0.55019021080090447
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

0.8472222222222221

Cross validation and MLJ tuning

In [22]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11),
        winparams=(; type=adaptivewindow, nwindows=5),
        features=[minimum, maximum, mean, cov, std],
        tuning=(
            method=(type=latinhypercube, rng=rng), 
            params=(repeats=20, n=10),
            ranges=[
                SoleXplorer.range(:grow_policy, values=["depthwise", "lossguide"]),
                SoleXplorer.range(:booster, values=["gbtree", "dart"])
            ]
        ), 
    ),
    preprocess=(
        valid_ratio = 0.8,
        stratified=true,
        nfolds=10,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.23177356747586764	eval-mlogloss:1.26494487982529868
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.92939475305292141	eval-mlogloss:0.98090204275571380
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.72728197155772034	eval-mlogloss:0.79119169895465558
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.58293874832193826	eval-mlogloss:0.66121822228798499
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.48190357303067066	eval-mlogloss:0.56509272318619952
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

0.8472222222222221

In [6]:
Random.seed!(train_seed)
model = traintest(X, y; models=(type=:xgboost,
    params=(
        num_round=10000,
        max_depth=6,
        objective="multi:softprob",
        early_stopping_rounds=20,
        watchlist=makewatchlist,
        seed=11),
        winparams=(; type=adaptivewindow, nwindows=5),
        features=catch9
    ),
    preprocess=(
        valid_ratio = 0.8,
        # stratified=true,
        # nfolds=6,
    )
)

preds = [MLJ.predict(mach, Xtest) for (mach, Xtest) in zip(model.mach, model.ds.Xtest)]
yhat = [MLJ.mode.(p) for p in preds]
acc = mean([MLJ.accuracy(hat, ytest) for (hat, ytest) in zip(yhat, model.ds.ytest)])

┌ Info: XGBoost: starting training.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:601
┌ Info: Will train until there has been no improvement in 20 rounds.
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:438
┌ Info: [1]	train-mlogloss:1.13305319547653194	eval-mlogloss:1.28214483631068266
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [2]	train-mlogloss:0.79413719073585842	eval-mlogloss:1.02041349945397219
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [3]	train-mlogloss:0.57979233886884607	eval-mlogloss:0.85356328251032998
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [4]	train-mlogloss:0.43162072767382081	eval-mlogloss:0.71635070443153381
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/booster.jl:451
┌ Info: [5]	train-mlogloss:0.32667233684788577	eval-mlogloss:0.62758346364415929
└ @ XGBoost /home/paso/.julia/packages/XGBoost/nqMqQ/src/b

MethodError: MethodError: no method matching bestguess(::Vector{Union{Nothing, String}}; suppress_parity_warning::Bool)
The function `bestguess` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  bestguess(!Matched::AbstractVector{<:AbstractFloat}, !Matched::Union{Nothing, AbstractVector}; suppress_parity_warning)
   @ SoleBase ~/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:102
  bestguess(!Matched::AbstractVector{<:Union{AbstractString, Integer, CategoricalArrays.CategoricalValue}}, !Matched::Union{Nothing, AbstractVector}; suppress_parity_warning)
   @ SoleBase ~/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:71
  bestguess(!Matched::AbstractVector{<:AbstractFloat}; ...)
   @ SoleBase ~/Documents/Aclai/Sole/SoleBase.jl/src/machine-learning-utils.jl:102
  ...
