# Getting started

## Composing models

> https://juliaai.github.io/DataScienceTutorials.jl/getting-started/composing-models/
> <br> (project folder) https://raw.githubusercontent.com/juliaai/DataScienceTutorials.jl/gh-pages/__generated/A-composing-models.tar.gz

In [1]:
using Pkg ; Pkg.activate("D:/JULIA/6_ML_with_Julia/A-composing-models"); Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\A-composing-models`


> Generating dummy data <br>
> Declaring a pipeline

### Generating dummy data

---

Let's start by generating some dummy data with both numerical values and categorical values:

In [2]:
using MLJ
using PrettyPrinting

KNNRegressor = @load KNNRegressor

# input
X = (age = [23, 45, 34, 25, 67],
     gender = categorical(['m', 'm', 'f', 'm', 'f']))

# target
height = [178, 194, 165, 173, 168]

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\loading.jl:168


import NearestNeighborModels ✔


5-element Vector{Int64}:
 178
 194
 165
 173
 168

Note that the scientific type of ```age``` is ```Count``` here:

In [3]:
scitype(X.age)

AbstractVector{Count} (alias for AbstractArray{Count, 1})

In [4]:
schema(X)

┌────────┬───────────────┬────────────────────────────────┐
│[22m names  [0m│[22m scitypes      [0m│[22m types                          [0m│
├────────┼───────────────┼────────────────────────────────┤
│ age    │ Count         │ Int64                          │
│ gender │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
└────────┴───────────────┴────────────────────────────────┘


We will want to coerce that to ```Continuous``` so that it can be given to a regressor that expects such values.

### Declaring a pipeline

---

A typical workflow for such data is to one-hot-encode the categorical data and then apply some regression model on the data. Let's say that we want to apply the following steps:

1. One hot encode the categorical features in  ```X```

2. Standardize the target variable (```:height```)

3. Train a KNN regression model on the one hot encoded data and the Standardized target.

The ```Pipeline``` constructor helps you define such a simple (non-branching) pipeline of steps to be applied in order:

In [5]:
pipe = Pipeline(
    coercer = X -> coerce(X, :age => Continuous),
    one_hot_encoder = OneHotEncoder(), 
    transformed_target_model = TransformedTargetModel(
        model = KNNRegressor(K = 3);
        target = UnivariateStandardizer()
        )
)

DeterministicPipeline(
    coercer = var"#1#2"(),
    one_hot_encoder = OneHotEncoder(
            features = Symbol[],
            drop_last = false,
            ordered_factor = true,
            ignore = false),
    transformed_target_model = TransformedTargetModelDeterministic(
            model = KNNRegressor,
            target = UnivariateStandardizer,
            inverse = nothing,
            cache = true),
    cache = true)

Note the coercion of the ```:age``` variable to Continuous since ```KNNRegressor``` expects ```Continuous``` input. Note also the ```TransformedTargetModel``` which allows one to learn a transformation (in this cas Standardization) of the target variable to be passed to the ```KNNRegressor```.

Hyperparameters of this pipeline can be accessed (and set) using dot syntax:

In [6]:
pipe.transformed_target_model.model.K = 2

2

In [7]:
pipe.one_hot_encoder.drop_last = true;

Evaluation for a pipe can be done with the ```evaluate!``` method; implicitly it will construct machines that will contain the fitted parameters etc:

In [8]:
evaluate(pipe, X, height, resampling = Holdout(), measure = rms) |> pprint

│ scitype(y) = AbstractVector{Count}
│ target_scitype(model) = AbstractVector{Continuous}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:140


PerformanceEvaluation(11.5,)