Skip to content

Commit

Permalink
Merge pull request #506 from alan-turing-institute/dev
Browse files Browse the repository at this point in the history
For a 0.11.1 release
  • Loading branch information
ablaom committed Apr 29, 2020
2 parents 3936fd2 + e7b84e4 commit 7bd2ed1
Show file tree
Hide file tree
Showing 12 changed files with 173 additions and 102 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Expand Up @@ -6,4 +6,7 @@
*.bu
.DS_Store
sandbox/
docs/build
docs/build
paper/auto
*.aux
*.bbl
2 changes: 1 addition & 1 deletion Project.toml
@@ -1,7 +1,7 @@
name = "MLJ"
uuid = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
authors = ["Anthony D. Blaom <anthony.blaom@gmail.com>"]
version = "0.11.0"
version = "0.11.1"

[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Expand Down
13 changes: 8 additions & 5 deletions docs/src/julia_blogpost.md
@@ -1,10 +1,17 @@
!!! warning "Old post"

This post is quite old. For a newer overview of the design of MLJ, see [here](https://github.com/alan-turing-institute/MLJ.jl/blob/master/paper/paper.md)


# Beyond machine learning pipelines with MLJ

Anthony Blaom, Diego Arenas, Franz Kiraly, Yiannis Simillides, Sebastian Vollmer

**May 1st, 2019.** Blog post also posted on the [Julia Language Blog](https://julialang.org/blog/2019/05/beyond-ml-pipelines-with-mlj)




![](img/learningcurves.png) | ![](img/heatmap.png)
------------------------|--------------------------
![](img/wrapped_ridge.png) | ![](img/MLPackages.png)
Expand Down Expand Up @@ -33,11 +40,7 @@ composition.

- Video from [London Julia User Group meetup in March 2019](https://www.youtube.com/watch?v=CfHkjNmj1eE) (skip to [demo at 21'39](https://youtu.be/CfHkjNmj1eE?t=21m39s)) &nbsp;

- The MLJ [tour](https://github.com/alan-turing-institute/MLJ.jl/blob/master/docs/src/tour.ipynb)

- Building a [self-tuning random forest](https://github.com/alan-turing-institute/MLJ.jl/blob/master/examples/random_forest.ipynb)

- An MLJ [docker image](https://github.com/ysimillides/mlj-docker) (including tour)
- [MLJ Tutorials](https://alan-turing-institute.github.io/MLJTutorials/)

- Implementing the MLJ interface for a [new model](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/)

Expand Down
6 changes: 6 additions & 0 deletions docs/src/machines.md
Expand Up @@ -78,6 +78,12 @@ fitted_params(mach)
report(mach)
```

```@docs
fitted_params
report
```


## Saving machines

To save a machine to file, use the [`MLJ.save`](@ref) command:
Expand Down
112 changes: 81 additions & 31 deletions docs/src/mlj_cheatsheet.md
Expand Up @@ -19,19 +19,17 @@ for "RidgeRegresssor", which is provided by multiple packages

`models(x -> x.is_supervised && x.is_pure_julia)` lists all supervised models written in pure julia.

**experimental:**
`models(matching(X))` lists all unsupervised models compatible with input `X`.

**experimental!**
`models(matching(X, y))` lists all supervised modesl compatible with input/target `X/y`.

**experimental!** With additional conditions:
With additional conditions:

```julia
models() do model
matching(model, X, y)) &&
matching(model, X, y) &&
model.prediction_type == :probabilistic &&
model.is_pure_julia
model.is_pure_julia
end
```

Expand All @@ -45,7 +43,7 @@ instantiates a model provided by multiple packages

## Scitypes and coercion

`scitype(x)` is the scientific type of `x`. For example `scitype(2.4) = Continuous`
`scitype(x)` is the scientific type of `x`. For example `scitype(2.4) == Continuous`

![scitypes.png](img/scitypes_small.png)

Expand All @@ -55,14 +53,16 @@ type | scitype
`Integer` | `Count`
`CategoricalValue` and `CategoricalString` | `Multiclass` or `OrderedFactor`

*Figure and Table for scalar scitypes*
*Figure and Table for common scalar scitypes*

Use `schema(X)` to get the column scitypes of a table `X`

`coerce(y, Multiclass)` attempts coercion of all elements of `y` into scitype `Multiclass`

`coerce(X, :x1 => Continuous, :x2 => OrderedFactor)` to coerce columns `:x1` and `:x2` of table `X`.

`coerce(X, Count => Continuous)` to coerce all columns with `Count` scitype to `Continuous`.


## Ingesting data

Expand All @@ -74,15 +74,31 @@ channing = dataset("boot", "channing")
y, X = unpack(channing,
==(:Exit), # y is the :Exit column
!=(:Time); # X is the rest, except :Time
:Exit=>Continuous, # correct wrong scitypes
:Exit=>Continuous, # correcting wrong scitypes (optional)
:Entry=>Continuous,
:Cens=>Multiclass)
```
*Warning.* Before julia 1.2 use `col -> col != :Time` insead of `!=(:Time)`.

Splitting row indices into train/validation/test:

`train, valid, test = partition(eachindex(y), 0.7, 0.2, shuffle=true, rng=1234)` for 70:20:10 ratio

For a stratified split:

`train, test = partition(eachindex(y), 0.8, stratify=y)`

Getting data from [OpenML](https://www.openml.org):

`table = openML.load(91)`

Creating synthetic classification data:

`X, y = make_blobs(100, 2)` (also: `make_moons`, `make_circles`)

Creating synthetic regression data:

`X, y = make_regression(100, 2)`

## Machine construction

Expand Down Expand Up @@ -114,79 +130,112 @@ Unsupervised case: `transform(mach, rows=1:100)` or `inverse_transform(mach, row

`params(model)` gets nested-tuple of all hyperparameters, even nested ones

`info(ConstantRegresssor())`, `info("PCA")`, `info("RidgeRegressor",
`info(ConstantRegressor())`, `info("PCA")`, `info("RidgeRegressor",
pkg="MultivariateStats")` gets all properties (aka traits) of registered models

`info(rms)` gets all properties of a performance measure

`schema(X)` get column names, types and scitypes, and nrows, of a table `X`

`scitype(model)`, `scitype(rms)`, `scitype(X)` gets scientific type of a model, measure or table (encoding key properties)
`scitype(X)` gets scientific type of a table

`fitted_params(mach)` gets learned parameters of fitted machine

`report(mach)` gets other training results (e.g. feature rankings)

## Resampling strategies
`fitted_params(mach).fitted_params_given_machine` returns a dictionary if `mach` is a composite (eg, wraps a `@pipeline` model)

`Holdout(fraction_train=…, shuffle=false)` for simple holdout
`report(mach).report_given_machine` returns a dictionary if `mach` is a composite (eg, wraps a `@pipeline` model)

`CV(nfolds=6, shuffle=false)` for cross-validation

or a list of pairs of row indices:
## Saving and retrieving machines

`MLJ.save("trained_for_five_days.jlso", mach)` to save machine `mach`

`predict_only_mach = machine("trained_for_five_days.jlso")` to deserialize.

`[(train1, eval1), (train2, eval2), ... (traink, evalk)]`

## Performance estimation

`evaluate(model, X, y, resampling=CV(), measure=rms, operation=predict, weights=..., verbosity=1)`

`evaluate!(mach, resampling=Holdout(), measure=[rms, mav], operation=predict, weights=..., verbosity=1)`

`evaluate!(mach, resampling=[(fold1, fold2), (fold2, fold1)], measure=rms)`

## Resampling strategies (`resampling=...`)

`Holdout(fraction_train=0.7, rng=1234)` for simple holdout

`CV(nfolds=6, rng=1234)` for cross-validation

`StratifiedCV(nfolds=6, rng=1234)` for stratified cross-validation

or a list of pairs of row indices:

`[(train1, eval1), (train2, eval2), ... (traink, evalk)]`

## Tuning

### Ranges for tuning
### Tuning model wrapper

`tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…)`

If `r = range(KNNRegressor(), :K, lower=1, upper = 20, scale=:log)` then `iterator(r, 6) = [1, 2, 3, 6, 11, 20]`
### Ranges for tuning (`range=...`)

Non-numeric ranges: `r = range(model, :parameter, values=…)`.
If `r = range(KNNRegressor(), :K, lower=1, upper = 20, scale=:log)`

then `Grid()` search uses `iterator(r, 6) == [1, 2, 3, 6, 11, 20]`.

`lower=-Inf` and `upper=Inf` are allowed.

Non-numeric ranges: `r = range(model, :parameter, values=…)`

Nested ranges: Use dot syntax, as in `r = range(EnsembleModel(atom=tree), :(atom.max_depth), ...)`

### Tuning strategies
Can specify multiple ranges, as in `range=[r1, r2, r3]`. For more range options do `?Grid` or `?RandomSearch`

`Grid(resolution=10)` for grid search

### Tuning model wrapper
### Tuning strategies

`Grid(resolution=10)` or `Grid(goal=50)` for basic grid search

`tuned_model = TunedModel(model=…, tuning=Grid(), resampling=Holdout(), measure=…, operation=predict, ranges=…, minimize=true, full_report=true)`
`RandomSearch(rng=1234)` for basic random search


#### Learning curves

`curve = learning_curve!(mach, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)`
For generating plot of performance against parameter specified by `range`:

`curve = learning_curve(mach, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)`

`curve = learning_curve(model, X, y, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)`

If using Plots.jl:

`plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)`


## Built-in performance measures
## Performance measures (metrics)

`l1`, `l2`, `mav`, `rms`, `rmsl`, `rmslp1`, `rmsp`, `misclassification_rate`, `cross_entropy`
`area_under_curve`, `accuracy`, `balanced_accuracy`, `cross_entropy`, `FScore`, `false_discovery_rate`, `false_negative`, `false_negative_rate`, `false_positive`, `false_positive_rate`, `l1`, `l2`, `mae`, `matthews_correlation`, `misclassification_rate`, `negative_predictive_value`, `positive_predictive_value`, `rms`, `rmsl`, `rmslp1`, `rmsp`, `true_negative`, `true_negative_rate`, `true_positive`, `true_positive_rate`, `BrierScore()`, `confusion_matrix`

`info(rms)` to list properties (aka traits) of the `rms` measure
Available after doing `using LossFunctions`:

`DWDMarginLoss()`, `ExpLoss()`, `L1HingeLoss()`, `L2HingeLoss()`, `L2MarginLoss()`, `LogitMarginLoss()`, `ModifiedHuberLoss()`, `PerceptronLoss()`, `SigmoidLoss()`, `SmoothedL1HingeLoss()`, `ZeroOneLoss()`, `HuberLoss()`, `L1EpsilonInsLoss()`, `L2EpsilonInsLoss()`, `LPDistLoss()`, `LogitDistLoss()`, `PeriodicLoss()`, `QuantileLoss()`

`using LossFunctions` to use more measures
`measures()` to get full list

`info(rms)` to list properties (aka traits) of the `rms` measure


## Transformers

Built-ins include: `Standardizer`, `OneHotEncoder`, `UnivariateBoxCoxTransformer`, `FeatureSelector`, `UnivariateStandardizer`
Built-ins include: `Standardizer`, `OneHotEncoder`, `UnivariateBoxCoxTransformer`, `FeatureSelector`, `FillImputer`, `UnivariateDiscretizer`, `UnivariateStandardizer`, `ContinuousEncoder`

Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Clustering).

Full list: do `models(m -> !m[:is_supervised])`
`models(m -> !m.is_supervised)` to get full list


## Ensemble model wrapper
Expand All @@ -196,14 +245,15 @@ Full list: do `models(m -> !m[:is_supervised])`

## Pipelines

With point predictions:
With deterministic (point) predictions:

`pipe = @pipeline MyPipe(hot=OneHotEncoder(), knn=KNNRegressor(K=3), target=UnivariateStandardizer())`

`pipe = @pipeline MyPipe(hot=OneHotEncoder(), knn=KNNRegressor(K=3), target=v->log.(V), inverse=v->exp.(v))`

With probabilistic-predictions:

`pipe = @pipeline MyPipe(hot=OneHotEncoder(), knn=KNNRegressor(K=3), target=v->log.(V), inverse=v->exp.(v)) is_probabilistic=true`
`pipe = @pipeline MyPipe(hot=OneHotEncoder(), tree=DecisionTreeClassifier()) is_probabilistic=true`)

Unsupervised:

Expand Down
4 changes: 0 additions & 4 deletions docs/src/model_search.md
Expand Up @@ -54,10 +54,6 @@ conjunctively.

## Matching models to data

!!! note
The `matching` method described below is experimental and may
break in subsequent MLJ releases.

Common searches are streamlined with the help of the `matching`
command, defined as follows:

Expand Down

0 comments on commit 7bd2ed1

Please sign in to comment.