Skip to content

Commit

Permalink
Merge e5edd4c into 1df34b1
Browse files Browse the repository at this point in the history
  • Loading branch information
ablaom committed Jun 11, 2020
2 parents 1df34b1 + e5edd4c commit a4a4e33
Show file tree
Hide file tree
Showing 10 changed files with 406 additions and 118 deletions.
7 changes: 2 additions & 5 deletions README.md
Expand Up @@ -24,6 +24,8 @@
</p>
</h2>

**New to MLJ? Start [here](https://alan-turing-institute.github.io/MLJ.jl/stable/)**.

MLJ (Machine Learning in Julia) is a toolbox written in Julia
providing a common interface and meta-algorithms for selecting,
tuning, evaluating, composing and comparing machine learning models written in Julia and other languages. MLJ is released
Expand All @@ -39,11 +41,6 @@ Institute](https://www.turing.ac.uk/).
</p>
</br>

**The starting point for the general MLJ user** is MLJ's
[documentation](https://alan-turing-institute.github.io/MLJ.jl/stable/). This
page contains information for developers of MLJ and related packages
in the Julia machine learning eco-system.


### The MLJ Universe

Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Expand Up @@ -24,6 +24,7 @@ pages = [
"Introduction" => "index.md",
"Getting Started" => "getting_started.md",
"Common MLJ Workflows" => "common_mlj_workflows.md",
"Working with Categorical Data" => "working_with_categorical_data.md",
"Model Search" => "model_search.md",
"Machines" => "machines.md",
"Evaluating Model Performance" => "evaluating_model_performance.md",
Expand Down
103 changes: 59 additions & 44 deletions docs/src/adding_models_for_general_use.md
@@ -1,6 +1,9 @@

# Adding Models for General Use

!!! warning

Models implementing the MLJ model interface according to the instructions given here should import MLJModelInterface version 0.3 or higher. This is enforced with a statement such as `MLJModelInterface = "^0.3" ` under `[compat]` in the Project.toml file of the package containing the implementation.

This guide outlines the specification of the MLJ model interface
and provides detailed guidelines for implementing the interface for
models intended for general use. See also the more condensed
Expand Down Expand Up @@ -408,7 +411,7 @@ ordering of these integers being consistent with that of the pool),
integers back into `CategoricalValue`/`CategoricalString` objects),
and `classes`, for extracting all the `CategoricalValue` or
`CategoricalString` objects sharing the pool of a particular
value/string. Refer to [Convenience methods](@ref) below for important
value. Refer to [Convenience methods](@ref) below for important
details.

Note that a decoder created during `fit` may need to be bundled with
Expand Down Expand Up @@ -456,68 +459,77 @@ must be an `AbstractVector` whose elements are distributions (one distribution
per row of `Xnew`).

Presently, a *distribution* is any object `d` for which
`MMI.isdistribution(::d) = true`, which is currently restricted to
objects subtyping `Distributions.Sampleable` from the package
Distributions.jl.
`MMI.isdistribution(::d) = true`, which is the case for objects of
type `Distributions.Sampleable`.

Use the distribution `MMI.UnivariateFinite` for `Probabilistic` models
predicting a target with `Finite` scitype (classifiers). In this case
the eltype of the training target `y` will be a `CategoricalValue`.

For efficiency, one should not construct `UnivariateDistribution`
instances one at a time. Rather, once a probability vector or matrix
is known, construct an instance of `UnivariateFiniteVector <:
AbstractArray{<:UnivariateFinite},1}` to return. Both `UnivariateFinite`
and `UnivariateFiniteVector` objects are constructed using the single
`UnivariateFinite` function.

Use the distribution `MMI.UnivariateFinite` for `Probabilistic`
models predicting a target with `Finite` scitype (classifiers). In
this case each element of the training target `y` is a
`CategoricalValue` or `CategoricalString`, as in this contrived example:
For example, suppose the target `y` arrives as a subsample of some
`ybig` and is missing some classes:

```julia
using CategoricalArrays
y = Any[categorical([:yes, :no, :no, :maybe, :maybe])...]
ybig = categorical([:a, :b, :a, :a, :b, :a, :rare, :a, :b])
y = ybig[1:6]
```

Note that, as in this case, we cannot assume `y` is a
`CategoricalVector`, and we rely on elements for pool information (if
we need it); this is accessible using the convenience method
`MLJ.classes`:
Your fit method has bundled the first element of `y` with the
`fitresult` to make it available to `predict` for purposes of tracking
the complete pool of classes. Let's call this `an_element =
y[1]`. Then, supposing the corresponding probabilities of the observed
classes `[:a, :b]` are in an `n x 2` matrix `probs` (where `n` the number of
rows of `Xnew`) then you return

```julia
julia> yes = y[1]
julia> levels = MMI.classes(yes)
3-element Array{CategoricalValue{Symbol,UInt32},1}:
:maybe
:no
:yes
yhat = UnivariateFinite([:a, :b], probs, pool=an_element)
```

Now supposing that, for some new input pattern, the elements `yes =
y[1]` and `no = y[2]` are to be assigned respective probabilities of
0.2 and 0.8. Then the corresponding distribution `d` is constructed as
follows:
This object automatically assigns zero-probability to the unseen class
`:rare` (i.e., `pdf.(yhat, :rare)` works and returns a zero
vector). If you would like to assign `:rare` non-zero probabilities,
simply add it to the first vector (the *support*) and supply a larger
`probs` matrix.

```julia
julia> d = MMI.UnivariateFinite([yes, no], [0.2, 0.8])
UnivariateFinite(:yes=>0.2, :maybe=>0.0, :no=>0.8)
If instead of raw labels `[:a, :b]` you have the corresponding
`CategoricalElement`s (from, e.g., `filter(cv->cv in unique(y),
classes(y))`) then you can use these instead and drop the `pool`
specifier.

julia> pdf(d, yes)
0.2
In a binary classification problem it suffices to specify a single
vector of probabilities, provided you specify `augment=true`, as in
the following example, *and note carefully that these probablities are
associated with the* **last** *(second) class you specify in the
constructor:*

julia> maybe = y[4]; pdf(d, maybe)
0.0
```julia
y = categorical([:TRUE, :FALSE, :FALSE, :TRUE, :TRUE])
an_element = y[1]
probs = rand(10)
yhat = UnivariateFinite([:FALSE, :TRUE], probs, augment=true, pool=an_element)
```

Alternatively, a dictionary can be passed to the constructor.
The constructor has a lot of options, including passing a dictionary
instead of vectors. See [`UnivariateFinite`](@ref) for details.

See
[LinearBinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
for an example of a Probabilistic classifier implementation.


```@docs
UnivariateFinite
```

*Important note on binary classifiers.* There is no "Binary" scitype
distinct from `Multiclass{2}` or `OrderedFactor{2}`; `Binary` is just
an alias for `Union{Multiclass{2},OrderedFactor{2}}`. The
`target_scitype` of a binary classifier will generally be
`AbstractVector{<:Binary}` and according to the *mlj* scitype
convention, elements of `y` have type `CategoricalValue` or
`CategoricalString`, and *not* `Bool`. See
convention, elements of `y` have type `CategoricalValue`, and *not*
`Bool`. See
[BinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
for an example.

Expand Down Expand Up @@ -558,8 +570,7 @@ MMI.input_scitype(::Type{<:DecisionTreeClassifier}) = Table(Union{Continuous,Mis
```

Similarly, to ensure the target is an AbstractVector whose elements
have `Finite` scitype (and hence `CategoricalValue` or
`CategoricalString` machine type) we declare
have `Finite` scitype (and hence `CategoricalValue` machine type) we declare

```julia
MMI.target_scitype(::Type{<:DecisionTreeClassifier}) = AbstractVector{<:Finite}
Expand All @@ -584,8 +595,7 @@ restricts to tables with continuous or binary (ordered or unordered)
columns.

For predicting variable length sequences of, say, binary values
(`CategoricalValue`s or `CategoricalString`s with some common size-two
pool) we declare
(`CategoricalValue`s) with some common size-two pool) we declare

```julia
target_scitype(SomeSupervisedModel) = AbstractVector{<:NTuple{<:Finite{2}}}
Expand Down Expand Up @@ -875,6 +885,11 @@ MLJModelInterface.selectrows
MLJModelInterface.selectcols
```

```@docs
UnivariateFinite
```



### Where to place code implementing new models

Expand Down
37 changes: 24 additions & 13 deletions docs/src/getting_started.md
@@ -1,7 +1,7 @@
# Getting Started

For an outline of MLJ's **goals** and **features**, see the
[Introduction](@ref).
[Introduction](@ref).

This section introduces the most basic MLJ operations and concepts. It
assumes MJL has been successfully installed. See [Installation](@ref)
Expand All @@ -20,12 +20,12 @@ seed!(1234)

To load some demonstration data, add
[RDatasets](https://github.com/JuliaStats/RDatasets.jl) to your load
path and enter
path and enter

```@repl doda
import RDatasets
iris = RDatasets.dataset("datasets", "iris"); # a DataFrame
```
```@repl doda
import RDatasets
iris = RDatasets.dataset("datasets", "iris"); # a DataFrame
```

and then split the data into input and target parts:

Expand All @@ -35,15 +35,16 @@ y, X = unpack(iris, ==(:Species), colname -> true);
first(X, 3) |> pretty
```

To list all models available in MLJ's [model
registry](model_search.md):
To list *all* models available in MLJ's [model
registry](model_search.md) do `models()`. Listing the models
compatible with the present data:

```@repl doda
models()
models(matching(X,y))
```

In MLJ a *model* is a struct storing the hyperparameters of the
learning algorithm indicated by the struct name.
learning algorithm indicated by the struct name.

Assuming the DecisionTree.jl package is in your load path, we can use
`@load` to load the code defining the `DecisionTreeClassifier` model
Expand Down Expand Up @@ -150,6 +151,7 @@ package can be applied to such distributions:

```@repl doda
broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
broadcast(pdf, yhat, y[test])[3:5] # predicted probability of observed class
mode.(yhat[3:5])
```

Expand All @@ -160,6 +162,15 @@ Or, one can explicitly get modes by using `predict_mode` instead of
predict_mode(tree, rows=test[3:5])
```

*(MLJ v0.2.7 and higher)* Finally, we note that `pdf()` is
overloaded to allow the retrieval of probabilities for all levels at
once:

```@repl doda
L = levels(y)
pdf(yhat[3:5], L)
```

Unsupervised models have a `transform` method instead of `predict`,
and may optionally implement an `inverse_transform` method:

Expand Down Expand Up @@ -333,11 +344,11 @@ are the key features of that convention:

- `String`s and `Char`s are *not* interpreted as `Multiclass` or
`OrderedFactor` (they have scitypes `Textual` and `Unknown`
respectively).
respectively).

- In particular, *integers* (including `Bool`s) *cannot be used to
represent categorical data.* Use the preceding `coerce` operations
to coerce to a `Finite` scitype.
to coerce to a `Finite` scitype.

Use `coerce(v, OrderedFactor)` or `coerce(v, Multiclass)` to coerce a
vector `v` of integers, strings or characters to a vector with an
Expand Down

0 comments on commit a4a4e33

Please sign in to comment.