Skip to content

Commit

Permalink
big update of docs, including using discrete data WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
ablaom committed Jun 5, 2020
1 parent 12601de commit f3bfa4d
Show file tree
Hide file tree
Showing 3 changed files with 260 additions and 51 deletions.
107 changes: 63 additions & 44 deletions docs/src/adding_models_for_general_use.md
@@ -1,6 +1,13 @@

# Adding Models for General Use

!!! warning

Models implementing the MLJ model interface according to the instructions
given here should import MLJModelInterface version 0.3 or higher. This is
enforced with a statement such as `MLJModelInterface = "^0.3" under
`[compat]` in the Project.toml file of the package containing the
implementation.

This guide outlines the specification of the MLJ model interface
and provides detailed guidelines for implementing the interface for
models intended for general use. See also the more condensed
Expand Down Expand Up @@ -408,7 +415,7 @@ ordering of these integers being consistent with that of the pool),
integers back into `CategoricalValue`/`CategoricalString` objects),
and `classes`, for extracting all the `CategoricalValue` or
`CategoricalString` objects sharing the pool of a particular
value/string. Refer to [Convenience methods](@ref) below for important
value. Refer to [Convenience methods](@ref) below for important
details.

Note that a decoder created during `fit` may need to be bundled with
Expand Down Expand Up @@ -456,68 +463,77 @@ must be an `AbstractVector` whose elements are distributions (one distribution
per row of `Xnew`).

Presently, a *distribution* is any object `d` for which
`MMI.isdistribution(::d) = true`, which is currently restricted to
objects subtyping `Distributions.Sampleable` from the package
Distributions.jl.
`MMI.isdistribution(::d) = true`, which is the case for objects of
type `Distributions.Sampleable`.

Use the distribution `MMI.UnivariateFinite` for `Probabilistic` models
predicting a target with `Finite` scitype (classifiers). In this case
the eltype of the training target `y` will be a `CategoricalValue`.

For efficiency, one should not construct `UnivariateDistribution`
instances one at a time. Rather, once a probability vector or matrix
is known, construct an instance of `UnivariateFiniteVector <:
AbstractArray{<:UnivariateFinite},1}` to return. Both `UnivariateFinite`
and `UnivariateFiniteVector` objects are constructed using the single
`UnivariateFinite` function.

Use the distribution `MMI.UnivariateFinite` for `Probabilistic`
models predicting a target with `Finite` scitype (classifiers). In
this case each element of the training target `y` is a
`CategoricalValue` or `CategoricalString`, as in this contrived example:
For example, suppose the target `y` arrives as a subsample of some
`ybig` and is missing some classes:

```julia
using CategoricalArrays
y = Any[categorical([:yes, :no, :no, :maybe, :maybe])...]
ybig = categorical([:a, :b, :a, :a, :b, :a, :rare, :a, :b])
y = ybig[1:6]
```

Note that, as in this case, we cannot assume `y` is a
`CategoricalVector`, and we rely on elements for pool information (if
we need it); this is accessible using the convenience method
`MLJ.classes`:
Your fit method has bundled the first element of `y` with the
`fitresult` to make it available to `predict` for purposes of tracking
the complete pool of classes. Let's call this `an_element =
y[1]`. Then, supposing the corresponding probabilities of the observed
classes `[:a, :b]` are in an `n x 2` matrix `probs` (where `n` the number of
rows of `Xnew`) then you return

```julia
julia> yes = y[1]
julia> levels = MMI.classes(yes)
3-element Array{CategoricalValue{Symbol,UInt32},1}:
:maybe
:no
:yes
yhat = UnivariateFinite([:a, :b], probs, pool=an_element)
```

Now supposing that, for some new input pattern, the elements `yes =
y[1]` and `no = y[2]` are to be assigned respective probabilities of
0.2 and 0.8. Then the corresponding distribution `d` is constructed as
follows:
This object automatically assigns zero-probability to the unseen class
`:rare` (i.e., `pdf.(yhat, :rare)` works and returns a zero
vector). If you would like to assign `:rare` non-zero probabilities,
simply add it to the first vector (the *support*) and supply a larger
`probs` matrix.

```julia
julia> d = MMI.UnivariateFinite([yes, no], [0.2, 0.8])
UnivariateFinite(:yes=>0.2, :maybe=>0.0, :no=>0.8)
If instead of raw labels `[:a, :b]` you have the corresponding
`CategoricalElement`s (from, e.g., `filter(cv->cv in unique(y),
classes(y))`) then you can use these instead and drop the `pool`
specifier.

julia> pdf(d, yes)
0.2
In a binary classification problem it suffices to specify a single
vector of probabilities, provided you specify `augment=true`, as in
the following example, *and note carefully that these probablities are
associated with the* **last** *(second) class you specify in the
constructor:*

julia> maybe = y[4]; pdf(d, maybe)
0.0
```julia
y = categorical([:TRUE, :FALSE, :FALSE, :TRUE, :TRUE])
an_element = y[1]
probs = rand(10)
yhat = UnivariateFinite([:FALSE, :TRUE], probs, augment=true, pool=an_element)
```

Alternatively, a dictionary can be passed to the constructor.
The constructor has a lot of options, including passing a dictionary
instead of vectors. See [`UnivariateFinite`](@ref) for details.

See
[LinearBinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
for an example of a Probabilistic classifier implementation.


```@docs
UnivariateFinite
```

*Important note on binary classifiers.* There is no "Binary" scitype
distinct from `Multiclass{2}` or `OrderedFactor{2}`; `Binary` is just
an alias for `Union{Multiclass{2},OrderedFactor{2}}`. The
`target_scitype` of a binary classifier will generally be
`AbstractVector{<:Binary}` and according to the *mlj* scitype
convention, elements of `y` have type `CategoricalValue` or
`CategoricalString`, and *not* `Bool`. See
convention, elements of `y` have type `CategoricalValue`, and *not*
`Bool`. See
[BinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
for an example.

Expand Down Expand Up @@ -558,8 +574,7 @@ MMI.input_scitype(::Type{<:DecisionTreeClassifier}) = Table(Union{Continuous,Mis
```

Similarly, to ensure the target is an AbstractVector whose elements
have `Finite` scitype (and hence `CategoricalValue` or
`CategoricalString` machine type) we declare
have `Finite` scitype (and hence `CategoricalValue` machine type) we declare

```julia
MMI.target_scitype(::Type{<:DecisionTreeClassifier}) = AbstractVector{<:Finite}
Expand All @@ -584,8 +599,7 @@ restricts to tables with continuous or binary (ordered or unordered)
columns.

For predicting variable length sequences of, say, binary values
(`CategoricalValue`s or `CategoricalString`s with some common size-two
pool) we declare
(`CategoricalValue`s) with some common size-two pool) we declare

```julia
target_scitype(SomeSupervisedModel) = AbstractVector{<:NTuple{<:Finite{2}}}
Expand Down Expand Up @@ -875,6 +889,11 @@ MLJModelInterface.selectrows
MLJModelInterface.selectcols
```

```@docs
UnivariateFinite
```



### Where to place code implementing new models

Expand Down
20 changes: 13 additions & 7 deletions docs/src/model_search.md
Expand Up @@ -9,10 +9,10 @@ methods, as detailed below.

## Model metadata

*Terminology.* In this section the word "model" refers to the metadata
entry in the registry of an actual model `struct`, as appearing
elsewhere in the manual. One can obtain such an entry with the `info`
command:
*Terminology.* In this section the word "model" refers to a metadata
entry in the model registry, as opposed to an actual model `struct`
that such an entry represents. One can obtain such an entry with the
`info` command:

```@setup tokai
using MLJ
Expand All @@ -38,14 +38,20 @@ localmodels()
localmodels()[2]
```

If `models` is passed any `Bool`-valued function `test`, it returns every `model` for which `test(model)` is true, as in
One can search for models containing specified strings or regular expressions in their `docstring` attributes, as in

```@repl tokai
models("forest")
```

or by specifying a filter (`Bool`-valued function):

```@repl tokai
test(model) = model.is_supervised &&
filter(model) = model.is_supervised &&
model.input_scitype >: MLJ.Table(Continuous) &&
model.target_scitype >: AbstractVector{<:Multiclass{3}} &&
model.prediction_type == :deterministic
models(test)
models(filter)
```

Multiple test arguments may be passed to `models`, which are applied
Expand Down
184 changes: 184 additions & 0 deletions docs/src/working_with_categorical_data.md
@@ -0,0 +1,184 @@
# Working with Categorical Data

## Scientific types for discrete data

Recall that models articulate their data requirements using scientific
types (see [Getting Started](@ref) or the MLJScientificTypes.jl
[documentation](https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/)). There
are three scientific types discrete data can have: `Count`,
`OrderedFactor` and `Multiclass`.


### Count data

In MLJ you cannot use integers to represent (finite) categorical
data. Integers are reserved for discrete data you want interpreted as
`Count <: Infinite`:

```@example hut
using MLJ # hide
scitype([1, 4, 5, 6])
```

The `Count` scientific type includes things like the number of phone
calls, or city populations, and other "frequency" data of a generally
unbounded nature.

That said, you may have data that is theoretically `Count`, but which
you coerce to `OrderedFactor` to enable the use of more models,
trusting to your knowledge of how those models work to inform an
appropriate interpretation.


### OrderedFactor and Multiclass data

Other integer data, such as the number of an animal's legs, or number
of rooms of homes, are generally coerced to `OrderedFactor <:
Finite`. The other categorical scientific type is `Multiclass <:
Finite`, which is for *unordered* categorical data. Coercing data to
one of these two forms is discussed under [ Detecting and coercing
improperly represented categorical data](@ref) below.


### Binary data

There is no separate scientific type for binary data. Binary data is
either `OrderedFactor{2}` if ordered, and `Multiclass{2}` otherwise.
Data with type `OrderedFactor{2}` is considered to have an instrinsic
"positive" class, e.g., the outcome of a medical test, and the
"pass/fail" outcome of an exam. MLJ measures, such as `true_positive`
assume the *second* class in the ordering is the "positive"
class. Inspecting and changing order is discussed below.

If data has type `Bool` it is considered `Count` data (as `Bool <:
Integer`) and in generally users will want to coerce to a binary type.


## Detecting and coercing improperly represented categorical data

One inspects the scientific type of data using `scitype` as shown
above. To inspect all column scientific types in a table
simultaneously, use `schema`. (Tables also have a `scitype`, in which
this information appears in a condensed form more appropriate for type
dispatch.)

```@example hut
using DataFrames
X = DataFrame(
name = ["Siri", "Robo", "Alexa", "Cortana"],
gender = ["male", "male", "Female", "female"],
likes_soup = [true, false, false, true],
height = [152, missing, 148, 163],
rating = [2, 5, 2, 1],
outcome = ["rejected", "accepted", "accepted", "rejected"])
schema(X)
```

Coercing a single column:

```@example hut
X.outcome = coerce(X.outcome, OrderedFactor)
```

Inspecting the order of the levels:

```julia
levels(X.outcome)
```

Since we wish to regard "accepted" as the positive class, it should
appear second, which we correct with the `levels!` function:

```@example hut
levels!(X.outcome, ["rejected", "accepted"]);
```
Coercing all remaining types simultaneously:

```@example hut
Xnew = coerce(X, :gender => Multiclass,
:like_soup => OrderedFactor,
:height => Continuous,
:rating => OrderedFactor)
schema(Xnew)
```

(For `DataFrame`s there is also in-place coercion using `coerce!`.)


## Tracking all levels

The key property of vectors of scientific type `OrderedFactor` and
`Multiclass` is that the pool of all levels is not lost when
separating out one or more elements:

```@example hut
v = Xnew.rating
```

```@example hut
levels(v)
```

```@example hut
levels(v[1:2])
```

```@example hut
levels(v[2])
```
By tracking all classes in this way, MLJ
avoids common pain points around categorical data, such as evaluating
models on an evaluation set only to crash your code because classes appear
there which were not seen during training.


## Under the hood: CategoricalValue and CategoricalArray

In MLJ the atomic objects with `OrderedFactor` or `Multiclass`
scientific are `CategoricalValue`s, from the [CategoricalArrays.jl]
(https://juliadata.github.io/CategoricalArrays.jl/stable/) package.
In some sense `CategoricalValue`s are an implementation detail users
can ignore for the most part, as shown above. However, some users may
want some basic understanding of these types - and those implementing
MLJ's model interface for new alogorithms will have to understand
them, which we do so informally now. For the complete API, see the
CategoricalArrays.jl
[documentation](https://juliadata.github.io/CategoricalArrays.jl/stable/)


To construct an `OrderedFactor` or `Multiclass` vector from raw
labels, one uses `categorical`:

```
@example hut
using CategoricalArrays # hide
v = categorical([:A, :B, :A, :A, :C])
typeof(v)
```

```
@example hut
scitype(v)
```

```
@example hut
v = categorical([:A, :B, :A, :A, :C], ordered=true)
scitype(v)
```

When you index a `CategoricalVector` you don't get a raw label, but
instead an instance of `CategoricalValue`. As explained above, this
value knows the complete pool of levels from vector from which it
came. Use `get(val)` to extract the raw label from a value `val`.

Despite the distinction that exists between a value (element) and a
label, the two are the same, from the point of `==` and `in`:

```@julia
v[1] == :A # true
:A in v # true
```


## Probablilistic predictions of categorical data

0 comments on commit f3bfa4d

Please sign in to comment.