Merge e5edd4c into 1df34b1

alan-turing-institute · Jun 11, 2020 · a4a4e33 · a4a4e33
2 parents 1df34b1 + e5edd4c
commit a4a4e33
Show file tree

Hide file tree

Showing 10 changed files with 406 additions and 118 deletions.
diff --git a/README.md b/README.md
@@ -24,6 +24,8 @@
 </p>
 </h2>
 
+**New to MLJ? Start [here](https://alan-turing-institute.github.io/MLJ.jl/stable/)**. 
+
 MLJ (Machine Learning in Julia) is a toolbox written in Julia
 providing a common interface and meta-algorithms for selecting,
 tuning, evaluating, composing and comparing machine learning models written in Julia and other languages.  MLJ is released
@@ -39,11 +41,6 @@ Institute](https://www.turing.ac.uk/).
 </p>
 </br>
 
-**The starting point for the general MLJ user** is MLJ's
-[documentation](https://alan-turing-institute.github.io/MLJ.jl/stable/). This
-page contains information for developers of MLJ and related packages
-in the Julia machine learning eco-system.
-
 
 ### The MLJ Universe
 

diff --git a/docs/make.jl b/docs/make.jl
@@ -24,6 +24,7 @@ pages = [
     "Introduction" => "index.md",
     "Getting Started" => "getting_started.md",
     "Common MLJ Workflows" => "common_mlj_workflows.md",
+    "Working with Categorical Data" => "working_with_categorical_data.md",
     "Model Search" => "model_search.md",
     "Machines" => "machines.md",
     "Evaluating Model Performance" => "evaluating_model_performance.md",

diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md
@@ -1,6 +1,9 @@
-
 # Adding Models for General Use
 
+!!! warning
+
+    Models implementing the MLJ model interface according to the instructions given here should import MLJModelInterface version 0.3 or higher. This is	enforced with a statement such as `MLJModelInterface = "^0.3" ` under `[compat]` in the Project.toml file of the package containing the implementation.
+
 This guide outlines the specification of the MLJ model interface
 and provides detailed guidelines for implementing the interface for
 models intended for general use. See also the more condensed
@@ -408,7 +411,7 @@ ordering of these integers being consistent with that of the pool),
 integers back into `CategoricalValue`/`CategoricalString` objects),
 and `classes`, for extracting all the `CategoricalValue` or
 `CategoricalString` objects sharing the pool of a particular
-value/string. Refer to [Convenience methods](@ref) below for important
+value. Refer to [Convenience methods](@ref) below for important
 details.
 
 Note that a decoder created during `fit` may need to be bundled with
@@ -456,68 +459,77 @@ must be an `AbstractVector` whose elements are distributions (one distribution
 per row of `Xnew`).
 
 Presently, a *distribution* is any object `d` for which
-`MMI.isdistribution(::d) = true`, which is currently restricted to
-objects subtyping `Distributions.Sampleable` from the package
-Distributions.jl.
+`MMI.isdistribution(::d) = true`, which is the case for objects of
+type `Distributions.Sampleable`.
+
+Use the distribution `MMI.UnivariateFinite` for `Probabilistic` models
+predicting a target with `Finite` scitype (classifiers). In this case
+the eltype of the training target `y` will be a `CategoricalValue`.
+
+For efficiency, one should not construct `UnivariateDistribution`
+instances one at a time. Rather, once a probability vector or matrix
+is known, construct an instance of `UnivariateFiniteVector <:
+AbstractArray{<:UnivariateFinite},1}` to return. Both `UnivariateFinite`
+and `UnivariateFiniteVector` objects are constructed using the single
+`UnivariateFinite` function.
 
-Use the distribution `MMI.UnivariateFinite` for `Probabilistic`
-models predicting a target with `Finite` scitype (classifiers). In
-this case each element of the training target `y` is a
-`CategoricalValue` or `CategoricalString`, as in this contrived example:
+For example, suppose the target `y` arrives as a subsample of some
+`ybig` and is missing some classes:
 
 ```julia
-using CategoricalArrays
-y = Any[categorical([:yes, :no, :no, :maybe, :maybe])...]
+ybig = categorical([:a, :b, :a, :a, :b, :a, :rare, :a, :b])
+y = ybig[1:6]
 ```
 
-Note that, as in this case, we cannot assume `y` is a
-`CategoricalVector`, and we rely on elements for pool information (if
-we need it); this is accessible using the convenience method
-`MLJ.classes`:
+Your fit method has bundled the first element of `y` with the
+`fitresult` to make it available to `predict` for purposes of tracking
+the complete pool of classes. Let's call this `an_element =
+y[1]`. Then, supposing the corresponding probabilities of the observed
+classes `[:a, :b]` are in an `n x 2` matrix `probs` (where `n` the number of
+rows of `Xnew`) then you return
 
 ```julia
-julia> yes = y[1]
-julia> levels = MMI.classes(yes)
-3-element Array{CategoricalValue{Symbol,UInt32},1}:
- :maybe
- :no
- :yes
+yhat = UnivariateFinite([:a, :b], probs, pool=an_element)
 ```
 
-Now supposing that, for some new input pattern, the elements `yes =
-y[1]` and `no = y[2]` are to be assigned respective probabilities of
-0.2 and 0.8. Then the corresponding distribution `d` is constructed as
-follows:
+This object automatically assigns zero-probability to the unseen class
+`:rare` (i.e., `pdf.(yhat, :rare)` works and returns a zero
+vector). If you would like to assign `:rare` non-zero probabilities,
+simply add it to the first vector (the *support*) and supply a larger
+`probs` matrix. 
 
-```julia
-julia> d = MMI.UnivariateFinite([yes, no], [0.2, 0.8])
-UnivariateFinite(:yes=>0.2, :maybe=>0.0, :no=>0.8)
+If instead of raw labels `[:a, :b]` you have the corresponding
+`CategoricalElement`s (from, e.g., `filter(cv->cv in unique(y),
+classes(y))`) then you can use these instead and drop the `pool`
+specifier.
 
-julia> pdf(d, yes)
-0.2
+In a binary classification problem it suffices to specify a single
+vector of probabilities, provided you specify `augment=true`, as in
+the following example, *and note carefully that these probablities are
+associated with the* **last** *(second) class you specify in the
+constructor:*
 
-julia> maybe = y[4]; pdf(d, maybe)
-0.0
+```julia
+y = categorical([:TRUE, :FALSE, :FALSE, :TRUE, :TRUE])
+an_element = y[1]
+probs = rand(10)
+yhat = UnivariateFinite([:FALSE, :TRUE], probs, augment=true, pool=an_element)
 ```
 
-Alternatively, a dictionary can be passed to the constructor.
+The constructor has a lot of options, including passing a dictionary
+instead of vectors. See [`UnivariateFinite`](@ref) for details.
 
 See
 [LinearBinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
 for an example of a Probabilistic classifier implementation.
 
-
-```@docs
-UnivariateFinite
-```
-
 *Important note on binary classifiers.* There is no "Binary" scitype
 distinct from `Multiclass{2}` or `OrderedFactor{2}`; `Binary` is just
 an alias for `Union{Multiclass{2},OrderedFactor{2}}`. The
 `target_scitype` of a binary classifier will generally be
 `AbstractVector{<:Binary}` and according to the *mlj* scitype
-convention, elements of `y` have type `CategoricalValue` or
-`CategoricalString`, and *not* `Bool`. See
+convention, elements of `y` have type `CategoricalValue`, and *not*
+`Bool`. See
 [BinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
 for an example.
 
@@ -558,8 +570,7 @@ MMI.input_scitype(::Type{<:DecisionTreeClassifier}) = Table(Union{Continuous,Mis
 ```
 
 Similarly, to ensure the target is an AbstractVector whose elements
-have `Finite` scitype (and hence `CategoricalValue` or
-`CategoricalString` machine type) we declare
+have `Finite` scitype (and hence `CategoricalValue` machine type) we declare
 
 ```julia
 MMI.target_scitype(::Type{<:DecisionTreeClassifier}) = AbstractVector{<:Finite}
@@ -584,8 +595,7 @@ restricts to tables with continuous or binary (ordered or unordered)
 columns.
 
 For predicting variable length sequences of, say, binary values
-(`CategoricalValue`s or `CategoricalString`s with some common size-two
-pool) we declare
+(`CategoricalValue`s) with some common size-two pool) we declare
 
 ```julia
 target_scitype(SomeSupervisedModel) = AbstractVector{<:NTuple{<:Finite{2}}}
@@ -875,6 +885,11 @@ MLJModelInterface.selectrows
 MLJModelInterface.selectcols
 ```
 
+```@docs
+UnivariateFinite
+```
+
+
 
 ### Where to place code implementing new models
 

diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
@@ -1,7 +1,7 @@
 # Getting Started
 
 For an outline of MLJ's **goals** and **features**, see the
-[Introduction](@ref). 
+[Introduction](@ref).
 
 This section introduces the most basic MLJ operations and concepts. It
 assumes MJL has been successfully installed. See [Installation](@ref)
@@ -20,12 +20,12 @@ seed!(1234)
 
 To load some demonstration data, add
 [RDatasets](https://github.com/JuliaStats/RDatasets.jl) to your load
-path and enter 
+path and enter
 
-```@repl doda 
-import RDatasets 
-iris = RDatasets.dataset("datasets", "iris"); # a DataFrame 
-``` 
+```@repl doda
+import RDatasets
+iris = RDatasets.dataset("datasets", "iris"); # a DataFrame
+```
 
 and then split the data into input and target parts:
 
@@ -35,15 +35,16 @@ y, X = unpack(iris, ==(:Species), colname -> true);
 first(X, 3) |> pretty
 ```
 
-To list all models available in MLJ's [model
-registry](model_search.md):
+To list *all* models available in MLJ's [model
+registry](model_search.md) do `models()`. Listing the models
+compatible with the present data:
 
 ```@repl doda
-models()
+models(matching(X,y))
 ```
 
 In MLJ a *model* is a struct storing the hyperparameters of the
-learning algorithm indicated by the struct name.  
+learning algorithm indicated by the struct name.
 
 Assuming the DecisionTree.jl package is in your load path, we can use
 `@load` to load the code defining the `DecisionTreeClassifier` model
@@ -150,6 +151,7 @@ package can be applied to such distributions:
 
 ```@repl doda
 broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
+broadcast(pdf, yhat, y[test])[3:5] # predicted probability of observed class
 mode.(yhat[3:5])
 ```
 
@@ -160,6 +162,15 @@ Or, one can explicitly get modes by using `predict_mode` instead of
 predict_mode(tree, rows=test[3:5])
 ```
 
+*(MLJ v0.2.7 and higher)* Finally, we note that `pdf()` is
+overloaded to allow the retrieval of probabilities for all levels at
+once:
+
+```@repl doda
+L = levels(y)
+pdf(yhat[3:5], L)
+```
+
 Unsupervised models have a `transform` method instead of `predict`,
 and may optionally implement an `inverse_transform` method:
 
@@ -333,11 +344,11 @@ are the key features of that convention:
 
 - `String`s and `Char`s are *not* interpreted as `Multiclass` or
   `OrderedFactor` (they have scitypes `Textual` and `Unknown`
-  respectively). 
-  
+  respectively).
+
 - In particular, *integers* (including `Bool`s) *cannot be used to
   represent categorical data.* Use the preceding `coerce` operations
-  to coerce to a `Finite` scitype. 
+  to coerce to a `Finite` scitype.
 
 Use `coerce(v, OrderedFactor)` or `coerce(v, Multiclass)` to coerce a
 vector `v` of integers, strings or characters to a vector with an