Proposal for metadata #22

ablaom · 2018-11-22T14:39:37Z

I'm inviting feedback on a suggestion for encoding metadata.

We would like to associate certain metadata with models (most of these being defined in external packages). The main purpose of the metadata is so we can mimic the R task interface, which allows a user to match task specifications (e.g., I want a classifier that handles nominal features) to a list of qualifying models.

I expect a local registry will store the model metadata, with a macro call updating the registry each time a model is defined (which means when the user imports the relevant external package, in the case of lazily loaded interfaces).

We suggest metadata consist of:

list of model properties (see below for a list)
list of supported operations (predict, predict_proba, inverse_transform, etc)
the allowed data types for inputs (and target)

Note that at present, the only the subtypes Model (our abstract type for the hyperparameter containers) are Supervised and Unsupervised; so Regression, Classification and MultiClass are just properties.

In the core code we do something like this:

abstract type Property end   # subtypes are the allowable model properties

""" Models with this property perform regression """
struct Regression <: Property end    
""" Models with this property perform binary classification """
struct Classification <: Property end
""" Models with this property perform binary and multiclass classification """
struct MultiClass <: Property end
""" Models with this property support nominal (categorical) features """
struct Nominal <: Property end
""" Models with this property support features of numeric type (continuous or ordered factor) """
struct Numeric <: Property end
""" Classfication models with this property allow weighting of the target classes """
struct Weights <: Property end
""" Models with this property support features with missing values """ 
struct NAs <: Property end

And model declarations look something like this:

mutable struct DecisionTreeClassifier{T} <: Supervised{DecisionTreeClassifierFitResultType{T}} 
    pruning_purity::Float64 
    max_depth::Int
    min_samples_leaf::Int
    min_samples_split::Int
    min_purity_increase::Float64
    n_subfeatures::Float64
    display_depth::Int
    post_prune::Bool
    merge_purity_threshold::Float64
end

# metadata:
properties(::Type{DecisionTreeClassifier}) = [MultiClass(), Numeric()]
operations(::Type{DecisionTreeClassifier}) = [predict]
type_of_X(::Type{DecisionTreeClassifier}) = Array{Float64,2}
type_of_y(::Type{DecisionTreeClassifier}) = Vector

The text was updated successfully, but these errors were encountered:

darenasc · 2018-11-22T16:06:58Z

Looks good to me.

An additional property for a model could be Ranking.

Also, default values for parameters could be included in a constructor method when the model is created.

fkiraly · 2018-11-22T18:14:52Z

Hm, I'd propose a design which is a bit more structured than mlr's "let's dump all metadata as flags".

For example, metadata indicating a supervised learner (regressor or classifier, single class or multi-class, probabilistic or not) could be defined through the input and output types it supports.

E.g., given through a type theoretic arrow expression
[continuous] {or} [factor] {mapsto} [univariate][factor]
would define a deterministic (univariate) classifier,
[continuous] {or} [NAble][factor] {mapsto} [multivariate][continuous]
would define a multi-target regressor that can deal with NAs in factor variables but not in continuous variables, and
[continuous] {mapsto} [distributionof][univariate][factor]
would define a probabilistic classifier (i.e., one that has predict_proba in scikit-learn terminology) that can take only numerical inputs.

For type theory aficionados, the estimator would have a first-order type "predictor" with argument type given by the formula; {mapsto} is the type arrow, and [...] are types, some of which are first-order types in prefix notation.

Practically, the "holy grail" solution would have this as a model specification language with a rudimentary compiler that would be able to construct standard models from primitives via first-order operations (via term resolution calculus or similar).
The "basic" solution could have a semantic hierarchy that has a bucket for input types and output types, where the output type bucket has flags attached that say "probabilistic" or "multivariate", and the input bucket has flags attached such as "NA" etc.

fkiraly · 2018-11-22T18:27:54Z

Regarding the "holy grail" solution, why is this useful:

Say the user specifies a problem that needs to cope with NAs and specific input types, e.g., [strangetype]. Unfortunately, none of the off-shelf strategies can deal with [strangetype] and NAs in the other variables, simultaneously.

However, the resolution calculus (e.g., clippy assistant for mlj) can tell us that the registry has an imputor algorithm for the non-strange NAs can be connected with the predictor algorithm to a composite/network algorithm which now can deal with NAs and the [strangetype] - problem solved.

ablaom · 2018-11-23T06:34:32Z

@darenasc The default values are encoded in the keyword constructor (omitted above):

function DecisionTreeClassifier(
    ; target_type=nothing 
    , pruning_purity=1.0
    , max_depth=-1
    , min_samples_leaf=1
    , min_samples_split=2
    , min_purity_increase=0.0
    , n_subfeatures=0
    , display_depth=5
    , post_prune=false
    , merge_purity_threshold=0.9)

    !(target_type == nothing) || throw(error("You must specify target_type=..."))

    model = DecisionTreeClassifier{target_type}(
        pruning_purity
        , max_depth
        , min_samples_leaf
        , min_samples_split
        , min_purity_increase
        , n_subfeatures
        , display_depth
        , post_prune
        , merge_purity_threshold)

     <omitted tests for invariants>

     return model
end

There is a small issue here though in this case. I would like to just call DecisionTreeClassifier() to get default object (from which I can get a dictionary of defaults), but I have to specify the type parameter as an argument. Probably best to split this into two: specify a dictionary for the parameters with defaults and use this in the keyword constructor.

Edit: Not an issue. Add target type to hyperparameters and every parameter has a default value.

ablaom · 2018-11-23T08:48:46Z

@fkiraly Organising the information around supported input and output types makes sense. However, I'm not convinced representing the metadata using a compiled specification language is worth the trouble (and extra abstraction for the user) if we can encode exactly the same information in some (binned) flags.

fkiraly · 2018-11-23T10:38:22Z

@ablaom meta-spec language only makes sense in a larger ecosystem where enough non-standard cases are supported. For tabular data, I'd agree, there's no immediate need, but it would be an interesting design/research study if we would have the resource.

Though if we were making a meta-spec language, I'd think it would/should look exactly like that.

My main point is that at least one would like some kind of hierarchy that would make conversion or interface with the simplest "arrow"-slash-"mapsto" type easy, i.e., without having to write a stack of gluecode encoding the "secret" knowledge what all the flags are meant to mean.

ablaom · 2018-11-29T14:25:49Z

After some discussions a slightly more structured form of the metadata has been proposed. Here is an example:

MLJ.properties(::Type{SomeMulticlassClassifier}) = (CanRankFeatures(), )
MLJ.operations(::Type{SomeMulticlassClassifier}) = (predict, predict_proba) 
MLJ.inputs_can_be(::Type{SomeMulitClassClassifier}) = (Numeric(), Nominal(), NA())
MLJ.outputs_are(::Type{DecisionTreeClassifier}) = (Nominal(),)

Available options can be gleaned from this code extract:

# `property(ModelType)` is a tuple of instances of:
""" Classfication models with this property allow weighting of the target classes """
struct CanWeightTarget <: Property end
""" Models with this property can provide feature rankings or importance scores """
struct CanRankFeatures <: Property end

# `inputs_can_be(SomeModelType)` and `outputs_are(SomeModelType)` are tuples of
# instances of:
struct Nominal <: Property end
struct Numeric <: Property end
struct NA <: Property end

# additionally, `outputs_are(SomeModelType)` can include:
struct Probababilistic <: Property end
struct Multivariate <: Property end

# for `Model`s with nominal targets (classifiers)
# `outputs_are(SomeModelType)` could also include:
struct Multiclass <: Property end # can handle more than two classes

ablaom · 2018-11-29T14:29:40Z

What suggestions are there for storing, updating and accessing the metadata? The constraints are: (1) external packages are loaded on a user-needs-it basis (lazy loading); and (2) user wants access to the metadata before loading any packages.

ablaom · 2018-12-03T13:59:05Z

#29 (comment)

ablaom · 2019-01-10T22:59:53Z

We are still looking for proposals on how to store metadata of supervised learning models (generally defined in external packages) that implement the MLJ model interface (now defined in MLJBase) . Currently metadata is everything defined by the dictionary returned by info(model) or info(SomeSuperivsedModelType), for example:

julia> using MLJ
julia> using MultivariateStats
julia> info(RidgeRegressor)
Dict{Symbol,Union{Array{Symbol,1}, String, Symbol}} with 5 entries:
  :is_pure_julia => :yes
  :package_uuid => "6f286f6a-111f-5878-ab1e-185364afe411"
  :package_name => "MultivariateStats"
  :target_is => Symbol[:deterministic, :numeric, :univariate]
  :inputs_can_be => Symbol[:numeric]

But check adding_new_models.md for the complete spec.)

fkiraly · 2019-01-12T22:47:10Z

Seems very sensible to me for a start - it should be easy to change the number, structure, and type of the dictionary entries in case we come up with a more sophisticated structure.

What I wanted to flag up that eventually, one may like to have a "model registry", that is, a singleton entity which can be queried to retrieve, or suggest, modelling strategies fitting a certain task. E.g., "find me all probabilistic supervised regression models that can deal with NAs in numerical variables and which are not Bayesian MCMC based, unless they are from packages I wrote".

For this kind of query, there would need to be some indexing of primitive strategies, which is the other way: i.e., not getting properties of strategies, but obtaining strategies with certain properties.

Hence, at least two additional things might make sense:

requiring a unique identifier for each primitive strategy
some way that later (i.e., not now since this is not a priority) easily allows look-up. Maybe calling a later written "register(RidgeRegressor)" by default when loading modules is all that's needed though, so maybe this second point can be ignored by now.

fkiraly · 2019-01-12T22:58:54Z

Regarding lazy-loading with queriable content, that would be solved by a registry - while that's a very "object oriented" solution, I cannot see how it is solved by a functional design. Simply given that dispatching requires a struct that's already there.

The registry could call "register" methods from some default packages or interfaces which loads model information from other packages (e.g., from a "registry" part) without loading the full package. Further, non-default packages could be asked to provide a "register" file, sub-module, or interface, which could be loaded into the main registry which would index the contents.

ablaom · 2019-01-14T03:48:46Z

@fkiraly Could you clarify for me exactly what you mean by a "primitive" strategy?

fkiraly · 2019-01-14T22:05:51Z

@ablaom you call it atomic - your terminology is, perhaps, better, since "atomic" means "indivisible" in Greek which is literally what it means to mean. "primitive" means "the simplest" in Latin and is slightly more vague since simplicity is more subjective, while compositionality (from atoms) is not.

nalimilan · 2019-01-29T18:39:17Z

Just in case you're not aware of it, StatsBase defines a set a functions for packages to override, which includes predict. There's also a discussion regarding transform and its inverse (which is currently called reconstruct in MultivariateStats). You may want to chime in there.

ablaom · 2019-02-21T01:28:30Z

I have been thinking more seriously about the task interface, and I am revisiting the way we currently organise models and metadata. I am inviting comment before proceeding with some revisions. Unless a objections are raised, I will proceed early next week with implementation.

Here's more-or-less my proposed plan:

The existing model type hierarchy remains the same. In particular, we split Supervised into Probabilistic and Deterministic.
For supervised learners, the trait functions input_quantity and output_quantity (taking values :univariate, :multivariate) stay the same (maybe rename output_quantity to target_quantity.
With a more formalised notion of "scientific type" (see Conventions about the representations of scalar data #86 and, in particular, this draft document), we replace the input_kinds trait with a input_scitypes_upperbound trait, and, similarly, replace output_kind with target_scitype_upperbound. The return value of input_scitypes_upperbound is any subtype of Union{Missing, Found} and that of target_scitype_upperbound is any subtype of Found. Some details on their planned function:
- Corresponding to input_scitypes_upperbound is the SupervisedTask field input_scitypes whose value, inferred from the task data, is the union over all x in the input data (rows and columns) of scitype(x). For example: Union{Missing, Muliticlass{3}, Continuous}.
- Corresponding to target_scitype_upperbound is the SupervisedTask field target_scitype whose value, inferred from the task data, is a tuple of scientific types (subtypes of Found), one for each target column, indicating the union, over all x in the column, of scitype(x). For example: Tuple{OrderedFactorInfinite, Multiclass{7}}.
- a model M is deemed to match a the task T if input_scitypes(T) <: input_scitypes_upperbound(M) and target_scitypes(T) <: target_types_upperbound(M), and if all the other (simpler) traits match corresponding values inferred from the task data.

fkiraly · 2019-02-21T08:52:30Z

Short comments:

I assume these are metadata flags which are automatically set (e.g., in a infer_traits method) when a task is created, on data?
also I wouldn't call it "upperbound" - I understand what you want to say (in terms of set containedness for the set of types to necessarily support, right?), but it might sound confusing as the primary association is an upper bound on the reals.

ablaom · 2019-02-21T18:49:20Z

Short comments:

I assume these are metadata flags which are automatically set (e.g., in a infer_traits method) when a task is created, on data?

Yes, inferred from the data, not user-defined.

also I wouldn't call it "upperbound" - I understand what you want to say (in terms of set containedness for the set of types to necessarily support, right?), but it might sound confusing as the primary association is an upper bound on the reals.

Yes, I mean an upper bound in the sense of the partial order on the set of types by inclusion. Is there is a domain-specific alternative name for the notion of upper bound?

fkiraly · 2019-02-21T20:20:06Z

Yes, I mean an upper bound in the sense of the partial order on the set of types by inclusion. Is there is a domain-specific alternative name for the notion of upper bound?

My point is just that the common user won't know what an upper bound is in a general order theoretic context, or specifically in the context of the partial order of sets with inclusion.

Something along the lines of "requiressupportforscitypes" or "containsthefollowingscitypes" or "necessaryinputscitypes" but more concise might be better.

Just following the design principle of giving self-evident or descriptive names to user sided functionality.
The rookie mistake is calling things "a" or "b" or "variable1", the expert mistake (which I'm most definitely not innocent of, but always happy to point it out in others) is giving things eclectially obscure names that only other experts understand...

ablaom assigned MikeInnes Nov 29, 2018

ysimillides added the design discussion Discussing design issues label Dec 11, 2018

fkiraly mentioned this issue Jan 23, 2019

Related efforts in the Julia ecosystem. PP, autoML, formulae, visualization, and others. #47

Open

ablaom mentioned this issue Mar 5, 2019

Saving metadata about wrapped but not loaded algorithms #3

Closed

ablaom closed this as completed Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for metadata #22

Proposal for metadata #22

ablaom commented Nov 22, 2018 •

edited

darenasc commented Nov 22, 2018

fkiraly commented Nov 22, 2018 •

edited

fkiraly commented Nov 22, 2018

ablaom commented Nov 23, 2018 •

edited

ablaom commented Nov 23, 2018

fkiraly commented Nov 23, 2018 •

edited

ablaom commented Nov 29, 2018

ablaom commented Nov 29, 2018

ablaom commented Dec 3, 2018

ablaom commented Jan 10, 2019

fkiraly commented Jan 12, 2019

fkiraly commented Jan 12, 2019

ablaom commented Jan 14, 2019 •

edited

fkiraly commented Jan 14, 2019

nalimilan commented Jan 29, 2019

ablaom commented Feb 21, 2019 •

edited

fkiraly commented Feb 21, 2019

ablaom commented Feb 21, 2019 •

edited

fkiraly commented Feb 21, 2019 •

edited

Proposal for metadata #22

Proposal for metadata #22

Comments

ablaom commented Nov 22, 2018 • edited

darenasc commented Nov 22, 2018

fkiraly commented Nov 22, 2018 • edited

fkiraly commented Nov 22, 2018

ablaom commented Nov 23, 2018 • edited

ablaom commented Nov 23, 2018

fkiraly commented Nov 23, 2018 • edited

ablaom commented Nov 29, 2018

ablaom commented Nov 29, 2018

ablaom commented Dec 3, 2018

ablaom commented Jan 10, 2019

fkiraly commented Jan 12, 2019

fkiraly commented Jan 12, 2019

ablaom commented Jan 14, 2019 • edited

fkiraly commented Jan 14, 2019

nalimilan commented Jan 29, 2019

ablaom commented Feb 21, 2019 • edited

fkiraly commented Feb 21, 2019

ablaom commented Feb 21, 2019 • edited

fkiraly commented Feb 21, 2019 • edited

ablaom commented Nov 22, 2018 •

edited

fkiraly commented Nov 22, 2018 •

edited

ablaom commented Nov 23, 2018 •

edited

fkiraly commented Nov 23, 2018 •

edited

ablaom commented Jan 14, 2019 •

edited

ablaom commented Feb 21, 2019 •

edited

ablaom commented Feb 21, 2019 •

edited

fkiraly commented Feb 21, 2019 •

edited