Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for metadata #22

Closed
ablaom opened this issue Nov 22, 2018 · 19 comments
Closed

Proposal for metadata #22

ablaom opened this issue Nov 22, 2018 · 19 comments
Assignees
Labels
design discussion Discussing design issues

Comments

@ablaom
Copy link
Member

ablaom commented Nov 22, 2018

I'm inviting feedback on a suggestion for encoding metadata.

We would like to associate certain metadata with models (most of these being defined in external packages). The main purpose of the metadata is so we can mimic the R task interface, which allows a user to match task specifications (e.g., I want a classifier that handles nominal features) to a list of qualifying models.

I expect a local registry will store the model metadata, with a macro call updating the registry each time a model is defined (which means when the user imports the relevant external package, in the case of lazily loaded interfaces).

We suggest metadata consist of:

  • list of model properties (see below for a list)
  • list of supported operations (predict, predict_proba, inverse_transform, etc)
  • the allowed data types for inputs (and target)

Note that at present, the only the subtypes Model (our abstract type for the hyperparameter containers) are Supervised and Unsupervised; so Regression, Classification and MultiClass are just properties.

In the core code we do something like this:

abstract type Property end   # subtypes are the allowable model properties

""" Models with this property perform regression """
struct Regression <: Property end    
""" Models with this property perform binary classification """
struct Classification <: Property end
""" Models with this property perform binary and multiclass classification """
struct MultiClass <: Property end
""" Models with this property support nominal (categorical) features """
struct Nominal <: Property end
""" Models with this property support features of numeric type (continuous or ordered factor) """
struct Numeric <: Property end
""" Classfication models with this property allow weighting of the target classes """
struct Weights <: Property end
""" Models with this property support features with missing values """ 
struct NAs <: Property end

And model declarations look something like this:

mutable struct DecisionTreeClassifier{T} <: Supervised{DecisionTreeClassifierFitResultType{T}} 
    pruning_purity::Float64 
    max_depth::Int
    min_samples_leaf::Int
    min_samples_split::Int
    min_purity_increase::Float64
    n_subfeatures::Float64
    display_depth::Int
    post_prune::Bool
    merge_purity_threshold::Float64
end

# metadata:
properties(::Type{DecisionTreeClassifier}) = [MultiClass(), Numeric()]
operations(::Type{DecisionTreeClassifier}) = [predict]
type_of_X(::Type{DecisionTreeClassifier}) = Array{Float64,2}
type_of_y(::Type{DecisionTreeClassifier}) = Vector
@darenasc
Copy link
Collaborator

Looks good to me.

An additional property for a model could be Ranking.

Also, default values for parameters could be included in a constructor method when the model is created.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2018

Hm, I'd propose a design which is a bit more structured than mlr's "let's dump all metadata as flags".

For example, metadata indicating a supervised learner (regressor or classifier, single class or multi-class, probabilistic or not) could be defined through the input and output types it supports.

E.g., given through a type theoretic arrow expression
[continuous] {or} [factor] {mapsto} [univariate][factor]
would define a deterministic (univariate) classifier,
[continuous] {or} [NAble][factor] {mapsto} [multivariate][continuous]
would define a multi-target regressor that can deal with NAs in factor variables but not in continuous variables, and
[continuous] {mapsto} [distributionof][univariate][factor]
would define a probabilistic classifier (i.e., one that has predict_proba in scikit-learn terminology) that can take only numerical inputs.

For type theory aficionados, the estimator would have a first-order type "predictor" with argument type given by the formula; {mapsto} is the type arrow, and [...] are types, some of which are first-order types in prefix notation.

Practically, the "holy grail" solution would have this as a model specification language with a rudimentary compiler that would be able to construct standard models from primitives via first-order operations (via term resolution calculus or similar).
The "basic" solution could have a semantic hierarchy that has a bucket for input types and output types, where the output type bucket has flags attached that say "probabilistic" or "multivariate", and the input bucket has flags attached such as "NA" etc.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2018

Regarding the "holy grail" solution, why is this useful:

Say the user specifies a problem that needs to cope with NAs and specific input types, e.g., [strangetype]. Unfortunately, none of the off-shelf strategies can deal with [strangetype] and NAs in the other variables, simultaneously.

However, the resolution calculus (e.g., clippy assistant for mlj) can tell us that the registry has an imputor algorithm for the non-strange NAs can be connected with the predictor algorithm to a composite/network algorithm which now can deal with NAs and the [strangetype] - problem solved.

@ablaom
Copy link
Member Author

ablaom commented Nov 23, 2018

@darenasc The default values are encoded in the keyword constructor (omitted above):

function DecisionTreeClassifier(
    ; target_type=nothing 
    , pruning_purity=1.0
    , max_depth=-1
    , min_samples_leaf=1
    , min_samples_split=2
    , min_purity_increase=0.0
    , n_subfeatures=0
    , display_depth=5
    , post_prune=false
    , merge_purity_threshold=0.9)

    !(target_type == nothing) || throw(error("You must specify target_type=..."))

    model = DecisionTreeClassifier{target_type}(
        pruning_purity
        , max_depth
        , min_samples_leaf
        , min_samples_split
        , min_purity_increase
        , n_subfeatures
        , display_depth
        , post_prune
        , merge_purity_threshold)

     <omitted tests for invariants>

     return model
end

There is a small issue here though in this case. I would like to just call DecisionTreeClassifier() to get default object (from which I can get a dictionary of defaults), but I have to specify the type parameter as an argument. Probably best to split this into two: specify a dictionary for the parameters with defaults and use this in the keyword constructor.


Edit: Not an issue. Add target type to hyperparameters and every parameter has a default value.

@ablaom
Copy link
Member Author

ablaom commented Nov 23, 2018

@fkiraly Organising the information around supported input and output types makes sense. However, I'm not convinced representing the metadata using a compiled specification language is worth the trouble (and extra abstraction for the user) if we can encode exactly the same information in some (binned) flags.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 23, 2018

@ablaom meta-spec language only makes sense in a larger ecosystem where enough non-standard cases are supported. For tabular data, I'd agree, there's no immediate need, but it would be an interesting design/research study if we would have the resource.

Though if we were making a meta-spec language, I'd think it would/should look exactly like that.

My main point is that at least one would like some kind of hierarchy that would make conversion or interface with the simplest "arrow"-slash-"mapsto" type easy, i.e., without having to write a stack of gluecode encoding the "secret" knowledge what all the flags are meant to mean.

@ablaom
Copy link
Member Author

ablaom commented Nov 29, 2018

After some discussions a slightly more structured form of the metadata has been proposed. Here is an example:

MLJ.properties(::Type{SomeMulticlassClassifier}) = (CanRankFeatures(), )
MLJ.operations(::Type{SomeMulticlassClassifier}) = (predict, predict_proba) 
MLJ.inputs_can_be(::Type{SomeMulitClassClassifier}) = (Numeric(), Nominal(), NA())
MLJ.outputs_are(::Type{DecisionTreeClassifier}) = (Nominal(),)

Available options can be gleaned from this code extract:

# `property(ModelType)` is a tuple of instances of:
""" Classfication models with this property allow weighting of the target classes """
struct CanWeightTarget <: Property end
""" Models with this property can provide feature rankings or importance scores """
struct CanRankFeatures <: Property end

# `inputs_can_be(SomeModelType)` and `outputs_are(SomeModelType)` are tuples of
# instances of:
struct Nominal <: Property end
struct Numeric <: Property end
struct NA <: Property end

# additionally, `outputs_are(SomeModelType)` can include:
struct Probababilistic <: Property end
struct Multivariate <: Property end

# for `Model`s with nominal targets (classifiers)
# `outputs_are(SomeModelType)` could also include:
struct Multiclass <: Property end # can handle more than two classes

@ablaom
Copy link
Member Author

ablaom commented Nov 29, 2018

What suggestions are there for storing, updating and accessing the metadata? The constraints are: (1) external packages are loaded on a user-needs-it basis (lazy loading); and (2) user wants access to the metadata before loading any packages.

@ablaom
Copy link
Member Author

ablaom commented Dec 3, 2018

#29 (comment)

@ysimillides ysimillides added the design discussion Discussing design issues label Dec 11, 2018
@ablaom
Copy link
Member Author

ablaom commented Jan 10, 2019

We are still looking for proposals on how to store metadata of supervised learning models (generally defined in external packages) that implement the MLJ model interface (now defined in MLJBase) . Currently metadata is everything defined by the dictionary returned by info(model) or info(SomeSuperivsedModelType), for example:

julia> using MLJ
julia> using MultivariateStats
julia> info(RidgeRegressor)
Dict{Symbol,Union{Array{Symbol,1}, String, Symbol}} with 5 entries:
  :is_pure_julia => :yes
  :package_uuid => "6f286f6a-111f-5878-ab1e-185364afe411"
  :package_name => "MultivariateStats"
  :target_is => Symbol[:deterministic, :numeric, :univariate]
  :inputs_can_be => Symbol[:numeric]

But check adding_new_models.md for the complete spec.)

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 12, 2019

Seems very sensible to me for a start - it should be easy to change the number, structure, and type of the dictionary entries in case we come up with a more sophisticated structure.

What I wanted to flag up that eventually, one may like to have a "model registry", that is, a singleton entity which can be queried to retrieve, or suggest, modelling strategies fitting a certain task. E.g., "find me all probabilistic supervised regression models that can deal with NAs in numerical variables and which are not Bayesian MCMC based, unless they are from packages I wrote".

For this kind of query, there would need to be some indexing of primitive strategies, which is the other way: i.e., not getting properties of strategies, but obtaining strategies with certain properties.

Hence, at least two additional things might make sense:

  • requiring a unique identifier for each primitive strategy
  • some way that later (i.e., not now since this is not a priority) easily allows look-up. Maybe calling a later written "register(RidgeRegressor)" by default when loading modules is all that's needed though, so maybe this second point can be ignored by now.

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 12, 2019

Regarding lazy-loading with queriable content, that would be solved by a registry - while that's a very "object oriented" solution, I cannot see how it is solved by a functional design. Simply given that dispatching requires a struct that's already there.

The registry could call "register" methods from some default packages or interfaces which loads model information from other packages (e.g., from a "registry" part) without loading the full package. Further, non-default packages could be asked to provide a "register" file, sub-module, or interface, which could be loaded into the main registry which would index the contents.

@ablaom
Copy link
Member Author

ablaom commented Jan 14, 2019

@fkiraly Could you clarify for me exactly what you mean by a "primitive" strategy?

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 14, 2019

@ablaom you call it atomic - your terminology is, perhaps, better, since "atomic" means "indivisible" in Greek which is literally what it means to mean. "primitive" means "the simplest" in Latin and is slightly more vague since simplicity is more subjective, while compositionality (from atoms) is not.

@nalimilan
Copy link
Contributor

Just in case you're not aware of it, StatsBase defines a set a functions for packages to override, which includes predict. There's also a discussion regarding transform and its inverse (which is currently called reconstruct in MultivariateStats). You may want to chime in there.

@ablaom
Copy link
Member Author

ablaom commented Feb 21, 2019

I have been thinking more seriously about the task interface, and I am revisiting the way we currently organise models and metadata. I am inviting comment before proceeding with some revisions. Unless a objections are raised, I will proceed early next week with implementation.

Here's more-or-less my proposed plan:

  1. The existing model type hierarchy remains the same. In particular, we split Supervised into Probabilistic and Deterministic.

  2. For supervised learners, the trait functions input_quantity and output_quantity (taking values :univariate, :multivariate) stay the same (maybe rename output_quantity to target_quantity.

  3. With a more formalised notion of "scientific type" (see Conventions about the representations of scalar data #86 and, in particular, this draft document), we replace the input_kinds trait with a input_scitypes_upperbound trait, and, similarly, replace output_kind with target_scitype_upperbound. The return value of input_scitypes_upperbound is any subtype of Union{Missing, Found} and that of target_scitype_upperbound is any subtype of Found. Some details on their planned function:

    • Corresponding to input_scitypes_upperbound is the SupervisedTask field input_scitypes whose value, inferred from the task data, is the union over all x in the input data (rows and columns) of scitype(x). For example: Union{Missing, Muliticlass{3}, Continuous}.

    • Corresponding to target_scitype_upperbound is the SupervisedTask field target_scitype whose value, inferred from the task data, is a tuple of scientific types (subtypes of Found), one for each target column, indicating the union, over all x in the column, of scitype(x). For example: Tuple{OrderedFactorInfinite, Multiclass{7}}.

    • a model M is deemed to match a the task T if input_scitypes(T) <: input_scitypes_upperbound(M) and target_scitypes(T) <: target_types_upperbound(M), and if all the other (simpler) traits match corresponding values inferred from the task data.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 21, 2019

Short comments:

  • I assume these are metadata flags which are automatically set (e.g., in a infer_traits method) when a task is created, on data?
  • also I wouldn't call it "upperbound" - I understand what you want to say (in terms of set containedness for the set of types to necessarily support, right?), but it might sound confusing as the primary association is an upper bound on the reals.

@ablaom
Copy link
Member Author

ablaom commented Feb 21, 2019

Short comments:

  • I assume these are metadata flags which are automatically set (e.g., in a infer_traits method) when a task is created, on data?

Yes, inferred from the data, not user-defined.

  • also I wouldn't call it "upperbound" - I understand what you want to say (in terms of set containedness for the set of types to necessarily support, right?), but it might sound confusing as the primary association is an upper bound on the reals.

Yes, I mean an upper bound in the sense of the partial order on the set of types by inclusion. Is there is a domain-specific alternative name for the notion of upper bound?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 21, 2019

Yes, I mean an upper bound in the sense of the partial order on the set of types by inclusion. Is there is a domain-specific alternative name for the notion of upper bound?

My point is just that the common user won't know what an upper bound is in a general order theoretic context, or specifically in the context of the partial order of sets with inclusion.

Something along the lines of "requiressupportforscitypes" or "containsthefollowingscitypes" or "necessaryinputscitypes" but more concise might be better.

Just following the design principle of giving self-evident or descriptive names to user sided functionality.
The rookie mistake is calling things "a" or "b" or "variable1", the expert mistake (which I'm most definitely not innocent of, but always happy to point it out in others) is giving things eclectially obscure names that only other experts understand...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

6 participants