-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for metadata #22
Comments
Looks good to me. An additional property for a model could be Also, default values for parameters could be included in a constructor method when the model is created. |
Hm, I'd propose a design which is a bit more structured than mlr's "let's dump all metadata as flags". For example, metadata indicating a supervised learner (regressor or classifier, single class or multi-class, probabilistic or not) could be defined through the input and output types it supports. E.g., given through a type theoretic arrow expression For type theory aficionados, the estimator would have a first-order type "predictor" with argument type given by the formula; {mapsto} is the type arrow, and [...] are types, some of which are first-order types in prefix notation. Practically, the "holy grail" solution would have this as a model specification language with a rudimentary compiler that would be able to construct standard models from primitives via first-order operations (via term resolution calculus or similar). |
Regarding the "holy grail" solution, why is this useful: Say the user specifies a problem that needs to cope with NAs and specific input types, e.g., [strangetype]. Unfortunately, none of the off-shelf strategies can deal with [strangetype] and NAs in the other variables, simultaneously. However, the resolution calculus (e.g., clippy assistant for mlj) can tell us that the registry has an imputor algorithm for the non-strange NAs can be connected with the predictor algorithm to a composite/network algorithm which now can deal with NAs and the [strangetype] - problem solved. |
@darenasc The default values are encoded in the keyword constructor (omitted above):
There is a small issue here though in this case. I would like to just call Edit: Not an issue. Add target type to hyperparameters and every parameter has a default value. |
@fkiraly Organising the information around supported input and output types makes sense. However, I'm not convinced representing the metadata using a compiled specification language is worth the trouble (and extra abstraction for the user) if we can encode exactly the same information in some (binned) flags. |
@ablaom meta-spec language only makes sense in a larger ecosystem where enough non-standard cases are supported. For tabular data, I'd agree, there's no immediate need, but it would be an interesting design/research study if we would have the resource. Though if we were making a meta-spec language, I'd think it would/should look exactly like that. My main point is that at least one would like some kind of hierarchy that would make conversion or interface with the simplest "arrow"-slash-"mapsto" type easy, i.e., without having to write a stack of gluecode encoding the "secret" knowledge what all the flags are meant to mean. |
After some discussions a slightly more structured form of the metadata has been proposed. Here is an example: MLJ.properties(::Type{SomeMulticlassClassifier}) = (CanRankFeatures(), )
MLJ.operations(::Type{SomeMulticlassClassifier}) = (predict, predict_proba)
MLJ.inputs_can_be(::Type{SomeMulitClassClassifier}) = (Numeric(), Nominal(), NA())
MLJ.outputs_are(::Type{DecisionTreeClassifier}) = (Nominal(),) Available options can be gleaned from this code extract:
|
What suggestions are there for storing, updating and accessing the metadata? The constraints are: (1) external packages are loaded on a user-needs-it basis (lazy loading); and (2) user wants access to the metadata before loading any packages. |
We are still looking for proposals on how to store metadata of supervised learning models (generally defined in external packages) that implement the MLJ model interface (now defined in MLJBase) . Currently metadata is everything defined by the dictionary returned by julia> using MLJ
julia> using MultivariateStats
julia> info(RidgeRegressor)
Dict{Symbol,Union{Array{Symbol,1}, String, Symbol}} with 5 entries:
:is_pure_julia => :yes
:package_uuid => "6f286f6a-111f-5878-ab1e-185364afe411"
:package_name => "MultivariateStats"
:target_is => Symbol[:deterministic, :numeric, :univariate]
:inputs_can_be => Symbol[:numeric] But check |
Seems very sensible to me for a start - it should be easy to change the number, structure, and type of the dictionary entries in case we come up with a more sophisticated structure. What I wanted to flag up that eventually, one may like to have a "model registry", that is, a singleton entity which can be queried to retrieve, or suggest, modelling strategies fitting a certain task. E.g., "find me all probabilistic supervised regression models that can deal with NAs in numerical variables and which are not Bayesian MCMC based, unless they are from packages I wrote". For this kind of query, there would need to be some indexing of primitive strategies, which is the other way: i.e., not getting properties of strategies, but obtaining strategies with certain properties. Hence, at least two additional things might make sense:
|
Regarding lazy-loading with queriable content, that would be solved by a registry - while that's a very "object oriented" solution, I cannot see how it is solved by a functional design. Simply given that dispatching requires a struct that's already there. The registry could call "register" methods from some default packages or interfaces which loads model information from other packages (e.g., from a "registry" part) without loading the full package. Further, non-default packages could be asked to provide a "register" file, sub-module, or interface, which could be loaded into the main registry which would index the contents. |
@fkiraly Could you clarify for me exactly what you mean by a "primitive" strategy? |
@ablaom you call it atomic - your terminology is, perhaps, better, since "atomic" means "indivisible" in Greek which is literally what it means to mean. "primitive" means "the simplest" in Latin and is slightly more vague since simplicity is more subjective, while compositionality (from atoms) is not. |
Just in case you're not aware of it, StatsBase defines a set a functions for packages to override, which includes |
I have been thinking more seriously about the task interface, and I am revisiting the way we currently organise models and metadata. I am inviting comment before proceeding with some revisions. Unless a objections are raised, I will proceed early next week with implementation. Here's more-or-less my proposed plan:
|
Short comments:
|
Yes, inferred from the data, not user-defined.
Yes, I mean an upper bound in the sense of the partial order on the set of types by inclusion. Is there is a domain-specific alternative name for the notion of upper bound? |
My point is just that the common user won't know what an upper bound is in a general order theoretic context, or specifically in the context of the partial order of sets with inclusion. Something along the lines of "requiressupportforscitypes" or "containsthefollowingscitypes" or "necessaryinputscitypes" but more concise might be better. Just following the design principle of giving self-evident or descriptive names to user sided functionality. |
I'm inviting feedback on a suggestion for encoding metadata.
We would like to associate certain metadata with models (most of these being defined in external packages). The main purpose of the metadata is so we can mimic the R task interface, which allows a user to match task specifications (e.g., I want a classifier that handles nominal features) to a list of qualifying models.
I expect a local registry will store the model metadata, with a macro call updating the registry each time a model is defined (which means when the user imports the relevant external package, in the case of lazily loaded interfaces).
We suggest metadata consist of:
predict
,predict_proba
,inverse_transform
, etc)Note that at present, the only the subtypes
Model
(our abstract type for the hyperparameter containers) areSupervised
andUnsupervised
; soRegression
,Classification
andMultiClass
are just properties.In the core code we do something like this:
And model declarations look something like this:
The text was updated successfully, but these errors were encountered: