Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formalize extending common_cfg.definitions (and other things) in extensions? #6314

Closed
yarikoptic opened this issue Dec 14, 2021 · 8 comments · Fixed by #6601
Closed

formalize extending common_cfg.definitions (and other things) in extensions? #6314

yarikoptic opened this issue Dec 14, 2021 · 8 comments · Fixed by #6601

Comments

@yarikoptic
Copy link
Member

Looking at http://docs.datalad.org/en/latest/customization.html (filed separate #6313), and then skimming through extension template and extensions, it seems we have not formalized how extensions could (or should at all) expand the common_cfg.definitions (https://github.com/datalad/datalad/blob/master/datalad/interface/common_cfg.py#L46). ATM it seems just be a matter of "documenting" those added/used config variables somewhere in docstrings, but IMHO they should be added to aforementioned definitions. It would be particularly useful in the scope of the configuration command (#6306).

I guess we could just tell people to add entries to that (common_cfg.definitions) dictionary. But I start feeling that we need something like datalad.support.extensions module which would provide formalized and centralized interfaces for extensions to extend datalad. E.g. in the case of extending configuration, could be

def add_config_variable(name, ui, *, destination='global', default=None, type=None)

if we decide to just mimic internal structure present ATM (may be name it add_common_cfg to bring closer to current setup). If we later decide to formalize/change internal config structure or interface to add config variables, we would be able to provide deprecation cycle, while adjusting initial (above) version to map into a new internal data structure, etc. So, benefits are multiple.

In the scope of #5963 we could also add_downloader.

@mih
Copy link
Member

mih commented Dec 15, 2021

Duplicate of #5769

@mih mih marked this as a duplicate of #5769 Dec 15, 2021
@mih
Copy link
Member

mih commented Dec 16, 2021

Importanting the content from #5769, which is partly overlapping with the OP:


Two modes of operations come to my mind:

Do nothing and don't fight extensions

Once imported, extensions could "monkey-patch" this dictionary and announce additional config. This would have minimal impact on performance, as it is only happening when it become relevant. There are two major downsides: 1) the model would be to invite 3rd-party code mess with the internal of datalad-core (fragile), and 2) generic tools like x_configuration will have a hard time to pick these up (not impossible, though).

Introduce update functionality that queries extensions (via entrypoint)

This will be slower, but could be implemented in a lazy-loading fashion (likely the simple dict needs to become a class). The big advantage is to have a formal interface for extensions to work with.

@yarikoptic
Copy link
Member Author

Well, in the original description of this PR I proposed a third (then I guess) way:

Define DataLad core API for extensions to use to update internal DataLad "registries"

which is different from both modes listed in the prior comment. Unlike the 2nd (entry points) it would provide an interface in the core (not in extensions) which would allow for easier deprecation cycles if we would need to change anything (if it would be via entry point -- "core" would need to support multiple "flavors" during deprecation and figure out which flavor of the interface extension uses). I do not think there would be any speed penalty in comparison to "monkey-patch"ing.

@mih
Copy link
Member

mih commented Mar 23, 2022

I think a straightforward way to resolve this issue would be to turn datalad.interface.common_cfg.definitions into a class instance of a newly established facility for this purpose (third option).

I screened the code and this class would need to support to following API subset of dict in order to be a drop in replacement:

  • __setitem__()
  • __getitem__()
  • __contains__()
  • keys()
  • get()
  • items()

We would be able to add additional methods in the spirit of the proposed add_config_variable() easily, and future changes and deprecations would have a layer to live in.

@mih
Copy link
Member

mih commented Mar 23, 2022

An aspect that still needs a resolution is how and when to trigger extensions to actually deposit their config definitions. In #6584 I tried to explore how fast this could be made, but it remains to be decided how such a mechanism needs to be used.

My conclusion is that unless we trigger the respective extension code centrally, the ability of extensions to configure anything will remain severely limited -- to influence core package behavior only within running extension code. However, this is already possible right now, and requires no changes.

So depending on the outcome of this decision, we can either close this or add a central entrypoint loop somewhere.

@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 23, 2022

Indeed a dedicated class/interface for compute_cfg_defaults (AKA cfg_defs AKA common_cfg.definitions) would be good! And then individual config definition should also be typed to some @dataclass -- that would allow for type checking etc. I would only worry if such formalization wouldn't add notable performance overhead to get initialized/used. But all of that is IMHO largely orthogonal to the problem that we do not have centralized interface(s) for extensions to operate on...

... this is already possible right now, and requires no changes.

Given that we have not had yet a use case where we needed a "less limited" influence from extensions, I think that

  • it is ok "to influence core package behavior only within running extension code" as long as we add possibility to make extensions run that code of theirs at datalad interfaces (CLI or datalad.api) import time, or explicitly requested "load all extensions" (e.g. for wtf) if so needed. I don't think we should "discover" extensions for a generic datalad import
  • extensions (like they need to do now) should not refer to some arbitrary nested and ad-hoc structures and interfaces in datalad core submodules to trigger some interface to "register" themselves
  • it would be nice DX if all "changes" an extension could do be conveniently available via "discoverable interfaces" (like it is possible to discover what methods any given class has)

ATM we use

  • entry points (those are good in that at least we can find relevant code in datalad-core by git grep 'iter_entry_points(')
    • datalad.extensions: expected to be provided ad-hoc tuple structure collecting UI interface ad-hoc (tuple) definitions to be added. notes: 1. so more logical entry point name could probably be datalad.interfaces. 2. we do not use this entry point in datalad core to specify its own interfaces
    • datalad.metadata.extractors: metadata extractors. notes: we do populate that entry_point even within datalad core setup.py.
    • datalad.metadata.indexers: new one for datalad-metalad AFAIK
    • datalad.tests: the one to run datalad tests, but not sure if actually used
    • in ENH: Add ability to expand downloaders from entry points #5963 I have tried to add datalad.downloaders entry point to add more downloaders, didn't finish/forgot immediate motivation
  • config directly:
  • agreed upon location resources/procedures taken from the top of the datalad.extension's module.

What I was thinking is to have a singular entry point for each extension which would then do all desired registrations of extra features of the extension, smth like

def myextension_entry_point(extensions):
   extensions.add_config(name, ui, *, destination='global', default=None, type=None)  # come up with your example ;)
   with extensions.add_interfaces_group("Containerized environments") as g:
       # optional kwargs can all be deduced from the name of the module or explicitly
       # provided
       g.add_interface('datalad_container.containers_list'[, cls=None, pyinterface=None, cliinterface=None])
       g.add_interface('datalad_container.containers_remove')
       ...
   extensions.add_downloader(...)
   extensions.add_procedures([path='resources/procedures'])  # use the same magic on traversal but now explicitly requested

and "datalad core" would in a single helper iterate over all such registered datalad.extensions and trigger necessary changes to take effect of such changes within internal to datalad core structures/classes, thus eliminating multiple points of iteration over extensions.

pros:

  • clear interfaces, instead of "documented tuple-based structures"
  • completion on extensions. (will be just a submodule or may be some fancy class instance) -- and see what you can do for an extension
  • singular location for an extension to do registration (not spread through setup.{cfg,py}, assumed location for procedures, etc)

cons I see in such approach

  • registering an extension would register its all capabilities at once, not "per needed location" iteration of entrypoints. So may be would become slower? but may be faster since a singular iteration over entry points?
  • error (e.g. due to incompatibility in updated interfaces of extensions helpers) for one aspect (e.g. config) would disable importing that extension entirely but only after it might have partially being already registered (e.g. interfaces added). To some degree could be taken as a pros if e.g. we recommend to register from config to interfaces -- if registration of config fails, most likely interfaces shouldn't be used
    • pros could be that extension might be able to decide what it wants to do about the error
  • extension can add arbitrary complicated logic to test on presence of other extensions etc to decide on what to register/do
  • we would need to have "internal" functioning and interfaces/structures organized in such a way that we do not somehow manage to delay extensions registration... e.g. in add_interface we should not modify some internal dict with new interfaces while going through that dict for interfaces definitions -- probably all interfaces from extensions should be kept in a separate data structure to be traversed in addition.

@mih
Copy link
Member

mih commented Mar 23, 2022

The use cases are plenty. My mental model for thinking about this is clone. What would have prevented the RIA code from having to enter the core package?

@yarikoptic
Copy link
Member Author

Great use case -- clone of RIA. Unfortunately I do not have a mental map of RIA functionality within datalad core. Is there a list of RIA-specific functionality/configuration to be moved away from "datalad core"? then we could see what "interfaces" are needed for that datalad-ria extension and where they would need to be "applied".

mih added a commit to mih/datalad that referenced this issue Mar 30, 2022
This also replaces the previously very flexible data structure
for such items with a more rigid set of classes that implement
lazy evaluation of defaults (avoiding their possibly expensive
computation, even when they are not actually needed). They otherwise
implement all elements of the previously used API, such that they
work as a drop-in replacement.

Only register_config() is intended to be used publicly.

Fixed datalad#6314
mih added a commit to mih/datalad that referenced this issue Apr 1, 2022
This also replaces the previously very flexible data structure
for such items with a more rigid set of classes that implement
lazy evaluation of defaults (avoiding their possibly expensive
computation, even when they are not actually needed). They otherwise
implement all elements of the previously used API, such that they
work as a drop-in replacement.

Only register_config() is intended to be used publicly.

Fixed datalad#6314
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants