Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add high-level Config and Analysis classes #2323

Merged
merged 14 commits into from Aug 28, 2019

Conversation

@Bultako
Copy link
Member

commented Aug 23, 2019

This PR is the first of a series related with PIG 12: High-level interface

It introduces the Analysis class as a session class. At this moment it only does handling of the setting and validation of the analysis session configuration parameters. The main points to highlight are:

  • jsonschema dependency for validation of configuration parameters against a defined schema
  • validation of astropy quantities and units for the values of the configuration parameters
  • user setting, modification and removal of configuration parameters using the interface and/or providing a user defined configuration file
  • display of the actual configuration parameters of the session
  • reset method to recover the original configuration state
@cdeil cdeil self-assigned this Aug 23, 2019
@cdeil cdeil added the feature label Aug 23, 2019
@cdeil cdeil added this to the 0.14 milestone Aug 23, 2019
@Bultako Bultako force-pushed the Bultako:HLI branch from 40c700b to 20bef78 Aug 23, 2019
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

@Bultako - Thanks!

This is very hard to review. I see a complex Singleton class, but it's not clear why it's needed or why we would want that, and also other code like the reset method that isn't used or tested and not clear to me if we want that. Can you please delete that, or show a useful docs or test example?

Like I said in the call this morning, I would suggest to always do use-case and test-driven development, and as a first step to try to drive this via the new code:
https://github.com/gammapy/gammapy/blob/master/gammapy/scripts/spectrum_pipe.py
and to migrate the existing test case for that to the new scheme and then to delete that code.
At the start what you get will be a mix of using the config file and hard-coded methods that do stuff based on config file content, I think that's fine and even if we manage to get something better that would be the path forward.

@Bultako - Where do you want to go with this PR? Continue to a point where this drives an analysis (either the spectral analysis, or what gammapy bin cube does now to have a simpler example), or merge this soon, and continue in a new PR?

Copy link
Member

left a comment

Suggest to just take the config dict in __init__ and to do the I/O in a @classmethod.
We do that in most other classes in Gammapy, including SpectrumAnalysisIACT.

This is more flexible, it allows one to create and run analyses via this API without having to write config files to file, one can just put config in a dict (or fetch it from disk or the web or anywhere) and pass it to Analysis. For the common case to read from dict we provide a classmethod. OK?

gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

I didn't look at the design of Fermipy or ctapipe or other packages for config-handling.
Would it make sense to have an AnalysisConfig class that mostly does the schema validation, merging in defaults with user-defined settings etc.?
And then the Analysis class would just make one of those, and it's job would be to orchestrate execution based on a config?
I think this separation of concerns into two classes might be simpler than a single class?

@Bultako

This comment has been minimized.

Copy link
Member Author

commented Aug 24, 2019

@cdeil
Thanks very much for the review.
I think I've addressed most of your comments and suggestions.

I've greatly simplified the Analysis class.

I have completely switched to an optic of seeing users as consenting adults. This has reduced much of the complex code, so it also allow us to not consider (for the moment) having another class for configuration handling. As it is the case with Fermipy, we can instantiate the class with a config file or/and with config parameters passed at that moment. By default the Analysis class will consider a small set of config params declared in the default.yaml file, the user can also decide not to consider the default file, but his own file, also params passed as **kwargs (that will be rewrite those present in the config file) or neither file not params, and in that case she will be filling and modifying the config params of an empty config class during the analysis session. The validation of configuration params has to be done explicitly by the user at any moment with analysis.validate_config(), the implicit validation when setting/modifying params has been removed.

Re: Analysis class as a singleton
This may be an interesting point of discussion. I see Analysis also as a session class, logging actions in a single history, hence I wanted to be sure that no other Analysis class was opened in parallel to not miss any info of the history. But it is also true that we could provide the possibility to use (maybe in scripts) several different Analysis classes in parallel. For now I've removed the singleton constraint, we could face this in the future if need arrives.

Since we do have now a smaller amount of code to review, I would prefer to merge this PR soon before starting a bigger one addressing a real use case, or at least agree on the code we have now.

Copy link
Member

left a comment

@Bultako - Thanks!

I think 100% we should for now write Gammapy with the "consenting adults" philosophy. Another way how I think about it is that we are still in the process of building an MVP (minimum viable product). So we should aim to get a working config-driven analysis, then see how it works, maybe refactor, and once we think it's OK, we can polish and put in more safeties and make it more user friendly. But first we need the functionality & design, and that will need a few iterations, i.e. as little & simple code as possible is what we need.

Concerning the question of multiple analyses in parallel: I don't think we need this for v1.0, we'll introduce parallellism in Gammapy later. Figuring out the solution for logging in that case will be part of the task.

Now concerning the code here, I have a few fundamental comments: I think two classes ("Config" and "Analysis") would work better. The config handling how you do it here has some issues: you do "if" checks where you use a config value, at the moment only for logging. We will access config values in 100s of places, we can't do "if" checks everywhere. I think this can be resolved by having defaults. This guarantees that values are present, no if checks needed.
A second issue is that you have two parallel YAML files: defaults and schema. Basically the two files will have a very parallel structure, no? Then we should probably try to maintain it in a single file?

I've looked around a bit, and I think it would be worth looking at and stealing ideas from here:
https://github.com/tardis-sn/tardis/tree/1baa780cbe9327376f63f93e562f870b4dc8c07e/tardis/io

Specifically you can see here how they have YAML schema files that include the default value and a description string for every item.

Another example: https://github.com/gammapy/gamma-cat/tree/master/input/schemas

And another example of a config system where items have defaults and descriptions, and it's possible to programatically create & change config, and then do I/O to a config file any time:
https://docs.astropy.org/en/stable/config/index.html?highlight=configuration#adding-new-configuration-items

You need to write a setup_package.py file that declares the YAML files to be part of the Gammapy package data, so that it's copied when Gammapy is installed to site-packages. This will fix this error:
https://travis-ci.org/gammapy/gammapy/jobs/576178076#L3661
See example here or we also have a few of those in Gammapy.

Another important question is how the default config should interact with the user-defined config.
Usually this is done using an approach like https://docs.python.org/3/library/collections.html#collections.ChainMap or https://confuse.readthedocs.io/en/latest/#view-theory, i.e. to have a list of config dicts, and to look up first in user-defined config dict, then the default config dict. This is a data structure that allows to keep track of user config separately from default, i.e. it's possible to e.g. write just the user config for provenance to the output file (usually 5-10 values), instead of having a single config where it's merged in with default config (we'll have 100+ values, hard to find what parameters the user actually changed).

So I think for config what we'll end up with is either using some config package or roll our own, copying the approach from Astropy, Tardis, Fermipy, ... e.g. I don't know yet at which point we want to insert our special validators and object mappers for things like quantities, times, energies, skycoords, regions, ... I think this really should go in a separate Config class, and then the Analysis class and all other code in Gammapy doesn't have to worry about any validation or defaults etc, it can just get (and in some cases set?) config items. Note that it's likely that we would pass either the whole config, or sub-config objects to many Gammapy maker classes, i.e. we would interact with config from all over the place, not just from the Analysis class eventually, another reason why I think having a separate Config class makes sense. Also: all other existing codes (Fermipy, Tardis, ...) have a separate config class.

I would have not started with this work on config, but just with Analysis and taken a config dict on init. But OK, in the end it doesn't matter if we work on config or analysis first, OK to continue with this here. I do think it would be useful to continue with the mininal approach, and instead of adding two large YAML files with a lot of content that's untested or unreviewed, start minimal and e.g. just have a config with a handful of options, but to have them tested and working. E.g. logging verbosity you already have, keep that. The next most important item would probably be source position like here?

@Bultako - if you just want to merge this in, and then continue in new PRs, that's fine with me as well (although I don't think it's the efficient way to proceed). Just resolve the CI fail by declaring the package_data files. I just wanted to make some comments to clarify where I think eventually we'll have to go with the whole config system.

gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Show resolved Hide resolved
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

@Bultako - This is another project worth looking at IMO: https://github.com/mahmoud/glom
You could use it if you want, or just as an example.
Probably if we have defaults, we don't need any of the fallback logic, we can just write:

config["global"]["logging"]["level"]

and it's guaranteed to be there, because it's in the default config and we don't support users messing with the config, if they do, it can just fail with KeyError.

Having this I guess would be nice to have, but also causes problems because it's limited to string keys, so not really needed for now?

config.global.logging.level
config["global.logging.level"]

I think if you want to get good validation of config for setting e.g. for

config.global.logging.level = "spam"

at runtime isn't easy to achieve (but also not very hard), we'd need to have a ConfigItem proxy class or somehow trigger validation on setting. I think this is something we can develop later, for now we just have to try and get the config getting API right.

I also see that you use **kwargs in ``Analysis.initto set config. I suggest to not use that coding pattern, but to take an explicitconfig` dict argument instead. The problem with `**kwargs` is that then often what happens is that there's a mix of config things, and other arguments, and in the method complex logic to sort out the two accumulates. It's just as easy and cleaner to take this:

config = {...}
analysis = Analysis(config=config)
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

@Bultako - Merge conflict in setup.py. See #2279 (comment)

Could you please rebase this branch?

I tried locally, and found that git rebase --continue didn't work for some reason and I had to use git rebase --skip instead. See console log and result.

@Bultako

This comment has been minimized.

Copy link
Member Author

commented Aug 27, 2019

@cdeil
Thanks for all those suggestions.
I summarize below the main actions of what I've done:

  • Added setup_package.py to install the schema YAML file
  • Added a Config separated class dealing with parameters/settings handling.
  • Removed default.yaml file and use only one schema.yaml file to validate the settings, and also to declare default values. This has been done extending the jsonschema validator to use our own, that also implements validation of astropy quantities (I've added a test for this)
  • Stored settings in three different internal properties of the Config class (default settings, settings from file, user defined settings) at the moment of instantiation of Analysis
    Here are different examples on how to create an `Analysis` session class:

    >>> from gammapy.scripts import Analysis
    >>> cfg = {"general": {"out_folder": "myfolder"}}
    >>> analysis = Analysis(configfile="myfile.yaml", config=cfg)
    >>> analysis = Analysis(config=cfg)
    >>> analysis = Analysis()

I use a small _update_settings function of my own to not overwrite all the default settings by a user supplied dictionary of settings, but only those specified by the user in the config param or in his/her own configuration file. Yes, I think we might have a look at more fancy solutions for parameter handling, but I prefer to go ahead with a simple solution now and come back to this later when working with complex/real uses cases rise this need.

@Bultako Bultako force-pushed the Bultako:HLI branch from 2cae05d to 4dc6abc Aug 27, 2019
@Bultako Bultako force-pushed the Bultako:HLI branch from 4dc6abc to f6128aa Aug 27, 2019
Copy link
Member

left a comment

@Bultako - This is coming along nicely, thanks!

I left a few more comments inline.

I'm not sure if the settings member and accessing that from the outside is a good idea. I was expecting that config would at the very least behave like a dict where I can read config like this:
config["logging"]..., even if that's backed by 2 or 3 dicts behind the scenes. But I guess that's something we'll have to get experience with and can refactor. Could you please add a @property or something so that settings is described with a docstring if you make it the main public API we use from everywhere to get config? It's not obvious to me how it relates to the 3 internal config data members.
Can we get rid of the _file_settings? Usually a config is created either from file, or from an in-memory Python object like a dict, but I don't see how it's useful to keep both around on the config class? -> Keep it simple and remove _file_settings`?

gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/analysis.py Show resolved Hide resolved
gammapy/scripts/analysis.py Outdated Show resolved Hide resolved
gammapy/scripts/setup_package.py Outdated Show resolved Hide resolved
@Bultako

This comment has been minimized.

Copy link
Member Author

commented Aug 27, 2019

@cdeil
I think I've done all your suggestions, please check.
I write some examples to clarify what we have now for param settings handling :)

>>> analysis = Analysis()
>>> analysis.settings
{'general': {'logging': {'level': 'INFO'}, 'out_folder': '.'}}

>>> cfg = {"general": {"out_folder": "myfolder"}}
>>> analysis = Analysis(cfg)
>>> analysis.settings
{'general': {'logging': {'level': 'INFO'}, 'out_folder': 'myfolder'}}

Note how the default logging level info is still there in analysis.settings["general"], only the out_folder setting has been modified.

We can access and modify the settings from the analysis object, analysis has-a configuration and settings, so we can use analysis.settings and analysis.configuration.validate() to validate the config.

>>> analysis.settings["general"]["logging"]["level"] = "WARNING"
>>> analysis.configuration.validate()

But we need to proceed step-by-step to define nested properties

>>> analysis.settings["observations"]["datastore"]= "$GAMMAPY_DATA/cta/"
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-adc372c1634c> in <module>
----> 1 analysis.settings["observations"]["datastore"]= "$GAMMAPY_DATA/cta-1dc/"

KeyError: 'observations'
>>> analysis.settings["observations"] = {}
>>> analysis.settings["observations"]["datastore"] = "$GAMMAPY_DATA/cta-1dc/"
Copy link
Member

left a comment

One last CI fail:
https://travis-ci.org/gammapy/gammapy/jobs/577482103#L2692

I think we want to put jsonschema as a required dependency, just as pyyaml:
https://github.com/gammapy/gammapy/pull/2218/files#diff-6e73f737e53f1e0bf0a6eb0acb0b4526R108
You need to list it as required dependency in setup.cfg.

gammapy/scripts/tests/test_analysis.py Outdated Show resolved Hide resolved
@Bultako Bultako force-pushed the Bultako:HLI branch from 0ef3348 to 83b6f9a Aug 28, 2019
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@Bultako - Thanks!

I've done some minor code formatting in f3ba360. (ran make black and make isort and renamed some variables and arguments to config instead of cfg or configuration to have less variation, with the goal for users to more easily remember the API.

I'm merging this now.

But FYI: I still think we should simplify and change the Config, to work like a collections.ChainMap with user settings before default settings on get, but still as separate objects so that defaults never get changed, but user settings can change. And we need tests for that. Once we have some usage and real-world experience with this class in the coming weeks, let's come back to this and try to simplify and improve the design.

@cdeil
cdeil approved these changes Aug 28, 2019
@cdeil cdeil changed the title High-level interface session class Add high-level Conf and Analysis classes Aug 28, 2019
@cdeil cdeil merged commit 90e0ad8 into gammapy:master Aug 28, 2019
7 of 9 checks passed
7 of 9 checks passed
Codacy/PR Quality Review Hang in there, Codacy is reviewing your Pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Scrutinizer Analysis: 14 updated code elements – Tests: passed
Details
gammapy.gammapy Build #20190828.3 succeeded
Details
gammapy.gammapy (DevDocs) DevDocs succeeded
Details
gammapy.gammapy (Lint) Lint succeeded
Details
gammapy.gammapy (Test Python36) Test Python36 succeeded
Details
gammapy.gammapy (Test Windows36) Test Windows36 succeeded
Details
gammapy.gammapy (Test Windows37) Test Windows37 succeeded
Details
@Bultako Bultako deleted the Bultako:HLI branch Aug 28, 2019
@Bultako

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

@cdeil
Thanks !

Re: user settings and default settings

I've tried doing the settings handling with collections.ChainMap to have the behavior that you expose. The problem is that we work with nested dictionaries, so with ChainMap the whole upper level dict is replaced by what it is defined in the user config dict, and the default values are lost.

i.e.

>>> analysis = Analysis()
>>> analysis.settings
{'general': {'logging': {'level': 'INFO'}, 'outdir': '.'}}

>>> config = {"general": {"outdir": "myfolder"}}
>>> analysis = Analysis(config)
>>> analysis.settings
{'general': {'outdir': 'myfolder'}}

That's why I wrote my own function, but I agree we will need to get back to this.
I also would like to have a better way to set the config with analysis.settings, instead of writing a list of brackets and avoiding defining empty values for deep undefined nested levels.

>>> analysis.settings["observations"]["datastore"]= "$GAMMAPY_DATA/cta/"
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-adc372c1634c> in <module>
----> 1 analysis.settings["observations"]["datastore"]= "$GAMMAPY_DATA/cta-1dc/"

KeyError: 'observations'
>>> analysis.settings["observations"] = {}
>>> analysis.settings["observations"]["datastore"] = "$GAMMAPY_DATA/cta-1dc/"

I will revisit some of the links references you wrote above for inspiration.. But, yes parameter handling is still in my TODO list.

@cdeil cdeil changed the title Add high-level Conf and Analysis classes Add high-level Config and Analysis classes Sep 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.