Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absorption database abstraction, Dynaconf migration #397

Merged
merged 12 commits into from
Apr 10, 2024
Merged

Absorption database abstraction, Dynaconf migration #397

merged 12 commits into from
Apr 10, 2024

Conversation

leroyvn
Copy link
Member

@leroyvn leroyvn commented Apr 7, 2024

Description

Experimental, do not merge!

This PR brings two major changes:

  • Transition of configuration to Dynaconf. The environ-config framework is replaced by Dynaconf. This is motivated mainly by the need for a solution that allows to easily specify configuration using rc files. Other bonus features: hierarchical settings, and environment-dependent defaults (not yet implemented but everything is there).
  • Addition of an abstraction to handle large molecular absorption databases. This is long overdue and finally allows us to manage the convenience vs access performance vs memory footprint tradeoff. In particular, access performance (almost) no longer depends on database size. Other bonus features: simplified atmosphere setup specification, and more concise code for improved maintainability.

Transition of configuration to Dynaconf

The new configuration framework is based on Dynaconf. We retain the main features of the previous system based on environ-config:

  • Settings can still be configured using environment variables.
  • Settings are still applied conversion and validation protocols.

In addition, the new system brings the following new features:

  • Hierarchical settings are now possible. This feature is the primary reason for this migration: to make the error handling behaviour for absorption database interpolation errors configurable, we needed hierarchical settings (using a flat setup would have been confusing and messy).
  • Settings can now be configured using rc files written in the TOML language. For those wondering why we settled with TOML instead of YAML (already used in some parts of our codebase), it is because TOML has a better specification and is usually seen as better suited for configuration.
  • Eradiate now loads a base configuration from a default setting file that can be replaced. In practice, this allows us to offer contextual defaults, e.g. differentiating between development, testing and production environments.

The interface, although very similar to the previous one, has changed a little and requires some adaptations:

  • The primary access point is now named eradiate.config.settings. This data structure implements a mapping interface, but also still allows attribute access. Access is also case insensitive. Therefore, access such as

    eradiate.config.progress = "spectral_loop"

    becomes one of

    eradiate.config.settings.progress = "spectral_loop"
    eradiate.config.settings.PROGRESS = "spectral_loop"
    eradiate.config.settings["progress"] = "spectral_loop"
    eradiate.config.settings["PROGRESS"] = "spectral_loop"
  • The SOURCE_DIR variable is no longer a setting, but an environment variable only. It is configured exclusively through the ERADIATE_SOURCE_DIR environment variable and accessed as eradiate.config.SOURCE_DIR in Python.

  • A new ERADIATE_ENV environment variable is now available. The intent is to allow the user to specify the type of environment they are running Eradiate in (e.g. development, production, testing, etc.) to fine-tune some behaviour dynamically. This feature is not used yet.

Addition of an abstraction to handle large molecular absorption databases

This is the main topic of this PR. We introduce a new AbsorptionDatabase hierarchy to provide efficient access to our absorption coefficient databases. This new handling component solidifies the ad-hoc data structures we had been using before. The main limitations we suffered were:

  • Data loading and access complexity was tied to the number of loaded files. For large databases (with thousands of files and more, like our CKD databases), this resulted in an enormous penalty that would slow down pre-processing to a point where it would become the dominant operation in some workflows. We overcame this by caching information useful to navigate multi-file, heterogeneous dataset aggregates in the form of index tables.
  • Data access checks were thorough but costly, also with a complexity higher than the number of dimensions in the datasets. We overcame this by making checks more generic, and simpler, with the idea that if they would prove to be useful for debugging sometime in the future, we would reintroduce them, not as a prior check, but as a post-failure diagnostic.
  • Data access was unconditionally eager. That would mean that upon requesting access to a file in the database, it would be loaded into memory, even if it would be large and the user would need only a small part of it. This was overcome by adding a flexible access policy to the database handling component.

The resulting components provide now great flexibility to balance convenience, performance and memory usage:

  • Upon access, file contents are cached for reuse. Therefore, if accessed repeatedly, a dataset is not reloaded. The cache size can be configured by the user.
  • The database can be configured to load file contents eagerly (default behaviour, suitable for databases with small files) or lazily (suitable for databases with large files).

In addition, data handling is much more convenient:

  • Spectral databases can now be instantiated on their own using the AbsorptionDatabase.from_directory() and AbsorptionDatabase.from_name() constructors.
  • Upon atmosphere configuration, spectral database specification can now be done using a single keyword, e.g.: MolecularAtmosphere(absorption_data="monotropa"). Prior interface still works, but emits a deprecation warning and will be removed in a future version.

The new system does however not do well with our data handling strategy: files have to be on the hard drive prior for the database to load gracefully. This means that built-in databases will not behave well if the data is not downloaded in advance. To address this issue, the CLI entry point eradiate data fetch can now be used to download files making up a database. For instance:

eradiate data fetch komodo monotropa

Other minor updates

  • The default download locations is now a .eradiate_downloads directory, either located at the Eradiate source root if ERADIATE_SOURCE_DIR is specified, or in the current working directory in a production setup.
  • Several fixtures were updated to leverage the new atmospheric data handling.

To do / caveats:

  • Clean up old code.
  • Write tests for new database.
  • Downloads are kind of broken (need a little interface to make downloading a database easy).
  • Documentation of environment variables and configuration is broken.
  • Documentation of download_dir setting is missing.
  • All docstrings for updated classes need a full pass.
  • Add a prefix to the standard atmosphere dictionary fixture names.
  • Need dependency review (added Dynaconf and cachetools) and lock update.
  • Tutorials need a pass.

Checklist

  • The code follows the relevant coding guidelines
  • The code generates no new warnings
  • The code is appropriately documented
  • The code is tested to prove its function
  • The feature branch is rebased on the current state of the main branch
  • I updated the change log if relevant
  • I give permission that the Eradiate project may redistribute my contributions under the terms of its license

@leroyvn leroyvn changed the title Abs db Absorption database abstract, Dynaconf migration Apr 8, 2024
@leroyvn leroyvn changed the title Absorption database abstract, Dynaconf migration Absorption database abstraction, Dynaconf migration Apr 10, 2024
@leroyvn leroyvn changed the base branch from main to next April 10, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant