Absorption database abstraction, Dynaconf migration #397

leroyvn · 2024-04-07T21:21:45Z

Description

Experimental, do not merge!

This PR brings two major changes:

Transition of configuration to Dynaconf. The environ-config framework is replaced by Dynaconf. This is motivated mainly by the need for a solution that allows to easily specify configuration using rc files. Other bonus features: hierarchical settings, and environment-dependent defaults (not yet implemented but everything is there).
Addition of an abstraction to handle large molecular absorption databases. This is long overdue and finally allows us to manage the convenience vs access performance vs memory footprint tradeoff. In particular, access performance (almost) no longer depends on database size. Other bonus features: simplified atmosphere setup specification, and more concise code for improved maintainability.

Transition of configuration to Dynaconf

The new configuration framework is based on Dynaconf. We retain the main features of the previous system based on environ-config:

Settings can still be configured using environment variables.
Settings are still applied conversion and validation protocols.

In addition, the new system brings the following new features:

Hierarchical settings are now possible. This feature is the primary reason for this migration: to make the error handling behaviour for absorption database interpolation errors configurable, we needed hierarchical settings (using a flat setup would have been confusing and messy).
Settings can now be configured using rc files written in the TOML language. For those wondering why we settled with TOML instead of YAML (already used in some parts of our codebase), it is because TOML has a better specification and is usually seen as better suited for configuration.
Eradiate now loads a base configuration from a default setting file that can be replaced. In practice, this allows us to offer contextual defaults, e.g. differentiating between development, testing and production environments.

The interface, although very similar to the previous one, has changed a little and requires some adaptations:

The primary access point is now named eradiate.config.settings. This data structure implements a mapping interface, but also still allows attribute access. Access is also case insensitive. Therefore, access such as

eradiate.config.progress = "spectral_loop"

becomes one of

eradiate.config.settings.progress = "spectral_loop"
eradiate.config.settings.PROGRESS = "spectral_loop"
eradiate.config.settings["progress"] = "spectral_loop"
eradiate.config.settings["PROGRESS"] = "spectral_loop"

The SOURCE_DIR variable is no longer a setting, but an environment variable only. It is configured exclusively through the ERADIATE_SOURCE_DIR environment variable and accessed as eradiate.config.SOURCE_DIR in Python.
A new ERADIATE_ENV environment variable is now available. The intent is to allow the user to specify the type of environment they are running Eradiate in (e.g. development, production, testing, etc.) to fine-tune some behaviour dynamically. This feature is not used yet.

Addition of an abstraction to handle large molecular absorption databases

This is the main topic of this PR. We introduce a new AbsorptionDatabase hierarchy to provide efficient access to our absorption coefficient databases. This new handling component solidifies the ad-hoc data structures we had been using before. The main limitations we suffered were:

Data loading and access complexity was tied to the number of loaded files. For large databases (with thousands of files and more, like our CKD databases), this resulted in an enormous penalty that would slow down pre-processing to a point where it would become the dominant operation in some workflows. We overcame this by caching information useful to navigate multi-file, heterogeneous dataset aggregates in the form of index tables.
Data access checks were thorough but costly, also with a complexity higher than the number of dimensions in the datasets. We overcame this by making checks more generic, and simpler, with the idea that if they would prove to be useful for debugging sometime in the future, we would reintroduce them, not as a prior check, but as a post-failure diagnostic.
Data access was unconditionally eager. That would mean that upon requesting access to a file in the database, it would be loaded into memory, even if it would be large and the user would need only a small part of it. This was overcome by adding a flexible access policy to the database handling component.

The resulting components provide now great flexibility to balance convenience, performance and memory usage:

Upon access, file contents are cached for reuse. Therefore, if accessed repeatedly, a dataset is not reloaded. The cache size can be configured by the user.
The database can be configured to load file contents eagerly (default behaviour, suitable for databases with small files) or lazily (suitable for databases with large files).

In addition, data handling is much more convenient:

Spectral databases can now be instantiated on their own using the AbsorptionDatabase.from_directory() and AbsorptionDatabase.from_name() constructors.
Upon atmosphere configuration, spectral database specification can now be done using a single keyword, e.g.: MolecularAtmosphere(absorption_data="monotropa"). Prior interface still works, but emits a deprecation warning and will be removed in a future version.

The new system does however not do well with our data handling strategy: files have to be on the hard drive prior for the database to load gracefully. This means that built-in databases will not behave well if the data is not downloaded in advance. To address this issue, the CLI entry point eradiate data fetch can now be used to download files making up a database. For instance:

eradiate data fetch komodo monotropa

Other minor updates

The default download locations is now a .eradiate_downloads directory, either located at the Eradiate source root if ERADIATE_SOURCE_DIR is specified, or in the current working directory in a production setup.
Several fixtures were updated to leverage the new atmospheric data handling.

To do / caveats:

Clean up old code.
Write tests for new database.
Downloads are kind of broken (need a little interface to make downloading a database easy).
Documentation of environment variables and configuration is broken.
Documentation of download_dir setting is missing.
All docstrings for updated classes need a full pass.
Add a prefix to the standard atmosphere dictionary fixture names.
Need dependency review (added Dynaconf and cachetools) and lock update.
Tutorials need a pass.

Checklist

The code follows the relevant coding guidelines
The code generates no new warnings
The code is appropriately documented
The code is tested to prove its function
The feature branch is rebased on the current state of the main branch
I updated the change log if relevant
I give permission that the Eradiate project may redistribute my contributions under the terms of its license

…cient data

leroyvn force-pushed the abs_db branch from 887cd99 to 28997a5 Compare April 7, 2024 21:22

leroyvn changed the title ~~Abs db~~ Absorption database abstract, Dynaconf migration Apr 8, 2024

leroyvn changed the title ~~Absorption database abstract, Dynaconf migration~~ Absorption database abstraction, Dynaconf migration Apr 10, 2024

leroyvn changed the base branch from main to next April 10, 2024 10:31

leroyvn added 12 commits April 10, 2024 12:33

Add absorption database abstraction, migrate settings to Dynaconf

6218951

More sensible defaults for default databases

08a6c65

data: Add convenience aliases to download molecular absorption coeffi…

54a9457

…cient data

Docs: Revive configuration documentation

7c7c626

Docs: Add AbsorptionDatabase API docs

3f06820

CLI: Fix eradiate show and display loaded setting files

c45b9e1

Do not pretty-format default configuration

65865b7

Docs: Document default setting file

c465ab1

Docs: Docstring pass

a169cb7

Tests: Rename fixtures for clarity

f6507c4

Reqs: Update locks

654d6b0

Docs: Update release notes

9e6e384

leroyvn force-pushed the abs_db branch from 1e639e3 to 9e6e384 Compare April 10, 2024 10:34

leroyvn merged commit 1f17b4b into next Apr 10, 2024

leroyvn deleted the abs_db branch April 10, 2024 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Absorption database abstraction, Dynaconf migration #397

Absorption database abstraction, Dynaconf migration #397

leroyvn commented Apr 7, 2024 •

edited

Absorption database abstraction, Dynaconf migration #397

Absorption database abstraction, Dynaconf migration #397

Conversation

leroyvn commented Apr 7, 2024 • edited

Description

Transition of configuration to Dynaconf

Addition of an abstraction to handle large molecular absorption databases

Other minor updates

Checklist

leroyvn commented Apr 7, 2024 •

edited