Mock data loading iterface by kacpnowak · Pull Request #336 · ecmwf/WeatherGenerator

kacpnowak · 2025-06-13T09:22:44Z

No description provided.

* Implement mock IO (#336) * Adapt score class score class (#339) * Implement mock IO * Adapt score class * Removing unused file (#349) * remove database folder (#355) * Small change - CI - pinning the version of formatting (#361) * changes * changes * Update INSTALL.md * Update INSTALL.md * Fixed Exxx lint issues (#284) * Rebased to the latest changes and linted new changes * addressed review comments * addressed review comments * Linted the latest changes. * corrected the formating * corrected the formating * configured ruff to use LF line endings in pyproject.toml * [357] Sub-package for evaluation (#359) * working * changes * removing deps from non-core project * changes * fixes * comments * Iluise quick fix stac (#374) * remove database folder * fix database * Simplifying workflow for plot_training (#368) * Simplifying workflow for plot_training * Ruffed * Working on implementing exclude_source * Remove unused code * Fixed ruff issue * Fixing bug in lat handling (377) (#378) * Fixing bug in lat handling * Added comment --------- Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * recover num_ranks from previous run to calculate epoch_base (#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> * Introduce tuples instead for strings to avoid TypeError (#392) * Exclude channels from src / target (#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> * add embed_dropout_rate to config v1 (#358) * [402] adds checks to the pull request (#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Implement per-channel logging (#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * [346] Passing options through the slurm script (#400) * changes * fixes * refactor `validation_io.write_validation` to make it more readable * remove legacy code `validation_io.read_validation` * encapsulate artifact path logic in config module * remove redundant attribute `Trainer.path_run` * use config to look up base_path in `write_validation` * remove unused `write_validation` args: `base_path`, `rank` * ensure correct type for pathes * remove streams initialization from `Trainer` * remove path logic from `Trainer.save_model` * simplify conditional * rename mock io module * update uv to include dask * Implement io module to support reading/writing model output * implement new validation_io routine * use new write_validation routine * remove unused code * rename output routine to `write_output` * ruffed and added comments * fixed annotation * use simple __init__ method for `OutputItem` instead of dataclasses magic * address reviewers comments * rename method * add simple docstrings * ruffed * typehint fixes * refactor names * update comments and typehints, dont import pytorch * remove `__post_init__` methods, cache properties * fixes and integration test * final fixes :) * changes * changes * changes * changes * changes * more work * changes * changes * changes * ruffed * ruffed * improve logging and comments * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Add doc-string to call-method and provide example usage for efficient graph-construction. * Some fixes to score-class. * Some fixes to handling aggregation dimension. * Add missing import of MockIO. * changes * changes * removing the scores * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes --------- Co-authored-by: Kacper Nowak <kacper.nowak@awi.de> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com> Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com> Co-authored-by: ankitpatnala <ankitpatnala@gmail.com> Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> Co-authored-by: Till Hauer <till@web-hauer.de> Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de> Co-authored-by: Michael <m.langguth@fz-juelich.de>

* Implement mock IO (ecmwf#336) * Adapt score class score class (ecmwf#339) * Implement mock IO * Adapt score class * Removing unused file (ecmwf#349) * remove database folder (ecmwf#355) * Small change - CI - pinning the version of formatting (ecmwf#361) * changes * changes * Update INSTALL.md * Update INSTALL.md * Fixed Exxx lint issues (ecmwf#284) * Rebased to the latest changes and linted new changes * addressed review comments * addressed review comments * Linted the latest changes. * corrected the formating * corrected the formating * configured ruff to use LF line endings in pyproject.toml * [357] Sub-package for evaluation (ecmwf#359) * working * changes * removing deps from non-core project * changes * fixes * comments * Iluise quick fix stac (ecmwf#374) * remove database folder * fix database * Simplifying workflow for plot_training (ecmwf#368) * Simplifying workflow for plot_training * Ruffed * Working on implementing exclude_source * Remove unused code * Fixed ruff issue * Fixing bug in lat handling (377) (ecmwf#378) * Fixing bug in lat handling * Added comment --------- Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * recover num_ranks from previous run to calculate epoch_base (ecmwf#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (ecmwf#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (ecmwf#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> * Introduce tuples instead for strings to avoid TypeError (ecmwf#392) * Exclude channels from src / target (ecmwf#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (ecmwf#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> * add embed_dropout_rate to config v1 (ecmwf#358) * [402] adds checks to the pull request (ecmwf#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (ecmwf#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Implement per-channel logging (ecmwf#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * [346] Passing options through the slurm script (ecmwf#400) * changes * fixes * refactor `validation_io.write_validation` to make it more readable * remove legacy code `validation_io.read_validation` * encapsulate artifact path logic in config module * remove redundant attribute `Trainer.path_run` * use config to look up base_path in `write_validation` * remove unused `write_validation` args: `base_path`, `rank` * ensure correct type for pathes * remove streams initialization from `Trainer` * remove path logic from `Trainer.save_model` * simplify conditional * rename mock io module * update uv to include dask * Implement io module to support reading/writing model output * implement new validation_io routine * use new write_validation routine * remove unused code * rename output routine to `write_output` * ruffed and added comments * fixed annotation * use simple __init__ method for `OutputItem` instead of dataclasses magic * address reviewers comments * rename method * add simple docstrings * ruffed * typehint fixes * refactor names * update comments and typehints, dont import pytorch * remove `__post_init__` methods, cache properties * fixes and integration test * final fixes :) * changes * changes * changes * changes * changes * more work * changes * changes * changes * ruffed * ruffed * improve logging and comments * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Add doc-string to call-method and provide example usage for efficient graph-construction. * Some fixes to score-class. * Some fixes to handling aggregation dimension. * Add missing import of MockIO. * changes * changes * removing the scores * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes --------- Co-authored-by: Kacper Nowak <kacper.nowak@awi.de> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com> Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com> Co-authored-by: ankitpatnala <ankitpatnala@gmail.com> Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> Co-authored-by: Till Hauer <till@web-hauer.de> Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de> Co-authored-by: Michael <m.langguth@fz-juelich.de>

* Revert "Implement per-channel logging (#283)" (#434) This reverts commit 989ab6e1d6e8c0f69594414c7733adf30acd1c54. * Fix FESOM datareader and int overflow (#417) * Fix indexing in DataReaderFesom * Enforce using only int64 in data loading * ruff * ruff2 * Review * Change int64 back to int32 * changes (#462) * Fix incorrect handling of empty window (which triggered problem in IO writing code). (#447) * Update default_config.yml (#446) analysis_streams_output is missing, which leads to error with val_initial=True and log_validation > 0. * Re-enabled option to run plot_training as script and fixed -rf argument (#444) * Re-enabled option to runplot_training as script and removed relative path as default from mutually-exclusive argument -rf. * Ruffed code. * Ruff check fix. * Rename flags for parsing configuration and fixed default handling for standard config YAML-file. * fix era5 config (#473) Adding z back in * [251] Merge new IO class (#469) * Implement mock IO (#336) * Adapt score class score class (#339) * Implement mock IO * Adapt score class * Removing unused file (#349) * remove database folder (#355) * Small change - CI - pinning the version of formatting (#361) * changes * changes * Update INSTALL.md * Update INSTALL.md * Fixed Exxx lint issues (#284) * Rebased to the latest changes and linted new changes * addressed review comments * addressed review comments * Linted the latest changes. * corrected the formating * corrected the formating * configured ruff to use LF line endings in pyproject.toml * [357] Sub-package for evaluation (#359) * working * changes * removing deps from non-core project * changes * fixes * comments * Iluise quick fix stac (#374) * remove database folder * fix database * Simplifying workflow for plot_training (#368) * Simplifying workflow for plot_training * Ruffed * Working on implementing exclude_source * Remove unused code * Fixed ruff issue * Fixing bug in lat handling (377) (#378) * Fixing bug in lat handling * Added comment --------- Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * recover num_ranks from previous run to calculate epoch_base (#317) * recover num_ranks from previous run to calculate epoch_base * set email settings for commits * addressing Tim's comment * make ruff happy * improve style * changes (#385) Linter rule so np.ndarray is not used as type * changed the script name from evaluate to inference as it simply gener… (#376) * changed the script name from evaluate to inference as it simply generate infer samples * changed evaluate to inference in the main scripts and corresponding calls in the config * update the main function for the inference script * changed evaluate to inference also in docstring, unit test scripts, and integration test scripts --------- Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> * Introduce tuples instead for strings to avoid TypeError (#392) * Exclude channels from src / target (#363) * Exclude channels from src / target * Simplified code and added comment that pattern matching is used * Adding new stream config * Fixing bug that led to error when accessing self.ds when dataset is empty * Wokign on exlcude_source * work in progress * Fixing incorrect formating for logger (#388) * Ruffed * Refactored and cleaned up channel selection. Also added check that channels are not empty * Cleaned channel parsing and selection * Adjustments * Removing asserts incompatible with empty dataset --------- Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> * add embed_dropout_rate to config v1 (#358) * [402] adds checks to the pull request (#403) * chanegs * mistake * mistake * mistake * changes * doc * Introduce masking class and incorporate in TokenizerMasking (#383) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * Ruffed * Added warning to capture corner case, likely due to incorrect user settings. * Fixed incorrect call twice * Fixed missing conditional for logger statement * Required changes for better handling of rngs * Improved handling of rngs * Improved handling of rng --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Implement per-channel logging (#283) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff --------- Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * [346] Passing options through the slurm script (#400) * changes * fixes * refactor `validation_io.write_validation` to make it more readable * remove legacy code `validation_io.read_validation` * encapsulate artifact path logic in config module * remove redundant attribute `Trainer.path_run` * use config to look up base_path in `write_validation` * remove unused `write_validation` args: `base_path`, `rank` * ensure correct type for pathes * remove streams initialization from `Trainer` * remove path logic from `Trainer.save_model` * simplify conditional * rename mock io module * update uv to include dask * Implement io module to support reading/writing model output * implement new validation_io routine * use new write_validation routine * remove unused code * rename output routine to `write_output` * ruffed and added comments * fixed annotation * use simple __init__ method for `OutputItem` instead of dataclasses magic * address reviewers comments * rename method * add simple docstrings * ruffed * typehint fixes * refactor names * update comments and typehints, dont import pytorch * remove `__post_init__` methods, cache properties * fixes and integration test * final fixes :) * changes * changes * changes * changes * changes * more work * changes * changes * changes * ruffed * ruffed * improve logging and comments * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Update to score-class according to internal discussions and feedback in PR. * Add license header. * Ruffed code. * Add doc-string to call-method and provide example usage for efficient graph-construction. * Some fixes to score-class. * Some fixes to handling aggregation dimension. * Add missing import of MockIO. * changes * changes * removing the scores * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes --------- Co-authored-by: Kacper Nowak <kacper.nowak@awi.de> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com> Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com> Co-authored-by: ankitpatnala <ankitpatnala@gmail.com> Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int> Co-authored-by: Till Hauer <till@web-hauer.de> Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de> Co-authored-by: Michael <m.langguth@fz-juelich.de> * [459] Attempt to fix ruff differences (#463) * changes * debug * changes * changes * Update pyproject.toml (#457) * Continue training through slurm script (#395) * train_continue via slurm * using __main__ as entry point for slurm script * reverting config files to match base branch * reverting config files to match base branch * removing param_sum control logging before and after loading of model weights * run ruff * check whether from_run_id is in arguments * trigger PR check * remove block to set reuse_run_id=True --------- Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> * added the .python_version file set to python 3.12 (#482) Co-authored-by: Kerem Can Tezcan <ktezcan0@login07.leonardo.local> * script (#489) * Remove print statements for logging (#421) (#439) * first change * removed all prints * changed model.py back * adding comments and fixes@ * added ruff fixes * reverting files for PR * ruff fixes * removing run_id.py * formatting changes * changing comments in check_gh_issue script --------- Co-authored-by: owens1 <owens1@jwlogin09.juwels> Co-authored-by: Timothy Hunter <tim.hunter@ecmwf.int> * Rename batchsize to batchsize_per_gpu (#475) * Rename batchsize to batchsize_per_gpu * Fix ruff stuff * fix (#490) * add polar orbiters and abi-goes to the stac database (#426) * testing adding metopa and metopb as placeholder drafts to stac database * added the actual json files because I think we have to * updated metopa metopb jsons and ets * add fy3 and update metops * updated names of metops * updated metopb untarred size inodes and end date * update names to instrument, satellite * add untarred data size and inodes for metopa * updated to oscar naming, with format platform, instrument, and added fengyun satellites * update size and inodes of fy3c mwhs * add fengyun jsons, missing before, and update unique ids of metopa and b * add processing_level field to metopa as a test * adding processing level field * fix up processing level * updated jsons and jsonnets for provenance * actually include provenance * updated to include processor and provider, remove provenance * add abi-goes * fix abi goes geometry * fix latitude and longitude * fix typo * hopefully this time lat is right.. * update catalogue json for develop * check catalogue on this branch * jsonneted for develop --------- Co-authored-by: iluise <luise.ilaria@gmail.com> * Added naming convention checks to lint (#501) * Added naming convention checks to lint * Implemented python naming conventions and corrected code accordingly --------- Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx> * Correct the in-code-names for rotation matrices (#516) * Added naming convention checks to lint * Implemented python naming conventions and corrected code accordingly * Corrected renaming of rotation matrices from R to rot instead of to r --------- Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx> * extend format string and timedelta to days (#499) * extend format string and timedelta to days * replace with pd.to_timedelta * import pandas * ruff * enforce "HH:MM:SS" format * ruff * Mlangguth/develop/issue 251 (#495) * Add score-class to evaluate-package. * Add score-class to evaluate-package. * Lintered and ruffed code. * Add fix to io.py and update dependencies in common. * Several small fixes to score-class and fast evaluation. * Add utils for evaluate. * Moved to_list to utils and improved doc-strings. * Improve several doc-strings, avoid formatting of logger and other changes from PR review. * Add xhistogram and xskillscore to dependencies of evaluate. * Ruffed code. * Lintered code. * Fix incorrect retrieval of validation batch size in validation IO. * Final minor changes to argument-names * changes (#471) * Updated to camel case. (#445) * Updated to camel case. * Fixed formatting. * Revert "Updated to camel case. (#445)" (#530) This reverts commit 4a8bd49067d86c8c9dd2930544d52cb9db8577af. * [327] Script to create the links to output directories (results, ...) (#528) * changes * fixes * slash * slash * checks * checks * Update config parameters lr and grad_clip (#545) * updated lr and grad_clip in config * modify lr to 1e-4 * Fixed randomization problem with masking (#510) * Fixed randomization problem with masking (needs to be verified) * Making sure the seed is ok * Fixed problem with seed init. * More improvements. But problem still seems to be there. * Clean up of rng handling. Re-initalization is passed through to masker, which was the issue. * - Fixed prime numbers - Cleaned up unnecessary rng init and added further comments. --------- Co-authored-by: clessig <christian.lessig@ecwmf.int> * Sophiex/dev/upper bound targets (#526) * recovering my stash * Fix bug * Clean up pull request * Clessig/develop/fix forecasting 448 (#449) * Removed (second) residual connection for forecasting * Added init to forecasting engine to small values * Default values for forecasting experiments * Updated settings * Setting local engine to empty * Fix z settings. * Revised defaults with larger net * Revised defaults with larger config * Restoring default config * Restoring * Restoring default --------- Co-authored-by: clessig <christian.lessig@ecwmf.int> * Restore self.size_time_embedding in tokenizer_forecast.py (#548) * Restore self.size_time_embedding in tokenizer_forecast.py Fixes #547 * Remove empty line for ruff Remove line for ruff? * Replace cf.rank==0 with utils.distributed.is_root (#535) Co-authored-by: wang85 <wang85@jwlogin22.juwels> * Fixed handling of empty streams in plot_train (#552) * Fixed handling of empty streams * Fixed --------- Co-authored-by: clessig <christian.lessig@ecwmf.int> * Fix train_continue (#556) * add DocStrings to model (#268) * added DocStrings for class ModelParams * added DocStrings for class Model * Docstring cleanup v1 * Docstring cleanup v2 * Docstring cleanup v3 * Docstring corrections v1 * Docstring corrections v2 * Docstring corrections v3 * ruff check v1 * ruff check v2 * ruff check v3 --------- Co-authored-by: th3002s <till.hauer@alumni.fh-aachen.de> * Revised structure in metric JSON-file (#549) * Update score-class to support groupby-operations for per-sample evaluation. * Update of fast evaluation pipeline to track metrics sample-wise and dump them into the newly structured JSON-files. * Changes according to PR review and fix for handling situations with a single sample. * Changes according to PR review and fix to filter channels for score-calculation. * Fixed handling of empty source/target channels (#558) Co-authored-by: clessig <christian.lessig@ecwmf.int> * Fix to peel_tar_channels to allow situations where no data for fstep=0 is present. (#572) * Update era5.yml: token size 8 (#583) * [DRAFT] CLI for scoring and plotting (#522) * first insterface * working version * save json * add omegaconf * address comment and clean up interface * add config * update scoring class * Fix to allow for channel-selection in get_data and efficiency improvement to plot_data. * Avoid circulra dependency issues with to_list-function. * Fix data selection issues. * Enable proper handling of lists from omegaconf. * update to mlangguth89 fork * refactor forecast step * ruffed * add printing summary * add ZarrData class * adjust size of the plots * attempt to solve sorting issue * Rename model to run in config and in code. * Fixes to Michael's review comments. * Ruffed code. * resync with mlangguth89 + add plot titles * revert mixed --------- Co-authored-by: Michael <m.langguth@fz-juelich.de> * 'Handle list input to forecast_steps (Closes #573)' (#581) * 'fixed bug not handling list input to forecast step #573' * linted * replace error with assert * lint * roll-back accidental lint --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * remove plot config (#597) * first insterface * working version * save json * add omegaconf * address comment and clean up interface * add config * update scoring class * Fix to allow for channel-selection in get_data and efficiency improvement to plot_data. * Avoid circulra dependency issues with to_list-function. * Fix data selection issues. * Enable proper handling of lists from omegaconf. * update to mlangguth89 fork * refactor forecast step * ruffed * add printing summary * add ZarrData class * adjust size of the plots * attempt to solve sorting issue * Rename model to run in config and in code. * Fixes to Michael's review comments. * Ruffed code. * resync with mlangguth89 + add plot titles * revert mixed * remove plot config + style addition to evaluation package * ruffed --------- Co-authored-by: Michael <m.langguth@fz-juelich.de> * integrate IFS scores from Quaver into FastEvaluation (#600) * first insterface * working version * save json * add omegaconf * address comment and clean up interface * add config * update scoring class * Fix to allow for channel-selection in get_data and efficiency improvement to plot_data. * Avoid circulra dependency issues with to_list-function. * Fix data selection issues. * Enable proper handling of lists from omegaconf. * update to mlangguth89 fork * refactor forecast step * ruffed * add printing summary * add ZarrData class * adjust size of the plots * attempt to solve sorting issue * Rename model to run in config and in code. * Fixes to Michael's review comments. * Ruffed code. * resync with mlangguth89 + add plot titles * revert mixed * remove plot config + style addition to evaluation package * ruffed * add option to comment out plotting * resync utils to develop --------- Co-authored-by: Michael <m.langguth@fz-juelich.de> * [569] Load eagerly the stream content in order (#585) * changes * change * changes * Remove loading of streams also from inference. --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * [DRAFT][590] Rename metrics file (#601) * Implemented backward-compatible function to read and write `{RUN-ID}_train_metrics.json` (new) or `metrics.json` (old) * Quick fix for #553 NaT from encode_times_target, move offset to before trigs (#589) * quick fix for 553 NaT from encode_times_target, move offset * change offset to 10 minutes... * ruffed * apply hotfix to deltas_sec * ruffed * fix: associate output stream names with correct index (#519) * fix: associate output stream names with correct index * ruffed * fix: iteration over output items * address comments * fix: correctly index channels * fix stream indexing logic, add asserts * fix: extraction of data/coordinates for sources * fix assert * Clessig/develop/channel logging 282 (#615) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff * Implement sending tensors of different shapes * ruff * Fix merge * Fix docstring * rerun workflow * Review * Change default colums name * Fix merge * - Added ddp_average_nan that is robust to NaN/0 entries when computing mean - Switched from all_gather to this function in trainer to robustly average - Some code cleanup * use all_to_all communication * Fixing problem with single-worker (non-DDP) training * Ruffed * Re-enabled validation loss output in terminal * Simplified handling of dist initalized --------- Co-authored-by: Kacper Nowak <kacper.nowak@awi.de> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> Co-authored-by: clessig <christian.lessig@ecwmf.int> * Fix bug in corner case of data reading (#621) * Changed logging level for some messages. * Fix bug in data reading and add assert to better detect these problems. * Loss class refactoring (#533) * Fix bug with seed being divided by 0 for worker ID=0 * Fix bug causing crash when secrets aren't in private config * Implement logging losses per channel * Fix issue with empty targets * Rework loss logging * ruff * Remove computing max_channels * Change variables names * ruffed * Remove redundant enumerations * Use stages for logging * Add type hints * Apply the review * ruff * fix * Fix type hints * ruff * Implement sending tensors of different shapes * ruff * Fix merge * Fix docstring * rerun workflow * creating loss class * Adapted varnames in new compute_loss function to match LossModule * comments and loss_fcts refactoring * Suggested a separation of mask creation and loss computation * first working version of LossModule; added unit test * Modifications and TODOs after meeting with Christian and Julian * Added Christian's comments and updated code partially * Julian & Matze further advances to understand shapes * New mask_t computations. Not yet correct, thus commented * Resolved reshaping of tensors for loss computation * small changes in _prepare_logging * J&M first refactoring version finished, 2 tests ok * First round of resolving PR comments * add ModelLoss dataclass, rearrange mask and loss computation * Integrating new LossCalculator into trainer.py and adding docstrings * J&M resolved temp.item() error * Second round of PR comments integrated * - Fixed loss accumulation - Cleaned up variable names * Renamed weight * Removed unused vars * Inspected loss normalization for logging * Minor clean-up * Removing unused code. * More refactoring: breaking code down in smaller pieces * Fix * Adding missing copyright * Adding missing copyright * Fixing incorrect indent * Fix --------- Co-authored-by: Kacper Nowak <kacper.nowak@awi.de> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: clessig <christian.lessig@ecwmf.int> * Update momentum (#633) * Update momentum * Remove final GELU in MLP * Adding assert to catch inconsistent config params (#630) * Update default_config.yml (#641) Fix incorrect stream * Backward compatibility of 'loss_avg_mean' metric name (#637) Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Iluise/develop/plotting issues (#635) * fix plotted timestamp * fix crashing when a run is plot only * ruffed * implement comments * Mlangguth/develop/issue 586 (#625) * Add options to configure the marker size, the marker type and enable marker-scaling with latitude for map-plots * Update doc-strings to follow standard format. * Ruffed code. * Changes due to review comments. * Less verbose logging and improved handling of setting to plot histograms. * Corrected error-message in plot_data. * [DRAFT]: Prediction head architecture clean-up (#481) * - Avoid time encoding is 0 - eps in layer norms to 10^-3 - bf16 * Make the attention dtype and norm eps configurable * Fix gitignore and add config files * Shuffle config files into sensible folders * Implement first attempt at new prediction heads * Fix some bugs * Fix trainer compile + fsdp * Fix trainer and better defaults * Choose AdaLN * Correlate predictions per cell Previously this pr treated as independent * Make things more parameter efficient * Revert "Make things more parameter efficient" It made things way worse This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757. * Improve the prediction heads at small sizes * Improve the stability of training Two main changes: better beta 1 and beta 2 values in adam w and remove gelu * Adding some more regularisation In particular to prevent training divergences and overfitting * Forgot the dropout in MLPs * Tune the learning rate * Add the original prediction heads CAREFUL: Untested!!! * Fix bugs and ruff * Restore old version last part * Start fixing the defaults * Deleting hpc specific configs * Deleting hpc specific configs * Defaults and documentation * Apply ruff * Clean up code * Add one more comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fix bug in loggin buffer reset (#651) Co-authored-by: clessig <christian.lessig@ecwmf.int> * use config dropout_rate in EmbeddingEngine (#646) * Make numpy argsort version resilient (#645) * Fix backward compatibility (#655) * Implement global and per-cell channel masking (#496) * creating masking class and adapting tokenizer_masking to use this class * minor changes to masking.py and tokenizer_masking * removed old tokenizer_masking * include masking_strategy in default_config * change ValueError to assert * linting formatting changes files * further linting of docstrings * create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements * linted masking, tokenizer_masking * modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class * remove check if all masked, not masked * remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source * update tokenizer utils with description of idx_ord_lens in comment * remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked) * working implementation of healpix level masking in Masker, with too many prints and hardcoded hl_mask and hl_data * adding masking_strategy: to config * remove unused mentions of masking_combination * removed comment about streams * changed assert to check self perm_sel is not None * ruff masking, tokenizer_masking * implementation of healpix masking code with lots of printing * removed print statements from masking.py * minor line change * remove default for strategy_kwargs * add strategy_kwargs to config, and pass through masker to pass masking strategy specific args * vectorise child indices calcs, implement masking_rate_sampling, minorly updated docs * remove print statements * cf.strategy_kwargs passed to Masker in multi_stream_data_sampler * masking_strategy random and strategy kwargs passed to config * ruffed * pass cf.get(strategy_kwargs or {}) to the Masker and update masking to reflect this * update config so it does not include strategy_kwargs, no longer needed * move asserts for healpix to constructor, rename to masking_strategy_config, update config with example of healpix * test working version, understanding what is happening * revert breaking develop merge and conflict in config * default config put channel masking * reverting the accidental revert... * small change to config * implemented global and per-cell per channel masking in masking, change to config * remove print statements from multistream * updated config for compatibility to run immediately * cleaned code, assert to fail for different number of source and target streams * updated default config to match latest * fixed _generate_channel_mask to handle empty cells of data * fixed docstring of masker * ruffed linted * rename l in token_lens * lint ruff, remove prints * add assert for source and target channels must be the same * fix config to develop, new assert, remove assert * revert assert statement for readability * clip the values in masking_rate_sampling to 0.01 and 0.99 * revert cell name to tl * remove empty lines from model * remove empty line from embeddings * remove empty line tokenizer_masking * ruff masking, tokenizer_masking * update config again to develop version * update config comment for masking strategies * update channel masking to handle non-data channels for new loss * ruffed * Implemented check that for channel masking source and target channel have to be identical * Minor code improvements * Fixed incorrect return type for special case * Ruffed + and reduced magic constants * Minor fixes to _generate_healpix_mask * Cleaned up and optimized mask generation for channel masking * changed to use mode global or per_cell, improved docstring for masking strategies * added documented valid examples for masking_strategy_config to default_config * ruffed * update example masking_strategy_config in default * Minor adjustments to default settings * remove mention of hl_data in masking_strat_config --------- Co-authored-by: clessig <christian.lessig@ecwmf.int> * Removed that checkpoint is saved at the first batch (#663) * Clessig/develop/fix data reading anemoi missing date 671 (#672) * Changed logging level for some messages. * Fixed unhandled exception with missing dates. * Fixed debug message * Make compare_run_config.py usable again (#661) * Update compare_run_config.py to use existing functions from current repo. * Ruffed code. * [595] Changes for running a notebook script (#598) * Changes * Chanegs * work * change * changes * changes * changes * changes * changes * changes * changes * changes * reverse old changes * linter * Implement regional evaluation (#652) * Add RegionBoundingBox data class to score-utils to handle evaluation for different regions. * Implement region-specific evaluation in plot_inference.py. * Adapted utils. * Introduction of clean RegionLibrary in score_utils.py. * Ruffed code. * Updates following reviewer comments. * Ruffed code. * Clessig/develop/fix loss 678 (#679) * Changed logging level for some messages. * Fixing bug with incorrect counting * using config results path instead of fixed path (#631) * using config results path instead of fixed path * ruff * Add forgotten LayerNorm (#687) * Add forgotten LayerNorm * Apply ruff --------- Co-authored-by: Sophie Xhonneux <sxhonneux@clariden-ln001.cscs.ch> * Fix performance degradation in loss computation (#690) * Changed logging level for some messages. * Refactored loss computation to improve performance. * Working around ruff issue * - Refactored code to improve structure and readability - Fixed problem with incomplete normalization over loss functions - Solved problem with mse_weighted as loss function when mse is specified * Fixed problems with multi-worker training * Fixed indentation bug and bug in assert * [DRAFT] Rename plot_inference.py and entrypoint for evaluation (#683) * Rename plot_inference.py. * Rename of main-method and move parsing of arguments for entrypoint. * Introduce entrypoints to fast evaluation. * Fix to call of main in run_evaluation.py. * Rename entrypoint and add dependency to weathergen-evaluate. * Add missing comma in pyproject.toml. * Option for non-linear output layer in prediction head (#673) * Add score-class to evaluate-package. * Add score-class to evaluate-package. * Lintered and ruffed code. * Add fix to io.py and update dependencies in common. * Several small fixes to score-class and fast evaluation. * Add utils for evaluate. * Moved to_list to utils and improved doc-strings. * Improve several doc-strings, avoid formatting of logger and other changes from PR review. * Add xhistogram and xskillscore to dependencies of evaluate. * Ruffed code. * Lintered code. * Add helper function to get custom last activation. * Add option to control stream-specific non-linear output layer. * Controlling print-statement to model.py. * Corrected handling of config for prediction head. * Add support for stream-specific, optional non-linear output actiavtion function. * Provision of ActivationFactory. * Ruffed. * Changes following review comments. * Fix in parsing final_activation-argument. * Clessig/develop/fix empty 647 (#675) * Changed logging level for some messages. * Removed checks that requires non-empty channels * Adding warning * Fixed convergence of training (#696) * Restored old prediction had functionally. Other adjustments/reverts, in particular in attention. * Ruff'ed * Addressed reviewer comments and cleaned up minor details * Fixed bug in obs data reading (#698) * Restored old prediction had functionally. Other adjustments/reverts, in particular in attention. * Ruff'ed * Fixed bug in obs data reading so that data violated window * Fix * Update data_reader_obs.py * Restoring to develop * Fix * Ruffed * Clessig/develop/fix logging verbosity 564 (#619) * Changed logging level for some messages. * Added support for more fine grained output control. * Changed logging setting for inference. * Minor improvement to doc string * include run_id in debug log file * ruff --------- Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> * Refactor path-setting for 'model' and 'results' to be dynamic (no relative paths) (Closes #591) (#677) * temp commit wip * change model_path and run_path setting to dynamic (independent of HPC) (untested) * removed unnecessary set_paths references * linted * remove commented code * removed commented lines * Enable plot_train with dynamic paths * lint --------- Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> * Fix (#715) * modified evaluation api, callable as python function (#713) * Fixed bug for degenerate streams (#723) NaN-robust min/max computation. * Fixed (#725) Resolves config loading error when passing a `model_dir` * Fix on loading model config (#726) * Small fix on loading model config * minor change * Detect if channels for plotting differ from JSON and recompute if necessary (Closes #701) (#718) * new branch * detecting changes in channel spec * style changes * style changes * Delete config/plot_config.yml * incorporated PR feedback * added run_evaluation (again) * Clessig/develop/fix logging 719 (#720) * Cleaned up to use proper logger * Cleaned up to use proper logger * Fix logging: needs to be registered per output stream and not per logging level * Set logging level consistently with debug to file * Fixes * Added FSDP-sharding after loading model for train continue (#729) * Added FSDP-sharding after loading model for train continue * Improved consistency * Fixed resetting FSDP after checkpoint saving * Update handling of `run_path` and `model_path` (Closes #716) (#732) * proposed solution, untested * assert instead of error * lint * incorporating PR feedback * lint * added explicit argument passing * lint * Make cartopy map resources a shared asset to prevent downloading from… (#731) * Make cartopy map resources a shared asset to prevent downloading from the internet which is not always possible * Replaced print by logger statement --------- Co-authored-by: xhonneux2 <xhonneux2@jwlogin22.juwels> Co-authored-by: karlbauer1 <karlbauer1@jwlogin21.juwels> * Clessig/develop/fixes hackathon (#736) * Fixed some comments that generated warnings * Added to create path for log files if it doesn't exist --------- Co-authored-by: Christian Lessig <christian.lessig@ovgu.de> * Revised path defaults and output dirctory structure for fast evaluation (#681) * First changes to path-handling. * Consistent path for maps and histograms. * Update of evaluation scipts for proper path defaults and directory structures. * Make root-path to repo available via common-package. * Introduce proper defaults to plot_inference.py and set-up desired directory structure for evaluation output. * Rename of results_dir-parameter to results_base_dir * Ruffed code. * Allow for run-specific results-paths and use config to get defaults. * Several fixes and consistency improvements. * Remove manual default usage in plotter.py * Ruffed code. * Update __init__.py Remove _REPO_ROOT. --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Mk/develop/fix plot train 727 (#738) * Load model_path from private config if not provided * Use existing function to get private model path * Incorporated PR comments * Fix problems with rel paths in logging files (#742) * Fixed relative path handling for logging files. * Adding default argument to _load_private_conf() * Implement first function for latitude weighting (#705) * Changed logging level for some messages. * Refactored loss computation to improve performance. * Working around ruff issue * - Refactored code to improve structure and readability - Fixed problem with incomplete normalization over loss functions - Solved problem with mse_weighted as loss function when mse is specified * Fixed problems with multi-worker training * add location weights, first commit * assertion on mask and len(location_weights) * restructuring of location weights and fixes in mse_channel_location_weighted function * fix coords_raw dependency on offset and fstep * ruff * addressing review commits and fixing bug * rm location_weight from default stream config --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> * Fix failure for notebooks. (#750) * add proper error message for source_include not equal to target_include (#767) * Implemented fractional target selection (#751) * implemented fractional target selection * ruffed * fix up configs and <= to accept target_fraction 0.0 * revert to simple implementation of per stream sampling_rate_target * restore configs * Corrected formula for L2-error in score-class. (#721) * Corrected formula for L2-error in score-class. * Introduced option to get the original or the squared L2-norm. * Added doc-string for L2-norm. * Fix sampling rate (#773) Co-authored-by: clessig <christian.lessig@ecwmf.int> * Update default_config.yml (#776) * Adding the animations feature, fixing stable colorbars (not per stream) (#692) * Adding the animations feature * keep only the animations and max-min functions. * Sophiex/dev/name modules (#754) * Add names to modules as prep for freezing * Add functionality to freeze modules based on added names * Ruff * Clean up * Wrong import path * Ruff * Fix animations bug with paths (#781) * Fix another bug in animations (#783) * Work around to allow for model freezing (#785) * Work around to allow for model freezing. * Ruff * fix to avoid whole model element of named_modules and hence freeze whole model --------- Co-authored-by: clessig <christian.lessig@ecwmf.int> Co-authored-by: Sebastian Hickman <seb.hickman@ecmwf.int> Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * Fast evaluation for integration tests (#770) * rename module level constant * split inference into own method * use proper fast evaluation pipeline for `evaluate_results` * ruffed * remove assert => different bug * adjust tests for new plot template * Update checking the value of plot_histograms and plot_animations (#788) * pass StreamData instances to io.py (#779) * Rename anemoi directories and built backward compatibility (Closes #709) (#771) * renamed anemoi dirs and built backward compatibility * ruff * removed stream directories and updated logging * renamed all streams * ruff * seviri file name change * cerra_seviri folder update * cerra path update --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fix IO when targets/preds are empty. (#760) * Modify DataReaderObs to get base_yyyy... from stream config (#794) * modify DataReaderObs to get base_yyyy... from stream config, and set it in the ctor, with default of 19700101. Use it in _setup_sample_index. Remove loading obs_id attr. Add igra.yml with example usage. * add license to igra config * update to ISO base_datetime, parse to read idx from zarr * fix integration tests (#796) * Fixed bug for empty source (#800) Co-authored-by: clessig <christian.lessig@ecwmf.int> * Train continue function with arguments (#803) * add train_continue_from_args to call with arguments --------- Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> * remove module common/mock_io (#809) * Update data_reader_obs.py removing asserts (#817) * Sgrasse/develop/issue 616 (#648) * encapsulate extraction of source data * bundle offseting of key attributes * consolidate calculation of datapoints indices into method * encapsulate extraction of coordinate axis in function. * replace attribute `channels` by `target_channels` and `source_channels` * ruffed * ruffed * fixes * address michas comments * reactivate assert * fix typo / renaming * small fix * uncomment source_n_empty and target_n_empty unused variables * fix untit tests (#814) * Plot substeps (#789) * Create subplots with grouping by valid_time. * Create histograms at substeps with grouping by valid_time. * Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required. * Add helper function to get values of keys from stream configs. * Corrected loading of model config. * Ruffed code and turning message-level on cartopy path to debug. * Revisions following reviewer comments. * fix histograms * ruffed --------- Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local> Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * Add the possibility of common ranges in plots per variable and stream (#801) * Create subplots with grouping by valid_time. * Create histograms at substeps with grouping by valid_time. * Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required. * Add helper function to get values of keys from stream configs. * Corrected loading of model config. * Ruffed code and turning message-level on cartopy path to debug. * Add the possibility of common ranges in plots per variable and stream * Revisions following reviewer comments. * fix histograms * ruffed * update utils --------- Co-authored-by: Michael <m.langguth@fz-juelich.de> Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local> Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * Fix to io problems. (#820) * Enable histograms for data with some NaNs (#823) * Fix to filter NaNs before histogram creation. * Removed unused code lines and correct for bug in marker scaling in plotter.py. * Clessig/develop/fix empty io 819 2 (#822) * Fix to io problems. * Fix issues in input * Iluise/fix empty io 819 plotting (#826) * Fix to io problems. * Fix issues in input * fix plotting * ruffed --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * fix plotting for partially filled first forecast steps (#828) Co-authored-by: luise1 <luise1@jrc0288.jureca> * Fix calculation of scores per fstep (#853) * fix calculation of scores per fstep * simplified syntax --------- Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * Fix Issue 835 (#841) Enable freezing the target coord embedding when it is just a simple layer * Improve r3tos2 (#744) * vectorized r3tos2 * revise comment --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Sophiex/dev/latent noise (#594) * - Avoid time encoding is 0 - eps in layer norms to 10^-3 - bf16 * Make the attention dtype and norm eps configurable * Fix gitignore and add config files * Shuffle config files into sensible folders * Implement first attempt at new prediction heads * Fix some bugs * Fix trainer compile + fsdp * Fix trainer and better defaults * Choose AdaLN * Correlate predictions per cell Previously this pr treated as independent * Make things more parameter efficient * Revert "Make things more parameter efficient" It made things way worse This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757. * Improve the prediction heads at small sizes * Improve the stability of training Two main changes: better beta 1 and beta 2 values in adam w and remove gelu * Adding some more regularisation In particular to prevent training divergences and overfitting * Create classes for latent noise * Add the latent noise after the local engine * Add the KL loss * Formatting * Clean up * Use the same for loop as before * Prepare branch for merge * Remove superfluous configs * Restore default configs * Mistake in the merge fixed * Final beauty changes * Final clean up * Ruff --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * [Hotfix] Fix crash when using list of forecasting steps (#824) * Fix crash when using list of forecasting steps * Ruff * Grammar fix * Fix grammar * Add checking forecast steps list * Review * Allow 0 as forecast step * Add list length check * Assert non-negative forecast step integer, added assertion messages * Ruff * ruff * Move check to config * what the ruff --------- Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> * add tokenizer base class (#815) * add tokenizer base class * ruffed * ruffed v2 * move calculation of centroids to base_class * move size_time_embedding initialization * remove ABC from tokenizer base_class * renaming * ruffed * ruffed v2 * add return value to compute_source_centroids --------- Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * vectorize s2tor3 (#745) * vectorize s2tor3 * ruff code --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * Remove cleaning stream name when logging loss (#763) * Combine masking strategies during training, with appropriate masking_… (#756) * combine masking strategies during training, with appropriate masking_strategy_config * restore config samples per validation * restore cofigs, and add to masking_strategy_config * clarify pass to per batch per stream * updated combination masking to support same masking strategy for all streams in the batch. Strategy resampled for every batch. * rename so we have masking_strategy and masking_strategy_per_batch * ruffed * clean, default to different strategy per batch for combination * ruff * remove unused variable * updated docstrings (#875) Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> * Enable correct reading of channels, forecast_step, sample variables in plot config file (Closes #717) (#755) * adjusted run_evaluation and utils code to take into account forecast_step variable from config (cherry picked from commit 26c26a923cabc5777bc75ef911f0fc3c61397e1a) * print statement change * catching error when fstep not present in zarr file * upgrades based on PR feedback * intermediate commit * intermediate commit * new functions _get_channels_fsteps_samples and check_metric * edited plotting * inter commit * fixed bug in get_data * self review * refactor * dummy commit * inter commit * feedback appleid * incorporate review feedback * removed sorting of fsteps_final * remove comments --------- Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * Implement causal masking as MTM strategy (#798) * first rough implementation of causal masking * incorporated combine masking strategies * include per stream sampling rate target in tokenizer_masking based on other PR * clean up implementation of causal masking * remove TODO * remove old causal masking function * add latest error message for channel * change if to elif for causal masking * if to elif in mask_target * cleaned up causal masking code * tokenizer_masking small change * updated config * fix up config * restore era5 config * ruffed * update config and masking.py with causal masking specific masking rate, and some comments * ruffed * roll back causal_masking_rate changes, return to just use masking_rate * faster version of causal masking, vectorise where possible. Need list comprehension for variable length tokens * ruffed * add log scale and refactor plot_summary (#865) * add log scale and refactor plot_summary * add plot_utils * add grid * ruffed * fix marker size * fix global plotting options * add types * ruffed * Fixed stream name factoring (#534) * Updated to camel case. * Fixed formatting. * to reflect upstream develop * got rid of regex and changed formatting of str names * pulled recent changes from upstream develop * Removed refactoring of lf_name. * clean_name with the new changes * Fetched latest changes to the branch * Fixed linting * Fixed stream name without touching the losses dict * fixed type annotation * add srun to integration-test in actions script (#886) * add srun to integration-test in actions script * add --offline flag to integration-test in actions.sh * Merge compare_run_configs.py with markdown table version (#699) * initial comments to outline implementation * Refactor config comparison script to support YAML input and enhance output formatting * remove unused code * Add example configuration for model run IDs and display patterns * shorten * Add 'tabulate' dependency to enhance table formatting capabilities * add instructions to config * restore option for command line run ids and model dirs * ruff * fix arg parsing * add option to show specific or all parameters in config comparison * ruff * Remove 'tabulate' from dependencies Removed 'tabulate' dependency from project requirements. * logging, imports and dependency in compare_run_configs.py * fix logging and dependencies * ruff * set default * fix arg order and checks * improve model directory handling and add exception when there is not latest model * refactor error handling in main function to omit exception details in logs * ruff * ruff * add weathergen dependency * make file executable * add private home argument * revert to default model path argument assuming symlink to shared folder is set * ruff * Implement splitting zarr and regex filenames (#524) * Implement splitting zarr and regex filenames * Optimize dask reading operations * Ruff * Review * Ruff * Remove stream name cleaning when logging loss * Add tolerance to setting std to 1.0 * Implement input column reordering and channel exclusion * Update stream config * Ruff * Add config file * Implement variable state persistance * ruffffff * ruffa * ruf ruf * Add select channels method * updating .gitignore file to include all development directories without / (#900) * [804] vectorize tokenize_window_space (with test) (#893) * vectorize tokenize_window_space * import pad_sequence from torch * change vari names, remove device, add comments, ruff code * changes * changes * changes * simplify * small change * unit tests * unit tests * unit tests --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * [812] efficient tcs computation (with tests) (#894) * efficient tcs computation * revise vectorize tcs_optimized, ruff code * add typing * changes * changes * merge --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * [811] Improve perf locs to cell coords ctrs (#895) * optimize locs_to_cell_coords_ctrs * revise get_target_coords_local_ffast for new optimized locs_to_cell_coords_ctrs * changes * changes * tests --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Sophie X <24638638+sophie-xhonneux@users.noreply.github.com> * [810] optimize locs_to_ctr_coords (with tests) (#896) * optimize locs_to_ctr_coords * changes * changes * changes * changes * merge * changes --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * Migrated Config to common (#607) * Updated to camel case. * Fixed formatting. * to reflect upstream develop * got rid of regex and changed formatting of str names * pulled recent changes from upstream develop * migrated config to common * fixed lint issues * Corrected all the changes * syntax err fixed * Fixed import * Latest upstream changes pulled * Fixed Linting errors * Lint fix * Pulled latest * fixed other occurences * Fix compare_config after config went to common (#903) * fix after config went to common * Change argument type for --show option from int to str in main function * Update default config path in main function to compare_config_list.yml * Iluise/develop/add io reader (#891) * first implementation of reader class for evaluation package * add io reader * move check_availability to reader * update to develop * fix retrive results * address comments * Fix minor bug in modules (#909) --------- Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> * [908] Harmonize the linter check between the CI and our CLI (#910) * changes * changes * [554] Updates the PR template (#912) * changes * changes * changes * comments * [906] Bug fix in tokenizer (#907) * changes * changes * changes * changes * changes * changes * changes * changes * cleanups * changes * comments --------- Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com> * Add levels (#916) * changes to include discrete levels in colormap if needed * Change slightly the position of the feature * Lint * changes (#920) * [926][evaluation] weatherGen reader for evaluation package (#927) * weatherGen reader for evaluation package * ruffed * [939] Fix CI (#940) * changes * changes * Implement forecast activity metrics (#892) * Add forecast activity calculations and update fstep handling in utils * Add forecast rate of change metrics (froct, troct) to score calculations * update description * add next data to verified data * move cases for kwargs to score * refactor froct adn troct to use calc_change_rate * remove metric specific kwargs in calc_scores_per_stream * calc_change_rate now gives NaN array when next step is None * fix nans --------- Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> * added IFS-FESOM streams and updated all stac files using jsonnet (#934) * added IFS-FESOM streams and updated all stac files using jsonnet * changes according to comments by Ilaria * resolved by using providers from common.jsonnet file * changed refererrence to ecmwf and develop branch --------- Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de> * [939] Catches failures of labeling CI job (#950) * changes * more permissions * more permissions * [880] Informative type checks in the CI (#915) * attempt * fixing pyrefly * changes * changes * changes * Sets dropout rate to 0 in eval mode for flash_attn (#923) * added check for train/eval for setting dropout_p value * ruff * rm ceil, conj, floor, and matmul from annotations.json (#951) Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * Add the calculation of 10ff for ERA5 and CERRA (#914) * Add the calculation of 10ff * Caring for cases were 10ff cannot be calculated * Create a new script for derived_channels, minor changes to reader_io * Remove stream specific settings add regex * Add more datetime formats (#962) * fix error when global_plotting_opt does not exist (#964) * fix error when global_plotting_opt does not exist * fix linter * changes (#1009) * Revert "changes (#1009)" (#1012) This reverts commit 2af1c09a11e6dd027d247b670737bbac0cd1a766. * sorcha/dev/500 (#1001) * lint reformatting + fixing get_channels * debug messages * [datasets] move to new cerra and new era5 (#995) * move to new cerra and new era5 * fix cerra * Removed the method freeze_weights_forecast and all forecast_freeze_model flag occurences (#924) * Add Coordinate System Conversion to DataReaderFesom (#1024) * Add coordinates conversion * Ruff * Add check for longitude * Sophiex/dev/fsdp2 fix (#959) * Save current state * Save current state * Barebone FSDP2 prototype TODO save checkpoints * First version of saving model * Fix save_model * Log everything and log to files * Remove redundant path creation * Allow for both slurm and torchrun + fewer log files * Cleaning up init_ddp * Ruff * Attempt to avoid duplicate logging * FSDP2 with mixed precision policy * Ruff * Clean up and logging * Try to get loggers to behave as we want * Makes ruff unhappy but works * Fixed ruff issue * Fixed problems with multi-node training. * Fix for interactive/non-DDP runs * No idea why, but this seems to work so far Committing simply so it is saved, obviously needs cleanup * Still works! So which is it memory or the grad scaler? * Also still works, I now strongly suspect the amp.gradscaler * This still works, I have no clue anymore why but whatever it works now.... * Enable loading model from absolute paths * Enable loading for 1 GPU only * Fix 1 GPU train continue * Appease ruff * Fix saving the model more regularly and perf logging * Fixed problem when training with 2 nodes. * Fix data loader seed * Appease ruff * Shouldn't overwrite with_fsdp like this * Potential fix for FSDP2 issue with different ranks using different model parts * Fix loss scaling and logging of dummy data loss * Clean up * Appease ruff * Fixed problem when source channels are empty (i.e. with diagnostic trainings). * Update io.py * FSDP2 suggestions from Tim (#1015) * comments * sophie's comments * removed logger suggestions * Clean up deadcode etc * Removing unused imports that the linter didn't like --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * Ensure sample coordinate is repeated along ipoint for single sample cases in WeatherGenReader (#1026) Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> * Fix change rate calculation by aligning s1 with s0 (#1007) * Fix change rate calculation by aligning s1 with s0 * Refactor score calculation to remove unnecessary alignment and add sorting function for coordinates * use .values option * Optimize pos enc harmonic (#1033) * add device & dtype * ruffed --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * [evaluation] fix score computation with empty cerra samples (#1039) * fix samples * riffed * answer comments * ruffed * Add score cards plotting feature (#1041) * add the feature of score_cards * Refactor, fix error when sample are different for each run, linting * fix bug, fix sizes when skill difference is huge * linting * changes on comments * linting * Clessig/develop/fix inference 1049 (#1053) * add device & dtype * ruffed * Fix inference --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * Channel weighting in loss computation (#753) * introducing channel weights * tested channel weighting * adding target_channel_weights to data_reader_base * uncomment target channel parsing in anemoi dataset * remove channel weights from default stream config * Adds default config for run_evaluation (#1028) * adding default config + changing yml path locations * linter checks * linter checks * revert .yml file * updates --------- Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> * Fix CERRA eval breaking with coord sorting in `froct` (#1057) * Move coord sorting inside score function to be metric-specific * Linting: Removed unused import * Return nan data array to prevent crash * fix nans shape in calc_change_rate --------- Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * [1059][eval] Fix eval crash of inference models from other HPCs (#1060) * Read model path from private repo instead of inference config * Linting: Organized imports * Interface improvements * [1022] Getting WG to work on santis (#1023) * working pytorch * changes * Fix for code to work on Alps-Santis * changes * cleanups * changes * reverting change * having issues with the latest branch on santis * changes * changes * changes * override with cpu * working for cpu * flash-attn moved to gpu * remove contstraint * simplifying * trying * working on atos * changes * macos * chanegs * cleanups * actions * actions * actions * actions * changes --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * fix crash in case of missing streams (#1058) * fix issue with empty region * fix non existing stream * fix channel order in evaluation (#1066) * New templates for issues (#1017) * changes * changes * Revert "New templates for issues (#1017)" (#1071) This reverts commit 3a6e7b826b7b29a6df4af27b6771567474302fb3. * [1002] Template for issues try 2 (#1072) * changes * changes * issue with template * updates * changes * issue * issue * [1073][model] Adds latent noise imputation (#1074) * Add latent noise imputation to model.py with backwards compatibility * Linted * Resetting default_config, except for new flag * Resetting default_config 2nd try * add modules to annotations.json (#1035) Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> * Jk/develop/gamma decay (#998) * Update to develop, prepare for new experiment series * gamma decay over fsteps first commit * add gamma decay factor to config * working gamma decay weighting * rm breakpoint * rm eval and plot configs * reverting default config --------- Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int> * Add materialisation of new modules before loading checkpoint (#1030) * Add materialisation of new modules before loading checkpoint * Initialize new modules in load_model * Fix adding new embedding networks * Clessig/develop/fix kcrps 1077 (#1078) * Improved robustness for loss fcts where ch loss does not make sense * Re-enabled kernel CRPS and added weighting options * Fixes * Improved tensor reordering * Sgrasse/develop/issue 898 checkpoint freq conf (#905) * add new/changed parameters in default_config * implement backward compatibility * remove `train_log.log_interval` from default config * use new configuration arguments in Trainer * fix: wrong variable name * ruffed * Rework method structure * fix bug * rename `log_intevals` to `train_log_freq` * fix integration tests * fix forgot renaming * fix rebasing artifact * Sorcha/dev/571 (#957) * debug for netcdf pipeline * zarr_netcdf first draft * fixing pipeline * linter checks * removing debug prints from io.py * refactoring, found issue with forecast_ref_time * deleting unnecessary lines * proper docstrings * moving filepaths * linting * multithread processing added * debug info * debugging * refactoring * linting * fstep as argument * change assert --------- Co-authored-by: owens1 <owens1@jrlogin09.jureca> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: ilaria luise <luise.ilaria@gmail.com> * pyproject.toml checks (#1042) * adding chceks for toml files into actions * lint fixes * lint checks * link fixes * changes * disabling ruff check * change to path info instead * adding E501 and E721 to be ignored for now --------- Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * Implement EMA of the model (#1005) * Save current state * Save current state * Barebone FSDP2 prototype TODO save checkpoints * First version of saving model * Fix save_model * Log everything and log to files * Remove redundant path creation * Allow for both slurm and torchrun + fewer log files * Cleaning up init_ddp * Ruff * Attempt to avoid duplicate logging * FSDP2 with mixed precision policy * Ruff * Clean up and logging * Try to get loggers to behave as we want * Makes ruff unhappy but works * Fixed ruff issue * Fixed problems with multi-node training. * Fix for interactive/non-DDP runs * No idea why, but this seems to work so far Committing simply so it is saved, obviously needs cleanup * Still works! So which is it memory or the grad scaler? * Also still works, I now strongly suspect the amp.gradscaler * This still works, I have no clue anymore why but whatever it works now.... * Enable loading model from absolute paths * Enable loading for 1 GPU only * Fix 1 GPU train continue * Appease ruff * Fix saving the model more regularly and perf logging * Fixed problem when training with 2 nodes. * Fix data loader seed * Appease ruff * Shouldn't overwrite with_fsdp like this * Potential fix for FSDP2 issue with different ranks using different model parts * Fix loss scaling and logging of dummy data loss * Clean up * Appease ruff * Start implementing EMA, works for 1 GPU * Make EMA model multi-gpu compatible …

Implement mock IO

4205a0a

github-project-automation bot added this to WeatherGen-dev Jun 13, 2025

grassesi merged commit 98acd2c into ecmwf:grassesi/dev/hackathon_evaluation Jun 13, 2025
1 check passed

github-project-automation bot moved this to Done in WeatherGen-dev Jun 13, 2025

grassesi pushed a commit that referenced this pull request Jun 16, 2025

Implement mock IO (#336)

34ef488

grassesi pushed a commit that referenced this pull request Jun 16, 2025

Implement mock IO (#336)

4767a18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mock data loading iterface #336

Mock data loading iterface #336
grassesi merged 1 commit intoecmwf:grassesi/dev/hackathon_evaluationfrom
kacpnowak:kacpnowak/develop/score_class

kacpnowak commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kacpnowak commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants