Skip to content

Mock data loading iterface #336

Merged
grassesi merged 1 commit intoecmwf:grassesi/dev/hackathon_evaluationfrom
kacpnowak:kacpnowak/develop/score_class
Jun 13, 2025
Merged

Mock data loading iterface #336
grassesi merged 1 commit intoecmwf:grassesi/dev/hackathon_evaluationfrom
kacpnowak:kacpnowak/develop/score_class

Conversation

@kacpnowak
Copy link
Copy Markdown
Contributor

No description provided.

@grassesi grassesi merged commit 98acd2c into ecmwf:grassesi/dev/hackathon_evaluation Jun 13, 2025
1 check passed
grassesi pushed a commit that referenced this pull request Jun 16, 2025
grassesi pushed a commit that referenced this pull request Jun 16, 2025
tjhunter added a commit that referenced this pull request Jul 8, 2025
* Implement mock IO (#336)

* Adapt score class score class (#339)

* Implement mock IO

* Adapt score class

* Removing unused file (#349)

* remove database folder (#355)

* Small change - CI - pinning the version of formatting (#361)

* changes

* changes

* Update INSTALL.md

* Update INSTALL.md

* Fixed Exxx lint issues (#284)

* Rebased to the latest changes and linted new changes

* addressed review comments

* addressed review comments

* Linted the latest changes.

* corrected the formating

* corrected the formating

* configured ruff to use LF line endings in pyproject.toml

* [357] Sub-package for evaluation (#359)

* working

* changes

* removing deps from non-core project

* changes

* fixes

* comments

* Iluise quick fix stac (#374)

* remove database folder

* fix database

* Simplifying workflow for plot_training (#368)

* Simplifying workflow for plot_training

* Ruffed

* Working on implementing exclude_source

* Remove unused code

* Fixed ruff issue

* Fixing bug in lat handling (377) (#378)

* Fixing bug in lat handling

* Added comment

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* recover num_ranks from previous run to calculate epoch_base (#317)

* recover num_ranks from previous run to calculate epoch_base

* set email settings for commits

* addressing Tim's comment

* make ruff happy

* improve style

* changes (#385)

Linter rule so np.ndarray is not used as type

* changed the script name from evaluate to inference as it simply gener… (#376)

* changed the script name from evaluate to inference as it simply generate infer samples

* changed evaluate to inference in the main scripts and corresponding calls in the config

* update the main function for the inference script

* changed evaluate to inference also in docstring, unit test scripts, and integration test scripts

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Introduce tuples instead for strings to avoid TypeError (#392)

* Exclude channels from src / target (#363)

* Exclude channels from src / target

* Simplified code and added comment that pattern matching is used

* Adding new stream config

* Fixing bug that led to error when accessing self.ds when dataset is empty

* Wokign on exlcude_source

* work in progress

* Fixing incorrect formating for logger (#388)

* Ruffed

* Refactored and cleaned up channel selection. Also added check that channels are not empty

* Cleaned channel parsing and selection

* Adjustments

* Removing asserts incompatible with empty dataset

---------

Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>

* add embed_dropout_rate to config v1 (#358)

* [402] adds checks to the pull request (#403)

* chanegs

* mistake

* mistake

* mistake

* changes

* doc

* Introduce masking class and incorporate in TokenizerMasking (#383)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* Ruffed

* Added warning to capture corner case, likely due to incorrect user settings.

* Fixed incorrect call twice

* Fixed missing conditional for logger statement

* Required changes for better handling of rngs

* Improved handling of rngs

* Improved handling of rng

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Implement per-channel logging (#283)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* [346] Passing options through the slurm script (#400)

* changes

* fixes

* refactor `validation_io.write_validation` to make it more readable

* remove legacy code `validation_io.read_validation`

* encapsulate artifact path logic in config module

* remove redundant attribute `Trainer.path_run`

* use config to look up base_path in `write_validation`

* remove unused `write_validation` args: `base_path`, `rank`

* ensure correct type for pathes

* remove streams initialization from `Trainer`

* remove path logic from `Trainer.save_model`

* simplify conditional

* rename mock io module

* update uv to include dask

* Implement io module to support reading/writing model output

* implement new validation_io routine

* use new write_validation routine

* remove unused code

* rename output routine to `write_output`

* ruffed and added comments

* fixed annotation

* use simple __init__ method for `OutputItem` instead of dataclasses magic

* address reviewers comments

* rename method

* add simple docstrings

* ruffed

* typehint fixes

* refactor names

* update comments and typehints, dont import pytorch

* remove `__post_init__` methods, cache properties

* fixes and integration test

* final fixes :)

* changes

* changes

* changes

* changes

* changes

* more work

* changes

* changes

* changes

* ruffed

* ruffed

* improve logging and comments

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Add doc-string to call-method and provide example usage for efficient graph-construction.

* Some fixes to score-class.

* Some fixes to handling aggregation dimension.

* Add missing import of MockIO.

* changes

* changes

* removing the scores

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>
Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com>
Co-authored-by: ankitpatnala <ankitpatnala@gmail.com>
Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>
Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>
Co-authored-by: Till Hauer <till@web-hauer.de>
Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de>
Co-authored-by: Michael <m.langguth@fz-juelich.de>
jpolz pushed a commit to jpolz/WeatherGenerator that referenced this pull request Jul 10, 2025
* Implement mock IO (ecmwf#336)

* Adapt score class score class (ecmwf#339)

* Implement mock IO

* Adapt score class

* Removing unused file (ecmwf#349)

* remove database folder (ecmwf#355)

* Small change - CI - pinning the version of formatting (ecmwf#361)

* changes

* changes

* Update INSTALL.md

* Update INSTALL.md

* Fixed Exxx lint issues (ecmwf#284)

* Rebased to the latest changes and linted new changes

* addressed review comments

* addressed review comments

* Linted the latest changes.

* corrected the formating

* corrected the formating

* configured ruff to use LF line endings in pyproject.toml

* [357] Sub-package for evaluation (ecmwf#359)

* working

* changes

* removing deps from non-core project

* changes

* fixes

* comments

* Iluise quick fix stac (ecmwf#374)

* remove database folder

* fix database

* Simplifying workflow for plot_training (ecmwf#368)

* Simplifying workflow for plot_training

* Ruffed

* Working on implementing exclude_source

* Remove unused code

* Fixed ruff issue

* Fixing bug in lat handling (377) (ecmwf#378)

* Fixing bug in lat handling

* Added comment

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* recover num_ranks from previous run to calculate epoch_base (ecmwf#317)

* recover num_ranks from previous run to calculate epoch_base

* set email settings for commits

* addressing Tim's comment

* make ruff happy

* improve style

* changes (ecmwf#385)

Linter rule so np.ndarray is not used as type

* changed the script name from evaluate to inference as it simply gener… (ecmwf#376)

* changed the script name from evaluate to inference as it simply generate infer samples

* changed evaluate to inference in the main scripts and corresponding calls in the config

* update the main function for the inference script

* changed evaluate to inference also in docstring, unit test scripts, and integration test scripts

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Introduce tuples instead for strings to avoid TypeError (ecmwf#392)

* Exclude channels from src / target (ecmwf#363)

* Exclude channels from src / target

* Simplified code and added comment that pattern matching is used

* Adding new stream config

* Fixing bug that led to error when accessing self.ds when dataset is empty

* Wokign on exlcude_source

* work in progress

* Fixing incorrect formating for logger (ecmwf#388)

* Ruffed

* Refactored and cleaned up channel selection. Also added check that channels are not empty

* Cleaned channel parsing and selection

* Adjustments

* Removing asserts incompatible with empty dataset

---------

Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>

* add embed_dropout_rate to config v1 (ecmwf#358)

* [402] adds checks to the pull request (ecmwf#403)

* chanegs

* mistake

* mistake

* mistake

* changes

* doc

* Introduce masking class and incorporate in TokenizerMasking (ecmwf#383)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* Ruffed

* Added warning to capture corner case, likely due to incorrect user settings.

* Fixed incorrect call twice

* Fixed missing conditional for logger statement

* Required changes for better handling of rngs

* Improved handling of rngs

* Improved handling of rng

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Implement per-channel logging (ecmwf#283)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* [346] Passing options through the slurm script (ecmwf#400)

* changes

* fixes

* refactor `validation_io.write_validation` to make it more readable

* remove legacy code `validation_io.read_validation`

* encapsulate artifact path logic in config module

* remove redundant attribute `Trainer.path_run`

* use config to look up base_path in `write_validation`

* remove unused `write_validation` args: `base_path`, `rank`

* ensure correct type for pathes

* remove streams initialization from `Trainer`

* remove path logic from `Trainer.save_model`

* simplify conditional

* rename mock io module

* update uv to include dask

* Implement io module to support reading/writing model output

* implement new validation_io routine

* use new write_validation routine

* remove unused code

* rename output routine to `write_output`

* ruffed and added comments

* fixed annotation

* use simple __init__ method for `OutputItem` instead of dataclasses magic

* address reviewers comments

* rename method

* add simple docstrings

* ruffed

* typehint fixes

* refactor names

* update comments and typehints, dont import pytorch

* remove `__post_init__` methods, cache properties

* fixes and integration test

* final fixes :)

* changes

* changes

* changes

* changes

* changes

* more work

* changes

* changes

* changes

* ruffed

* ruffed

* improve logging and comments

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Add doc-string to call-method and provide example usage for efficient graph-construction.

* Some fixes to score-class.

* Some fixes to handling aggregation dimension.

* Add missing import of MockIO.

* changes

* changes

* removing the scores

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>
Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com>
Co-authored-by: ankitpatnala <ankitpatnala@gmail.com>
Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>
Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>
Co-authored-by: Till Hauer <till@web-hauer.de>
Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de>
Co-authored-by: Michael <m.langguth@fz-juelich.de>
tjhunter added a commit that referenced this pull request Nov 18, 2025
* Revert "Implement per-channel logging (#283)" (#434)

This reverts commit 989ab6e1d6e8c0f69594414c7733adf30acd1c54.

* Fix FESOM datareader and int overflow  (#417)

* Fix indexing in DataReaderFesom

* Enforce using only int64 in data loading

* ruff

* ruff2

* Review

* Change int64 back to int32

* changes (#462)

* Fix incorrect handling of empty window (which triggered problem in IO writing code). (#447)

* Update default_config.yml (#446)

analysis_streams_output is missing, which leads to error with val_initial=True and log_validation > 0.

* Re-enabled option to run plot_training as script and fixed -rf argument (#444)

* Re-enabled option to runplot_training as script and removed relative path as default from mutually-exclusive argument -rf.

* Ruffed code.

* Ruff check fix.

* Rename flags for parsing configuration and fixed default handling for standard config YAML-file.

* fix era5 config (#473)

Adding z back in

* [251] Merge new IO class (#469)

* Implement mock IO (#336)

* Adapt score class score class (#339)

* Implement mock IO

* Adapt score class

* Removing unused file (#349)

* remove database folder (#355)

* Small change - CI - pinning the version of formatting (#361)

* changes

* changes

* Update INSTALL.md

* Update INSTALL.md

* Fixed Exxx lint issues (#284)

* Rebased to the latest changes and linted new changes

* addressed review comments

* addressed review comments

* Linted the latest changes.

* corrected the formating

* corrected the formating

* configured ruff to use LF line endings in pyproject.toml

* [357] Sub-package for evaluation (#359)

* working

* changes

* removing deps from non-core project

* changes

* fixes

* comments

* Iluise quick fix stac (#374)

* remove database folder

* fix database

* Simplifying workflow for plot_training (#368)

* Simplifying workflow for plot_training

* Ruffed

* Working on implementing exclude_source

* Remove unused code

* Fixed ruff issue

* Fixing bug in lat handling (377) (#378)

* Fixing bug in lat handling

* Added comment

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* recover num_ranks from previous run to calculate epoch_base (#317)

* recover num_ranks from previous run to calculate epoch_base

* set email settings for commits

* addressing Tim's comment

* make ruff happy

* improve style

* changes (#385)

Linter rule so np.ndarray is not used as type

* changed the script name from evaluate to inference as it simply gener… (#376)

* changed the script name from evaluate to inference as it simply generate infer samples

* changed evaluate to inference in the main scripts and corresponding calls in the config

* update the main function for the inference script

* changed evaluate to inference also in docstring, unit test scripts, and integration test scripts

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Introduce tuples instead for strings to avoid TypeError (#392)

* Exclude channels from src / target (#363)

* Exclude channels from src / target

* Simplified code and added comment that pattern matching is used

* Adding new stream config

* Fixing bug that led to error when accessing self.ds when dataset is empty

* Wokign on exlcude_source

* work in progress

* Fixing incorrect formating for logger (#388)

* Ruffed

* Refactored and cleaned up channel selection. Also added check that channels are not empty

* Cleaned channel parsing and selection

* Adjustments

* Removing asserts incompatible with empty dataset

---------

Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>

* add embed_dropout_rate to config v1 (#358)

* [402] adds checks to the pull request (#403)

* chanegs

* mistake

* mistake

* mistake

* changes

* doc

* Introduce masking class and incorporate in TokenizerMasking (#383)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* Ruffed

* Added warning to capture corner case, likely due to incorrect user settings.

* Fixed incorrect call twice

* Fixed missing conditional for logger statement

* Required changes for better handling of rngs

* Improved handling of rngs

* Improved handling of rng

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Implement per-channel logging (#283)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* [346] Passing options through the slurm script (#400)

* changes

* fixes

* refactor `validation_io.write_validation` to make it more readable

* remove legacy code `validation_io.read_validation`

* encapsulate artifact path logic in config module

* remove redundant attribute `Trainer.path_run`

* use config to look up base_path in `write_validation`

* remove unused `write_validation` args: `base_path`, `rank`

* ensure correct type for pathes

* remove streams initialization from `Trainer`

* remove path logic from `Trainer.save_model`

* simplify conditional

* rename mock io module

* update uv to include dask

* Implement io module to support reading/writing model output

* implement new validation_io routine

* use new write_validation routine

* remove unused code

* rename output routine to `write_output`

* ruffed and added comments

* fixed annotation

* use simple __init__ method for `OutputItem` instead of dataclasses magic

* address reviewers comments

* rename method

* add simple docstrings

* ruffed

* typehint fixes

* refactor names

* update comments and typehints, dont import pytorch

* remove `__post_init__` methods, cache properties

* fixes and integration test

* final fixes :)

* changes

* changes

* changes

* changes

* changes

* more work

* changes

* changes

* changes

* ruffed

* ruffed

* improve logging and comments

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Add doc-string to call-method and provide example usage for efficient graph-construction.

* Some fixes to score-class.

* Some fixes to handling aggregation dimension.

* Add missing import of MockIO.

* changes

* changes

* removing the scores

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>
Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com>
Co-authored-by: ankitpatnala <ankitpatnala@gmail.com>
Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>
Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>
Co-authored-by: Till Hauer <till@web-hauer.de>
Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de>
Co-authored-by: Michael <m.langguth@fz-juelich.de>

* [459] Attempt to fix ruff differences (#463)

* changes

* debug

* changes

* changes

* Update pyproject.toml (#457)

* Continue training through slurm script (#395)

* train_continue via slurm

* using __main__ as entry point for slurm script

* reverting config files to match base branch

* reverting config files to match base branch

* removing param_sum control logging before and after loading of model weights

* run ruff

* check whether from_run_id is in arguments

* trigger PR check

* remove block to set reuse_run_id=True

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* added the .python_version file set to python 3.12 (#482)

Co-authored-by: Kerem Can Tezcan <ktezcan0@login07.leonardo.local>

* script (#489)

* Remove print statements for logging (#421) (#439)

* first change

* removed all prints

* changed model.py back

* adding comments and fixes@

* added ruff fixes

* reverting files for PR

* ruff fixes

* removing run_id.py

* formatting changes

* changing comments in check_gh_issue script

---------

Co-authored-by: owens1 <owens1@jwlogin09.juwels>
Co-authored-by: Timothy Hunter <tim.hunter@ecmwf.int>

* Rename batchsize to batchsize_per_gpu (#475)

* Rename batchsize to batchsize_per_gpu

* Fix ruff stuff

* fix (#490)

* add polar orbiters and abi-goes to the stac database (#426)

* testing adding metopa and metopb as placeholder drafts to stac database

* added the actual json files because I think we have to

* updated metopa metopb jsons and ets

* add fy3 and update metops

* updated names of metops

* updated metopb untarred size inodes and end date

* update names to instrument, satellite

* add untarred data size and inodes for metopa

* updated to oscar naming, with format platform, instrument, and added fengyun satellites

* update size and inodes of fy3c mwhs

* add fengyun jsons, missing before, and update unique ids of metopa and b

* add processing_level field to metopa as a test

* adding processing level field

* fix up processing level

* updated jsons and jsonnets for provenance

* actually include provenance

* updated to include processor and provider, remove provenance

* add abi-goes

* fix abi goes geometry

* fix latitude and longitude

* fix typo

* hopefully this time lat is right..

* update catalogue json for develop

* check catalogue on this branch

* jsonneted for develop

---------

Co-authored-by: iluise <luise.ilaria@gmail.com>

* Added naming convention checks to lint (#501)

* Added naming convention checks to lint

* Implemented python naming conventions and corrected code accordingly

---------

Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx>

* Correct the in-code-names for rotation matrices (#516)

* Added naming convention checks to lint

* Implemented python naming conventions and corrected code accordingly

* Corrected renaming of rotation matrices from R to rot instead of to r

---------

Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx>

* extend format string and timedelta to days (#499)

* extend format string and timedelta to days

* replace with pd.to_timedelta

* import pandas

* ruff

* enforce "HH:MM:SS" format

* ruff

* Mlangguth/develop/issue 251 (#495)

* Add score-class to evaluate-package.

* Add score-class to evaluate-package.

* Lintered and ruffed code.

* Add fix to io.py and update dependencies in common.

* Several small fixes to score-class and fast evaluation.

* Add utils for evaluate.

* Moved to_list to utils and improved doc-strings.

* Improve several doc-strings, avoid formatting of logger and other changes from PR review.

* Add xhistogram and xskillscore to dependencies of evaluate.

* Ruffed code.

* Lintered code.

* Fix incorrect retrieval of validation batch size in validation IO.

* Final minor changes to argument-names

* changes (#471)

* Updated to camel case. (#445)

* Updated to camel case.

* Fixed formatting.

* Revert "Updated to camel case. (#445)" (#530)

This reverts commit 4a8bd49067d86c8c9dd2930544d52cb9db8577af.

* [327] Script to create the links to output directories (results, ...) (#528)

* changes

* fixes

* slash

* slash

* checks

* checks

* Update config parameters lr and grad_clip (#545)

* updated lr and grad_clip in config

* modify lr to 1e-4

* Fixed randomization problem with masking  (#510)

* Fixed randomization problem with masking (needs to be verified)

* Making sure the seed is ok

* Fixed problem with seed init.

* More improvements. But problem still seems to be there.

* Clean up of rng handling. Re-initalization is passed through to masker, which was the issue.

* - Fixed prime numbers
- Cleaned up unnecessary rng init and added further comments.

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Sophiex/dev/upper bound targets (#526)

* recovering my stash

* Fix bug

* Clean up pull request

* Clessig/develop/fix forecasting 448 (#449)

* Removed (second) residual connection for forecasting

* Added init to forecasting engine to small values

* Default values for forecasting experiments

* Updated settings

* Setting local engine to empty

* Fix z settings.

* Revised defaults with larger net

* Revised defaults with larger config

* Restoring default config

* Restoring

* Restoring default

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Restore self.size_time_embedding in tokenizer_forecast.py (#548)

* Restore self.size_time_embedding in tokenizer_forecast.py

Fixes #547

* Remove empty line for ruff

Remove line for ruff?

* Replace cf.rank==0 with utils.distributed.is_root (#535)

Co-authored-by: wang85 <wang85@jwlogin22.juwels>

* Fixed handling of empty streams in plot_train (#552)

* Fixed handling of empty streams

* Fixed

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix train_continue (#556)

* add DocStrings to model (#268)

* added DocStrings for class ModelParams

* added DocStrings for class Model

* Docstring cleanup v1

* Docstring cleanup v2

* Docstring cleanup v3

* Docstring corrections v1

* Docstring corrections v2

* Docstring corrections v3

* ruff check v1

* ruff check v2

* ruff check v3

---------

Co-authored-by: th3002s <till.hauer@alumni.fh-aachen.de>

* Revised structure in metric JSON-file (#549)

* Update score-class to support groupby-operations for per-sample evaluation.

* Update of fast evaluation pipeline to track metrics sample-wise and dump them into the newly structured JSON-files.

* Changes according to PR review and fix for handling situations with a single sample.

* Changes according to PR review and fix to filter channels for score-calculation.

* Fixed handling of empty source/target channels (#558)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix to peel_tar_channels to allow situations where no data for fstep=0 is present. (#572)

* Update era5.yml: token size 8 (#583)

* [DRAFT] CLI for scoring and plotting  (#522)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* 'Handle list input to forecast_steps (Closes #573)' (#581)

* 'fixed bug not handling list input to forecast step #573'

* linted

* replace error with assert

* lint

* roll-back accidental lint

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* remove plot config  (#597)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

* remove plot config + style addition to evaluation package

* ruffed

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* integrate IFS scores from Quaver into FastEvaluation (#600)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

* remove plot config + style addition to evaluation package

* ruffed

* add option to comment out plotting

* resync utils to develop

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* [569] Load eagerly the stream content in order (#585)

* changes

* change

* changes

* Remove loading of streams also from inference.

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* [DRAFT][590] Rename metrics file (#601)

* Implemented backward-compatible function to read and write `{RUN-ID}_train_metrics.json` (new) or `metrics.json` (old)

* Quick fix for #553 NaT from encode_times_target, move offset to before trigs (#589)

* quick fix for 553 NaT from encode_times_target, move offset

* change offset to 10 minutes...

* ruffed

* apply hotfix to deltas_sec

* ruffed

* fix: associate output stream names with correct index (#519)

* fix: associate output stream names with correct index

* ruffed

* fix: iteration over output items

* address comments

* fix: correctly index channels

* fix stream indexing logic, add asserts

* fix: extraction of data/coordinates for sources

* fix assert

* Clessig/develop/channel logging 282 (#615)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

* Implement sending tensors of different shapes

* ruff

* Fix merge

* Fix docstring

* rerun workflow

* Review

* Change default colums name

* Fix merge

* - Added ddp_average_nan that is robust to NaN/0 entries when computing mean
- Switched from all_gather to this function in trainer to robustly average
- Some code cleanup

* use all_to_all communication

* Fixing problem with single-worker (non-DDP) training

* Ruffed

* Re-enabled validation loss output in terminal

* Simplified handling of dist initalized

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>
Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix bug in corner case of data reading (#621)

* Changed logging level for some messages.

* Fix bug in data reading and add assert to better detect these problems.

* Loss class refactoring (#533)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

* Implement sending tensors of different shapes

* ruff

* Fix merge

* Fix docstring

* rerun workflow

* creating loss class

* Adapted varnames in new compute_loss function to match LossModule

* comments and loss_fcts refactoring

* Suggested a separation of mask creation and loss computation

* first working version of LossModule; added unit test

* Modifications and TODOs after meeting with Christian and Julian

* Added Christian's comments and updated code partially

* Julian & Matze further advances to understand shapes

* New mask_t computations. Not yet correct, thus commented

* Resolved reshaping of tensors for loss computation

* small changes in _prepare_logging

* J&M first refactoring version finished, 2 tests ok

* First round of resolving PR comments

* add ModelLoss dataclass, rearrange mask and loss computation

* Integrating new LossCalculator into trainer.py and adding docstrings

* J&M resolved temp.item() error

* Second round of PR comments integrated

* - Fixed loss accumulation
- Cleaned up variable names

* Renamed weight

* Removed unused vars

* Inspected loss normalization for logging

* Minor clean-up

* Removing unused code.

* More refactoring: breaking code down in smaller pieces

* Fix

* Adding missing copyright

* Adding missing copyright

* Fixing incorrect indent

* Fix

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>
Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Update momentum (#633)

* Update momentum

* Remove final GELU in MLP

* Adding assert to catch inconsistent config params (#630)

* Update default_config.yml (#641)

Fix incorrect stream

* Backward compatibility of 'loss_avg_mean' metric name (#637)

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Iluise/develop/plotting issues (#635)

* fix plotted timestamp

* fix crashing when a run is plot only

* ruffed

* implement comments

* Mlangguth/develop/issue 586 (#625)

* Add options to configure the marker size, the marker type and enable marker-scaling with latitude for map-plots

* Update doc-strings to follow standard format.

* Ruffed code.

* Changes due to review comments.

* Less verbose logging and improved handling of setting to plot histograms.

* Corrected error-message in plot_data.

* [DRAFT]: Prediction head architecture clean-up (#481)

* - Avoid time encoding is 0
- eps in layer norms to 10^-3
- bf16

* Make the attention dtype and norm eps configurable

* Fix gitignore and add config files

* Shuffle config files into sensible folders

* Implement first attempt at new prediction heads

* Fix some bugs

* Fix trainer compile + fsdp

* Fix trainer and better defaults

* Choose AdaLN

* Correlate predictions per cell

Previously this pr treated as independent

* Make things more parameter efficient

* Revert "Make things more parameter efficient"

It made things way worse

This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757.

* Improve the prediction heads at small sizes

* Improve the stability of training

Two main changes: better beta 1 and beta 2 values in adam w and remove
gelu

* Adding some more regularisation

In particular to prevent training divergences and overfitting

* Forgot the dropout in MLPs

* Tune the learning rate

* Add the original prediction heads

CAREFUL: Untested!!!

* Fix bugs and ruff

* Restore old version last part

* Start fixing the defaults

* Deleting hpc specific configs

* Deleting hpc specific configs

* Defaults and documentation

* Apply ruff

* Clean up code

* Add one more comment

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Fix bug in loggin buffer reset (#651)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* use config dropout_rate in EmbeddingEngine (#646)

* Make numpy argsort version resilient (#645)

* Fix backward compatibility (#655)

* Implement global and per-cell channel masking (#496)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* working implementation of healpix level masking in Masker, with too many prints and hardcoded hl_mask and hl_data

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* implementation of healpix masking code with lots of printing

* removed print statements from masking.py

* minor line change

* remove default for strategy_kwargs

* add strategy_kwargs to config, and pass through masker to pass masking strategy specific args

* vectorise child indices calcs, implement masking_rate_sampling, minorly updated docs

* remove print statements

* cf.strategy_kwargs passed to Masker in multi_stream_data_sampler

* masking_strategy random and strategy kwargs passed to config

* ruffed

* pass cf.get(strategy_kwargs or {}) to the Masker and update masking to reflect this

* update config so it does not include strategy_kwargs, no longer needed

* move asserts for healpix to constructor, rename to masking_strategy_config, update config with example of healpix

* test working version, understanding what is happening

* revert breaking develop merge and conflict in config

* default config put channel masking

* reverting the accidental revert...

* small change to config

* implemented global and per-cell per channel masking in masking, change to config

* remove print statements from multistream

* updated config for compatibility to run immediately

* cleaned code, assert to fail for different number of source and target streams

* updated default config to match latest

* fixed _generate_channel_mask to handle empty cells of data

* fixed docstring of masker

* ruffed linted

* rename l in token_lens

* lint ruff, remove prints

* add assert for source and target channels must be the same

* fix config to develop, new assert, remove assert

* revert assert statement for readability

* clip the values in masking_rate_sampling to 0.01 and 0.99

* revert cell name to tl

* remove empty lines from model

* remove empty line from embeddings

* remove empty line tokenizer_masking

* ruff masking, tokenizer_masking

* update config again to develop version

* update config comment for masking strategies

* update channel masking to handle non-data channels for new loss

* ruffed

* Implemented check that for channel masking source and target channel have to be identical

* Minor code improvements

* Fixed incorrect return type for special case

* Ruffed + and reduced magic constants

* Minor fixes to _generate_healpix_mask

* Cleaned up and optimized mask generation for channel masking

* changed to use mode global or per_cell, improved docstring for masking strategies

* added documented valid examples for masking_strategy_config to default_config

* ruffed

* update example masking_strategy_config in default

* Minor adjustments to default settings

* remove mention of hl_data in masking_strat_config

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Removed that checkpoint is saved at the first batch (#663)

* Clessig/develop/fix data reading anemoi missing date 671 (#672)

* Changed logging level for some messages.

* Fixed unhandled exception with missing dates.

* Fixed debug message

* Make compare_run_config.py usable again (#661)

* Update compare_run_config.py to use existing functions from current repo.

* Ruffed code.

* [595] Changes for running a notebook script  (#598)

* Changes

* Chanegs

* work

* change

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* reverse old changes

* linter

* Implement regional evaluation  (#652)

* Add RegionBoundingBox data class to score-utils to handle evaluation for different regions.

* Implement region-specific evaluation in plot_inference.py.

* Adapted utils.

* Introduction of clean RegionLibrary in score_utils.py.

* Ruffed code.

* Updates following reviewer comments.

* Ruffed code.

* Clessig/develop/fix loss 678 (#679)

* Changed logging level for some messages.

* Fixing bug with incorrect counting

* using config results path instead of fixed path (#631)

* using config results path instead of fixed path

* ruff

* Add forgotten LayerNorm (#687)

* Add forgotten LayerNorm

* Apply ruff

---------

Co-authored-by: Sophie Xhonneux <sxhonneux@clariden-ln001.cscs.ch>

* Fix performance degradation in loss computation (#690)

* Changed logging level for some messages.

* Refactored loss computation to improve performance.

* Working around ruff issue

* - Refactored code to improve structure and readability
- Fixed problem with incomplete normalization over loss functions
- Solved problem with mse_weighted as loss function when mse is specified

* Fixed problems with multi-worker training

* Fixed indentation bug and bug in assert

* [DRAFT] Rename plot_inference.py and entrypoint for evaluation (#683)

* Rename plot_inference.py.

* Rename of main-method and move parsing of arguments for entrypoint.

* Introduce entrypoints to fast evaluation.

* Fix to call of main in run_evaluation.py.

* Rename entrypoint and add dependency to weathergen-evaluate.

* Add missing comma in pyproject.toml.

* Option for non-linear output layer in prediction head (#673)

* Add score-class to evaluate-package.

* Add score-class to evaluate-package.

* Lintered and ruffed code.

* Add fix to io.py and update dependencies in common.

* Several small fixes to score-class and fast evaluation.

* Add utils for evaluate.

* Moved to_list to utils and improved doc-strings.

* Improve several doc-strings, avoid formatting of logger and other changes from PR review.

* Add xhistogram and xskillscore to dependencies of evaluate.

* Ruffed code.

* Lintered code.

* Add helper function to get custom last activation.

* Add option to control stream-specific non-linear output layer.

* Controlling print-statement to model.py.

* Corrected handling of config for prediction head.

* Add support for stream-specific, optional non-linear output actiavtion function.

* Provision of ActivationFactory.

* Ruffed.

* Changes following review comments.

* Fix in parsing final_activation-argument.

* Clessig/develop/fix empty 647 (#675)

* Changed logging level for some messages.

* Removed checks that requires non-empty channels

* Adding warning

* Fixed convergence of training (#696)

* Restored old prediction had functionally. Other adjustments/reverts, in particular in attention.

* Ruff'ed

* Addressed reviewer comments and cleaned up minor details

* Fixed bug in obs data reading (#698)

* Restored old prediction had functionally. Other adjustments/reverts, in particular in attention.

* Ruff'ed

* Fixed bug in obs data reading so that data violated window

* Fix

* Update data_reader_obs.py

* Restoring to develop

* Fix

* Ruffed

* Clessig/develop/fix logging verbosity 564 (#619)

* Changed logging level for some messages.

* Added support for more fine grained output control.

* Changed logging setting for inference.

* Minor improvement to doc string

* include run_id in debug log file

* ruff

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Refactor path-setting for 'model' and 'results' to be dynamic (no relative paths) (Closes #591) (#677)

* temp commit wip

* change model_path and run_path setting to dynamic (independent of HPC) (untested)

* removed unnecessary set_paths references

* linted

* remove commented code

* removed commented lines

* Enable plot_train with dynamic paths

* lint

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

* Fix (#715)

* modified evaluation api, callable as python function (#713)

* Fixed bug for degenerate streams (#723)

NaN-robust min/max computation.

* Fixed (#725)

Resolves config loading error when passing a `model_dir`

* Fix on loading model config (#726)

* Small fix on loading model config

* minor change

* Detect if channels for plotting differ from JSON and recompute if necessary (Closes #701) (#718)

* new branch

* detecting changes in channel spec

* style changes

* style changes

* Delete config/plot_config.yml

* incorporated PR feedback

* added run_evaluation (again)

* Clessig/develop/fix logging 719 (#720)

* Cleaned up to use proper logger

* Cleaned up to use proper logger

* Fix logging: needs to be registered per output stream and not per logging level

* Set logging level consistently with debug to file

* Fixes

* Added FSDP-sharding after loading model for train continue (#729)

* Added FSDP-sharding after loading model for train continue

* Improved consistency

* Fixed resetting FSDP after checkpoint saving

* Update handling of `run_path` and `model_path` (Closes #716) (#732)

* proposed solution, untested

* assert instead of error

* lint

* incorporating PR feedback

* lint

* added explicit argument passing

* lint

* Make cartopy map resources a shared asset to prevent downloading from… (#731)

* Make cartopy map resources a shared asset to prevent downloading from the internet which is not always possible

* Replaced print by logger statement

---------

Co-authored-by: xhonneux2 <xhonneux2@jwlogin22.juwels>
Co-authored-by: karlbauer1 <karlbauer1@jwlogin21.juwels>

* Clessig/develop/fixes hackathon (#736)

* Fixed some comments that generated warnings

* Added to create path for log files if it doesn't exist

---------

Co-authored-by: Christian Lessig <christian.lessig@ovgu.de>

* Revised path defaults and output dirctory structure for fast evaluation (#681)

* First changes to path-handling.

* Consistent path for maps and histograms.

* Update of evaluation scipts for proper path defaults and directory structures.

* Make root-path to repo available via common-package.

* Introduce proper defaults to plot_inference.py and set-up desired directory structure for evaluation output.

* Rename of results_dir-parameter to results_base_dir

* Ruffed code.

* Allow for run-specific results-paths and use config to get defaults.

* Several fixes and consistency improvements.

* Remove manual default usage in plotter.py

* Ruffed code.

* Update __init__.py

Remove _REPO_ROOT.

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Mk/develop/fix plot train 727 (#738)

* Load model_path from private config if not provided

* Use existing function to get private model path

* Incorporated PR comments

* Fix problems with rel paths in logging files (#742)

* Fixed relative path handling for logging files.

* Adding default argument to _load_private_conf()

* Implement first function for latitude weighting (#705)

* Changed logging level for some messages.

* Refactored loss computation to improve performance.

* Working around ruff issue

* - Refactored code to improve structure and readability
- Fixed problem with incomplete normalization over loss functions
- Solved problem with mse_weighted as loss function when mse is specified

* Fixed problems with multi-worker training

* add location weights, first commit

* assertion on mask and len(location_weights)

* restructuring of location weights and fixes in mse_channel_location_weighted function

* fix coords_raw dependency on offset and fstep

* ruff

* addressing review commits and fixing bug

* rm location_weight from default stream config

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Fix failure for notebooks. (#750)

* add proper error message for source_include not equal to target_include (#767)

* Implemented fractional target selection  (#751)

* implemented fractional target selection

* ruffed

* fix up configs and <= to accept target_fraction 0.0

* revert to simple implementation of per stream sampling_rate_target

* restore configs

* Corrected formula for L2-error in score-class. (#721)

* Corrected formula for L2-error in score-class.

* Introduced option to get the original or the squared L2-norm.

* Added doc-string for L2-norm.

* Fix sampling rate (#773)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Update default_config.yml (#776)

* Adding the animations feature, fixing stable colorbars (not per stream) (#692)

* Adding the animations feature
* keep only the animations and max-min functions.

* Sophiex/dev/name modules (#754)

* Add names to modules as prep for freezing

* Add functionality to freeze modules based on added names

* Ruff

* Clean up

* Wrong import path

* Ruff

* Fix animations bug with paths (#781)

* Fix another bug in animations (#783)

* Work around to allow for model freezing (#785)

* Work around to allow for model freezing.

* Ruff

* fix to avoid whole model element of named_modules and hence freeze whole model

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>
Co-authored-by: Sebastian Hickman <seb.hickman@ecmwf.int>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* Fast evaluation for integration tests (#770)

* rename module level constant

* split inference into own method

* use proper fast evaluation pipeline for `evaluate_results`

* ruffed

* remove assert => different bug

* adjust tests for new plot template

* Update checking the value of plot_histograms and plot_animations (#788)

* pass StreamData instances to io.py (#779)

* Rename anemoi directories and built backward compatibility (Closes #709) (#771)

* renamed anemoi dirs and built backward compatibility

* ruff

* removed stream directories and updated logging

* renamed all streams

* ruff

* seviri file name change

* cerra_seviri folder update

* cerra path update

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Fix IO when targets/preds are empty. (#760)

* Modify DataReaderObs to get base_yyyy... from stream config (#794)

* modify DataReaderObs to get base_yyyy... from stream config, and set it in the ctor, with default of 19700101. Use it in _setup_sample_index. Remove loading obs_id attr. Add igra.yml with example usage.

* add license to igra config

* update to ISO base_datetime, parse to read idx from zarr

* fix integration tests (#796)

* Fixed bug for empty source (#800)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Train continue function with arguments (#803)

* add train_continue_from_args to call with arguments


---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* remove module common/mock_io (#809)

* Update data_reader_obs.py removing asserts (#817)

* Sgrasse/develop/issue 616 (#648)

* encapsulate extraction of source data

* bundle offseting of key attributes

* consolidate calculation of datapoints indices into method

* encapsulate extraction of coordinate axis in function.

* replace attribute `channels` by `target_channels` and `source_channels`

* ruffed

* ruffed

* fixes

* address michas comments

* reactivate assert

* fix typo / renaming

* small fix

* uncomment source_n_empty and target_n_empty unused variables

* fix untit tests (#814)

* Plot substeps (#789)

* Create subplots with grouping by valid_time.

* Create histograms at substeps with grouping by valid_time.

* Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required.

* Add helper function to get values of keys from stream configs.

* Corrected loading of model config.

* Ruffed code and turning message-level on cartopy path to debug.

* Revisions following reviewer comments.

* fix histograms

* ruffed

---------

Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Add the possibility of common ranges in plots per variable and stream (#801)

* Create subplots with grouping by valid_time.

* Create histograms at substeps with grouping by valid_time.

* Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required.

* Add helper function to get values of keys from stream configs.

* Corrected loading of model config.

* Ruffed code and turning message-level on cartopy path to debug.

* Add the possibility of common ranges in plots per variable and stream

* Revisions following reviewer comments.

* fix histograms

* ruffed

* update utils

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>
Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Fix to io problems. (#820)

* Enable histograms for data with some NaNs (#823)

* Fix to filter NaNs before histogram creation.

* Removed unused code lines and correct for bug in marker scaling in plotter.py.

* Clessig/develop/fix empty io 819 2 (#822)

* Fix to io problems.

* Fix issues in input

* Iluise/fix empty io 819 plotting (#826)

* Fix to io problems.

* Fix issues in input

* fix plotting

* ruffed

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* fix plotting for partially filled first forecast steps (#828)

Co-authored-by: luise1 <luise1@jrc0288.jureca>

* Fix calculation of scores per fstep (#853)

* fix calculation of scores per fstep

* simplified syntax

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Fix Issue 835 (#841)

Enable freezing the target coord embedding when it is just a simple
layer

* Improve r3tos2 (#744)

* vectorized r3tos2

* revise comment

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Sophiex/dev/latent noise (#594)

* - Avoid time encoding is 0
- eps in layer norms to 10^-3
- bf16

* Make the attention dtype and norm eps configurable

* Fix gitignore and add config files

* Shuffle config files into sensible folders

* Implement first attempt at new prediction heads

* Fix some bugs

* Fix trainer compile + fsdp

* Fix trainer and better defaults

* Choose AdaLN

* Correlate predictions per cell

Previously this pr treated as independent

* Make things more parameter efficient

* Revert "Make things more parameter efficient"

It made things way worse

This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757.

* Improve the prediction heads at small sizes

* Improve the stability of training

Two main changes: better beta 1 and beta 2 values in adam w and remove
gelu

* Adding some more regularisation

In particular to prevent training divergences and overfitting

* Create classes for latent noise

* Add the latent noise after the local engine

* Add the KL loss

* Formatting

* Clean up

* Use the same for loop as before

* Prepare branch for merge

* Remove superfluous configs

* Restore default configs

* Mistake in the merge fixed

* Final beauty changes

* Final clean up

* Ruff

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* [Hotfix] Fix crash when using list of forecasting steps (#824)

* Fix crash when using list of forecasting steps

* Ruff

* Grammar fix

* Fix grammar

* Add checking forecast steps list

* Review

* Allow 0 as forecast step

* Add list length check

* Assert non-negative forecast step integer, added assertion messages

* Ruff

* ruff

* Move check to config

* what the ruff

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

* add tokenizer base class (#815)

* add tokenizer base class

* ruffed

* ruffed v2

* move calculation of centroids to base_class

* move size_time_embedding initialization

* remove ABC from tokenizer base_class

* renaming

* ruffed

* ruffed v2

* add return value to compute_source_centroids

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* vectorize s2tor3 (#745)

* vectorize s2tor3

* ruff code

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Remove cleaning stream name when logging loss (#763)

* Combine masking strategies during training, with appropriate masking_… (#756)

* combine masking strategies during training, with appropriate masking_strategy_config

* restore config samples per validation

* restore cofigs, and add to masking_strategy_config

* clarify pass to per batch per stream

* updated combination masking to support same masking strategy for all streams in the batch. Strategy resampled for every batch.

* rename so we have masking_strategy and masking_strategy_per_batch

* ruffed

* clean, default to different strategy per batch for combination

* ruff

* remove unused variable

* updated docstrings (#875)

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Enable correct reading of channels, forecast_step, sample variables in plot config file (Closes #717) (#755)

* adjusted run_evaluation and utils code to take into account forecast_step variable from config

(cherry picked from commit 26c26a923cabc5777bc75ef911f0fc3c61397e1a)

* print statement change

* catching error when fstep not present in zarr file

* upgrades based on PR feedback

* intermediate commit

* intermediate commit

* new functions _get_channels_fsteps_samples and check_metric

* edited plotting

* inter commit

* fixed bug in  get_data

* self review

* refactor

* dummy commit

* inter commit

* feedback appleid

* incorporate review feedback

* removed sorting of fsteps_final

* remove comments

---------

Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Implement causal masking as MTM strategy (#798)

* first rough implementation of causal masking

* incorporated combine masking strategies

* include per stream sampling rate target in tokenizer_masking based on other PR

* clean up implementation of causal masking

* remove TODO

* remove old causal masking function

* add latest error message for channel

* change if to elif for causal masking

* if to elif in mask_target

* cleaned up causal masking code

* tokenizer_masking small change

* updated config

* fix up config

* restore era5 config

* ruffed

* update config and masking.py with causal masking specific masking rate, and some comments

* ruffed

* roll back causal_masking_rate changes, return to just use masking_rate

* faster version of causal masking, vectorise where possible. Need list comprehension for variable length tokens

* ruffed

* add log scale and refactor plot_summary (#865)

* add log scale and refactor plot_summary

* add plot_utils

* add grid

* ruffed

* fix marker size

* fix global plotting options

* add types

* ruffed

* Fixed stream name factoring (#534)

* Updated to camel case.

* Fixed formatting.

* to reflect upstream develop

* got rid of regex and changed formatting of str names

* pulled recent changes from upstream develop

* Removed refactoring of lf_name.

* clean_name with the new changes

* Fetched latest changes to the branch

* Fixed linting

* Fixed stream name without touching the losses dict

* fixed type annotation

* add srun to integration-test in actions script (#886)

* add srun to integration-test in actions script

* add --offline flag to integration-test in actions.sh

* Merge compare_run_configs.py with markdown table version (#699)

* initial comments to outline implementation

* Refactor config comparison script to support YAML input and enhance output formatting

* remove unused code

* Add example configuration for model run IDs and display patterns

* shorten

* Add 'tabulate' dependency to enhance table formatting capabilities

* add instructions to config

* restore option for command line run ids and model dirs

* ruff

* fix arg parsing

* add option to show specific or all parameters in config comparison

* ruff

* Remove 'tabulate' from dependencies

Removed 'tabulate' dependency from project requirements.

* logging, imports and dependency in compare_run_configs.py

* fix logging and dependencies

* ruff

* set default

* fix arg order and checks

* improve model directory handling and add exception when there is not latest model

* refactor error handling in main function to omit exception details in logs

* ruff

* ruff

* add weathergen dependency

* make file executable

* add private home argument

* revert to default model path argument assuming symlink to shared folder is set

* ruff

* Implement splitting zarr and regex filenames (#524)

* Implement splitting zarr and regex filenames

* Optimize dask reading operations

* Ruff

* Review

* Ruff

* Remove stream name cleaning when logging loss

* Add tolerance to setting std to 1.0

* Implement input column reordering and channel exclusion

* Update stream config

* Ruff

* Add config file

* Implement variable state persistance

* ruffffff

* ruffa

* ruf ruf

* Add select channels method

* updating .gitignore file to include all development directories without / (#900)

* [804] vectorize tokenize_window_space (with test) (#893)

* vectorize tokenize_window_space

* import pad_sequence from torch

* change vari names, remove device, add comments, ruff code

* changes

* changes

* changes

* simplify

* small change

* unit tests

* unit tests

* unit tests

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [812] efficient tcs computation (with tests) (#894)

* efficient tcs computation

* revise vectorize tcs_optimized, ruff code

* add typing

* changes

* changes

* merge

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [811] Improve perf locs to cell coords ctrs (#895)

* optimize locs_to_cell_coords_ctrs

* revise get_target_coords_local_ffast for new optimized locs_to_cell_coords_ctrs

* changes

* changes

* tests

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>
Co-authored-by: Sophie X <24638638+sophie-xhonneux@users.noreply.github.com>

* [810] optimize locs_to_ctr_coords (with tests) (#896)

* optimize locs_to_ctr_coords

* changes

* changes

* changes

* changes

* merge

* changes

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Migrated Config to common (#607)

* Updated to camel case.

* Fixed formatting.

* to reflect upstream develop

* got rid of regex and changed formatting of str names

* pulled recent changes from upstream develop

* migrated config to common

* fixed lint issues

* Corrected all the changes

* syntax err fixed

* Fixed import

* Latest upstream changes pulled

* Fixed Linting errors

* Lint fix

* Pulled latest

* fixed other occurences

* Fix compare_config after config went to common (#903)

* fix after config went to common

* Change argument type for --show option from int to str in main function

* Update default config path in main function to compare_config_list.yml

* Iluise/develop/add io reader (#891)

* first implementation of reader class for evaluation package

* add io reader

* move check_availability to reader

* update to develop

* fix retrive results

* address comments

* Fix minor bug in modules (#909)

---------

Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>

* [908] Harmonize the linter check between the CI and our CLI (#910)

* changes

* changes

* [554] Updates the PR template (#912)

* changes

* changes

* changes

* comments

* [906] Bug fix in tokenizer (#907)

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* cleanups

* changes

* comments

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* Add levels (#916)

* changes to include discrete levels in colormap if needed

* Change slightly the position of the feature

* Lint

* changes (#920)

* [926][evaluation] weatherGen reader for evaluation package (#927)

* weatherGen reader for evaluation package

* ruffed

* [939] Fix CI (#940)

* changes

* changes

* Implement forecast activity metrics (#892)

* Add forecast activity calculations and update fstep handling in utils

* Add forecast rate of change metrics (froct, troct) to score calculations

* update description

* add next data to verified data

* move cases for kwargs to score

* refactor froct adn troct to use calc_change_rate

* remove metric specific kwargs in calc_scores_per_stream

* calc_change_rate now gives NaN array when next step is None

* fix nans

---------

Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com>
Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com>

* added IFS-FESOM streams and updated all stac files using jsonnet (#934)

* added IFS-FESOM streams and updated all stac files using jsonnet

* changes according to comments by Ilaria

* resolved by using providers from common.jsonnet file

* changed refererrence to ecmwf and develop branch

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* [939] Catches failures of labeling CI job (#950)

* changes

* more permissions

* more permissions

* [880] Informative type checks in the CI (#915)

* attempt

* fixing pyrefly

* changes

* changes

* changes

* Sets dropout rate to 0 in eval mode for flash_attn (#923)

* added check for train/eval for setting dropout_p value

* ruff

* rm ceil, conj, floor, and matmul from annotations.json (#951)

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Add the calculation of 10ff for ERA5 and CERRA  (#914)

* Add the calculation of 10ff

* Caring for cases were 10ff cannot be calculated

* Create a new script for derived_channels, minor changes to reader_io

* Remove stream specific settings add regex

* Add more datetime formats (#962)

* fix error when global_plotting_opt does not exist (#964)

* fix error when global_plotting_opt does not exist

* fix linter

* changes (#1009)

* Revert "changes (#1009)" (#1012)

This reverts commit 2af1c09a11e6dd027d247b670737bbac0cd1a766.

* sorcha/dev/500 (#1001)

* lint reformatting + fixing get_channels

* debug messages

* [datasets] move to new cerra and new era5 (#995)

* move to new cerra and new era5

* fix cerra

* Removed the method freeze_weights_forecast and all forecast_freeze_model flag occurences (#924)

* Add Coordinate System Conversion to DataReaderFesom (#1024)

* Add coordinates conversion

* Ruff

* Add check for longitude

* Sophiex/dev/fsdp2 fix (#959)

* Save current state

* Save current state

* Barebone FSDP2 prototype TODO save checkpoints

* First version of saving model

* Fix save_model

* Log everything and log to files

* Remove redundant path creation

* Allow for both slurm and torchrun + fewer log files

* Cleaning up init_ddp

* Ruff

* Attempt to avoid duplicate logging

* FSDP2 with mixed precision policy

* Ruff

* Clean up and logging

* Try to get loggers to behave as we want

* Makes ruff unhappy but works

* Fixed ruff issue

* Fixed problems with multi-node training.

* Fix for interactive/non-DDP runs

* No idea why, but this seems to work so far

Committing simply so it is saved, obviously needs cleanup

* Still works! So which is it memory or the grad scaler?

* Also still works, I now strongly suspect the amp.gradscaler

* This still works, I have no clue anymore why but whatever it works
now....

* Enable loading model from absolute paths

* Enable loading for 1 GPU only

* Fix 1 GPU train continue

* Appease ruff

* Fix saving the model more regularly and perf logging

* Fixed problem when training with 2 nodes.

* Fix data loader seed

* Appease ruff

* Shouldn't overwrite with_fsdp like this

* Potential fix for FSDP2 issue

with different ranks using different model parts

* Fix loss scaling and logging of dummy data loss

* Clean up

* Appease ruff

* Fixed problem when source channels are empty (i.e. with diagnostic trainings).

* Update io.py

* FSDP2 suggestions from Tim (#1015)

* comments

* sophie's comments

* removed logger suggestions

* Clean up deadcode etc

* Removing unused imports that the linter didn't like

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* Ensure sample coordinate is repeated along ipoint for single sample cases in WeatherGenReader (#1026)

Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>

* Fix change rate calculation by aligning s1 with s0 (#1007)

* Fix change rate calculation by aligning s1 with s0

* Refactor score calculation to remove unnecessary alignment and add sorting function for coordinates

* use .values option

* Optimize pos enc harmonic (#1033)

* add device & dtype

* ruffed

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [evaluation] fix score computation with empty cerra samples (#1039)

* fix samples

* riffed

* answer comments

* ruffed

* Add score cards plotting feature (#1041)

* add the feature of score_cards

* Refactor, fix error when sample are different for each run, linting

* fix bug, fix sizes when skill difference is huge

* linting

* changes on comments

* linting

* Clessig/develop/fix inference 1049 (#1053)

* add device & dtype

* ruffed

* Fix inference

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Channel weighting in loss computation (#753)

* introducing channel weights

* tested channel weighting

* adding target_channel_weights to data_reader_base

* uncomment target channel parsing in anemoi dataset

* remove channel weights from default stream config

* Adds default config for run_evaluation (#1028)

* adding default config + changing yml path locations

* linter checks

* linter checks

* revert .yml file

* updates

---------

Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>

* Fix CERRA eval breaking with coord sorting in `froct` (#1057)

* Move coord sorting inside score function to be metric-specific

* Linting: Removed unused import

* Return nan data array to prevent crash

* fix nans shape in calc_change_rate

---------

Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* [1059][eval] Fix eval crash of inference models from other HPCs (#1060)

* Read model path from private repo instead of inference config

* Linting: Organized imports

* Interface improvements

* [1022] Getting WG to work on santis (#1023)

* working pytorch

* changes

* Fix for code to work on Alps-Santis

* changes

* cleanups

* changes

* reverting change

* having issues with the latest branch on santis

* changes

* changes

* changes

* override with cpu

* working for cpu

* flash-attn moved to gpu

* remove contstraint

* simplifying

* trying

* working on atos

* changes

* macos

* chanegs

* cleanups

* actions

* actions

* actions

* actions

* changes

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* fix crash in case of missing streams  (#1058)

* fix issue with empty region

* fix non existing stream

* fix channel order in evaluation (#1066)

* New templates for issues (#1017)

* changes

* changes

* Revert "New templates for issues (#1017)" (#1071)

This reverts commit 3a6e7b826b7b29a6df4af27b6771567474302fb3.

* [1002] Template for issues try 2 (#1072)

* changes

* changes

* issue with template

* updates

* changes

* issue

* issue

* [1073][model] Adds latent noise imputation (#1074)

* Add latent noise imputation to model.py with backwards compatibility

* Linted

* Resetting default_config, except for new flag

* Resetting default_config 2nd try

* add modules to annotations.json (#1035)

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Jk/develop/gamma decay (#998)

* Update to develop, prepare for new experiment series

* gamma decay over fsteps first commit

* add gamma decay factor to config

* working gamma decay weighting

* rm breakpoint

* rm eval and plot configs

* reverting default config

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Add materialisation of new modules before loading checkpoint (#1030)

* Add materialisation of new modules before loading checkpoint

* Initialize new modules in load_model

* Fix adding new embedding networks

* Clessig/develop/fix kcrps 1077 (#1078)

* Improved robustness for loss fcts where ch loss does not make sense

* Re-enabled kernel CRPS and added weighting options

* Fixes

* Improved tensor reordering

* Sgrasse/develop/issue 898 checkpoint freq conf (#905)

* add new/changed parameters in default_config

* implement backward compatibility

* remove `train_log.log_interval` from default config

* use new configuration arguments in Trainer

* fix: wrong variable name

* ruffed

* Rework method structure

* fix bug

* rename `log_intevals` to `train_log_freq`

* fix integration tests

* fix forgot renaming

* fix rebasing artifact

* Sorcha/dev/571 (#957)

* debug for netcdf pipeline

* zarr_netcdf first draft

* fixing pipeline

* linter checks

* removing debug prints from io.py

* refactoring, found issue with forecast_ref_time

* deleting unnecessary lines

* proper docstrings

* moving filepaths

* linting

* multithread processing added

* debug info

* debugging

* refactoring

* linting

* fstep as argument

* change assert

---------

Co-authored-by: owens1 <owens1@jrlogin09.jureca>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* pyproject.toml checks (#1042)

* adding chceks for toml files into actions

* lint fixes

* lint checks

* link fixes

* changes

* disabling ruff check

* change to path info instead

* adding E501 and E721 to be ignored for now

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* Implement EMA of the model (#1005)

* Save current state

* Save current state

* Barebone FSDP2 prototype TODO save checkpoints

* First version of saving model

* Fix save_model

* Log everything and log to files

* Remove redundant path creation

* Allow for both slurm and torchrun + fewer log files

* Cleaning up init_ddp

* Ruff

* Attempt to avoid duplicate logging

* FSDP2 with mixed precision policy

* Ruff

* Clean up and logging

* Try to get loggers to behave as we want

* Makes ruff unhappy but works

* Fixed ruff issue

* Fixed problems with multi-node training.

* Fix for interactive/non-DDP runs

* No idea why, but this seems to work so far

Committing simply so it is saved, obviously needs cleanup

* Still works! So which is it memory or the grad scaler?

* Also still works, I now strongly suspect the amp.gradscaler

* This still works, I have no clue anymore why but whatever it works
now....

* Enable loading model from absolute paths

* Enable loading for 1 GPU only

* Fix 1 GPU train continue

* Appease ruff

* Fix saving the model more regularly and perf logging

* Fixed problem when training with 2 nodes.

* Fix data loader seed

* Appease ruff

* Shouldn't overwrite with_fsdp like this

* Potential fix for FSDP2 issue

with different ranks using different model parts

* Fix loss scaling and logging of dummy data loss

* Clean up

* Appease ruff

* Start implementing EMA, works for 1 GPU

* Make EMA model multi-gpu compatible

…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants