Skip to content

Refactoring changes in the training routine#243

Merged
wiederm merged 37 commits intomainfrom
ref-training
Aug 31, 2024
Merged

Refactoring changes in the training routine#243
wiederm merged 37 commits intomainfrom
ref-training

Conversation

@wiederm
Copy link
Copy Markdown
Member

@wiederm wiederm commented Aug 22, 2024

Description

This PR introduces a few improvements/changes to the training routine and its tests:

  • improve logging messages
  • improve fixture in conftest to obtain a single batch from different datasets and with different batchsize without hardcoding any of the values
  • provide training toml file that sets parameters for force training
  • The Error classes will return now the error per molecule instead of the mean of the error, this makes it easier to log and use torchmetrics
  • tracking loss now also with torchmetrics instead of custom torch.nn.Module
  • allowing to cache the processed dataset. In that case we assume that another process has already processed the dataset (since the .pt file is present) and, if cache regeneration is not requested, we will use this file and skiptt the prepare_dataset operation.
  • lock the prepare_dataset method: only a single instance should execute this method per dataset.

This PR also includes bugfixes:

  • fix a bug in the force loss calculation
  • only retrain/create graph of force when in training mode (not in evaluation or test mode)

Status

  • Ready to go

@wiederm wiederm self-assigned this Aug 27, 2024
@wiederm wiederm requested a review from chrisiacovella August 27, 2024 07:27
@wiederm wiederm added bug Something isn't working refactoring Improve the quality of the code without functional changes labels Aug 27, 2024
@wiederm wiederm changed the title Small refactoring changes in the training routine Refactoring changes in the training routine Aug 27, 2024
@chrisiacovella
Copy link
Copy Markdown
Member

The CI was failing due to running out of memory (CI runner is capped at 16 gb). See #246 . The code will skip training sake with forces.

Comment thread modelforge/train/training.py
Comment thread scripts/config.toml Outdated
@wiederm wiederm merged commit 769d6a8 into main Aug 31, 2024
@wiederm wiederm deleted the ref-training branch August 31, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working refactoring Improve the quality of the code without functional changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants