Skip to content

Conversation

@bghira
Copy link
Owner

@bghira bghira commented Feb 1, 2026

Closes #2523

This pull request improves how validation steps are determined when using dynamic epoch schedules in training. The main change is the introduction of logic to correctly compute the step within the current epoch, even when the number of steps per epoch varies due to dataset scheduling. This ensures that validation triggers at the correct times, especially in complex training scenarios. The update is accompanied by new and updated tests to verify correctness.

Validation logic improvements:

  • Added a new _epoch_relative_step method in validation.py to accurately compute the current step within an epoch, accounting for dynamic epoch_batches_schedule and gradient_accumulation_steps settings. This method helps ensure validation occurs at the correct epoch boundaries.
  • Updated should_perform_intermediary_validation to use the new _epoch_relative_step method, improving correctness when epochs have varying lengths.

Test enhancements:

  • Updated the test_epoch_2_validation_at_correct_step test to include a dynamic epoch_batches_schedule and gradient_accumulation_steps, ensuring it covers the new logic.
  • Added a new test, test_validation_uses_epoch_start_step_with_schedule, to verify that validation triggers correctly at the end of epochs when using dynamic schedules, covering scenarios from related issues.
  • Modified test_validation_aligns_with_checkpoints to include epoch_batches_schedule in its setup, further validating the updated logic.

@bghira bghira merged commit 945c9ea into main Feb 1, 2026
2 checks passed
@bghira bghira deleted the bugfix/2523 branch February 1, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug, follow up2] Validation image generation and eval doesn't respect epoch change #2508

2 participants