Skip to content

Validate across multiple GPUs#1881

Merged
bghira merged 1 commit intomainfrom
feature/batch-parallel-validation
Oct 31, 2025
Merged

Validate across multiple GPUs#1881
bghira merged 1 commit intomainfrom
feature/batch-parallel-validation

Conversation

@bghira
Copy link
Owner

@bghira bghira commented Oct 30, 2025

This pull request introduces distributed multi-GPU support for validation, refactors validation prompt handling, and improves serialization of validation results. The changes enable efficient parallel validation across processes and simplify the code for preparing and processing validation prompts.

Distributed Validation & Multi-Process Utilities

  • Added new utility functions to multi_process.py for broadcasting objects (broadcast_object_from_main), splitting data across processes (split_across_processes), and gathering results (gather_across_processes), enabling distributed validation workflows.
  • Integrated distributed broadcast of validation prompt metadata in trainer.py so all processes receive consistent validation data. [1] [2]

Validation Prompt Handling Refactor

  • Introduced _ValidationWorkItem dataclass and refactored prompt preparation in validation.py, making prompt handling more robust and extensible. [1] [2]
  • Replaced manual prompt unpacking with structured work items and distributed splitting in the validation process, supporting both single and multi-GPU modes. [1] [2] [3]

Serialization & Aggregation of Validation Results

  • Implemented serialization/deserialization methods for validation results, allowing images and other media to be safely transferred between processes and aggregated on the main process.

Improved Validation Control & Logging

  • Added logic to select validation execution mode (single-GPU or batch-parallel) based on configuration and accelerator state, ensuring correct behavior in distributed environments. [1] [2] [3]
  • Enhanced logging and webhook notification logic to avoid duplicate notifications and provide clear progress updates in distributed runs. [1] [2]

@bghira bghira merged commit 2a983ce into main Oct 31, 2025
1 check passed
@bghira bghira deleted the feature/batch-parallel-validation branch October 31, 2025 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant