-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[Bugfix] Fix CUDA/CPU mismatch in threaded training #6245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix CUDA/CPU mismatch in threaded training #6245
Conversation
Updated tensor creation in torch_policy.py and utils.py to explicitly use the default device, ensuring consistency across devices (CPU/GPU). Also set torch config in TrainerController to use the default device. This improves device management and prevents potential device mismatch errors.
Updated tensor creation in optimizers, reward providers, and network normalization to explicitly use the configured default_device. Removed redundant set_torch_config call in trainer_controller to avoid interfering with PyTorch's global device context. These changes improve device consistency and prevent device mismatch errors in multi-threaded or multi-device training scenarios.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes CUDA/CPU device mismatch errors that occur during threaded training on Windows by making tensor device placement explicit and consistent across the codebase.
- Ensures all tensor creation operations use
default_device()
to maintain device consistency - Fixes issues where tensors were implicitly created on different devices in multi-threaded environments
- Updates utilities, networks, optimizers, and reward providers to use explicit device placement
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
torch_entities/utils.py | Updates tensor creation utilities to allocate on default device |
torch_entities/networks.py | Fixes device placement in vector input normalization |
torch_entities/components/reward_providers/gail_reward_provider.py | Ensures GAIL reward provider tensors use correct device |
policy/torch_policy.py | Makes device placement explicit for masks, observations, and RNN memories |
poca/optimizer_torch.py | Initializes zero RNN memories on default device |
optimizer/torch_optimizer.py | Fixes RNN memory initialization device placement |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Please fix the issue with black reformatting and then you should be good to merge the PR. Thanks!
Proposed change(s)
On Windows, running with threaded: true produced “tensors on different devices” errors. Threaded trainers create tensors in multiple threads. Implicit CPU allocations (or per-thread default device changes) led to CPU-CUDA mixing and PyTorch mode stack corruption. Making device placement explicit and consistent prevents both classes of errors.
- Create action masks, observations, and RNN memories on default_device() during inference.
- ModelUtils.list_to_tensor() and list_to_tensor_list() now allocate on default_device().
- VectorInput.update_normalization() now uses device-correct tensors.
- Initialize zero RNN memories on default_device().
- Ensure DONE tensors, epsilons, and accumulators allocate on the correct device.
Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
Types of change(s)