Skip to content

feat(synthetic-sf): handle GPU memory contention in parallel batch processing #241

@coderabbitai

Description

@coderabbitai

Summary

When process_batch in src/sampleworks/eval/generate_synthetic_sf.py is run with n_jobs > 1 and a CUDA device, multiple loky worker processes each initialise their own CUDA context, competing for GPU memory. This can cause OOM errors or severe performance degradation.

Background

Raised during review of PR #234 (comment: #234 (comment)).

Even with multiple workers, jobs likely converge on one or a small number of GPUs, so a more targeted fix may involve specifying the device more precisely per worker rather than simply forcing n_jobs=1.

Suggested approaches

  • Detect device.type == "cuda" and, depending on available memory, either warn and cap n_jobs or assign each worker an explicit GPU device index.
  • Consider exposing a per-worker device assignment strategy.

Requested by

@marcuscollins

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions