-
Notifications
You must be signed in to change notification settings - Fork 703
Description
Summary
When using PyTorch with multiprocessing in OpenEvolve (e.g. KernelBench's eval_kernel_against_ref), CUDA requires the spawn start method to avoid the well-known “Cannot re-initialize CUDA in forked subprocess” error (see PyTorch docs) .
However, after setting in cli.py:
# cli.py
def main() -> int:
"""
Main entry point
Returns:
Exit code
"""
# BUGFIX: allow CUDA in mp
mp.set_start_method("spawn", force=True)
return asyncio.run(main_async())I observed that worker processes do not release CUDA resources after each evaluate_program call. This causes GPU memory to continuously grow and never be freed until the entire evolution run stops.
This is consistent with PyTorch’s documentation and also with general Python multiprocessing behavior.
CUDA contexts live for the lifetime of the worker process, so they are not released unless the worker exits.
What I tried
To force worker processes to restart after each task (so that CUDA contexts are destroyed), I modified the executor initialization in process_parallel.py:
# process_parallel.py
def start(self) -> None:
self.executor = ProcessPoolExecutor(
max_workers=self.num_workers,
initializer=_worker_init,
initargs=(config_dict, self.evaluation_file, current_env),
# BUGFIX: allow CUDA in mp
max_tasks_per_child=1,
)This ensures that each evaluation runs in a fresh process, and GPU memory is properly reclaimed.
Question
Before submitting a PR, I would like to confirm:
Does adding max_tasks_per_child=1 introduce any risks for OpenEvolve’s generality or execution model?
Specifically:
- Are there parts of OpenEvolve that depend on long-lived worker processes?
- Is there any state that workers are expected to preserve across tasks?
- Would forcing a full process respawn per task degrade performance in some scenarios (e.g., expensive model initialization)?
- Would a user-settable configuration be preferable?
Why this matters
Without killing workers after each task, CUDA contexts accumulate indefinitely, especially with inline kernels or any operation that initializes PyTorch CUDA state. This leads to (potential) GPU memory leaks and stale CUDA contexts that never reset.
If this aligns with the project’s architecture, I am happy to open a PR to add this behavior (or make it configurable).