Skip to content

Issue: CUDA workers do not release resources after each evaluate_program when using spawn start method #330

@yuxuan-z19

Description

@yuxuan-z19

Summary

When using PyTorch with multiprocessing in OpenEvolve (e.g. KernelBench's eval_kernel_against_ref), CUDA requires the spawn start method to avoid the well-known “Cannot re-initialize CUDA in forked subprocess” error (see PyTorch docs) .

However, after setting in cli.py:

# cli.py
def main() -> int:
    """
    Main entry point

    Returns:
        Exit code
    """

    # BUGFIX: allow CUDA in mp
    mp.set_start_method("spawn", force=True)

    return asyncio.run(main_async())

I observed that worker processes do not release CUDA resources after each evaluate_program call. This causes GPU memory to continuously grow and never be freed until the entire evolution run stops.

This is consistent with PyTorch’s documentation and also with general Python multiprocessing behavior.

CUDA contexts live for the lifetime of the worker process, so they are not released unless the worker exits.

What I tried

To force worker processes to restart after each task (so that CUDA contexts are destroyed), I modified the executor initialization in process_parallel.py:

# process_parallel.py
def start(self) -> None:
    self.executor = ProcessPoolExecutor(
        max_workers=self.num_workers,
        initializer=_worker_init,
        initargs=(config_dict, self.evaluation_file, current_env),
        # BUGFIX: allow CUDA in mp
        max_tasks_per_child=1,
     )

This ensures that each evaluation runs in a fresh process, and GPU memory is properly reclaimed.

Question

Before submitting a PR, I would like to confirm:

Does adding max_tasks_per_child=1 introduce any risks for OpenEvolve’s generality or execution model?

Specifically:

  • Are there parts of OpenEvolve that depend on long-lived worker processes?
  • Is there any state that workers are expected to preserve across tasks?
  • Would forcing a full process respawn per task degrade performance in some scenarios (e.g., expensive model initialization)?
  • Would a user-settable configuration be preferable?

Why this matters

Without killing workers after each task, CUDA contexts accumulate indefinitely, especially with inline kernels or any operation that initializes PyTorch CUDA state. This leads to (potential) GPU memory leaks and stale CUDA contexts that never reset.


If this aligns with the project’s architecture, I am happy to open a PR to add this behavior (or make it configurable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions