Skip to content

ArrayBuffer Lazy-assignment for GPU Context#5076

Merged
connorjward merged 36 commits into
connorjward/pyop3from
SamSJackson/pyop3-outer
May 13, 2026
Merged

ArrayBuffer Lazy-assignment for GPU Context#5076
connorjward merged 36 commits into
connorjward/pyop3from
SamSJackson/pyop3-outer

Conversation

@SamSJackson
Copy link
Copy Markdown

Description

This PR introduces a context manager to allow dynamic assignment of PyOP3 arrays on a given device.
Devices are defined within the new device.py module, where internal array management is also contained.

Key modifications:

  • Introduce device.py module to represent offloading devices as offloading context manager
  • Change buffer.py to lazily-evaluate data with respect to the current context (i.e. host or offloading device)
    • _lazy_data is now a dictionary object, mapping between Device objects and respective arrays.
    • _data property lazily-evaluates to appropriate data for the given context. If the data is not up-to-date, as per state property, the data is copied.
  • All data is copied lazily. Entering and exiting the context window will not automatically transfer data between devices.
  • Buffers are maintained between context windows - exiting a context window will not release the memory on the device.
  • All device-specific array management is kept within device.py so buffer.py can remain device/gpu-agnostic - apart from some type-hinting, i.e. cp.ndarray.

Notable issues:

  • CuPy has no support for writeable flag. It will throw away the flag when converting from NumPy objects (and it will not return if converted back)
  • Defaultdict is used for our state dictionary. Previously discussed that neither Connor or I like this but I cannot think of another approach.
    • Main issue: If an array is assigned, it does not have any knowledge of other devices beyond that in its current context. As such, there is also no state counter assigned to it. Hence, it is difficult to allow the user to check the buffer on other devices, even if they are initialised. Example can be seen in ./pyop3_gpu_demo.py from asserts before entering context manager.
    • Potential approaches:
      • Dictionary wrapper so we can create a more strict defaultdict (bit extreme and needless maintenance but should work).
      • Do not allow users to check state of array on device for which it has not entered context.
    • Open to any advice or solutions for this.

@SamSJackson SamSJackson requested a review from connorjward May 4, 2026 10:38
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been strict but in general this is fantastic, thank you

Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py
Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/device.py Outdated
Comment thread pyop3/device.py Outdated
Comment thread pyop3/device.py Outdated
Comment thread pyop3/device.py Outdated
Comment thread pyop3/buffer.py Outdated
SamSJackson added 6 commits May 4, 2026 14:04
- defaultdict gives -1 as default value if device object does not exist
- constant property was lost between cupy/numpy conversions
- fixed by passing kwarg that is disregarded for cupy but used for numpy
- using @Property for last_updated_device, known by state, does not need
  to be variable.
- duplicate method only copies most up-to-date copy and non-copy
  duplicate only copies for current device
- initialisation adds None as optional for data input
- v3.24.5 was failing to compile petsc4py due to PETSc no support for
  PCPatchSetComputeFunctionExteriorFacets
@connorjward
Copy link
Copy Markdown
Contributor

For any trivial changes please go ahead and resolve them. Saves me cross checking things.

SamSJackson added 3 commits May 6, 2026 11:28
- Wrappped all SF comms with on_host decorator to ensure happens on host
    - This increases number of copies but all MPI happening on host atm
- Executable requires receiving a pointer to array. Providing a
  conditional singledispatch register - in case user does not have cupy.
SamSJackson added 3 commits May 6, 2026 18:07
bug regarding:
    if an array is modified as the first action when offloaded:
    1. state for new device is updated and incremented (as
       modifying)
    2. then buffer attempts to sync from most up-to-date device
    3. this may be current device as state was incremented first
    4. then device tries to copy non-existent array to itself - bad
solution:
    adapt record-modified decorator to increment state after wrapped
    function
    - simple try-finally wrapped trick
basic unit tests introduced for gpu context as ground truth

bug: if AxisTree for FunctionSpace (or sim. object) not cached
until on device, it attempts to `compile` - this is not implemented for
GPU yet.

fix: this has been fixed with some patches for now, to be removed.
- `firedrake/functionspaceimpl::make_dat` is wrapped in on_host
- `pyop3/tree/axis_tree/tree.py::_buffer_indices` is wrapped in on_host
- `pyop3/buffer.py` has necessary sync in duplicate (in scenario that
  user assigns and immediately copies)

With current fix, this does mean that if the AxisTree properties are not cached,
the assign operation will happen on CPU - consequent calls work on GPU.
@SamSJackson
Copy link
Copy Markdown
Author

Update on progress:

Unit test cases have been introduced - also uncovered some bugs.

Bug: AxisTree uses cached properties. If cached properties are not available whilst in GPU, this will cause a Segfault.
Generation of cached properties requires compile strategy for arrays, not yet implemented for devices.

Fix:
The fix is to force some portions to take place on host until compile strategy is available.
The following functions have been wrapped in on_host decorators:

  • firedrake/functionspaceimpl.py::make_dat
  • pyop3/tree/axis_tree/tree.py::_buffer_indices

This means that if properties are not cached, assign takes place on host. If available, functions as normal.
As small side-effect, buffer.py::duplicate has a manual sync because current device would, correctly, suggest GPU but it was temporarily forced through host. Again, removable once compile works.

Also introduced a error catch in insn/exec.py to avoid the segfault taking place and enabling easier traceback.

Otherwise, I am quite happy with things.

@SamSJackson SamSJackson marked this pull request as ready for review May 12, 2026 12:29
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close now

Comment thread pyop3/buffer.py Outdated
try:
return func(self, *args, **kwargs)
finally:
self.inc_state()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Seems like it might be a good idea but want to make it's fully thought out.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to solve a previous bug. Essentially the equivalent of a C loop doing ++state vs state++.

Before, if an array was generated on host and then the first action on GPU is to modify, the state is updated before the sync happens. Then the sync would happen and the state update would be forgotten.

Skipped this problem by making the state update happen afterwards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is telling us that this decorator is being applied in the wrong place. I think we want to have

self._data = ...  # i.e. we modify the data on *exactly* this line
self.inc_state()

I.e. we want to have the inc_state happen at the exact point that we actually modify the data. If we decouple them then I think we're just going to get very confused.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is telling us that this decorator is being applied in the wrong place. I think we want to have

self._data = ...  # i.e. we modify the data on *exactly* this line
self.inc_state()

I.e. we want to have the inc_state happen at the exact point that we actually modify the data. If we decouple them then I think we're just going to get very confused.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard to be more precise with a decorator pattern, where it can only happen at start or end.

The record_modified wraps around only data_wo and data_rw. AIUI, we are not explicitly modifying the data ourselves, we serve a modifiable object to the user. As such, it would be hard to increment after an exact line - as we do not have one.

Although, current implementation is almost exactly as you want, just hidden in the wrapper.
User calls data_rw/data_wo -> self._data property is called and synced -> record_modified wrapper updates state -> returns self._data

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well, I think getting rid of decorator is necessary.

Do you want me to deprecate ._data or should we be fine if it is internal?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just delete _data. I think I use it in a few places to get access to the numpy array without tweaking the state (pyop3/insn/exec.py in particular) but I think you can just directly replace it with get_array, the intent should be available.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually ._data is used quite a lot internally in buffer, where intent is not explicitly available.

In properties like size or methods like reduce_leaves_to_roots_begin.
The _data calls can be replaced by get_array("ro") but don't think we want hanging literals like that.

Can't just use .data_ro as it is recursive through reduce_leaves_to_roots. We could default get_array() for intent as "ro"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right thing to do is probably replace them with self._lazy_data[current_device]. If we are doing things internal to the class I think we are allowed to work directly on the data - the state tracking is for managing how people interact with the ArrayBuffer.

But defaulting get_array to ro also seems sensible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just pushed changes for this. Requires a review as quite a few lines changed.

Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py Outdated
Comment thread pyop3/buffer.py Outdated
Comment thread tests/pyop3/unit/test_gpu_context.py Outdated
Comment thread tests/pyop3/unit/test_gpu_context.py Outdated
Comment thread tests/pyop3/unit/test_gpu_context.py Outdated
Comment thread pyop3_gpu_demo.py
Comment thread tests/pyop3/unit/test_gpu_context.py
Copy link
Copy Markdown
Contributor

@connorjward connorjward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close now

@connorjward connorjward merged commit 389f9f8 into connorjward/pyop3 May 13, 2026
3 of 7 checks passed
@connorjward connorjward deleted the SamSJackson/pyop3-outer branch May 13, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants