ArrayBuffer Lazy-assignment for GPU Context by SamSJackson · Pull Request #5076 · firedrakeproject/firedrake

SamSJackson · 2026-05-04T10:38:32Z

Description

This PR introduces a context manager to allow dynamic assignment of PyOP3 arrays on a given device.
Devices are defined within the new device.py module, where internal array management is also contained.

Key modifications:

Introduce device.py module to represent offloading devices as offloading context manager
Change buffer.py to lazily-evaluate data with respect to the current context (i.e. host or offloading device)
- _lazy_data is now a dictionary object, mapping between Device objects and respective arrays.
- _data property lazily-evaluates to appropriate data for the given context. If the data is not up-to-date, as per state property, the data is copied.
All data is copied lazily. Entering and exiting the context window will not automatically transfer data between devices.
Buffers are maintained between context windows - exiting a context window will not release the memory on the device.
All device-specific array management is kept within device.py so buffer.py can remain device/gpu-agnostic - apart from some type-hinting, i.e. cp.ndarray.

Notable issues:

CuPy has no support for writeable flag. It will throw away the flag when converting from NumPy objects (and it will not return if converted back)
Defaultdict is used for our state dictionary. Previously discussed that neither Connor or I like this but I cannot think of another approach.
- Main issue: If an array is assigned, it does not have any knowledge of other devices beyond that in its current context. As such, there is also no state counter assigned to it. Hence, it is difficult to allow the user to check the buffer on other devices, even if they are initialised. Example can be seen in ./pyop3_gpu_demo.py from asserts before entering context manager.
- Potential approaches:
  - Dictionary wrapper so we can create a more strict defaultdict (bit extreme and needless maintenance but should work).
  - Do not allow users to check state of array on device for which it has not entered context.
- Open to any advice or solutions for this.

…Patch*ExteriorFacets

…model)

- new state object (int -> dict) leads the reassignment being a weak reference.

connorjward

I've been strict but in general this is fantastic, thank you

- defaultdict gives -1 as default value if device object does not exist

- constant property was lost between cupy/numpy conversions - fixed by passing kwarg that is disregarded for cupy but used for numpy

- using @Property for last_updated_device, known by state, does not need to be variable. - duplicate method only copies most up-to-date copy and non-copy duplicate only copies for current device - initialisation adds None as optional for data input

- v3.24.5 was failing to compile petsc4py due to PETSc no support for PCPatchSetComputeFunctionExteriorFacets

connorjward · 2026-05-04T15:31:19Z

For any trivial changes please go ahead and resolve them. Saves me cross checking things.

- Wrappped all SF comms with on_host decorator to ensure happens on host - This increases number of copies but all MPI happening on host atm - Executable requires receiving a pointer to array. Providing a conditional singledispatch register - in case user does not have cupy.

bug regarding: if an array is modified as the first action when offloaded: 1. state for new device is updated and incremented (as modifying) 2. then buffer attempts to sync from most up-to-date device 3. this may be current device as state was incremented first 4. then device tries to copy non-existent array to itself - bad solution: adapt record-modified decorator to increment state after wrapped function - simple try-finally wrapped trick

basic unit tests introduced for gpu context as ground truth bug: if AxisTree for FunctionSpace (or sim. object) not cached until on device, it attempts to `compile` - this is not implemented for GPU yet. fix: this has been fixed with some patches for now, to be removed. - `firedrake/functionspaceimpl::make_dat` is wrapped in on_host - `pyop3/tree/axis_tree/tree.py::_buffer_indices` is wrapped in on_host - `pyop3/buffer.py` has necessary sync in duplicate (in scenario that user assigns and immediately copies) With current fix, this does mean that if the AxisTree properties are not cached, the assign operation will happen on CPU - consequent calls work on GPU.

SamSJackson · 2026-05-12T12:21:23Z

Update on progress:

Unit test cases have been introduced - also uncovered some bugs.

Bug: AxisTree uses cached properties. If cached properties are not available whilst in GPU, this will cause a Segfault.
Generation of cached properties requires compile strategy for arrays, not yet implemented for devices.

Fix:
The fix is to force some portions to take place on host until compile strategy is available.
The following functions have been wrapped in on_host decorators:

firedrake/functionspaceimpl.py::make_dat
pyop3/tree/axis_tree/tree.py::_buffer_indices

This means that if properties are not cached, assign takes place on host. If available, functions as normal.
As small side-effect, buffer.py::duplicate has a manual sync because current device would, correctly, suggest GPU but it was temporarily forced through host. Again, removable once compile works.

Also introduced a error catch in insn/exec.py to avoid the segfault taking place and enabling easier traceback.

Otherwise, I am quite happy with things.

connorjward

Looks close now

connorjward · 2026-05-12T12:43:01Z

+        try:
+            return func(self, *args, **kwargs)
+        finally:
+            self.inc_state()


Why this change? Seems like it might be a good idea but want to make it's fully thought out.

It was to solve a previous bug. Essentially the equivalent of a C loop doing ++state vs state++.

Before, if an array was generated on host and then the first action on GPU is to modify, the state is updated before the sync happens. Then the sync would happen and the state update would be forgotten.

Skipped this problem by making the state update happen afterwards.

I wonder if this is telling us that this decorator is being applied in the wrong place. I think we want to have

self._data = ... # i.e. we modify the data on *exactly* this line self.inc_state()

I.e. we want to have the inc_state happen at the exact point that we actually modify the data. If we decouple them then I think we're just going to get very confused.

I think it would be hard to be more precise with a decorator pattern, where it can only happen at start or end.

The record_modified wraps around only data_wo and data_rw. AIUI, we are not explicitly modifying the data ourselves, we serve a modifiable object to the user. As such, it would be hard to increment after an exact line - as we do not have one.

Although, current implementation is almost exactly as you want, just hidden in the wrapper.
User calls data_rw/data_wo -> self._data property is called and synced -> record_modified wrapper updates state -> returns self._data

This works well, I think getting rid of decorator is necessary.

Do you want me to deprecate ._data or should we be fine if it is internal?

We can just delete _data. I think I use it in a few places to get access to the numpy array without tweaking the state (pyop3/insn/exec.py in particular) but I think you can just directly replace it with get_array, the intent should be available.

Actually ._data is used quite a lot internally in buffer, where intent is not explicitly available.

In properties like size or methods like reduce_leaves_to_roots_begin.
The _data calls can be replaced by get_array("ro") but don't think we want hanging literals like that.

Can't just use .data_ro as it is recursive through reduce_leaves_to_roots. We could default get_array() for intent as "ro"?

I think the right thing to do is probably replace them with self._lazy_data[current_device]. If we are doing things internal to the class I think we are allowed to work directly on the data - the state tracking is for managing how people interact with the ArrayBuffer.

But defaulting get_array to ro also seems sensible.

Okay, just pushed changes for this. Requires a review as quite a few lines changed.

connorjward

Looks close now

SamSJackson added 18 commits April 27, 2026 12:39

introduce device.py and set up branch

8ceb830

const parameter for host device

ba6de9b

include gpu demo and update petsc version as 3.24.5 misaligned for PC…

6a9fd21

…Patch*ExteriorFacets

noting areas for change

fb3d0b7

introducing context variable and lazy cupy

e39f215

lazy evaluation of arrays

2a54e84

passes basic script functionality

98d6010

tofix: revised approach ensuring explicit choice of GPU device

b044c7a

implicit transfer and defaultdict implementation (pub/sub eager copy …

1de6320

…model)

explicit check if GPU available on init

6a26334

cudagpu and fix incoming re: remove eager copying/register & dev syncing

8d1f967

move conversion logic to device.py

9a729e9

managing buffer duplicate

21bac65

cleanup unnecessary todos/notes

67fe76a

removing notes and cleaning

1568315

fix: added copy to avoid weak reference

2517994

- new state object (int -> dict) leads the reassignment being a weak reference.

test: data_wo access works in context

9e7c6ad

add flatten from prev logic

cb5f28e

SamSJackson requested a review from connorjward May 4, 2026 10:38

connorjward requested changes May 4, 2026

View reviewed changes

SamSJackson added 6 commits May 4, 2026 14:04

context function as global function and def state

36d2b07

- defaultdict gives -1 as default value if device object does not exist

fix: maintaining constant array property

e5a0107

- constant property was lost between cupy/numpy conversions - fixed by passing kwarg that is disregarded for cupy but used for numpy

pr review: removing unused variables

39d0acd

remove dispatch to allow no-import cupy

e48d698

pr: fix property, duplicate, init

7d582d6

- using @Property for last_updated_device, known by state, does not need to be variable. - duplicate method only copies most up-to-date copy and non-copy duplicate only copies for current device - initialisation adds None as optional for data input

fix: change petsc config version to v3.25.0

0528e76

- v3.24.5 was failing to compile petsc4py due to PETSc no support for PCPatchSetComputeFunctionExteriorFacets

SamSJackson added 3 commits May 6, 2026 11:28

include cupy callable pointer

e07e6c6

basic GPU unit tests covering context

6f64d83

SamSJackson added 3 commits May 6, 2026 18:07

test cases for gpu - no fixtures due to bug

c8db3e2

cleaning comments

f13342a

SamSJackson marked this pull request as ready for review May 12, 2026 12:29

connorjward requested changes May 12, 2026

View reviewed changes

SamSJackson added 5 commits May 12, 2026 16:23

pr: resolving comments and adding docstrings

5fbbb8d

pr: clean tests and remove gpu demo

1e7755f

removing gpu demo

891e2c7

more descriptive docstring

5bd5243

pr: changes to remove _data for get_array

74eb11f

connorjward approved these changes May 13, 2026

View reviewed changes

connorjward merged commit 389f9f8 into connorjward/pyop3 May 13, 2026
3 of 7 checks passed

connorjward deleted the SamSJackson/pyop3-outer branch May 13, 2026 10:55

Conversation

SamSJackson commented May 4, 2026

Description

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward commented May 4, 2026

Uh oh!

SamSJackson commented May 12, 2026

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants