ArrayBuffer Lazy-assignment for GPU Context#5076
Conversation
…Patch*ExteriorFacets
- new state object (int -> dict) leads the reassignment being a weak reference.
connorjward
left a comment
There was a problem hiding this comment.
I've been strict but in general this is fantastic, thank you
- defaultdict gives -1 as default value if device object does not exist
- constant property was lost between cupy/numpy conversions - fixed by passing kwarg that is disregarded for cupy but used for numpy
- using @Property for last_updated_device, known by state, does not need to be variable. - duplicate method only copies most up-to-date copy and non-copy duplicate only copies for current device - initialisation adds None as optional for data input
- v3.24.5 was failing to compile petsc4py due to PETSc no support for PCPatchSetComputeFunctionExteriorFacets
|
For any trivial changes please go ahead and resolve them. Saves me cross checking things. |
- Wrappped all SF comms with on_host decorator to ensure happens on host
- This increases number of copies but all MPI happening on host atm
- Executable requires receiving a pointer to array. Providing a
conditional singledispatch register - in case user does not have cupy.
bug regarding:
if an array is modified as the first action when offloaded:
1. state for new device is updated and incremented (as
modifying)
2. then buffer attempts to sync from most up-to-date device
3. this may be current device as state was incremented first
4. then device tries to copy non-existent array to itself - bad
solution:
adapt record-modified decorator to increment state after wrapped
function
- simple try-finally wrapped trick
basic unit tests introduced for gpu context as ground truth bug: if AxisTree for FunctionSpace (or sim. object) not cached until on device, it attempts to `compile` - this is not implemented for GPU yet. fix: this has been fixed with some patches for now, to be removed. - `firedrake/functionspaceimpl::make_dat` is wrapped in on_host - `pyop3/tree/axis_tree/tree.py::_buffer_indices` is wrapped in on_host - `pyop3/buffer.py` has necessary sync in duplicate (in scenario that user assigns and immediately copies) With current fix, this does mean that if the AxisTree properties are not cached, the assign operation will happen on CPU - consequent calls work on GPU.
|
Update on progress: Unit test cases have been introduced - also uncovered some bugs. Bug: AxisTree uses cached properties. If cached properties are not available whilst in GPU, this will cause a Segfault. Fix:
This means that if properties are not cached, assign takes place on host. If available, functions as normal. Also introduced a error catch in Otherwise, I am quite happy with things. |
| try: | ||
| return func(self, *args, **kwargs) | ||
| finally: | ||
| self.inc_state() |
There was a problem hiding this comment.
Why this change? Seems like it might be a good idea but want to make it's fully thought out.
There was a problem hiding this comment.
It was to solve a previous bug. Essentially the equivalent of a C loop doing ++state vs state++.
Before, if an array was generated on host and then the first action on GPU is to modify, the state is updated before the sync happens. Then the sync would happen and the state update would be forgotten.
Skipped this problem by making the state update happen afterwards.
There was a problem hiding this comment.
I wonder if this is telling us that this decorator is being applied in the wrong place. I think we want to have
self._data = ... # i.e. we modify the data on *exactly* this line
self.inc_state()I.e. we want to have the inc_state happen at the exact point that we actually modify the data. If we decouple them then I think we're just going to get very confused.
There was a problem hiding this comment.
I wonder if this is telling us that this decorator is being applied in the wrong place. I think we want to have
self._data = ... # i.e. we modify the data on *exactly* this line
self.inc_state()I.e. we want to have the inc_state happen at the exact point that we actually modify the data. If we decouple them then I think we're just going to get very confused.
There was a problem hiding this comment.
I think it would be hard to be more precise with a decorator pattern, where it can only happen at start or end.
The record_modified wraps around only data_wo and data_rw. AIUI, we are not explicitly modifying the data ourselves, we serve a modifiable object to the user. As such, it would be hard to increment after an exact line - as we do not have one.
Although, current implementation is almost exactly as you want, just hidden in the wrapper.
User calls data_rw/data_wo -> self._data property is called and synced -> record_modified wrapper updates state -> returns self._data
There was a problem hiding this comment.
This works well, I think getting rid of decorator is necessary.
Do you want me to deprecate ._data or should we be fine if it is internal?
There was a problem hiding this comment.
We can just delete _data. I think I use it in a few places to get access to the numpy array without tweaking the state (pyop3/insn/exec.py in particular) but I think you can just directly replace it with get_array, the intent should be available.
There was a problem hiding this comment.
Actually ._data is used quite a lot internally in buffer, where intent is not explicitly available.
In properties like size or methods like reduce_leaves_to_roots_begin.
The _data calls can be replaced by get_array("ro") but don't think we want hanging literals like that.
Can't just use .data_ro as it is recursive through reduce_leaves_to_roots. We could default get_array() for intent as "ro"?
There was a problem hiding this comment.
I think the right thing to do is probably replace them with self._lazy_data[current_device]. If we are doing things internal to the class I think we are allowed to work directly on the data - the state tracking is for managing how people interact with the ArrayBuffer.
But defaulting get_array to ro also seems sensible.
There was a problem hiding this comment.
Okay, just pushed changes for this. Requires a review as quite a few lines changed.
Description
This PR introduces a context manager to allow dynamic assignment of PyOP3 arrays on a given device.
Devices are defined within the new
device.pymodule, where internal array management is also contained.Key modifications:
device.pymodule to represent offloading devices as offloading context managerbuffer.pyto lazily-evaluate data with respect to the current context (i.e. host or offloading device)_lazy_datais now a dictionary object, mapping between Device objects and respective arrays._dataproperty lazily-evaluates to appropriate data for the given context. If the data is not up-to-date, as per state property, the data is copied.device.pysobuffer.pycan remain device/gpu-agnostic - apart from some type-hinting, i.e.cp.ndarray.Notable issues:
writeableflag. It will throw away the flag when converting from NumPy objects (and it will not return if converted back)./pyop3_gpu_demo.pyfrom asserts before entering context manager.