compiler: Avoid int32 overflow in linearized host-device transfer size#2939
compiler: Avoid int32 overflow in linearized host-device transfer size#2939gaoflow wants to merge 2 commits into
Conversation
When a host-device data transfer is linearized, its array section size is emitted as a product of the Function's per-dimension sizes, e.g. `copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])`. The `size[i]` fields are 32-bit C ints, so for a Function with more than ~2**31 elements the product overflows `int` before it is used as the transfer bound, yielding a bogus size and a corrupt/failed device transfer. Cast each factor of the product to a 64-bit integer so the multiplication is carried out in 64-bit arithmetic. Casting the whole product would be too late (the overflow would already have occurred), so each factor is cast individually. Non-product bounds (a single size, an offset, a constant) cannot overflow and are left untouched, as are non-transfer expressions. Fixes devitocodes#2777
| return self._make_parallel(graph, sync_mapper=graph.sync_mapper) | ||
|
|
||
|
|
||
| def _avoid_overflow(expr): |
There was a problem hiding this comment.
there's an as_long helper in symbolics/manipulation.py that you should use instead of this function (which, btw, wouldn't even logically belong to this module -- besides its implementation being quite hacky)
There was a problem hiding this comment.
Thanks! Done — switched to as_long and dropped _avoid_overflow.
One thing worth flagging: as_long as written was a no-op here. The transfer size factors (vec->size[i]) are IndexedPointers, but as_long substituted only what retrieve_symbols returns (plain Symbols), so the product came out un-cast and the overflow remained. I changed as_long to use retrieve_terminals instead, so Indexed/IndexedPointer leaves are cast too (single-Symbol callers are unaffected, and numeric literals aren't touched). I kept the call scoped to Mul products in PragmaTransfer._generate so non-linearized multi-dimensional sections aren't needlessly upcast. test_symbolics/test_gpu_common green locally; the OpenMP exact-string expectations were updated for the new factor order.
Address review: replace the ad-hoc _avoid_overflow helper with the existing as_long. as_long only substituted plain Symbols (retrieve_symbols), so it was a no-op on the IndexedPointer size factors (vec->size[i]) of a linearized transfer bound; extend it to retrieve_terminals so Indexed/IndexedPointer leaves are cast too. Keep the cast scoped to Mul products in PragmaTransfer so non-linearized multi-dimensional sections are not needlessly upcast.
Description
Fixes #2777.
When a host↔device data transfer is linearized, its array section size is emitted as a product of the
Function's per-dimension sizes, for example:#pragma acc enter data copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])The
u_vec->size[i]fields are 32-bit Cints, so the productsize[0]*size[1]*size[2]*size[3]is evaluated in 32-bit arithmetic. For aFunctionwith more than~2**31elements (e.g. the reporter's1295**3 ≈ 2.17e9points, ~24.5 GB) the product overflowsintbefore it is used as the transfer bound, producing a bogus size (the reporter saw18446744065653020036) and a corrupt / failed device transfer — independent ofindex-mode=int64/linearize=True, because those control the kernel index type, not the type of the transfer-clause arithmetic.As @mloubout noted on the issue, the fix is to perform the size multiplication in 64-bit. This casts each factor of a product section bound to a 64-bit integer:
#pragma acc enter data copyin(u[0:(long)(u_vec->size[0])*(long)(u_vec->size[1])*(long)(u_vec->size[2])*(long)(u_vec->size[3])])Casting the whole product (
(long)(a*b*c)) would be too late — the overflow would already have happened in 32-bit — so each factor is cast individually, which forces every multiplication to be 64-bit regardless of operand ordering.The change lives in
PragmaTransfer._generate, so it is scoped to host-device transfer clauses only. Non-product bounds (a single dimension size, an offset, a constant) cannot overflow and are left untouched, addressing the concern that there is "no reason to uselongfor all of those". Non-transfer expressions (e.g. free-space guards, TMA descriptors) are unaffected.Reproduction
Before:
After:
Verification
openacccopyin/copyout/deleteandopenmpmap(to:/release:)) and to 2D/3D Functions.[0:s0][0:s1]...) is unchanged — there is no product there, hence no overflow.TestPassesOptional::test_linearize_transfer_no_overflowasserting eachsize[i]factor of a linearized transfer is cast tolongand that no bare 32-bit product remains.test_gpu_openmp.pyexpectations (test_basic,test_multiple_eqns) whose OpenMP transfers use the flattened product form.linearize=True, build and run unchanged (no device transfers emitted).flake8clean on the changed files.