Skip to content

Skip apply_gufunc compensation when post-rechunk block fits array.chunk-size#12360

Closed
thodson-usgs wants to merge 1 commit into
dask:mainfrom
thodson-usgs:fix/apply-gufunc-overcompensation
Closed

Skip apply_gufunc compensation when post-rechunk block fits array.chunk-size#12360
thodson-usgs wants to merge 1 commit into
dask:mainfrom
thodson-usgs:fix/apply-gufunc-overcompensation

Conversation

@thodson-usgs
Copy link
Copy Markdown

@thodson-usgs thodson-usgs commented Apr 23, 2026

  • Tests added
  • Passes pre-commit run --all-files

Follow-up to #11683. The loop-dim compensation block there unconditionally shrinks loop chunks after rechunking a core dim to -1, even when the original block was already small. Result: task graphs explode without any memory benefit.

import dask.array as da

src = da.zeros((200, 400), chunks=(2, 400))           # 100 input chunks, ~6 KB each
out = da.apply_gufunc(
    lambda x, c: x[:1], "(i),(i)->(j)",
    src, da.arange(200.0),
    axes=[(0,), (0,), (0,)], output_sizes={"j": 50},
    allow_rechunk=True, output_dtypes=float,
)
len(out.__dask_graph__())
# before: 20,701       after: 404

Fix: guard the compensation with an array.chunk-size budget. Only shrink loop dims when the post-rechunk block would actually exceed the limit. The memory-protection branch from #11683 still fires when it should — new test test_gufunc_chunksizes_adjustment_above_limit forces the budget down and pins that behavior.

Related: pydata/xarray#9907 (the original report motivating #11683; already closed), pydata/xarray#10130 (open memory-OOM report where the compensation over-fire contributes to graph blowup).

Companion xarray PR (pydata/xarray#11312) addresses the architectural side for interp(method="linear"|"nearest"); this one helps every apply_gufunc(allow_rechunk=True) caller.

cc @phofl @crusaderky

The compensation block introduced in dask#11683 unconditionally shrinks loop
dimensions after rechunking core dims to -1, to preserve per-block memory.
But it never checks whether the original block was already small — so it
over-splits already-small loop dims, producing huge task graphs with no
memory benefit.

Concrete repro (from pydata/xarray#9907 follow-up):
  src = da.zeros((200, 400), chunks=(2, 400))   # 100 input chunks
  da.apply_gufunc(..., allow_rechunk=True)      # core axis = 0

  Before: latitude → 1 chunk (good), longitude split to 100 chunks of 4
          (bad — 20,701 tasks in graph, loop dim was already one chunk)
  After:  latitude → 1 chunk, longitude stays one chunk (404 tasks)

Downstream impact on xarray.interp(method="linear"|"nearest") with
dask-chunked input: ~100x speedup on a 200x400 -> 50x100 interp with
100 chunks (1554 ms -> 14 ms; task graph 21,731 -> 413).

Fix: guard the compensation with a chunk-size budget check. Only shrink
loop dims when the post-rechunk block would actually exceed
``array.chunk-size`` (default 128 MiB). For the xarray#9907 scenario
the blocks still get split (verified by an explicit test with a reduced
limit).

Split the existing ``test_gufunc_chunksizes_adjustment`` into two tests
covering both branches — below limit (no compensation), above limit
(compensation fires).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/xarray that referenced this pull request Apr 23, 2026
Routes ``xarray.interp(method="linear"|"nearest"|"slinear")`` on a
dask-chunked core dim through a per-chunk dispatch instead of
``apply_ufunc(..., allow_rechunk=True)``. For each target point, look up
the source chunk that contains its coord value and run the interpolator
over that chunk plus a size-1 halo. Per-task memory scales with
``source_chunk + halo`` rather than the full interp axis.

Fall-back path preserves the existing behavior for cubic, multi-dim
interpn, non-monotonic source coord, empty target, and numpy input.

Verified against the existing apply_ufunc path on 200x400 -> 50x100 for
several source-chunk layouts (bit-identical), on a 3D time-chunked input
(time chunking preserved), and on the memory-constrained 6000x5000 case
where the new path beats ``apply_ufunc`` by ~10x.

The per-chunk path materializes 1D source coords (searchsorted-based
routing); data stays lazy. ``test_dataset_interp_datetime_dask`` bumped
its ``raise_if_dask_computes`` budget to account for this.

Related: :issue:`9907` (already closed; same root cause) and
:issue:`10130` (open; partial overlap — single-chunk-source cases still
use the existing path, better addressed by the dask-side guard in
dask/dask#12360).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     21 files  ± 0       21 suites  ±0   5h 25m 50s ⏱️ + 9m 34s
 18 294 tests + 1   16 935 ✅ ± 0   1 272 💤 ±0  87 ❌ +1 
317 539 runs  +19  273 745 ✅ +16  43 707 💤 +2  87 ❌ +1 

For more details on these failures, see this check.

Results for commit e9e8dba. ± Comparison against base commit c6a85f3.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
dask.array.tests.test_gufunc ‑ test_gufunc_chunksizes_adjustment
dask.array.tests.test_gufunc ‑ test_gufunc_chunksizes_adjustment_above_limit
dask.array.tests.test_gufunc ‑ test_gufunc_chunksizes_adjustment_below_limit

thodson-usgs added a commit to thodson-usgs/xarray that referenced this pull request Apr 23, 2026
Routes ``xarray.interp(method="linear"|"nearest"|"slinear")`` on a
dask-chunked core dim through a per-chunk dispatch instead of
``apply_ufunc(..., allow_rechunk=True)``. For each target point, look up
the source chunk that contains its coord value and run the interpolator
over that chunk plus a size-1 halo. Per-task memory scales with
``source_chunk + halo`` rather than the full interp axis.

Fall-back path preserves the existing behavior for cubic, multi-dim
interpn, non-monotonic source coord, empty target, and numpy input.

Verified against the existing apply_ufunc path on 200x400 -> 50x100 for
several source-chunk layouts (bit-identical), on a 3D time-chunked input
(time chunking preserved), and on the memory-constrained 6000x5000 case
where the new path beats ``apply_ufunc`` by ~10x.

The per-chunk path materializes 1D source coords (searchsorted-based
routing); data stays lazy. ``test_dataset_interp_datetime_dask`` bumped
its ``raise_if_dask_computes`` budget to account for this.

Related: :issue:`9907` (already closed; same root cause) and
:issue:`10130` (open; partial overlap — single-chunk-source cases still
use the existing path, better addressed by the dask-side guard in
dask/dask#12360).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/xarray that referenced this pull request Apr 24, 2026
Routes ``xarray.interp(method="linear"|"nearest"|"slinear")`` on a
dask-chunked core dim through a per-chunk dispatch instead of
``apply_ufunc(..., allow_rechunk=True)``. For each target point, look up
the source chunk that contains its coord value and run the interpolator
over that chunk plus a size-1 halo. Per-task memory scales with
``source_chunk + halo`` rather than the full interp axis.

Fall-back path preserves the existing behavior for cubic, multi-dim
interpn, non-monotonic source coord, empty target, and numpy input.

Verified against the existing apply_ufunc path on 200x400 -> 50x100 for
several source-chunk layouts (bit-identical), on a 3D time-chunked input
(time chunking preserved), and on the memory-constrained 6000x5000 case
where the new path beats ``apply_ufunc`` by ~10x.

The per-chunk path materializes 1D source coords (searchsorted-based
routing); data stays lazy. ``test_dataset_interp_datetime_dask`` bumped
its ``raise_if_dask_computes`` budget to account for this.

Related: :issue:`9907` (already closed; same root cause) and
:issue:`10130` (open; partial overlap — single-chunk-source cases still
use the existing path, better addressed by the dask-side guard in
dask/dask#12360).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/xarray that referenced this pull request Apr 24, 2026
Routes ``xarray.interp(method="linear"|"nearest"|"slinear")`` on a
dask-chunked core dim through a per-chunk dispatch instead of
``apply_ufunc(..., allow_rechunk=True)``. For each target point, look up
the source chunk that contains its coord value and run the interpolator
over that chunk plus a size-1 halo. Per-task memory scales with
``source_chunk + halo`` rather than the full interp axis.

Fall-back path preserves the existing behavior for cubic, multi-dim
interpn, non-monotonic source coord, empty target, and numpy input.

Verified against the existing apply_ufunc path on 200x400 -> 50x100 for
several source-chunk layouts (bit-identical), on a 3D time-chunked input
(time chunking preserved), and on the memory-constrained 6000x5000 case
where the new path beats ``apply_ufunc`` by ~10x.

The per-chunk path materializes 1D source coords (searchsorted-based
routing); data stays lazy. ``test_dataset_interp_datetime_dask`` bumped
its ``raise_if_dask_computes`` budget to account for this.

Related: :issue:`9907` (already closed; same root cause) and
:issue:`10130` (open; partial overlap — single-chunk-source cases still
use the existing path, better addressed by the dask-side guard in
dask/dask#12360).

Co-Authored-By: Claude <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Author

Closing this. After benchmarking, the regime this fix targets (post-rechunk blocks well below array.chunk-size) is narrow — at normal chunk sizes there's no measurable difference, and the specific case that motivated it (pydata/xarray#9907, #10130) is being addressed from the xarray side in pydata/xarray#11312. Happy to revisit if someone hits the same symptom through a different entry point.

[This is Claude Code on behalf of Tim Hodson]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant