Skip to content

Zarr multipart upload tune ups: condition on server capability, fix progress reporting, harmonize variable names#1846

Merged
jjnesbitt merged 4 commits intozarr-multipart-uploadfrom
zarr-multipart-upload-enhs
Apr 30, 2026
Merged

Zarr multipart upload tune ups: condition on server capability, fix progress reporting, harmonize variable names#1846
jjnesbitt merged 4 commits intozarr-multipart-uploadfrom
zarr-multipart-upload-enhs

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic commented Apr 30, 2026

Individual commits would have more of 'claudetail' ;-)

Target:

yarikoptic and others added 4 commits April 29, 2026 19:51
…_upload->singlepart_items

The split of a batch into "items above the 5GB threshold" and "items at
or below" is really a split between items routed through the multipart
upload endpoint and items uploaded with a single signed PUT.  Naming
them after the upload mechanism rather than the size makes the
subsequent capability/gating logic read more naturally.

Pure rename, no behavior change.

Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dandi-archive#2784 (still open) introduces the
``/uploads/zarr/initialize/`` endpoint required to upload zarr chunks
larger than 5 GiB.  Until that lands and is deployed, posting to the
endpoint returns 404 and the multi-part code path here would surface
that as an opaque ``HTTP404Error`` from inside ``_multipart_upload``.

Add a lazy ``DandiAPIClient.supports_zarr_multipart_upload`` cached
property that probes the endpoint once with an empty POST body: 404
means the server lacks the feature, anything else (400 for the bogus
payload, 401/403 for auth) means it is present.

In ``ZarrAsset.iter_upload``, right after the batch is split into
``multipart_items`` and ``singlepart_items``, raise an ``UploadError``
when there are multipart items and the server does not support them.
The error reports the count and aggregate size of the oversized chunks
plus the path of the largest one, mirroring the wording proposed in
#1827.

Marked with a TODO: the check (and the cached property) can be removed
once all supported servers ship dandi-archive#2784 (> 0.23.0).

Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…umulative

``_multipart_upload`` (used here to upload zarr chunks larger than 5
GiB) yields progress dicts whose ``current`` is the number of bytes
uploaded *within the chunk currently being transferred*.  The
surrounding ``ZarrAsset.iter_upload`` loop reports ``current`` as the
number of bytes uploaded *across the whole zarr*.  Yielded as-is, the
inner per-file values caused ``current`` to jump backwards each time a
new multipart chunk started, breaking progress bars driven from that
field.

Wrap the inner generator: when an "uploading" status carries a
``current`` field, translate it by the cumulative bytes uploaded so
far before re-yielding; pass other status dicts through unchanged.
The trailing post-loop yield is dropped — the last per-part
translated yield already reports the full cumulative size for that
chunk.

Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two ``ai_generated`` tests added in 9bc2976 patch
``ZARR_LARGE_CHUNK_THRESHOLD`` down so chunks are routed through
``_multipart_upload`` against whatever server backs ``new_dandiset``.
The local docker dandi-archive image does not yet ship
dandi-archive#2784, so the new ``supports_zarr_multipart_upload``
capability check raises ``UploadError`` before ``_multipart_upload``
is ever called and the tests fail without exercising what they claim
to.

Probe the live client for capability and ``pytest.skip`` only when
the endpoint is missing — keeps the tests effective on setups whose
server already carries dandi-archive#2784.

Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yarikoptic yarikoptic requested a review from jjnesbitt April 30, 2026 01:27
Comment thread dandi/files/zarr.py
it for it in all_items if it.size > ZARR_LARGE_CHUNK_THRESHOLD
]
items_to_upload = [
singlepart_items = [
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without name changes code would be hard(er) to read going forward. Cleaner names at least make more sense IMHO for future ourselves

Comment thread dandi/files/zarr.py
# stays monotonic for downstream consumers.
for it in multipart_items:
cumulative_before = bytes_uploaded
for status in _multipart_upload(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially we could change what record we yield in that helper to make it easier here, e.g. could provide "bytes_uploaded" to be provided inside to be taken into account for progress reporting or alike -- I will leave to you to choose

Comment thread dandi/tests/test_files.py
dandi_file,
find_dandi_files,
)
from ..files.bases import _multipart_upload as real_multipart_upload
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you have pre-commit installed for your local clone of dandi-cli @jjnesbitt ?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 51.51515% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.99%. Comparing base (871294b) to head (684da5c).
⚠️ Report is 5 commits behind head on zarr-multipart-upload.

Files with missing lines Patch % Lines
dandi/files/zarr.py 44.44% 10 Missing ⚠️
dandi/dandiapi.py 40.00% 6 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##           zarr-multipart-upload    #1846       +/-   ##
==========================================================
+ Coverage                  52.51%   75.99%   +23.47%     
==========================================================
  Files                         87       87               
  Lines                      12540    12562       +22     
==========================================================
+ Hits                        6586     9546     +2960     
+ Misses                      5954     3016     -2938     
Flag Coverage Δ
unittests 75.99% <51.51%> (+23.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added the patch Increment the patch version when merged label Apr 30, 2026
Copy link
Copy Markdown
Member

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green by simply not running the new tests, why didn't I think of that 👍

@jjnesbitt jjnesbitt merged commit 56126dd into zarr-multipart-upload Apr 30, 2026
41 of 43 checks passed
@jjnesbitt jjnesbitt deleted the zarr-multipart-upload-enhs branch April 30, 2026 15:51
@yarikoptic
Copy link
Copy Markdown
Member Author

CI is green by simply not running the new tests, why didn't I think of that 👍

;) but it might turn red on your PR in dandi-archive if you test against that instance ... as I noted before -- ideally we should add support for such matrix run to point to PRs in dandi-archive (#1847?) and aim for both stock run and dandi-archive PR run to stay green!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Increment the patch version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants