Zarr multipart upload tune ups: condition on server capability, fix progress reporting, harmonize variable names#1846
Conversation
…_upload->singlepart_items The split of a batch into "items above the 5GB threshold" and "items at or below" is really a split between items routed through the multipart upload endpoint and items uploaded with a single signed PUT. Naming them after the upload mechanism rather than the size makes the subsequent capability/gating logic read more naturally. Pure rename, no behavior change. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dandi-archive#2784 (still open) introduces the ``/uploads/zarr/initialize/`` endpoint required to upload zarr chunks larger than 5 GiB. Until that lands and is deployed, posting to the endpoint returns 404 and the multi-part code path here would surface that as an opaque ``HTTP404Error`` from inside ``_multipart_upload``. Add a lazy ``DandiAPIClient.supports_zarr_multipart_upload`` cached property that probes the endpoint once with an empty POST body: 404 means the server lacks the feature, anything else (400 for the bogus payload, 401/403 for auth) means it is present. In ``ZarrAsset.iter_upload``, right after the batch is split into ``multipart_items`` and ``singlepart_items``, raise an ``UploadError`` when there are multipart items and the server does not support them. The error reports the count and aggregate size of the oversized chunks plus the path of the largest one, mirroring the wording proposed in #1827. Marked with a TODO: the check (and the cached property) can be removed once all supported servers ship dandi-archive#2784 (> 0.23.0). Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…umulative ``_multipart_upload`` (used here to upload zarr chunks larger than 5 GiB) yields progress dicts whose ``current`` is the number of bytes uploaded *within the chunk currently being transferred*. The surrounding ``ZarrAsset.iter_upload`` loop reports ``current`` as the number of bytes uploaded *across the whole zarr*. Yielded as-is, the inner per-file values caused ``current`` to jump backwards each time a new multipart chunk started, breaking progress bars driven from that field. Wrap the inner generator: when an "uploading" status carries a ``current`` field, translate it by the cumulative bytes uploaded so far before re-yielding; pass other status dicts through unchanged. The trailing post-loop yield is dropped — the last per-part translated yield already reports the full cumulative size for that chunk. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two ``ai_generated`` tests added in 9bc2976 patch ``ZARR_LARGE_CHUNK_THRESHOLD`` down so chunks are routed through ``_multipart_upload`` against whatever server backs ``new_dandiset``. The local docker dandi-archive image does not yet ship dandi-archive#2784, so the new ``supports_zarr_multipart_upload`` capability check raises ``UploadError`` before ``_multipart_upload`` is ever called and the tests fail without exercising what they claim to. Probe the live client for capability and ``pytest.skip`` only when the endpoint is missing — keeps the tests effective on setups whose server already carries dandi-archive#2784. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| it for it in all_items if it.size > ZARR_LARGE_CHUNK_THRESHOLD | ||
| ] | ||
| items_to_upload = [ | ||
| singlepart_items = [ |
There was a problem hiding this comment.
without name changes code would be hard(er) to read going forward. Cleaner names at least make more sense IMHO for future ourselves
| # stays monotonic for downstream consumers. | ||
| for it in multipart_items: | ||
| cumulative_before = bytes_uploaded | ||
| for status in _multipart_upload( |
There was a problem hiding this comment.
potentially we could change what record we yield in that helper to make it easier here, e.g. could provide "bytes_uploaded" to be provided inside to be taken into account for progress reporting or alike -- I will leave to you to choose
| dandi_file, | ||
| find_dandi_files, | ||
| ) | ||
| from ..files.bases import _multipart_upload as real_multipart_upload |
There was a problem hiding this comment.
did you have pre-commit installed for your local clone of dandi-cli @jjnesbitt ?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## zarr-multipart-upload #1846 +/- ##
==========================================================
+ Coverage 52.51% 75.99% +23.47%
==========================================================
Files 87 87
Lines 12540 12562 +22
==========================================================
+ Hits 6586 9546 +2960
+ Misses 5954 3016 -2938
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jjnesbitt
left a comment
There was a problem hiding this comment.
CI is green by simply not running the new tests, why didn't I think of that 👍
;) but it might turn red on your PR in dandi-archive if you test against that instance ... as I noted before -- ideally we should add support for such matrix run to point to PRs in dandi-archive (#1847?) and aim for both stock run and dandi-archive PR run to stay green! |
Individual commits would have more of 'claudetail' ;-)
Target: