Skip to content

Conversation

@TheGreatAlgo
Copy link
Collaborator

@TheGreatAlgo TheGreatAlgo commented Jun 10, 2025

Summary by CodeRabbit

  • New Features

    • Sharded Zarr store with memory-bounded shard cache, Flat Zarr store option, and a converter + CLI to migrate HAMT stores to sharded layout; new sharded APIs exported.
  • Enhancements

    • Partial byte-range reads supported (range/suffix/offset) and CAS gateway Range support; RPC-based pin/unpin/update/list for pinned content.
  • Tests

    • Extensive new integration, conversion, resizing, grafting, deletion, partial-read, benchmark, and coverage tests.
  • Documentation

    • Added CLAUDE.md with setup, testing (IPFS) guidance, and architecture notes.
  • Chores

    • Pre-commit/ruff config and example CID/script tweaks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 10, 2025

Walkthrough

Adds sharded and flat Zarr store implementations, a HAMT→Sharded converter + CLI, partial-read support end-to-end (CAS→HAMT→Zarr), RPC pin management in KuboCAS, many async tests/benchmarks, small export/example/tooling updates, and new contributor docs (CLAUDE.md).

Changes

Cohort / File(s) Summary
Documentation
CLAUDE.md
New contributor guide: setup, IPFS test options, lint/format/type commands, architecture and design notes.
Package exports
py_hamt/__init__.py
Export ShardedZarrStore, convert_hamt_to_sharded, sharded_converter_cli; removed startup print; updated __all__.
HAMT core
py_hamt/hamt.py
HAMT.get signature extended with offset, length, suffix forwarded to CAS.load for partial/suffix reads.
Zarr HAMT integration
py_hamt/zarr_hamt_store.py
Map Zarr ByteRequest→(offset,length,suffix); bypass metadata cache for partial reads; implement concurrent get_partial_values; add exception logging.
Sharded Zarr backend
py_hamt/sharded_zarr_store.py
New ShardedZarrStore and MemoryBoundedLRUCache: DAG‑CBOR shard index, per‑shard locks, async get/set/delete/list, partial reads, grafting, resizing, flush, read‑only handling, caching and concurrency logic.
HAMT→Sharded converter & CLI
py_hamt/hamt_to_sharded_converter.py
New async convert_hamt_to_sharded and sharded_converter_cli to migrate HAMT stores to ShardedZarrStore.
Flat reference store
py_hamt/reference/flat_zarr_store.py
New FlatZarrStore: flat CID index layout, async ops, partial reads via concurrent gets, flush/delete/listing, equality.
CAS / HTTP client / pinning
py_hamt/store_httpx.py
ContentAddressedStore.load extended for slicing; added abstract pin RPC methods; InMemoryCAS.load supports slicing; KuboCAS manages per-event-loop httpx clients, Range requests, pin_cid/unpin_cid/pin_update/pin_ls and pin_on_add.
Tests — sharded store & coverage
tests/test_sharded_zarr_store*.py, tests/test_sharded_zarr_store_coverage.py
Large async test suites exercising ShardedZarrStore: write/read, concurrency, append/resize, metadata, partial reads, error paths, and extensive failure-mode coverage (mocks used).
Tests — converter & CLI
tests/test_converter.py
Tests for conversion correctness and CLI behavior (success, defaults, invalid CID).
Tests — deleting / grafting / resizing
tests/test_sharded_store_deleting.py, tests/test_sharded_store_grafting.py, tests/test_sharded_store_resizing.py
Targeted tests for delete semantics, grafting scenarios (concurrent/overlap/invalid), resizing (store & variable), persistence and concurrency.
Tests — partial reads & IPFS
tests/test_zarr_ipfs_partial.py, modified tests/test_zarr_ipfs*.py
New partial-read integration tests (range/offset/suffix) for IPFS/HAMT/InMemoryCAS; removed NotImplemented expectations and adjusted typing.
Benchmarks / compare
tests/test_benchmark_stores.py, tests/test_cpc_compare.py
Async benchmarks for ShardedZarrStore and ZarrHAMTStore measuring write/read timings and throughput.
Tests — Kubo pin
tests/test_kubo_pin.py
New test_pinning exercising pin/unpin/pin_update/pin_ls via KuboCAS RPC.
Examples / scripts
fsgs.py, public_gateway_example.py, run-checks.sh
Updated example CIDs; run-checks.sh adjusted pre-commit invocation (removed --show-diff-on-failure).
Tooling
.pre-commit-config.yaml, pyproject.toml
Bumped ruff hook version; enabled Ruff preview in pyproject.toml.
Small formatting
tests/testing_utils.py
Minor reformatting of hypothesis strategy call (no behavior change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client as Client
  participant Zarr as ZarrHAMTStore
  participant HAMT as HAMT
  participant CAS as ContentAddressedStore / KuboCAS
  participant Gateway as IPFS Gateway

  Client->>Zarr: get(key, prototype, byte_range)
  Zarr->>Zarr: _map_byte_request(byte_range) → (offset,length,suffix)
  Zarr->>HAMT: get(key, offset, length, suffix)
  HAMT->>CAS: load(cid, offset, length, suffix)
  alt partial read
    CAS->>Gateway: HTTP GET (Range header)
    Gateway-->>CAS: 206 Partial Content
  else full read
    CAS->>Gateway: HTTP GET
    Gateway-->>CAS: 200 OK
  end
  CAS-->>HAMT: bytes
  HAMT-->>Zarr: bytes/decoded
  Zarr-->>Client: buffer
Loading
sequenceDiagram
  autonumber
  participant User as User / CLI
  participant Conv as convert_hamt_to_sharded
  participant CAS as KuboCAS
  participant HStore as HAMT (ro)
  participant SStore as ShardedZarrStore (rw)

  User->>Conv: start(hamt_root_cid, chunks_per_shard)
  Conv->>HStore: open(read_only, hamt_root_cid)
  Conv->>SStore: open(rw, array_shape, chunk_shape, chunks_per_shard)
  loop each key
    Conv->>HStore: list/get pointer
    Conv->>SStore: set_pointer(key, CID)
  end
  Conv->>SStore: flush()
  SStore-->>Conv: sharded_root_cid
  Conv-->>User: sharded_root_cid
Loading
sequenceDiagram
  autonumber
  participant Dev as Dev
  participant Kubo as KuboCAS
  participant RPC as IPFS RPC

  Dev->>Kubo: pin_cid(CID)
  Kubo->>RPC: /api/v0/pin/add?arg=CID
  RPC-->>Kubo: OK
  Dev->>Kubo: pin_ls()
  Kubo->>RPC: /api/v0/pin/ls
  RPC-->>Kubo: {Pins}
  Dev->>Kubo: pin_update(old,new)
  Kubo->>RPC: /api/v0/pin/update?arg=old&arg=new
  RPC-->>Kubo: OK
  Dev->>Kubo: unpin_cid(CID)
  Kubo->>RPC: /api/v0/pin/rm?arg=CID
  RPC-->>Kubo: OK
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • abidsikder
  • Faolain
  • 0xSwego

Poem

"I nibble through CIDs, shard by shard,
hopping past HAMTs both neat and hard.
I fetch a slice, then pin the prize,
convert and flush beneath bright skies.
Thump—new roots bloom; carrot-coded highs." 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.46% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: Sharding" is concise and directly describes the primary change in the PR — adding sharding support (new ShardedZarrStore, conversion tooling, and related tests). It follows a conventional commit-style prefix and is clear enough for a teammate scanning history to understand the main intent without extraneous detail.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sharding

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (9)
tests/test_cpc_compare.py (1)

1-129: Consider removing or properly integrating this benchmark file.

This entire file is commented out. If these benchmarks are intended for future use, consider:

  1. Moving them to a separate benchmarks directory
  2. Adding proper documentation explaining when/how to use them
  3. If keeping them, address the security concern with the empty API key on line 28
py_hamt/hamt_to_sharded_converter.py (2)

34-34: Remove the unnecessary consolidated=True parameter.

The consolidated=True parameter in xr.open_zarr is typically used when reading consolidated metadata. Since you're opening a HAMT store that doesn't use Zarr's consolidation feature, this parameter has no effect and should be removed for clarity.

Apply this diff:

-    source_dataset = xr.open_zarr(store=source_store, consolidated=True)
+    source_dataset = xr.open_zarr(store=source_store)

39-44: Simplify array and chunk shape extraction.

The intermediate variables array_shape_tuple and chunk_shape_tuple are redundant.

Apply this diff to reduce local variables:

     ordered_dims = list(source_dataset.dims)
-    array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
-    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
-    array_shape = array_shape_tuple
-    chunk_shape = chunk_shape_tuple
+    array_shape = tuple(source_dataset.dims[dim] for dim in ordered_dims)
+    chunk_shape = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
tests/test_benchmark_stores.py (1)

1-313: Consider removing or properly integrating this benchmark file.

Like test_cpc_compare.py, this entire file is commented out. If these benchmarks are needed:

  1. Uncomment and ensure they work with the current test infrastructure
  2. Move to a dedicated benchmarks directory if they're not part of regular CI
  3. Remove if they're no longer needed
tests/test_converter.py (2)

152-152: Remove unnecessary consolidated=True parameter.

Similar to the converter module, the consolidated=True parameter has no effect with HAMT stores.

Apply this diff:

-        test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True)
+        test_ds.to_zarr(store=source_hamt_store, mode="w")

201-201: Remove unnecessary consolidated=True parameter.

Apply this diff:

-        test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True)
+        test_ds.to_zarr(store=source_hamt_store, mode="w")
py_hamt/store.py (1)

85-96: Simplify control flow by removing unnecessary else

The partial read logic is correct, but the control flow can be simplified.

Apply this diff to improve readability:

 if offset is not None:
     start = offset
     if length is not None:
         end = start + length
         return data[start:end]
-    else:
-        return data[start:]
+    return data[start:]
 elif suffix is not None:  # If only length is given, assume start from 0
     return data[-suffix:]
-else:  # Full load
-    return data
+# Full load
+return data
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

py_hamt/sharded_zarr_store.py (2)

4-4: Remove unused import

The cast import from typing is not used.

-from typing import Optional, cast, Dict, List, Set, Tuple
+from typing import Optional, Dict, List, Set, Tuple
🧰 Tools
🪛 Ruff (0.11.9)

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)


190-191: Consider making excluded array prefixes configurable

The hardcoded exclusion list might limit flexibility for different use cases.

Consider making this configurable:

-excluded_array_prefixes = {"time", "lat", "lon", "latitude", "longitude"}
+# Add as class attribute or constructor parameter
+excluded_array_prefixes = self._excluded_array_prefixes or {"time", "lat", "lon", "latitude", "longitude"}

This would allow users to customize which arrays are excluded from sharding.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 342e545 and 8476806.

📒 Files selected for processing (16)
  • CLAUDE.md (1 hunks)
  • py_hamt/__init__.py (1 hunks)
  • py_hamt/flat_zarr_store.py (1 hunks)
  • py_hamt/hamt.py (2 hunks)
  • py_hamt/hamt_to_sharded_converter.py (1 hunks)
  • py_hamt/sharded_zarr_store.py (1 hunks)
  • py_hamt/store.py (5 hunks)
  • py_hamt/zarr_hamt_store.py (3 hunks)
  • tests/test_benchmark_stores.py (1 hunks)
  • tests/test_converter.py (1 hunks)
  • tests/test_cpc_compare.py (1 hunks)
  • tests/test_sharded_zarr_store.py (1 hunks)
  • tests/test_zarr_ipfs.py (0 hunks)
  • tests/test_zarr_ipfs_encrypted.py (4 hunks)
  • tests/test_zarr_ipfs_partial.py (1 hunks)
  • tests/testing_utils.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/test_zarr_ipfs.py
🧰 Additional context used
🧬 Code Graph Analysis (5)
py_hamt/hamt.py (4)
py_hamt/zarr_hamt_store.py (1)
  • get (116-149)
py_hamt/encryption_hamt_store.py (1)
  • get (146-170)
tests/test_zarr_ipfs_partial.py (1)
  • load (275-285)
py_hamt/store.py (3)
  • load (33-40)
  • load (58-95)
  • load (286-315)
py_hamt/hamt_to_sharded_converter.py (4)
py_hamt/hamt.py (3)
  • HAMT (288-705)
  • build (386-395)
  • keys (673-687)
py_hamt/store.py (1)
  • KuboCAS (98-315)
py_hamt/sharded_zarr_store.py (4)
  • ShardedZarrStore (13-632)
  • open (62-89)
  • set_pointer (447-489)
  • flush (323-360)
py_hamt/zarr_hamt_store.py (1)
  • ZarrHAMTStore (12-289)
py_hamt/__init__.py (3)
py_hamt/store.py (3)
  • ContentAddressedStore (11-40)
  • InMemoryCAS (43-95)
  • KuboCAS (98-315)
py_hamt/sharded_zarr_store.py (1)
  • ShardedZarrStore (13-632)
py_hamt/hamt_to_sharded_converter.py (2)
  • convert_hamt_to_sharded (12-83)
  • sharded_converter_cli (86-123)
py_hamt/zarr_hamt_store.py (2)
py_hamt/encryption_hamt_store.py (1)
  • get (146-170)
py_hamt/hamt.py (1)
  • get (593-608)
py_hamt/store.py (3)
tests/test_zarr_ipfs_partial.py (1)
  • load (275-285)
py_hamt/hamt.py (4)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
py_hamt/zarr_hamt_store.py (1)
  • get (116-149)
🪛 Ruff (0.11.9)
py_hamt/hamt_to_sharded_converter.py

3-3: json imported but unused

Remove unused import: json

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)

tests/test_converter.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)


171-171: f-string without any placeholders

Remove extraneous f prefix

(F541)


241-241: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)

py_hamt/sharded_zarr_store.py

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)

tests/test_sharded_zarr_store.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


2-2: math imported but unused

Remove unused import: math

(F401)


12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)


13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)


25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)


525-525: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)


602-602: Redefinition of unused test_sharded_zarr_store_init_invalid_shapes from line 490

(F811)


637-637: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

🪛 Pylint (3.3.7)
py_hamt/hamt_to_sharded_converter.py

[refactor] 12-12: Too many local variables (20/15)

(R0914)

tests/test_converter.py

[refactor] 46-46: Too many local variables (21/15)

(R0914)


[refactor] 135-135: Too many local variables (16/15)

(R0914)


[refactor] 184-184: Too many local variables (17/15)

(R0914)

py_hamt/store.py

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

tests/test_zarr_ipfs_partial.py

[refactor] 62-62: Too many local variables (34/15)

(R0914)


[refactor] 188-188: Redundant comparison - zhs_read == zhs_read

(R0124)


[refactor] 62-62: Too many statements (90/50)

(R0915)


[refactor] 229-229: Too few public methods (0/2)

(R0903)


[refactor] 238-238: Too many local variables (38/15)

(R0914)


[refactor] 238-238: Too many statements (100/50)

(R0915)

py_hamt/sharded_zarr_store.py

[refactor] 13-13: Too many instance attributes (14/7)

(R0902)


[refactor] 62-62: Too many arguments (8/5)

(R0913)


[refactor] 186-186: Too many return statements (8/6)

(R0911)


[refactor] 274-277: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 363-363: Too many local variables (17/15)

(R0914)


[refactor] 363-363: Too many return statements (7/6)

(R0911)


[refactor] 363-363: Too many branches (14/12)

(R0912)


[refactor] 552-557: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

tests/test_sharded_zarr_store.py

[refactor] 162-162: Too many local variables (16/15)

(R0914)


[refactor] 212-212: Too many local variables (17/15)

(R0914)


[refactor] 270-270: Too many local variables (18/15)

(R0914)


[refactor] 313-313: Too many local variables (18/15)

(R0914)


[error] 602-602: function already defined line 490

(E0102)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
py_hamt/hamt_to_sharded_converter.py

[error] 1-12: Ruff lint errors: I001 import block un-sorted; multiple F401 unused imports (json, typing.Dict, typing.Any, zarr.core.buffer.Buffer, BufferPrototype).

tests/test_converter.py

[error] 1-241: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, aiohttp); F541 f-string without placeholders; F841 assigned but unused local variable 'kubo_cas'.

py_hamt/__init__.py

[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.

py_hamt/sharded_zarr_store.py

[error] 1-1: Mypy type errors: multiple argument type incompatibilities and unsupported operand types involving 'None' and 'int' or 'tuple[int, ...] | None'.


[error] 93-678: Mypy type errors: Argument and operator type mismatches, index errors, and union attribute errors detected in multiple lines.


[error] 1-10: Ruff lint errors: I001 import block un-sorted; F401 unused import 'typing.cast'.

tests/test_sharded_zarr_store.py

[error] 550-687: Mypy errors: dict indexing on possibly None type and redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.


[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.

🔇 Additional comments (14)
tests/testing_utils.py (1)

172-172:

❓ Verification inconclusive

Consider using a pinned version instead of "master-latest" for test stability.

Using "ipfs/kubo:master-latest" introduces non-deterministic behavior in tests as the underlying IPFS version can change between test runs. This can lead to:

  • Flaky tests due to upstream breaking changes
  • Difficulty reproducing test failures
  • Inconsistent behavior across different environments

Consider using a specific version tag:

-    image = "ipfs/kubo:master-latest"
+    image = "ipfs/kubo:v0.35.0"  # or latest stable version

If you need to use the latest version, please verify that this change doesn't introduce test instability by running the test suite multiple times:


🏁 Script executed:

#!/bin/bash
# Verify test stability with new IPFS version
for i in {1..5}; do
  echo "Test run $i"
  pytest tests/ --ipfs -v
  if [ $? -ne 0 ]; then
    echo "Test failed on run $i"
    exit 1
  fi
done
echo "All test runs passed"

Length of output: 251


Pin the IPFS Docker image for stable tests

Switching from a moving “master-latest” tag to a fixed release prevents non-deterministic failures in your integration suite.

• File: tests/testing_utils.py
Line: 172
• Current:

    image = "ipfs/kubo:master-latest"

• Suggested change:

-    image = "ipfs/kubo:master-latest"
+    image = "ipfs/kubo:v0.35.0"  # or the latest stable release

Please run your full test suite in an environment where pytest is available to confirm there are no regressions or flakiness.

tests/test_zarr_ipfs_encrypted.py (3)

8-8: LGTM! Import change improves type annotations.

The addition of IPLDKind import supports better type annotations throughout the test file.


93-93: LGTM! Explicit type annotation improves code clarity.

The explicit type annotation makes the variable's type clear and aligns with the import of IPLDKind.


109-109: LGTM! Type ignore comments are appropriately used.

The # type: ignore comments suppress legitimate mypy warnings where the type checker cannot infer the correct types for Zarr store operations.

Also applies to: 196-196

py_hamt/__init__.py (2)

3-3: LGTM! Exposing InMemoryCAS in the public API.

Adding InMemoryCAS to the public exports makes sense for testing scenarios and provides users with an in-memory storage option alongside KuboCAS.

Also applies to: 14-14


5-6: LGTM! New sharding functionality properly exposed.

The addition of ShardedZarrStore, convert_hamt_to_sharded, and sharded_converter_cli to the public API enables users to leverage the new sharding capabilities introduced in this PR.

Also applies to: 16-18

🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.

py_hamt/hamt.py (2)

11-11: LGTM! Import addition supports new optional parameters.

The Optional import is needed for the new type annotations in the enhanced get method.


593-604: LGTM! Well-implemented partial read support.

The enhancement to the get method properly implements partial read functionality:

  • Backward compatibility: All new parameters are optional with None defaults
  • Clear interface: Parameters follow standard byte range conventions (offset, length, suffix)
  • Proper forwarding: Parameters are correctly passed to the underlying cas.load call
  • Type safety: Method signature includes proper type annotations

This aligns well with the corresponding changes in py_hamt/store.py where the ContentAddressedStore.load method was enhanced with the same parameters.

CLAUDE.md (1)

1-88: Excellent documentation for AI-assisted development.

This documentation file provides comprehensive guidance for working with the repository:

  • Clear development workflow: Covers setup, testing, linting, and type checking
  • Accurate architecture description: Well-documented overview of CAS, HAMT, and Zarr integration layers
  • Helpful testing guidance: Explains IPFS integration requirements and test execution options
  • Design pattern documentation: Captures key architectural decisions like async operations and content addressing

This will be valuable for AI-assisted development and new contributors understanding the codebase.

py_hamt/zarr_hamt_store.py (1)

151-163: Well-implemented concurrent partial value retrieval.

The implementation correctly uses asyncio.gather for efficient concurrent fetching of multiple byte ranges. Good addition for supporting partial reads.

py_hamt/store.py (2)

33-39: Good API extension for partial data retrieval

The addition of optional parameters for partial loading is well-designed and maintains backward compatibility with default values.


298-311: Well-implemented HTTP Range header support

The Range header construction correctly follows the HTTP standard with proper handling of all three cases: range with start and end, open-ended range, and suffix range.

tests/test_zarr_ipfs_partial.py (1)

431-461: Excellent test coverage for partial reads

The test comprehensively covers all partial read scenarios including edge cases and error conditions. Well structured!

py_hamt/sharded_zarr_store.py (1)

323-361: Well-structured flush implementation

The flush method properly handles saving dirty shards first, then the root object, with appropriate error handling and state management.

Comment on lines 1 to 37
import asyncio
import math

import numpy as np
import pandas as pd
import pytest
import xarray as xr
from zarr.abc.store import RangeByteRequest
import zarr.core.buffer
import dag_cbor

from py_hamt import HAMT, KuboCAS, ShardedZarrStore
from py_hamt.zarr_hamt_store import ZarrHAMTStore


@pytest.fixture(scope="module")
def random_zarr_dataset():
"""Creates a random xarray Dataset for benchmarking."""
# Using a slightly larger dataset for a more meaningful benchmark
times = pd.date_range("2024-01-01", periods=100)
lats = np.linspace(-90, 90, 18)
lons = np.linspace(-180, 180, 36)

temp = np.random.randn(len(times), len(lats), len(lons))
precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))

ds = xr.Dataset(
{
"temp": (["time", "lat", "lon"], temp),
},
coords={"time": times, "lat": lats, "lon": lons},
)

# Define chunking for the store
ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})
yield ds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused imports and variables

Several imports and variables are not used in the test file.

Apply these changes to clean up the code:

-import asyncio
-import math
 
 import numpy as np
 import pandas as pd
 import pytest
 import xarray as xr
 from zarr.abc.store import RangeByteRequest
 import zarr.core.buffer
 import dag_cbor
 
-from py_hamt import HAMT, KuboCAS, ShardedZarrStore
-from py_hamt.zarr_hamt_store import ZarrHAMTStore
+from py_hamt import KuboCAS, ShardedZarrStore

Also remove the unused precip variable:

 temp = np.random.randn(len(times), len(lats), len(lons))
-precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
 
 ds = xr.Dataset(
     {
         "temp": (["time", "lat", "lon"], temp),
     },
     coords={"time": times, "lat": lats, "lon": lons},
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import asyncio
import math
import numpy as np
import pandas as pd
import pytest
import xarray as xr
from zarr.abc.store import RangeByteRequest
import zarr.core.buffer
import dag_cbor
from py_hamt import HAMT, KuboCAS, ShardedZarrStore
from py_hamt.zarr_hamt_store import ZarrHAMTStore
@pytest.fixture(scope="module")
def random_zarr_dataset():
"""Creates a random xarray Dataset for benchmarking."""
# Using a slightly larger dataset for a more meaningful benchmark
times = pd.date_range("2024-01-01", periods=100)
lats = np.linspace(-90, 90, 18)
lons = np.linspace(-180, 180, 36)
temp = np.random.randn(len(times), len(lats), len(lons))
precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
ds = xr.Dataset(
{
"temp": (["time", "lat", "lon"], temp),
},
coords={"time": times, "lat": lats, "lon": lons},
)
# Define chunking for the store
ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})
yield ds
import numpy as np
import pandas as pd
import pytest
import xarray as xr
from zarr.abc.store import RangeByteRequest
import zarr.core.buffer
import dag_cbor
from py_hamt import KuboCAS, ShardedZarrStore
@pytest.fixture(scope="module")
def random_zarr_dataset():
"""Creates a random xarray Dataset for benchmarking."""
# Using a slightly larger dataset for a more meaningful benchmark
times = pd.date_range("2024-01-01", periods=100)
lats = np.linspace(-90, 90, 18)
lons = np.linspace(-180, 180, 36)
temp = np.random.randn(len(times), len(lats), len(lons))
ds = xr.Dataset(
{
"temp": (["time", "lat", "lon"], temp),
},
coords={"time": times, "lat": lats, "lon": lons},
)
# Define chunking for the store
ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})
yield ds
🧰 Tools
🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


2-2: math imported but unused

Remove unused import: math

(F401)


12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)


13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)


25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.

🤖 Prompt for AI Agents
In tests/test_sharded_zarr_store.py lines 1 to 37, remove the unused imports
math, RangeByteRequest, zarr.core.buffer, dag_cbor, HAMT, KuboCAS,
ShardedZarrStore, and ZarrHAMTStore. Also delete the unused variable precip in
the random_zarr_dataset fixture to clean up the code and avoid unnecessary
clutter.

Comment on lines 1 to 375
# Functional Flat Store for Zarr using IPFS. This is ultimately the "Best" way to do it
# but only if the metadata is small enough to fit in a single CBOR object.
# This is commented out because we are not using it. However I did not want to delete it
# because it is a good example of how to implement a Zarr store using a flat index

# import asyncio
# import math
# from collections.abc import AsyncIterator, Iterable
# from typing import Optional, cast

# import dag_cbor
# import zarr.abc.store
# import zarr.core.buffer
# from zarr.core.common import BytesLike

# from .store import ContentAddressedStore


# class FlatZarrStore(zarr.abc.store.Store):
# """
# Implements the Zarr Store API using a flat, predictable layout for chunk CIDs.

# This store bypasses the need for a HAMT, offering direct, calculated
# access to chunk data based on a mathematical formula. It is designed for
# dense, multi-dimensional arrays where chunk locations are predictable.

# The store is structured around a single root CBOR object. This root object contains:
# 1. A dictionary mapping metadata keys (like 'zarr.json') to their CIDs.
# 2. A single CID pointing to a large, contiguous block of bytes (the "flat index").
# This flat index is a concatenation of the CIDs of all data chunks.

# Accessing a chunk involves:
# 1. Loading the root object (if not cached).
# 2. Calculating the byte offset of the chunk's CID within the flat index.
# 3. Fetching that specific CID using a byte-range request on the flat index.
# 4. Fetching the actual chunk data using the retrieved CID.

# ### Sample Code
# ```python
# import xarray as xr
# import numpy as np
# from py_hamt import KuboCAS, FlatZarrStore

# # --- Write ---
# ds = xr.Dataset(
# {"data": (("t", "y", "x"), np.arange(24).reshape(2, 3, 4))},
# )
# cas = KuboCAS()
# # When creating, must provide array shape and chunk shape
# store_write = await FlatZarrStore.open(
# cas,
# read_only=False,
# array_shape=ds.data.shape,
# chunk_shape=ds.data.encoding['chunks']
# )
# ds.to_zarr(store=store_write, mode="w")
# root_cid = await store_write.flush() # IMPORTANT: flush to get final root CID
# print(f"Finished writing. Root CID: {root_cid}")


# # --- Read ---
# store_read = await FlatZarrStore.open(cas, read_only=True, root_cid=root_cid)
# ds_read = xr.open_zarr(store=store_read)
# print("Read back dataset:")
# print(ds_read)
# xr.testing.assert_identical(ds, ds_read)
# ```
# """

# def __init__(
# self, cas: ContentAddressedStore, read_only: bool, root_cid: Optional[str]
# ):
# """Use the async `open()` classmethod to instantiate this class."""
# super().__init__(read_only=read_only)
# self.cas = cas
# self._root_cid = root_cid
# self._root_obj: Optional[dict] = None
# self._flat_index_cache: Optional[bytearray] = None
# self._cid_len: Optional[int] = None
# self._array_shape: Optional[tuple[int, ...]] = None
# self._chunk_shape: Optional[tuple[int, ...]] = None
# self._chunks_per_dim: Optional[tuple[int, ...]] = None
# self._dirty = False

# @classmethod
# async def open(
# cls,
# cas: ContentAddressedStore,
# read_only: bool,
# root_cid: Optional[str] = None,
# *,
# array_shape: Optional[tuple[int, ...]] = None,
# chunk_shape: Optional[tuple[int, ...]] = None,
# cid_len: int = 59, # Default for base32 v1 CIDs like bafy...
# ) -> "FlatZarrStore":
# """
# Asynchronously opens an existing FlatZarrStore or initializes a new one.

# Args:
# cas: The Content Addressed Store (e.g., KuboCAS).
# read_only: If True, the store is in read-only mode.
# root_cid: The root CID of an existing store to open. Required for read_only.
# array_shape: The full shape of the Zarr array. Required for a new writeable store.
# chunk_shape: The shape of a single chunk. Required for a new writeable store.
# cid_len: The expected byte length of a CID string.
# """
# store = cls(cas, read_only, root_cid)
# if root_cid:
# await store._load_root_from_cid()
# elif not read_only:
# if not all([array_shape, chunk_shape]):
# raise ValueError(
# "array_shape and chunk_shape must be provided for a new store."
# )
# store._initialize_new_root(array_shape, chunk_shape, cid_len)
# else:
# raise ValueError("root_cid must be provided for a read-only store.")
# return store

# def _initialize_new_root(
# self,
# array_shape: tuple[int, ...],
# chunk_shape: tuple[int, ...],
# cid_len: int,
# ):
# self._array_shape = array_shape
# self._chunk_shape = chunk_shape
# self._cid_len = cid_len
# self._chunks_per_dim = tuple(
# math.ceil(a / c) for a, c in zip(array_shape, chunk_shape)
# )
# self._root_obj = {
# "manifest_version": "flat_zarr_v1",
# "metadata": {},
# "chunks": {
# "cid": None, # Will be filled on first flush
# "array_shape": list(self._array_shape),
# "chunk_shape": list(self._chunk_shape),
# "cid_byte_length": self._cid_len,
# },
# }
# self._dirty = True

# async def _load_root_from_cid(self):
# if not self._root_cid:
# raise ValueError("Cannot load root without a root_cid.")
# root_bytes = await self.cas.load(self._root_cid)
# self._root_obj = dag_cbor.decode(root_bytes)
# chunk_info = self._root_obj.get("chunks", {})
# self._array_shape = tuple(chunk_info["array_shape"])
# self._chunk_shape = tuple(chunk_info["chunk_shape"])
# self._cid_len = chunk_info["cid_byte_length"]
# self._chunks_per_dim = tuple(
# math.ceil(a / c) for a, c in zip(self._array_shape, self._chunk_shape)
# )

# def _parse_chunk_key(self, key: str) -> Optional[tuple[int, ...]]:
# if not self._array_shape or not key.startswith("c/"):
# return None
# parts = key.split("/")
# if len(parts) != len(self._array_shape) + 1:
# return None
# try:
# return tuple(map(int, parts[1:]))
# except (ValueError, IndexError):
# return None

# async def set_partial_values(
# self, key_start_values: Iterable[tuple[str, int, BytesLike]]
# ) -> None:
# """@private"""
# raise NotImplementedError("Partial writes are not supported by this store.")

# async def get_partial_values(
# self,
# prototype: zarr.core.buffer.BufferPrototype,
# key_ranges: Iterable[tuple[str, zarr.abc.store.ByteRequest | None]],
# ) -> list[zarr.core.buffer.Buffer | None]:
# """
# Retrieves multiple keys or byte ranges concurrently.
# """
# tasks = [self.get(key, prototype, byte_range) for key, byte_range in key_ranges]
# results = await asyncio.gather(*tasks)
# return results

# def __eq__(self, other: object) -> bool:
# """@private"""
# if not isinstance(other, FlatZarrStore):
# return NotImplemented
# return self._root_cid == other._root_cid

# def _get_chunk_offset(self, chunk_coords: tuple[int, ...]) -> int:
# linear_index = 0
# multiplier = 1
# for i in reversed(range(len(self._chunks_per_dim))):
# linear_index += chunk_coords[i] * multiplier
# multiplier *= self._chunks_per_dim[i]
# return linear_index * self._cid_len

# async def flush(self) -> str:
# """
# Writes all pending changes (metadata and chunk index) to the CAS
# and returns the new root CID. This MUST be called after all writes are complete.
# """
# if self.read_only or not self._dirty:
# return self._root_cid

# if self._flat_index_cache is not None:
# flat_index_cid_obj = await self.cas.save(
# bytes(self._flat_index_cache), codec="raw"
# )
# self._root_obj["chunks"]["cid"] = str(flat_index_cid_obj)

# root_obj_bytes = dag_cbor.encode(self._root_obj)
# new_root_cid_obj = await self.cas.save(root_obj_bytes, codec="dag-cbor")
# self._root_cid = str(new_root_cid_obj)
# self._dirty = False
# return self._root_cid

# async def get(
# self,
# key: str,
# prototype: zarr.core.buffer.BufferPrototype,
# byte_range: zarr.abc.store.ByteRequest | None = None,
# ) -> zarr.core.buffer.Buffer | None:
# """@private"""
# if self._root_obj is None:
# await self._load_root_from_cid()

# chunk_coords = self._parse_chunk_key(key)
# try:
# # Metadata request
# if chunk_coords is None:
# metadata_cid = self._root_obj["metadata"].get(key)
# if metadata_cid is None:
# return None
# data = await self.cas.load(metadata_cid)
# return prototype.buffer.from_bytes(data)

# # Chunk data request
# flat_index_cid = self._root_obj["chunks"]["cid"]
# if flat_index_cid is None:
# return None

# offset = self._get_chunk_offset(chunk_coords)
# chunk_cid_bytes = await self.cas.load(
# flat_index_cid, offset=offset, length=self._cid_len
# )

# if all(b == 0 for b in chunk_cid_bytes):
# return None # Chunk doesn't exist

# chunk_cid = chunk_cid_bytes.decode("ascii")
# data = await self.cas.load(chunk_cid)
# return prototype.buffer.from_bytes(data)

# except (KeyError, IndexError):
# return None

# async def set(self, key: str, value: zarr.core.buffer.Buffer) -> None:
# """@private"""
# if self.read_only:
# raise ValueError("Cannot write to a read-only store.")
# if self._root_obj is None:
# raise RuntimeError("Store not initialized for writing.")

# self._dirty = True
# raw_bytes = value.to_bytes()
# value_cid_obj = await self.cas.save(raw_bytes, codec="raw")
# value_cid = str(value_cid_obj)

# if len(value_cid) != self._cid_len:
# raise ValueError(
# f"Inconsistent CID length. Expected {self._cid_len}, got {len(value_cid)}"
# )

# chunk_coords = self._parse_chunk_key(key)

# if chunk_coords is None: # Metadata
# self._root_obj["metadata"][key] = value_cid
# return

# # Chunk Data
# if self._flat_index_cache is None:
# num_chunks = math.prod(self._chunks_per_dim)
# self._flat_index_cache = bytearray(num_chunks * self._cid_len)

# offset = self._get_chunk_offset(chunk_coords)
# self._flat_index_cache[offset : offset + self._cid_len] = value_cid.encode(
# "ascii"
# )

# # --- Other required zarr.abc.store methods ---

# async def exists(self, key: str) -> bool:
# """@private"""
# # A more efficient version might check for null bytes in the flat index
# # but this is functionally correct.

# # TODO: Optimize this check
# return True


# # return (await self.get(key, zarr.core.buffer.Buffer.prototype, None)) is not None

# @property
# def supports_writes(self) -> bool:
# """@private"""
# return not self.read_only

# @property
# def supports_partial_writes(self) -> bool:
# """@private"""
# return False # Each chunk is an immutable object

# @property
# def supports_deletes(self) -> bool:
# """@private"""
# return not self.read_only

# async def delete(self, key: str) -> None:
# if self.read_only:
# raise ValueError("Cannot delete from a read-only store.")
# if self._root_obj is None:
# await self._load_root_from_cid()
# chunk_coords = self._parse_chunk_key(key)
# if chunk_coords is None:
# if key in self._root_obj["metadata"]:
# del self._root_obj["metadata"][key]
# self._dirty = True
# return
# else:
# raise KeyError(f"Metadata key '{key}' not found.")
# flat_index_cid = self._root_obj["chunks"]["cid"]
# if self._flat_index_cache is None:
# if not flat_index_cid:
# raise KeyError(f"Chunk key '{key}' not found in non-existent index.")
# self._flat_index_cache = bytearray(await self.cas.load(flat_index_cid))
# offset = self._get_chunk_offset(chunk_coords)
# if all(b == 0 for b in self._flat_index_cache[offset : offset + self._cid_len]):
# raise KeyError(f"Chunk key '{key}' not found.")
# self._flat_index_cache[offset : offset + self._cid_len] = bytearray(self._cid_len)
# self._dirty = True

# @property
# def supports_listing(self) -> bool:
# """@private"""
# return True

# async def list(self) -> AsyncIterator[str]:
# """@private"""
# if self._root_obj is None:
# await self._load_root_from_cid()
# for key in self._root_obj["metadata"]:
# yield key
# # Note: Listing all chunk keys without reading the index is non-trivial.
# # A full implementation might need an efficient way to iterate non-null chunks.
# # This basic version only lists metadata.

# async def list_prefix(self, prefix: str) -> AsyncIterator[str]:
# """@private"""
# async for key in self.list():
# if key.startswith(prefix):
# yield key

# async def list_dir(self, prefix: str) -> AsyncIterator[str]:
# """@private"""
# # This simplified version only works for the root.
# if prefix == "":
# seen = set()
# async for key in self.list():
# name = key.split('/')[0]
# if name not in seen:
# seen.add(name)
# yield name No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider removing commented-out code

The entire file consists of commented-out code. While the header explains it's kept as a reference implementation, it's better to remove it and rely on version control history.

Consider one of these approaches:

  1. Remove the file entirely and reference it in documentation or a README if needed
  2. Move it to a examples/ or reference/ directory if it has educational value
  3. If parts are useful, extract them into a documentation file

Keeping large blocks of commented code in the main codebase can cause confusion and maintenance overhead.

🤖 Prompt for AI Agents
In py_hamt/flat_zarr_store.py lines 1 to 375, the entire file is commented-out
code serving as a reference implementation. To reduce confusion and maintenance
overhead, remove this commented-out code from the main codebase. Instead, either
delete the file and keep it accessible via version control history, move it to
an examples or reference directory if it has educational value, or extract
useful parts into a separate documentation file.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
py_hamt/store.py (1)

58-96: 🛠️ Refactor suggestion

length-only reads and value checking are missing; logic disagrees with comment

  1. Calling load(id, length=8) currently falls through to the full load path instead of returning the first 8 bytes.
  2. The comment “If only length is given, assume start from 0” is therefore incorrect.
  3. Negative values or a simultaneous offset + suffix are silently accepted, which can produce surprising slices.
  4. There is an unnecessary elif/else chain after early returns (pylint R1705).

Tighten the control flow and handle all combinations explicitly:

@@
-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if (offset is not None) and (suffix is not None):
+            raise ValueError("Specify either offset/length or suffix, not both")
+
+        if offset is not None:
+            if offset < 0:
+                raise ValueError("offset must be ≥ 0")
+            if length is None:
+                return data[offset:]
+            if length < 0:
+                raise ValueError("length must be ≥ 0")
+            return data[offset : offset + length]
+
+        if suffix is not None:
+            if suffix < 0:
+                raise ValueError("suffix must be ≥ 0")
+            return data[-suffix:]
+
+        if length is not None:        # implicit offset == 0
+            if length < 0:
+                raise ValueError("length must be ≥ 0")
+            return data[:length]
+
+        return data                   # full blob
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🧹 Nitpick comments (2)
py_hamt/store.py (2)

33-40: Docstring/method contract not updated for new slice parameters

ContentAddressedStore.load() now accepts offset, length, and suffix, but the docstring (“Retrieve data.”) was left untouched. Add parameter descriptions and clarify precedence/valid-value ranges (e.g. that negative numbers are rejected and that offset & suffix are mutually exclusive).


85-95: Minor: remove superfluous elif/else branches after early returns

Once a branch returns, the following elif/else is redundant; flattening the logic (see previous diff) resolves the pylint R1705 warning.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8476806 and 50fd6e6.

📒 Files selected for processing (1)
  • py_hamt/store.py (5 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
py_hamt/store.py

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

py_hamt/store.py Outdated
Comment on lines 307 to 333
async def load(
self,
id: IPLDKind,
offset: Optional[int] = None,
length: Optional[int] = None,
suffix: Optional[int] = None,
) -> bytes:
"""@private"""
cid = cast(CID, id) # CID is definitely in the IPLDKind type
cid = cast(CID, id)
url: str = self.gateway_base_url + str(cid)
headers: dict[str, str] = {}

# Construct the Range header if required
if offset is not None:
start = offset
if length is not None:
# Standard HTTP Range: bytes=start-end (inclusive)
end = start + length - 1
headers["Range"] = f"bytes={start}-{end}"
else:
# Standard HTTP Range: bytes=start- (from start to end)
headers["Range"] = f"bytes={start}-"
elif suffix is not None:
# Standard HTTP Range: bytes=-N (last N bytes)
headers["Range"] = f"bytes=-{suffix}"

async with self._sem: # throttle gateway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Partial-range header builder omits the length-only case and lacks validation

KuboCAS.load() mirrors the in-memory logic, so the same edge cases apply:

  • length without offset results in a full GET (no Range header).
    Expected header: Range: bytes=0-{length-1}.
  • Missing checks for negative values and for mutually-exclusive offset+suffix.

Proposed incremental fix:

@@
-    if offset is not None:
+    if (offset is not None) and (suffix is not None):
+        raise ValueError("Specify either offset/length or suffix, not both")
+
+    if offset is not None:
         start = offset
         if length is not None:
@@
             headers["Range"] = f"bytes={start}-"
-    elif suffix is not None:
+    elif suffix is not None:
         # Standard HTTP Range: bytes=-N (last N bytes)
         headers["Range"] = f"bytes=-{suffix}"
+    elif length is not None:
+        # Range from byte 0 to length-1
+        headers["Range"] = f"bytes=0-{length-1}"

Consider also asserting non-negative integers for all three parameters.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def load(
self,
id: IPLDKind,
offset: Optional[int] = None,
length: Optional[int] = None,
suffix: Optional[int] = None,
) -> bytes:
"""@private"""
cid = cast(CID, id) # CID is definitely in the IPLDKind type
cid = cast(CID, id)
url: str = self.gateway_base_url + str(cid)
headers: dict[str, str] = {}
# Construct the Range header if required
if offset is not None:
start = offset
if length is not None:
# Standard HTTP Range: bytes=start-end (inclusive)
end = start + length - 1
headers["Range"] = f"bytes={start}-{end}"
else:
# Standard HTTP Range: bytes=start- (from start to end)
headers["Range"] = f"bytes={start}-"
elif suffix is not None:
# Standard HTTP Range: bytes=-N (last N bytes)
headers["Range"] = f"bytes=-{suffix}"
async with self._sem: # throttle gateway
async def load(
self,
id: IPLDKind,
offset: Optional[int] = None,
length: Optional[int] = None,
suffix: Optional[int] = None,
) -> bytes:
"""@private"""
cid = cast(CID, id)
url: str = self.gateway_base_url + str(cid)
headers: dict[str, str] = {}
# Construct the Range header if required
if (offset is not None) and (suffix is not None):
raise ValueError("Specify either offset/length or suffix, not both")
if offset is not None:
start = offset
if length is not None:
# Standard HTTP Range: bytes=start-end (inclusive)
end = start + length - 1
headers["Range"] = f"bytes={start}-{end}"
else:
# Standard HTTP Range: bytes=start- (from start to end)
headers["Range"] = f"bytes={start}-"
elif suffix is not None:
# Standard HTTP Range: bytes=-N (last N bytes)
headers["Range"] = f"bytes=-{suffix}"
elif length is not None:
# Range from byte 0 to length-1
headers["Range"] = f"bytes=0-{length-1}"
async with self._sem: # throttle gateway
🤖 Prompt for AI Agents
In py_hamt/store.py around lines 307 to 333, the load method's Range header
construction misses handling the case where length is provided without offset,
which should set the Range header to bytes=0-(length-1). Additionally, add
validation to ensure offset, length, and suffix are non-negative integers and
enforce that offset and suffix are not both provided simultaneously. Update the
logic to include these checks and construct the Range header accordingly to
cover all edge cases.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tests/test_cpc_compare.py (1)

1-128: 🛠️ Refactor suggestion

Dead-code test module — delete or convert to a real test

Every line in this file is commented out, so the test runner will never execute it. Carrying around disabled test code adds noise and sends mixed signals about intended coverage.
• If you still want the benchmark, move it to an active pytest test (mark it with @pytest.mark.benchmark or similar and gate it behind an opt-in flag).
• Otherwise, delete the file to avoid confusion.

♻️ Duplicate comments (8)
py_hamt/hamt_to_sharded_converter.py (1)

3-10: Remove the unused imports flagged by Ruff (duplicate reminder)

json, Dict, Any, Buffer, and BufferPrototype are not referenced anywhere in this module. Keeping them causes F401 warnings and breaks the linting gate.

-import json
-from typing import Dict, Any
-from zarr.core.buffer import Buffer, BufferPrototype
🧰 Tools
🪛 Ruff (0.11.9)

3-3: json imported but unused

Remove unused import: json

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)

tests/test_converter.py (2)

1-7: Drop unused imports and sort the remaining block (duplicate reminder)

asyncio and aiohttp are unused, triggering F401. Remove them and keep standard-lib imports first.

-import asyncio
 ...
-from unittest.mock import patch
-import aiohttp
+from unittest.mock import patch
🧰 Tools
🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)


182-182: Remove the unnecessary f-string prefix

No interpolation happens here.

-        assert f"New ShardedZarrStore Root CID" in captured.out
+        assert "New ShardedZarrStore Root CID" in captured.out
🧰 Tools
🪛 Ruff (0.11.9)

182-182: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/test_sharded_zarr_store.py (2)

1-2: Remove unused imports and variables

Several imports and the precip variable are not used in the test file.

Apply this diff to clean up the code:

-import asyncio
-import math
 
 import numpy as np
 import pandas as pd
 import pytest
 import xarray as xr
 from zarr.abc.store import RangeByteRequest
 import zarr.core.buffer
 import dag_cbor
 
-from py_hamt import HAMT, KuboCAS, ShardedZarrStore
-from py_hamt.zarr_hamt_store import ZarrHAMTStore
+from py_hamt import KuboCAS, ShardedZarrStore

Also remove the unused precip variable:

 temp = np.random.randn(len(times), len(lats), len(lons))
-precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
 
 ds = xr.Dataset(

Also applies to: 12-13, 25-25

🧰 Tools
🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


2-2: math imported but unused

Remove unused import: math

(F401)


647-732: ⚠️ Potential issue

Remove duplicate function definition

The function test_sharded_zarr_store_init_invalid_shapes is defined twice. The commented-out version (lines 509-596) should be removed along with this duplicate definition.

🧰 Tools
🪛 Ruff (0.11.9)

689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

py_hamt/store.py (1)

334-346: ⚠️ Potential issue

Partial-range header builder omits the length-only case and lacks validation

KuboCAS.load() has a bug where length without offset results in a full GET (no Range header). It should set Range: bytes=0-{length-1}.

Apply this fix:

         # Construct the Range header if required
-        if offset is not None:
+        if (offset is not None) and (suffix is not None):
+            raise ValueError("Specify either offset/length or suffix, not both")
+        
+        if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
         elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
+        elif length is not None:
+            # Range from byte 0 to length-1
+            headers["Range"] = f"bytes=0-{length-1}"
py_hamt/sharded_zarr_store.py (2)

325-334: Add type assertion for better type safety

After checking that _cid_len is not None, add an assertion to help mypy understand the type.

         if self._cid_len is None:  # Should be set
             raise RuntimeError(
                 "Store not initialized: _cid_len is None for shard initialization."
             )
+        assert self._cid_len is not None  # Type assertion for mypy
         if self._chunks_per_shard is None:
             raise RuntimeError(
                 "Store not initialized: _chunks_per_shard is None for shard initialization."
             )
+        assert self._chunks_per_shard is not None  # Type assertion for mypy
         # New shard or shard not yet written, initialize with zeros
         shard_size_bytes = self._chunks_per_shard * self._cid_len

488-498: ⚠️ Potential issue

Add explicit ByteRequest type validation in get()

The code assumes byte_range has .start and .end attributes, which will fail for OffsetByteRequest or SuffixByteRequest types.

Replace the existing byte range handling:

-        req_offset = byte_range.start if byte_range else None
-        req_length = None
-        if byte_range:
-            if byte_range.end is not None:
-                if (
-                    byte_range.start > byte_range.end
-                ):  # Zarr allows start == stop for 0 length
-                    raise ValueError(
-                        f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
-                    )
-                req_length = byte_range.end - byte_range.start
+        from zarr.abc.store import RangeByteRequest, OffsetByteRequest, SuffixByteRequest
+
+        req_offset: Optional[int] = None
+        req_length: Optional[int] = None
+        if byte_range is not None:
+            if isinstance(byte_range, RangeByteRequest):
+                req_offset = byte_range.start
+                if byte_range.end is not None:
+                    if byte_range.start > byte_range.end:
+                        raise ValueError(f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})")
+                    req_length = byte_range.end - byte_range.start
+            elif isinstance(byte_range, OffsetByteRequest):
+                req_offset = byte_range.offset
+            elif isinstance(byte_range, SuffixByteRequest):
+                raise TypeError("SuffixByteRequest is not supported by ShardedZarrStore.get()")
+            else:
+                raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
🧹 Nitpick comments (12)
py_hamt/hamt_to_sharded_converter.py (1)

50-58: Missing async context manager / explicit close for dest_store

ShardedZarrStore.open() likely allocates buffers and holds a CAS reference.
Wrap it in an async context manager (if available) or explicitly call await dest_store.close() after flush() to avoid descriptor leaks.

py_hamt/manage_pins.py (1)

62-64: Propagate the target RPC explicitly for clarity

The tests call pin_entire_dataset(target_rpc=rpc_base_url), but the CLI path omits the parameter. Passing it makes the behaviour explicit and avoids surprises if ShardedZarrStore later defaults to a different endpoint.

-        await store.pin_entire_dataset()
+        await store.pin_entire_dataset(target_rpc=args.rpc_url)

Apply the same change for unpin_entire_dataset.

tests/test_sharded_zarr_pinning.py (2)

119-134: Fixed sleep can make the test flaky

Relying on await asyncio.sleep(1) assumes the Kubo daemon updates pin state within one second, which is optimistic on busy nodes.

Refactor to poll with a timeout:

async def _wait_until(predicate, timeout=30, interval=0.5):
    start = asyncio.get_event_loop().time()
    while not predicate():
        if asyncio.get_event_loop().time() - start > timeout:
            pytest.fail("Pin state did not converge in time")
        await asyncio.sleep(interval)

await _wait_until(lambda: expected_cids <= await get_pinned_cids(rpc_base_url))

12-20: Use GET for pin/ls for better compatibility

/api/v0/pin/ls accepts GET; POST works on recent Kubo but older nodes reject it. Switching to GET maximises compatibility.

-            async with session.post(url, params={"type": "all"}) as resp:
+            async with session.get(url, params={"type": "all"}) as resp:
tests/test_converter.py (1)

249-252: kubo_cas is created but unused

The context manager is only needed to satisfy type checks; drop the binding to silence F841.

-    async with KuboCAS(
-        rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url
-    ) as kubo_cas:
+    async with KuboCAS(
+        rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url
+    ):
🧰 Tools
🪛 Ruff (0.11.9)

251-251: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)

tests/test_sharded_zarr_store.py (1)

689-689: Remove unused variable assignment

The root_cid variable is assigned but never used.

-        root_cid = await store.flush()
+        await store.flush()
🧰 Tools
🪛 Ruff (0.11.9)

689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

py_hamt/store.py (1)

94-104: Simplify control flow by removing unnecessary else

The else block after return can be removed to reduce nesting.

         if offset is not None:
             start = offset
             if length is not None:
                 end = start + length
                 return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
+            return data[start:]
+        if suffix is not None:
             return data[-suffix:]
-        else:  # Full load
-            return data
+        # Full load
+        return data
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

py_hamt/sharded_zarr_store.py (5)

4-4: Remove unused import

The cast import is not used in this file.

-from typing import Optional, cast, Dict, List, Set, Tuple
+from typing import Optional, Dict, List, Set, Tuple
🧰 Tools
🪛 Ruff (0.11.9)

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)


828-833: Use or remove unused exception variable

The exception is assigned to e but never used. Either use it in error handling or remove the assignment.

                         try:
                             await self.cas.unpin_cid(
                                 chunk_cid_str, target_rpc=target_rpc
                             )
-                        except Exception as e:
-                            # ignore
+                        except Exception:
+                            # Ignore failures - chunk might already be unpinned
                             continue
🧰 Tools
🪛 Ruff (0.11.9)

831-831: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


844-847: Use exception information in warning message

The exception should be included in the warning message for better debugging.

             try:
                 await self.cas.unpin_cid(str(shard_cid), target_rpc=target_rpc)
             except Exception as e:
-                print(f"Warning: Could not unpin shard {str(shard_cid)}")
+                print(f"Warning: Could not unpin shard {str(shard_cid)}: {e}")
🧰 Tools
🪛 Ruff (0.11.9)

845-845: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


854-858: Include exception details in warning message

Add the exception information to the warning for better debugging.

                 try:
                     await self.cas.unpin_cid(cid, target_rpc=target_rpc)
                     print(f"Unpinned metadata CID {cid} from {target_rpc}...")
                 except Exception as e:
                     print(
-                        f"Warning: Could not unpin metadata CID {cid}. Likely already unpinned."
+                        f"Warning: Could not unpin metadata CID {cid}: {e}"
                     )
🧰 Tools
🪛 Ruff (0.11.9)

855-855: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


863-868: Include exception details in warning message

Add the exception information to improve debugging.

             try:
                 await self.cas.unpin_cid(self._root_cid, target_rpc=target_rpc)
                 print(f"Unpinned root CID {self._root_cid} from {target_rpc}...")
             except Exception as e:
                 print(
-                    f"Warning: Could not unpin root CID {self._root_cid}. Likely already unpinned."
+                    f"Warning: Could not unpin root CID {self._root_cid}: {e}"
                 )
🧰 Tools
🪛 Ruff (0.11.9)

865-865: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50fd6e6 and b0268ad.

📒 Files selected for processing (10)
  • py_hamt/flat_zarr_store.py (1 hunks)
  • py_hamt/hamt_to_sharded_converter.py (1 hunks)
  • py_hamt/manage_pins.py (1 hunks)
  • py_hamt/sharded_zarr_store.py (1 hunks)
  • py_hamt/store.py (7 hunks)
  • tests/test_benchmark_stores.py (1 hunks)
  • tests/test_converter.py (1 hunks)
  • tests/test_cpc_compare.py (1 hunks)
  • tests/test_sharded_zarr_pinning.py (1 hunks)
  • tests/test_sharded_zarr_store.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • py_hamt/flat_zarr_store.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_benchmark_stores.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
py_hamt/store.py (4)
tests/test_zarr_ipfs_partial.py (1)
  • load (275-285)
py_hamt/hamt.py (4)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
py_hamt/sharded_zarr_store.py (1)
  • get (413-501)
py_hamt/zarr_hamt_store.py (1)
  • get (116-149)
🪛 Ruff (0.11.9)
tests/test_converter.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)


182-182: f-string without any placeholders

Remove extraneous f prefix

(F541)


251-251: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)

tests/test_sharded_zarr_store.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)


2-2: math imported but unused

Remove unused import: math

(F401)


12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)


13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)


25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)


689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

py_hamt/sharded_zarr_store.py

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)


831-831: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


845-845: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


855-855: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


865-865: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

py_hamt/hamt_to_sharded_converter.py

3-3: json imported but unused

Remove unused import: json

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)


10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)

🪛 Pylint (3.3.7)
tests/test_converter.py

[refactor] 52-52: Too many local variables (21/15)

(R0914)


[refactor] 145-145: Too many local variables (16/15)

(R0914)


[refactor] 196-196: Too many local variables (17/15)

(R0914)

tests/test_sharded_zarr_pinning.py

[refactor] 83-83: Too many local variables (16/15)

(R0914)

tests/test_sharded_zarr_store.py

[refactor] 160-160: Too many local variables (16/15)

(R0914)


[refactor] 210-210: Too many local variables (17/15)

(R0914)


[refactor] 268-268: Too many local variables (18/15)

(R0914)


[refactor] 311-311: Too many local variables (18/15)

(R0914)

py_hamt/store.py

[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

py_hamt/sharded_zarr_store.py

[refactor] 13-13: Too many instance attributes (14/7)

(R0902)


[refactor] 68-68: Too many arguments (8/5)

(R0913)


[refactor] 204-204: Too many return statements (8/6)

(R0911)


[refactor] 296-299: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 289-289: Too many branches (13/12)

(R0912)


[refactor] 413-413: Too many local variables (17/15)

(R0914)


[refactor] 413-413: Too many return statements (7/6)

(R0911)


[refactor] 413-413: Too many branches (14/12)

(R0912)


[refactor] 638-643: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 795-795: Too many branches (15/12)

(R0912)

py_hamt/hamt_to_sharded_converter.py

[refactor] 13-13: Too many local variables (20/15)

(R0914)

🔇 Additional comments (1)
py_hamt/hamt_to_sharded_converter.py (1)

64-74:

❓ Verification inconclusive

Verify CID length before writing into shards

cid_base32_str is written verbatim, but cid_len is hard-coded to 59. A multibase-encoded CIDv1 can be longer than 59 chars (e.g. long multihash algorithms). Writing a longer string will silently truncate or corrupt the shard layout.

Would you run a quick search to see whether any CID in real datasets exceeds 59 characters?


🏁 Script executed:

#!/bin/bash
# Scan existing shards (if any) for long CIDs
fd -e zarr | xargs -I{} grep -oE 'b[a-z0-9]{60,}' {} | head

Length of output: 65


Potential CID length mismatch risk
The converter assumes a fixed cid_len = 59 for base32-encoded CIDv1, but longer multihash algorithms (e.g. sha2-512) will produce CIDs >59 characters and be silently truncated, corrupting shard layout. Please verify whether your data ever generates CIDs longer than 59 bytes or update the code to handle variable-length CIDs.

• File: py_hamt/hamt_to_sharded_converter.py (lines 64–74)
• Suggested fix – add a length guard before writing the pointer:

- await dest_store.set_pointer(key, cid_base32_str)
+ if len(cid_base32_str) > cid_len:
+     raise ValueError(f"CID length {len(cid_base32_str)} exceeds expected {cid_len}")
+ await dest_store.set_pointer(key, cid_base32_str)

Comment on lines 36 to 45
# 2. Introspect the source array to get its configuration
print("Reading metadata from source store...")

# Read the stores metadata to get array shape and chunk shape
ordered_dims = list(source_dataset.dims)
array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
array_shape = array_shape_tuple
chunk_shape = chunk_shape_tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Chunk-shape derivation fails for unchunked datasets

source_dataset.chunks is None for unchunked arrays, so the comprehension will raise TypeError.
Consider a safer fallback:

-    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+    if source_dataset.chunks is None:  # unchunked source
+        chunk_shape_tuple = array_shape_tuple
+    else:
+        chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 2. Introspect the source array to get its configuration
print("Reading metadata from source store...")
# Read the stores metadata to get array shape and chunk shape
ordered_dims = list(source_dataset.dims)
array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
array_shape = array_shape_tuple
chunk_shape = chunk_shape_tuple
# 2. Introspect the source array to get its configuration
print("Reading metadata from source store...")
# Read the stores metadata to get array shape and chunk shape
ordered_dims = list(source_dataset.dims)
array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
- chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+ if source_dataset.chunks is None: # unchunked source
+ chunk_shape_tuple = array_shape_tuple
+ else:
+ chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
array_shape = array_shape_tuple
chunk_shape = chunk_shape_tuple
🤖 Prompt for AI Agents
In py_hamt/hamt_to_sharded_converter.py around lines 36 to 45, the code assumes
source_dataset.chunks is always defined, but it can be None for unchunked
datasets, causing a TypeError. Modify the chunk_shape_tuple assignment to first
check if source_dataset.chunks is not None; if it is None, provide a fallback
such as using the full dimension size as the chunk size or a default chunk size.
This ensures safe handling of unchunked datasets without errors.

@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (de33efe) to head (78cf72d).

Additional details and impacted files
@@            Coverage Diff             @@
##              main       #68    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            5         7     +2     
  Lines          707      1310   +603     
==========================================
+ Hits           707      1310   +603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
py_hamt/store_httpx.py (1)

496-515: Align pin_ls to return a list of CID strings

After reviewing both the stub (line 57) and the full implementation (lines 496–515) in py_hamt/store_httpx.py, and inspecting all call sites (notably in tests/test_kubo_pin.py which uses assert str(cid) in listed_pins), it’s clear that:

  • The current signature -> list[Dict[str, Any]] is incorrect.
  • The code actually returns a dict of { cid: { "Type": … }, … }, which happens to support cid in pins but fails type expectations.
  • Tests only check membership, so returning a flat list[str] of CIDs both satisfies callers and keeps the API surface clean.

Please apply the following mandatory refactor:

• At line 57 (stub):

-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+    async def pin_ls(self, target_rpc: str) -> list[str]:

• At lines 496–515 (implementation):

-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> list[str]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            list[str]: A list of pinned CID strings.
@@
-        pins = response.json().get("Keys", [])
-        return pins
+        keys_map = response.json().get("Keys", {}) or {}
+        return list(keys_map.keys())

• Adjust both docstrings to reflect the new return type (list[str]).

This change preserves existing test behavior, aligns type annotations, and avoids future confusion around the raw RPC payload shape.

♻️ Duplicate comments (5)
py_hamt/zarr_hamt_store.py (1)

175-177: Replace print with structured logging

Use the logging module instead of print for production error reporting.

-        except Exception as e:
-            print(f"Error getting key '{key}' with range {byte_range}: {e}")
-            raise
+        except Exception as e:
+            logger.error("Error getting key %r with range %r: %s", key, byte_range, e)
+            raise

Add once at the top of the module:

+import logging
+logger = logging.getLogger(__name__)
py_hamt/store_httpx.py (4)

77-115: Support length-only reads and simplify control flow in InMemoryCAS.load

Length without offset should return the first N bytes. This was flagged earlier and remains unhandled.

     key = cast(bytes, id)
@@
-        data: bytes
+        data: bytes
         try:
             data = self.store[key]
         except KeyError as exc:
             raise KeyError("Object not found in in-memory store") from exc

-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if length == 0:
+            return b""
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        if length is not None:
+            # length-only → first N bytes
+            return data[:length]
+        if suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data  # full blob

435-451: Remove nonexistent name parameter from pin_cid docstring

Docstring mentions name, which is not in the signature and not sent.

     """
     Pins a CID to the local Kubo node via the RPC API.

     This call is recursive by default, pinning all linked objects.

     Args:
         cid (CID): The Content ID to pin.
-        name (Optional[str]): An optional name for the pin.
+        target_rpc (str): The RPC URL of the Kubo node.
     """

Consider adding a recursive: bool = True keyword if you need to toggle recursion in the future.


43-60: Make pin hook methods fail-fast instead of silently passing

These are effectively abstract hooks. Leaving them as pass hides misuse at runtime.

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def pin_update(
+    async def pin_update(
         self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
     ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> Dict[str, Any]:
+        """List pinned CIDs in the storage (shape matches Kubo's pin/ls Keys map)."""
+        raise NotImplementedError  # pragma: no cover

Note: This also updates the return type of pin_ls to match Kubo (see later comment).


403-434: Handle length-only HTTP Range requests; guard zero-length/suffix=0

load() still ignores length when offset is None. HTTP Range semantics expect "bytes=0-{length-1}". Also, length==0 and suffix==0 should short-circuit to avoid sending invalid ranges.

     async def load(
         self,
         id: IPLDKind,
         offset: Optional[int] = None,
         length: Optional[int] = None,
         suffix: Optional[int] = None,
     ) -> bytes:
         """@private"""
         cid = cast(CID, id)
         url: str = f"{self.gateway_base_url + str(cid)}"
         headers: Dict[str, str] = {}

+        # Zero-length/suffix guard: avoid invalid HTTP Range values
+        if length == 0 or suffix == 0:
+            return b""
+
         # Construct the Range header if required
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
+        elif length is not None:
+            # length only → first N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"

         async with self._sem:  # throttle gateway
             client = self._loop_client()
             response = await client.get(url, headers=headers or None)
             response.raise_for_status()
             return response.content
🧹 Nitpick comments (4)
py_hamt/zarr_hamt_store.py (1)

184-191: Tidy gather usage and consider throttling very large fan-out

The comment says "Set return_exceptions=True" but the call sets it to False (default). Remove the stale note. If key_ranges can be large, consider bounding concurrency with a semaphore to avoid scheduling thousands of tasks at once.

-        results = await asyncio.gather(
-            *tasks, return_exceptions=False
-        )  # Set return_exceptions=True for debugging
+        results = await asyncio.gather(*tasks)
         return results

Optional: introduce a local semaphore to cap concurrent get()s.

py_hamt/store_httpx.py (3)

34-41: Clarify partial-read semantics in the base interface

Document that:

  • offset is 0-based; length is a byte count; end is exclusive when derived from offset+length.
  • suffix returns the last N bytes.
  • negative values are invalid; length==0 should return empty bytes.
-    ) -> bytes:
-        """Retrieve data."""
+    ) -> bytes:
+        """Retrieve data.
+
+        Semantics:
+        - offset: 0‑based start position.
+        - length: number of bytes to read; when combined with offset, the "end" is exclusive (HTTP Range end is inclusive and derived as start+length-1).
+        - suffix: last N bytes of the object.
+        - Negative values are invalid; length == 0 returns b"".
+        """

190-193: pinOnAdd: URL building looks good; consider URL-encoding chunker/hash

If chunker ever contains characters requiring encoding (future formats), assemble the RPC URL using params= on client.post(...) rather than manual string concatenation.

Example:

response = await client.post(
    f"{rpc_base_url}/api/v0/add",
    params={"hash": self.hasher, "chunker": self.chunker, "pin": str(pinOnAdd).lower()},
    files=files,
)

Also applies to: 258-260


343-383: Destructor performs async work; keep best‑effort but avoid surprising side effects

__del__ attempts to schedule/drive async cleanup. This is best‑effort but can surprise callers, especially in interpreter shutdown. Consider limiting __del__ to clearing references and documenting that aclose() (or async with) is required for deterministic cleanup.

No code change required now; just a heads‑up. If you want code, I can propose a minimal __del__ that avoids asyncio.run() altogether.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5b1bad8 and bfb4a41.

📒 Files selected for processing (3)
  • py_hamt/store_httpx.py (8 hunks)
  • py_hamt/zarr_hamt_store.py (3 hunks)
  • tests/test_zarr_ipfs.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/test_zarr_ipfs.py
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/zarr_hamt_store.py (2)
py_hamt/encryption_hamt_store.py (1)
  • get (159-183)
py_hamt/hamt.py (2)
  • get (593-608)
  • HAMT (287-696)
py_hamt/store_httpx.py (4)
py_hamt/hamt.py (5)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
  • HAMT (287-696)
py_hamt/zarr_hamt_store.py (1)
  • get (144-177)
py_hamt/sharded_zarr_store.py (1)
  • get (319-374)
tests/test_kubo_cas.py (1)
  • test_kubo_cas (128-145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Create project environment, run all checks (3.12)

Comment on lines +84 to +106
def _map_byte_request(
self, byte_range: Optional[zarr.abc.store.ByteRequest]
) -> tuple[Optional[int], Optional[int], Optional[int]]:
"""Helper to map Zarr ByteRequest to offset, length, suffix."""
offset: Optional[int] = None
length: Optional[int] = None
suffix: Optional[int] = None

if byte_range:
if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
offset = byte_range.start
length = byte_range.end - byte_range.start
if length is not None and length < 0:
raise ValueError("End must be >= start for RangeByteRequest")
elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
offset = byte_range.offset
elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
suffix = byte_range.suffix
else:
raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")

return offset, length, suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix RangeByteRequest with open-ended ranges and validate inputs

_map_byte_request unconditionally computes end - start. If end is None (open-ended ranges are valid in Zarr v3), this raises a TypeError. Also, negative start and zero-length ranges should be handled explicitly to avoid downstream invalid HTTP ranges.

Apply:

 def _map_byte_request(
     self, byte_range: Optional[zarr.abc.store.ByteRequest]
 ) -> tuple[Optional[int], Optional[int], Optional[int]]:
     """Helper to map Zarr ByteRequest to offset, length, suffix."""
     offset: Optional[int] = None
     length: Optional[int] = None
     suffix: Optional[int] = None

     if byte_range:
         if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
-            offset = byte_range.start
-            length = byte_range.end - byte_range.start
-            if length is not None and length < 0:
-                raise ValueError("End must be >= start for RangeByteRequest")
+            if byte_range.start < 0:
+                raise ValueError("start must be >= 0 for RangeByteRequest")
+            offset = byte_range.start
+            if byte_range.end is not None:
+                if byte_range.end < byte_range.start:
+                    raise ValueError(
+                        "end must be >= start for RangeByteRequest"
+                    )
+                length = byte_range.end - byte_range.start  # end is exclusive
+            else:
+                # open-ended: request from offset to end of blob
+                length = None
         elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
             offset = byte_range.offset
         elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
             suffix = byte_range.suffix
         else:
             raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")

     return offset, length, suffix

Optionally, short-circuit zero-length reads at the call-site (see next comment) to avoid sending a bytes=start-(start-1) HTTP range.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _map_byte_request(
self, byte_range: Optional[zarr.abc.store.ByteRequest]
) -> tuple[Optional[int], Optional[int], Optional[int]]:
"""Helper to map Zarr ByteRequest to offset, length, suffix."""
offset: Optional[int] = None
length: Optional[int] = None
suffix: Optional[int] = None
if byte_range:
if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
offset = byte_range.start
length = byte_range.end - byte_range.start
if length is not None and length < 0:
raise ValueError("End must be >= start for RangeByteRequest")
elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
offset = byte_range.offset
elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
suffix = byte_range.suffix
else:
raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
return offset, length, suffix
def _map_byte_request(
self, byte_range: Optional[zarr.abc.store.ByteRequest]
) -> tuple[Optional[int], Optional[int], Optional[int]]:
"""Helper to map Zarr ByteRequest to offset, length, suffix."""
offset: Optional[int] = None
length: Optional[int] = None
suffix: Optional[int] = None
if byte_range:
if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
if byte_range.start < 0:
raise ValueError("start must be >= 0 for RangeByteRequest")
offset = byte_range.start
if byte_range.end is not None:
if byte_range.end < byte_range.start:
raise ValueError(
"end must be >= start for RangeByteRequest"
)
length = byte_range.end - byte_range.start # end is exclusive
else:
# open-ended: request from offset to end of blob
length = None
elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
offset = byte_range.offset
elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
suffix = byte_range.suffix
else:
raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
return offset, length, suffix

Comment on lines +158 to 169
if is_metadata and byte_range is None and key in self.metadata_read_cache:
val = self.metadata_read_cache[key]
else:
offset, length, suffix = self._map_byte_request(byte_range)
val = cast(
bytes, await self.hamt.get(key)
bytes,
await self.hamt.get(
key, offset=offset, length=length, suffix=suffix
),
) # We know values received will always be bytes since we only store bytes in the HAMT
if is_metadata:
if is_metadata and byte_range is None:
self.metadata_read_cache[key] = val
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Short-circuit zero-length ranges to avoid invalid HTTP Range headers

If a caller supplies RangeByteRequest(start, end) where start == end, length == 0. Forwarding this to the CAS will build end = start + length - 1, i.e. start-1 (invalid). Return an empty buffer locally for this edge case.

-            else:
-                offset, length, suffix = self._map_byte_request(byte_range)
+            else:
+                offset, length, suffix = self._map_byte_request(byte_range)
+                # Zero-length range → empty result, avoid invalid HTTP Range
+                if length == 0:
+                    return prototype.buffer.from_bytes(b"")
                 val = cast(
                     bytes,
                     await self.hamt.get(
                         key, offset=offset, length=length, suffix=suffix
                     ),
                 )  # We know values received will always be bytes since we only store bytes in the HAMT
                 if is_metadata and byte_range is None:
                     self.metadata_read_cache[key] = val
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if is_metadata and byte_range is None and key in self.metadata_read_cache:
val = self.metadata_read_cache[key]
else:
offset, length, suffix = self._map_byte_request(byte_range)
val = cast(
bytes, await self.hamt.get(key)
bytes,
await self.hamt.get(
key, offset=offset, length=length, suffix=suffix
),
) # We know values received will always be bytes since we only store bytes in the HAMT
if is_metadata:
if is_metadata and byte_range is None:
self.metadata_read_cache[key] = val
if is_metadata and byte_range is None and key in self.metadata_read_cache:
val = self.metadata_read_cache[key]
else:
offset, length, suffix = self._map_byte_request(byte_range)
# Zero-length range → empty result, avoid invalid HTTP Range
if length == 0:
return prototype.buffer.from_bytes(b"")
val = cast(
bytes,
await self.hamt.get(
key, offset=offset, length=length, suffix=suffix
),
) # We know values received will always be bytes since we only store bytes in the HAMT
if is_metadata and byte_range is None:
self.metadata_read_cache[key] = val
🤖 Prompt for AI Agents
In py_hamt/zarr_hamt_store.py around lines 158 to 169, when a RangeByteRequest
has start==end the computed length will be 0 and building an HTTP Range header
for the CAS becomes invalid; after calling self._map_byte_request(byte_range)
check if length == 0 and if so set val = b"" (return an empty bytes object)
instead of calling await self.hamt.get(...), and still populate
self.metadata_read_cache[key] when is_metadata and byte_range is None or when
appropriate for cached reads; this short-circuits the zero-length case and
avoids creating invalid Range headers.

TheGreatAlgo and others added 2 commits August 21, 2025 08:30
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
*,
headers: dict[str, str] | None = None,
auth: Tuple[str, str] | None = None,
pinOnAdd: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong casing:

Suggested change
pinOnAdd: bool = False,
pin_on_add: bool = False,

Comment on lines 258 to 259
pinString: str = "true" if pinOnAdd else "false"
self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, casing to be fixed:

Suggested change
pinString: str = "true" if pinOnAdd else "false"
self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}"
pin_string: str = "true" if pin_on_add else "false"
self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pin_string}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of casing issue, for the rest, it LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
py_hamt/store_httpx.py (3)

264-271: Bug: calling get_running_loop() in init when a user supplies a client

Constructing KuboCAS from sync code with a prebuilt AsyncClient raises RuntimeError. Don’t touch the loop in init; stash and return the external client in _loop_client().

-        if client is not None:
-            # A client was supplied by the user. We don't own it.
-            self._owns_client = False
-            self._client_per_loop = {asyncio.get_running_loop(): client}
+        if client is not None:
+            # A client was supplied by the user. We don't own it.
+            self._owns_client = False
+            # Do not access the running loop in __init__
+            self._external_client = client
+            self._client_per_loop = {}
         else:
             # No client supplied. We will own any clients we create.
             self._owns_client = True
             self._client_per_loop = {}

281-312: Return user-supplied client regardless of loop; lazily create owned clients per loop

This complements the previous fix. If the user provided a client, just return it. For owned clients, keep the per-loop behavior.

     def _loop_client(self) -> httpx.AsyncClient:
         """Get or create a client for the current event loop.
@@
         if self._closed:
             if not self._owns_client:
                 raise RuntimeError("KuboCAS is closed; create a new instance")
             # We previously closed all internally-owned clients. Reset the
             # state so that new clients can be created lazily.
             self._closed = False
             self._client_per_loop = {}
 
+        # External client → return as-is, independent of loop.
+        if not self._owns_client:
+            if self._external_client is None:
+                raise RuntimeError("External client missing (internal error)")
+            return self._external_client
+
         loop: asyncio.AbstractEventLoop = asyncio.get_running_loop()
         try:
             return self._client_per_loop[loop]
         except KeyError:
             # Create a new client
             client = httpx.AsyncClient(
                 timeout=60.0,
                 headers=self._default_headers,
                 auth=self._default_auth,
                 limits=httpx.Limits(max_connections=64, max_keepalive_connections=32),
                 # Uncomment when they finally support Robust HTTP/2 GOAWAY responses
                 # http2=True,
             )
             self._client_per_loop[loop] = client
             return client

496-515: pin_ls return type/docstring mismatch and incorrect shape

The RPC returns {"Keys": { "": {...}, ... }}, not a list. The annotation says list[Dict[str, Any]], docstring says List[CID]. Return List[CID] (decoded) for a stable API, or change both to Dict[str, Any]. Below converts to List[CID].

-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> list[CID]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            list[CID]: The pinned CIDs.
@@
-        pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls"
-        async with self._sem:  # throttle RPC
-            client = self._loop_client()
-            response = await client.post(pin_ls_url_base)
-            response.raise_for_status()
-            pins = response.json().get("Keys", [])
-            return pins
+        pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls"
+        async with self._sem:  # throttle RPC
+            client = self._loop_client()
+            response = await client.post(pin_ls_url_base)
+            response.raise_for_status()
+            payload = response.json()
+            keys_map = payload.get("Keys", {}) or {}
+            return [CID.decode(k) for k in keys_map.keys()]
♻️ Duplicate comments (4)
py_hamt/store_httpx.py (4)

439-475: Pin API docstrings align with behavior (recursive=true); nice cleanup

Docs now match the code and removed the non-existent name parameter. Looks good.


43-60: Base-class pin hooks silently no-op; surface misuse with NotImplementedError or mark abstract

Leaving these as pass/[] hides bugs at runtime when a caller expects pinning to work but the concrete CAS doesn’t implement it. Raise NotImplementedError to fail fast (or mark @AbstractMethod if all CASes must implement).

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid not implemented in this CAS")

-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid not implemented in this CAS")

-    async def pin_update(
-        self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
-    ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_update(
+        self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
+    ) -> None:
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update not implemented in this CAS")

-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls not implemented in this CAS")

104-114: Handle length-only loads and remove redundant else-after-return

When only length is provided (offset is None), callers expect the first N bytes. Also, the comment on Line 111 references length under the suffix branch. Tidy the control flow.

-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        elif length is not None:
+            # length only → first N bytes
+            return data[:length]
+        elif suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data  # full blob

415-428: Support length-only HTTP range requests (first N bytes)

When length is provided without offset, set Range to "bytes=0-{length-1}". Today that path falls through and fetches the whole blob.

         # Construct the Range header if required
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
+        elif length is not None:
+            # length only → first N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
         elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
🧹 Nitpick comments (4)
py_hamt/store_httpx.py (4)

190-192: Nit: prefer snake_case parameter names (pinOnAddpin_on_add)

PEP 8 consistency. This public API change is breaking, so consider an alias/deprecation path if you decide to adjust.


343-383: Avoid complex async orchestration in del

Scheduling tasks or calling asyncio.run() in a destructor is fragile and can produce noisy warnings. Since you already expose async context-manager and aclose(), consider simplifying del to only clear references when we own clients and log a warning for leaked clients.


489-494: Be explicit with repeated query params for pin/update

httpx will encode list values as repeated keys, but being explicit avoids ambiguity and IDE type nags; also mirrors curl usage clearly.

-        params = {"arg": [str(old_id), str(new_id)]}
+        params = [("arg", str(old_id)), ("arg", str(new_id))]
         pin_update_url_base: str = f"{target_rpc}/api/v0/pin/update"
         async with self._sem:  # throttle RPC
             client = self._loop_client()
             response = await client.post(pin_update_url_base, params=params)
             response.raise_for_status()

33-41: Nit: avoid using id as a parameter name

It shadows Python’s built-in id(). Consider object_id or content_id in new APIs; not worth churning existing ones.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7d826fa and 559a7ff.

📒 Files selected for processing (1)
  • py_hamt/store_httpx.py (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (3)
py_hamt/hamt.py (5)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
  • HAMT (287-696)
py_hamt/sharded_zarr_store.py (1)
  • get (361-416)
py_hamt/zarr_hamt_store.py (1)
  • get (144-177)

Comment on lines +229 to +234
self._owns_client: bool = False
self._closed: bool = True
self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
self._default_headers = headers
self._default_auth = auth

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Prepare for external client without touching the loop in init

Accessing asyncio.get_running_loop() in init breaks creation from sync contexts. Stash the external client and bind it on first use.

         self._owns_client: bool = False
         self._closed: bool = True
         self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
         self._default_headers = headers
         self._default_auth = auth
+        self._external_client: httpx.AsyncClient | None = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._owns_client: bool = False
self._closed: bool = True
self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
self._default_headers = headers
self._default_auth = auth
self._owns_client: bool = False
self._closed: bool = True
self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
self._default_headers = headers
self._default_auth = auth
self._external_client: httpx.AsyncClient | None = None
🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 229 to 234, __init__ is calling
asyncio.get_running_loop() (or otherwise touching the loop) when an external
httpx.AsyncClient may be passed, which breaks sync-side construction; instead,
store the provided client object and do not access the event loop in __init__;
set self._owns_client to False when an external client is provided and defer
creating or binding a per-loop AsyncClient until first use (e.g., in a private
_ensure_client() or _get_client() method) where you can safely call
asyncio.get_running_loop(), create or attach the client to that loop, and set
self._owns_client/ self._closed accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
py_hamt/store_httpx.py (1)

496-515: Align pin_ls() return type, implementation, and tests

The current implementation of pin_ls() (py_hamt/store_httpx.py lines 496–515) calls the IPFS RPC and does:

pins = response.json().get("Keys", [])
return pins

– at runtime this returns a dict mapping CIDs to metadata, but the signature is -> list[Dict[str, Any]] and the docstring says it returns a list of CIDs. Meanwhile, the tests in tests/test_kubo_pin.py (at lines 26, 34, 48, 59) do:

assert str(cid) in listed_pins
# …
assert str(cid) not in listed_pins_after_unpin

which relies on listed_pins being an iterable of CID strings (dict membership in Python tests keys, so it happens to work today).

We need to pick one consistent API shape and apply the corresponding refactors:

• Approach A – Return a list of {cid, ...} dicts
Implementation (replace lines 512–515):
diff - pins = response.json().get("Keys", []) - return pins + keys = response.json().get("Keys") or {} + # Convert {"<cid>": {...}, …} to [{"cid": "<cid>", **metadata}, …] + return [{"cid": k, **(v or {})} for k, v in keys.items()]
Signature & docstring (py_hamt/store_httpx.py lines 496–498 & doc):
```diff
- async def pin_ls(...) -> list[Dict[str, Any]]:
+ async def pin_ls(...) -> list[Dict[str, Any]]:
"""
Lists all pinned CIDs on the local Kubo node via the RPC API.

    Returns:
-       List[CID]: A list of pinned CIDs.
+       list[dict]: Each entry has the form {"cid": "<cid>", "Type": ..., "Metadata": ...}.
    """
```

Tests (tests/test_kubo_pin.py at lines 26, 34, 48, 59): change membership checks to look inside the returned dicts, e.g.:
diff - assert str(cid) in listed_pins + assert any(p["cid"] == str(cid) for p in listed_pins), f"CID {cid} was not pinned"
and similarly for the other three assertions (listed_pins_after_unpin, listed_pins_after_update, listed_pins_after_unpin_update).

• Approach B – Keep returning the raw dict
Signature & docstring:
diff - async def pin_ls(...) -> list[Dict[str, Any]]: + async def pin_ls(...) -> Dict[str, Dict[str, Any]]: """ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + Dict[str, dict]: Mapping from each pinned CID string to its metadata. """
Implementation: leave as is (it already returns the dict).
Tests: no change necessary.

Please select the intended API contract and update the code (py_hamt/store_httpx.py), docstring, signature, and—if you choose Approach A—the four assertions in tests/test_kubo_pin.py accordingly.

♻️ Duplicate comments (5)
py_hamt/store_httpx.py (5)

43-60: Surface misuse early: make base pin methods fail fast.

These default “no-op” implementations hide bugs at runtime. Prefer raising NotImplementedError to catch accidental calls on CAS backends that don’t implement pinning. This was noted previously.

 async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid is not supported by this CAS")

 async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid is not supported by this CAS")

 async def pin_update(
     self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
 ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update is not supported by this CAS")

 async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls is not supported by this CAS")

229-235: Avoid touching the running loop in init when an external client is provided.

Constructors should be usable from sync contexts; calling asyncio.get_running_loop() here forces an active loop. Stash the external client and bind lazily in _loop_client() instead. This was suggested earlier.

         self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
         self._default_headers = headers
         self._default_auth = auth
+        self._external_client: httpx.AsyncClient | None = None

Outside this range (illustrative change):

# __init__ (replace the current external-client block)
if client is not None:
    self._owns_client = False
    self._external_client = client
else:
    self._owns_client = True

# _loop_client (first lines)
loop = asyncio.get_running_loop()
if loop in self._client_per_loop:
    return self._client_per_loop[loop]
if self._external_client is not None:
    self._client_per_loop[loop] = self._external_client
    return self._external_client
# ... create internal client as you do now

435-459: Expose a recursive flag for pin/unpin for flexibility.

Docstrings say “recursive by default” and you correctly send "true". Exposing a recursive: bool = True kwarg adds control without breaking defaults. Same for unpin_cid.

 async def pin_cid(
     self,
     cid: CID,
-    target_rpc: str = "http://127.0.0.1:5001",
+    target_rpc: str = "http://127.0.0.1:5001",
+    *,
+    recursive: bool = True,
 ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

And for unpin_cid(...):

-    async def unpin_cid(
-        self, cid: CID, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> None:
+    async def unpin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

77-84: Support length-only reads and tighten slice validation in InMemoryCAS.load().

When only length is supplied, callers expect the first N bytes. Also remove “else after return” and validate non-negative inputs. This was raised earlier and still applies.

@@
-        if offset is not None:
+        if length is not None and offset is None:
+            # length only → first N bytes
+            if length < 0:
+                raise ValueError("length must be non-negative")
+            return data[:length]
+        if offset is not None:
             start = offset
+            if start < 0:
+                raise ValueError("offset must be non-negative")
             if length is not None:
-                end = start + length
+                if length < 0:
+                    raise ValueError("length must be non-negative")
+                end = start + length
                 return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
+            return data[start:]
+        if suffix is not None:
+            if suffix <= 0:
+                raise ValueError("suffix must be a positive integer")
             return data[-suffix:]
-        else:  # Full load
-            return data
+        return data  # full blob

Also applies to: 104-115


403-409: HTTP Range: add length-only requests, validate inputs.

load() currently ignores the case where only length is provided; gateway should receive Range: bytes=0-{length-1}. Also reject negative values. This was previously flagged.

@@
-        # Construct the Range header if required
-        if offset is not None:
+        # Construct the Range header if required
+        if offset is not None:
             start = offset
+            if start < 0:
+                raise ValueError("offset must be non-negative")
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
-                end = start + length - 1
+                if length < 0:
+                    raise ValueError("length must be non-negative")
+                end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif length is not None:
+            if length < 0:
+                raise ValueError("length must be non-negative")
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
-            headers["Range"] = f"bytes=-{suffix}"
+            if suffix <= 0:
+                raise ValueError("suffix must be a positive integer")
+            headers["Range"] = f"bytes=-{suffix}"
@@
-            response = await client.get(url, headers=headers or None)
+            response = await client.get(url, headers=headers or None)

Also applies to: 411-428, 431-431

🧹 Nitpick comments (1)
py_hamt/store_httpx.py (1)

34-41: Define and document range-parameter contract (mutual exclusivity and non-negatives).

The extended load(...) signature is good. Please explicitly document and enforce:

  • offset >= 0, length >= 0, suffix > 0.
  • suffix must not be used together with offset/length.
    This avoids ambiguous requests and undefined slices across implementations.

Would you like me to add a brief docstring here plus input validation in both implementations?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 559a7ff and e537389.

📒 Files selected for processing (1)
  • py_hamt/store_httpx.py (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (4)
py_hamt/hamt.py (4)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
py_hamt/zarr_hamt_store.py (1)
  • get (144-177)
py_hamt/sharded_zarr_store.py (1)
  • get (361-416)
tests/test_kubo_cas.py (1)
  • test_kubo_cas (128-145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (1)
py_hamt/store_httpx.py (1)

190-190: Nice: explicit pin-on-add toggle in the add RPC URL.

Good feature gate and correct boolean casing in query string.

Also applies to: 258-260

Copy link
Collaborator

@0xSwego 0xSwego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
py_hamt/store_httpx.py (5)

94-99: Fix length-only handling and add input validation in InMemoryCAS.load

length without offset currently ignored; also redundant else after return. Add basic validation and implement length-only semantics.

         key = cast(bytes, id)
         if not isinstance(key, (bytes, bytearray)):  # defensive guard
             raise TypeError(
                 f"InMemoryCAS only supports byte‐hash keys; got {type(id).__name__}"
             )
-        data: bytes
+        # Validate args
+        if length is not None and suffix is not None:
+            raise ValueError("length and suffix are mutually exclusive")
+        if offset is not None and offset < 0:
+            raise ValueError("offset must be >= 0")
+        if length is not None and length < 0:
+            raise ValueError("length must be >= 0")
+        if suffix is not None and suffix < 0:
+            raise ValueError("suffix must be >= 0")
+        data: bytes
@@
-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        elif length is not None:
+            # length only → first N bytes
+            return data[:length]
+        elif suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data

Also applies to: 105-116


233-238: Don’t touch the event loop in init; defer binding external clients

Accessing asyncio.get_running_loop() during construction breaks sync contexts. Stash user client and return it from _loop_client without loop lookups; only create per-loop clients when you own them.

-        self._owns_client: bool = False
-        self._closed: bool = True
-        self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
+        self._owns_client: bool = False
+        self._closed: bool = True
+        self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
+        self._external_client: httpx.AsyncClient | None = None
         self._default_headers = headers
         self._default_auth = auth
@@
-        if client is not None:
-            # A client was supplied by the user. We don't own it.
-            self._owns_client = False
-            self._client_per_loop = {asyncio.get_running_loop(): client}
-        else:
-            # No client supplied. We will own any clients we create.
-            self._owns_client = True
-            self._client_per_loop = {}
+        if client is not None:
+            self._owns_client = False
+            self._external_client = client
+            self._client_per_loop = {}
+        else:
+            self._owns_client = True
+            self._client_per_loop = {}
@@
     def _loop_client(self) -> httpx.AsyncClient:
         """Get or create a client for the current event loop.
@@
-        if self._closed:
+        if self._closed:
             if not self._owns_client:
                 raise RuntimeError("KuboCAS is closed; create a new instance")
             # We previously closed all internally-owned clients. Reset the
             # state so that new clients can be created lazily.
             self._closed = False
             self._client_per_loop = {}
 
+        # If user supplied a client, just reuse it.
+        if self._external_client is not None:
+            return self._external_client
+
         loop: asyncio.AbstractEventLoop = asyncio.get_running_loop()
         try:
             return self._client_per_loop[loop]
         except KeyError:
             # Create a new client
             client = httpx.AsyncClient(

Optional: use WeakKeyDictionary for _client_per_loop to avoid retaining closed loops.

Also applies to: 268-276, 299-330


510-526: Expose recursive flag in pin/unpin and align params

Docstrings say recursive by default; make it configurable and wire through.

-    async def pin_cid(
-        self,
-        cid: CID,
-        target_rpc: str = "http://127.0.0.1:5001",
-    ) -> None:
+    async def pin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}
@@
-    async def unpin_cid(
-        self, cid: CID, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> None:
+    async def unpin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

Also applies to: 532-547


44-61: Make pin_ hooks fail-fast instead of silently no-op*

Silent pass/[] hides misuse. Raise NotImplementedError so unsupported backends surface errors early.

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid not implemented for this CAS")

-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid not implemented for this CAS")

-    async def pin_update(
+    async def pin_update(
         self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
     ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update not implemented for this CAS")

-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls not implemented for this CAS")

456-471: Support length-only HTTP Range and validate inputs in KuboCAS.load

Handle length without offset and reject invalid combos/negatives. Return b"" for zero-length requests.

-        # Construct the Range header if required
+        # Validate and construct the Range header if required
+        if length is not None and suffix is not None:
+            raise ValueError("length and suffix are mutually exclusive")
+        if offset is not None and offset < 0:
+            raise ValueError("offset must be >= 0")
+        if length is not None and length < 0:
+            raise ValueError("length must be >= 0")
+        if suffix is not None and suffix < 0:
+            raise ValueError("suffix must be >= 0")
+        if length == 0:
+            return b""
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif length is not None:
+            # First N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
🧹 Nitpick comments (5)
py_hamt/store_httpx.py (5)

35-41: Clarify and validate range-argument contract in the base interface

Please document precedence and mutual exclusivity of offset/length/suffix (e.g., offset+suffix invalid; length-only means first N bytes) so all backends implement the same behavior. Consider adding lightweight validation in the abstract method’s docstring and enforcing it in concrete implementations.


278-279: Remove duplicate assignment of _default_headers/_default_auth

These are already set at Lines 236–237.

-        self._default_headers = headers
-        self._default_auth = auth

361-401: Be cautious running async cleanup from del

Scheduling tasks or calling asyncio.run in destructors can fire at interpreter shutdown and be brittle. Consider limiting del to clearing references and rely on aclose()/context manager for real cleanup.


548-566: Type consistency for pin_update

Align with other pin methods by accepting CID for old_id/new_id (runtime behavior unchanged).

-    async def pin_update(
-        self,
-        old_id: IPLDKind,
-        new_id: IPLDKind,
+    async def pin_update(
+        self,
+        old_id: CID,
+        new_id: CID,
         target_rpc: str = "http://127.0.0.1:5001",
     ) -> None:

568-586: pin_ls return type/docstring mismatch

Kubo returns {"Keys": {...}}; you return that mapping, not a list. Fix typing, docstring, and default.

-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> dict[str, Any]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            dict[str, Any]: The raw "Keys" mapping from Kubo pin/ls.
@@
-            pins = response.json().get("Keys", [])
+            pins = response.json().get("Keys", {})
             return pins
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e537389 and efb3c64.

📒 Files selected for processing (1)
  • py_hamt/store_httpx.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (5)
py_hamt/hamt.py (4)
  • load (138-139)
  • load (159-168)
  • load (276-285)
  • get (593-608)
py_hamt/sharded_zarr_store.py (1)
  • get (361-416)
py_hamt/zarr_hamt_store.py (1)
  • get (144-177)
tests/test_public_gateway.py (1)
  • test_fix_kubocas_load (270-316)
tests/test_kubo_cas.py (2)
  • test_kubo_urls_explicit (52-71)
  • test_kubo_cas (130-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (2)
py_hamt/store_httpx.py (2)

262-264: LGTM: pin_on_add wiring

Correctly threads pin flag into the add URL.


405-445: LGTM: save() retries and jitter

Solid retry/backoff with jitter and clear error propagation.

"InMemoryCAS",
"KuboCAS",
"ZarrHAMTStore",
"InMemoryCAS",
Copy link
Contributor

@Faolain Faolain Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit but duplicate export of InMemoryCAS, should remove

self._resize_complete.set()
self._shard_locks: DefaultDict[int, asyncio.Lock] = defaultdict(asyncio.Lock)

self._shard_data_cache: Dict[int, list[Optional[CID]]] = {}
Copy link
Contributor

@Faolain Faolain Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT had a good point here

  1. Memory bounds
  • _shard_data_cache has no capacity limit and may grow without bound as clients hit many shards in long‑lived processes. Consider
    an LRU for shard lists (configurable size) to bound memory.

Instead of a dict can use a LRU which comes with python itself as our apps using this which are long lived processes would forever have their cache grow unbounded no (since there's no eviction policy)?

I asked Claude Opus for a suggestion (can crosscheck with chatgpt 5)
Basic:
Screenshot 2025-09-10 at 2 49 36 AM

Advanced
Screenshot 2025-09-10 at 2 50 07 AM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
py_hamt/hamt_to_sharded_converter.py (1)

44-44: Chunk shape derivation may fail for unchunked datasets.

The function logic is correct. However, source_dataset.chunks can be None for unchunked arrays, causing a TypeError when accessing source_dataset.chunks[dim][0].

Add a safety check and fallback:

-    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+    if source_dataset.chunks is None:  # unchunked source
+        chunk_shape_tuple = array_shape_tuple  # Use full array dims as chunk shape
+    else:
+        chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
🧹 Nitpick comments (6)
tests/test_sharded_zarr_store.py (4)

1-1: Remove unused import.

The asyncio import is not used in the test file.

-import asyncio
-
 import dag_cbor

183-183: Remove print statement from test code.

Test code should not contain print statements.

-        print(f"Data written to ShardedZarrStore with root CID: {final_cid}")
         assert final_cid is not None

200-200: Remove print statement from test code.

-        print("\n✅ Append test successful! Data verified.")

752-752: Remove debugging print statement.

Multiple print statements should be removed from the test code.

-        print(f"Testing with chunk key: {chunk_key}")

Also applies to lines 759, 792, 799, 806, 812, 822.

py_hamt/sharded_zarr_store.py (2)

195-195: Add explicit strict parameter to zip().

Python 3.10+ recommends explicit strict parameter for zip() to catch length mismatches.

-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)

Also applies to lines 767 and 806.


809-814: Simplify conditional logic for num_shards calculation.

The conditional check for self._num_shards is not None is unnecessary since it's initialized to 0 in __init__.

-        old_num_shards = self._num_shards if self._num_shards is not None else 0
+        old_num_shards = self._num_shards
         self._num_shards = (
             (self._total_chunks + self._chunks_per_shard - 1) // self._chunks_per_shard
             if self._total_chunks > 0
             else 0
         )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41143f9 and e55a3e3.

📒 Files selected for processing (7)
  • py_hamt/__init__.py (2 hunks)
  • py_hamt/hamt_to_sharded_converter.py (1 hunks)
  • py_hamt/sharded_zarr_store.py (1 hunks)
  • tests/test_sharded_store_deleting.py (1 hunks)
  • tests/test_sharded_store_grafting.py (1 hunks)
  • tests/test_sharded_zarr_store.py (1 hunks)
  • tests/testing_utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/testing_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • py_hamt/init.py
  • tests/test_sharded_store_deleting.py
  • tests/test_sharded_store_grafting.py
🧰 Additional context used
🧬 Code graph analysis (3)
py_hamt/sharded_zarr_store.py (1)
py_hamt/store_httpx.py (10)
  • ContentAddressedStore (13-60)
  • load (35-42)
  • load (78-115)
  • load (446-505)
  • save (28-32)
  • save (73-76)
  • save (405-444)
  • InMemoryCAS (39-73)
  • ContentAddressedStore (13-36)
  • KuboCAS (76-429)
tests/test_sharded_zarr_store.py (2)
py_hamt/store_httpx.py (4)
  • KuboCAS (118-586)
  • save (28-32)
  • save (73-76)
  • save (405-444)
py_hamt/sharded_zarr_store.py (20)
  • ShardedZarrStore (137-870)
  • open (205-236)
  • flush (519-562)
  • _load_or_initialize_shard_cache (398-453)
  • get (46-53)
  • get (564-619)
  • exists (682-693)
  • list_prefix (737-740)
  • delete (707-727)
  • with_read_only (471-510)
  • list_dir (859-870)
  • get_partial_values (462-469)
  • _parse_chunk_key (343-382)
  • MemoryBoundedLRUCache (20-134)
  • put (55-96)
  • cache_size (127-129)
  • estimated_memory_usage (122-124)
  • dirty_cache_size (132-134)
  • mark_clean (104-107)
  • clear (109-115)
py_hamt/hamt_to_sharded_converter.py (4)
py_hamt/hamt.py (4)
  • HAMT (288-705)
  • keys (673-687)
  • get_pointer (610-624)
  • HAMT (287-696)
py_hamt/sharded_zarr_store.py (4)
  • ShardedZarrStore (137-870)
  • open (205-236)
  • set_pointer (660-680)
  • flush (519-562)
py_hamt/store_httpx.py (2)
  • KuboCAS (118-586)
  • KuboCAS (76-429)
py_hamt/zarr_hamt_store.py (3)
  • ZarrHAMTStore (12-322)
  • read_only (108-111)
  • ZarrHAMTStore (11-285)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py

189-189: Avoid specifying long messages outside the exception class

(TRY003)


191-191: Avoid specifying long messages outside the exception class

(TRY003)


195-195: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


226-228: Avoid specifying long messages outside the exception class

(TRY003)


231-231: Avoid specifying long messages outside the exception class

(TRY003)


235-235: Avoid specifying long messages outside the exception class

(TRY003)


269-271: Abstract raise to an inner function

(TRY301)


269-271: Avoid specifying long messages outside the exception class

(TRY003)


273-273: Prefer TypeError exception for invalid type

(TRY004)


273-273: Abstract raise to an inner function

(TRY301)


273-273: Avoid specifying long messages outside the exception class

(TRY003)


274-274: Do not catch blind exception: Exception

(BLE001)


275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


275-275: Avoid specifying long messages outside the exception class

(TRY003)


279-281: Avoid specifying long messages outside the exception class

(TRY003)


291-293: Avoid specifying long messages outside the exception class

(TRY003)


316-316: Abstract raise to an inner function

(TRY301)


316-316: Avoid specifying long messages outside the exception class

(TRY003)


322-322: Consider moving this statement to an else block

(TRY300)


335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


335-337: Avoid specifying long messages outside the exception class

(TRY003)


379-381: Avoid specifying long messages outside the exception class

(TRY003)


428-430: Avoid specifying long messages outside the exception class

(TRY003)


436-436: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


436-436: Avoid specifying long messages outside the exception class

(TRY003)


439-439: Avoid specifying long messages outside the exception class

(TRY003)


452-452: Avoid specifying long messages outside the exception class

(TRY003)


474-474: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)


508-508: Comment contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


527-527: Avoid specifying long messages outside the exception class

(TRY003)


577-579: Avoid specifying long messages outside the exception class

(TRY003)


608-610: Avoid specifying long messages outside the exception class

(TRY003)


623-623: Avoid specifying long messages outside the exception class

(TRY003)


636-636: Avoid specifying long messages outside the exception class

(TRY003)


656-656: Do not catch blind exception: Exception

(BLE001)


657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


657-657: Avoid specifying long messages outside the exception class

(TRY003)


709-709: Avoid specifying long messages outside the exception class

(TRY003)


716-716: Avoid specifying long messages outside the exception class

(TRY003)


744-744: Avoid specifying long messages outside the exception class

(TRY003)


767-767: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


790-790: Avoid specifying long messages outside the exception class

(TRY003)


797-797: Avoid specifying long messages outside the exception class

(TRY003)


799-801: Avoid specifying long messages outside the exception class

(TRY003)


806-806: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


833-833: Avoid specifying long messages outside the exception class

(TRY003)


839-841: Avoid specifying long messages outside the exception class

(TRY003)

py_hamt/hamt_to_sharded_converter.py

125-125: Do not catch blind exception: Exception

(BLE001)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
py_hamt/sharded_zarr_store.py

[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).

🔇 Additional comments (1)
py_hamt/sharded_zarr_store.py (1)

20-135: LGTM! Excellent memory-bounded cache implementation.

The MemoryBoundedLRUCache class effectively addresses the unbounded memory growth concern raised in previous reviews. Key strengths:

  • Proper LRU eviction when memory limit is exceeded
  • Protection of dirty shards from eviction
  • Thread-safe async operations with locks
  • Accurate memory tracking using sys.getsizeof()

Comment on lines +125 to +126
except Exception as e:
print(f"\nAn error occurred: {e}")
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid catching blind exception.

The broad Exception catch could hide important errors and make debugging difficult.

Catch specific exceptions that are expected during conversion:

-        except Exception as e:
-            print(f"\nAn error occurred: {e}")
+        except (ValueError, KeyError, RuntimeError) as e:
+            print(f"\nAn error occurred: {e}")
+            import sys
+            sys.exit(1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
print(f"\nAn error occurred: {e}")
except (ValueError, KeyError, RuntimeError) as e:
print(f"\nAn error occurred: {e}")
import sys
sys.exit(1)
🧰 Tools
🪛 Ruff (0.12.2)

125-125: Do not catch blind exception: Exception

(BLE001)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-09-15 at 1 37 31 AM

maybe something to consider here @TheGreatAlgo ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines 274 to 276
except Exception as e:
raise ValueError(f"Failed to decode root object: {e}")
self._root_obj = dag_cbor.decode(root_bytes)
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Duplicate decode operation and broad exception handling.

The root object is decoded twice (line 267 and 276), and the broad Exception catch could hide important errors.

         try:
             self._root_obj = dag_cbor.decode(root_bytes)
             if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
                 raise ValueError(
                     "Root object is not a valid dictionary with 'chunks' key."
                 )
             if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
                 raise ValueError("shard_cids is not a list.")
-        except Exception as e:
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
             raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
raise ValueError(f"Failed to decode root object: {e}")
self._root_obj = dag_cbor.decode(root_bytes)
try:
self._root_obj = dag_cbor.decode(root_bytes)
if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
raise ValueError(
"Root object is not a valid dictionary with 'chunks' key."
)
if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
raise ValueError("shard_cids is not a list.")
except (ValueError, dag_cbor.CBORDecodeError) as e:
raise ValueError(f"Failed to decode root object: {e}")
🧰 Tools
🪛 Ruff (0.12.2)

274-274: Do not catch blind exception: Exception

(BLE001)


275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


275-275: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 274 to 276, remove the duplicate
dag_cbor.decode call and the overly broad except Exception; instead catch the
specific decode-related exception(s) (e.g., ValueError/CBORDecodeError or the
library's decode error), assign the decoded result to self._root_obj from the
earlier successful decode (or decode once and reuse its variable), and re-raise
a ValueError with the original exception chained/preserved so you don't swallow
important error details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it appears to be decoded twice is this correct?

self._root_obj = dag_cbor.decode(root_bytes) on here twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Faolain, you're absolutely correct! This is indeed a bug - dag_cbor.decode(root_bytes) is being called twice:

  1. Line 267: Inside the try block, decodes and assigns to self._root_obj
  2. Line 276: Outside the try block, decodes again and overwrites self._root_obj

This is problematic because:

  • The validation performed inside the try block becomes meaningless since the result gets overwritten
  • If the second decode fails, it would raise an unhandled exception
  • It's wasteful to decode the same data twice

The correct fix is to simply remove line 276 since the decoding and assignment already happened successfully inside the try block:

async def _load_root_from_cid(self):
    root_bytes = await self.cas.load(self._root_cid)
    try:
        self._root_obj = dag_cbor.decode(root_bytes)
        if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
            raise ValueError(
                "Root object is not a valid dictionary with 'chunks' key."
            )
        if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
            raise ValueError("shard_cids is not a list.")
    except (ValueError, dag_cbor.CBORDecodeError) as e:
        raise ValueError(f"Failed to decode root object: {e}") from e
    # Remove this duplicate line:
    # self._root_obj = dag_cbor.decode(root_bytes)

Good catch! This would have caused the validation logic to be bypassed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheGreatAlgo this too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines 332 to 334
print(
f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace print statements with proper logging.

Print statements should use the logging module for better control in production environments.

+import logging
+
+logger = logging.getLogger(__name__)
+
                     # Log the failure and raise a specific error
-                    print(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
+                    logger.error(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
+                    )

Also applies to line 340 - replace print(f"Error fetching shard {shard_idx}: {e}") with logger.error(...).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 332-334 (and also the similar
occurrence at line 340), replace the print statements with proper logging calls:
use the module logger (ensure a logger is created/imported at the top, e.g.,
logger = logging.getLogger(__name__)) and change the prints to logger.error with
a clear message including shard_idx and max_retries, and pass exception info
(exc_info=True) where appropriate so stack traces are recorded; keep messages
concise and consistent with other log lines in the file.

Comment on lines +656 to +657
except Exception as e:
raise RuntimeError(f"Failed to save data for key {key}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve exception handling specificity.

The broad Exception catch could hide important errors during CAS save operations.

         try:
             data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
             await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+        except (ConnectionError, TimeoutError, RuntimeError) as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
raise RuntimeError(f"Failed to save data for key {key}: {e}")
try:
data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
await self.set_pointer(key, str(data_cid_obj))
except (ConnectionError, TimeoutError, RuntimeError) as e:
raise RuntimeError(f"Failed to save data for key {key}: {e}") from e
🧰 Tools
🪛 Ruff (0.12.2)

656-656: Do not catch blind exception: Exception

(BLE001)


657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


657-657: Avoid specifying long messages outside the exception class

(TRY003)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 656-657, the broad `except
Exception as e` hides specific errors during CAS save; replace it by catching
only the expected exceptions from the CAS save operation (e.g., the CAS client’s
specific exception types, OSError/IOError, ValueError, etc.), let unexpected
exceptions propagate, and when re-raising wrap with `raise RuntimeError(...)
from e` to preserve the original traceback; if you don't know exact CAS
exception classes, remove the blanket except and/or import and catch the CAS
library's specific error types and include `from e` when raising.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
tests/test_sharded_zarr_store_coverage.py (2)

19-23: Delete dead, commented-out blocks

These commented fragments age poorly and confuse intent. Remove them.

Also applies to: 85-89, 180-186, 232-236, 251-263


529-553: Assert retry failures via logging instead of patching print

Production code should log, not print. Switch assertions to caplog.

-    # This should retry and eventually succeed (testing retry logic lines 325-329)
-    with patch("builtins.print") as mock_print:
-        shard_data = await store._load_or_initialize_shard_cache(0)
-        assert shard_data is not None
-        assert len(shard_data) == 4
+    # This should retry and eventually succeed
+    import logging
+    with caplog.at_level(logging.ERROR):
+        shard_data = await store._load_or_initialize_shard_cache(0)
+    assert shard_data is not None
+    assert len(shard_data) == 4
@@
-    # This should hit lines 332-337 (failure after max retries)
-    with patch("builtins.print") as mock_print:
-        with pytest.raises(
-            RuntimeError, match="Failed to fetch shard 0 after 3 attempts"
-        ):
-            await store_fail._load_or_initialize_shard_cache(0)
-        # Should print failure message (line 333)
-        mock_print.assert_called()
+    # This should fail after max retries and log an error
+    with caplog.at_level(logging.ERROR):
+        with pytest.raises(RuntimeError, match="Failed to fetch shard 0 after 3 attempts"):
+            await store_fail._load_or_initialize_shard_cache(0)
+    assert any("Failed to fetch shard 0" in rec.message for rec in caplog.records)
py_hamt/sharded_zarr_store.py (3)

317-347: Replace prints with logging; narrow exceptions and chain causes

Use logging, exponential backoff stays. Chain causes on failure; avoid blind except.

+        import logging
+        logger = logging.getLogger(__name__)
@@
-            except (ConnectionError, TimeoutError) as e:
+            except (ConnectionError, TimeoutError) as e:
                 # Handle transient errors (e.g., network issues)
                 if attempt < max_retries - 1:
                     await asyncio.sleep(
                         retry_delay * (2**attempt)
                     )  # Exponential backoff
                     continue
                 else:
-                    # Log the failure and raise a specific error
-                    print(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
-                    raise RuntimeError(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
-            except Exception as e:
-                # Handle non-transient errors immediately
-                print(f"Error fetching shard {shard_idx}: {e}")
-                raise
+                    logger.error("Failed to fetch shard %s after %s attempts: %s",
+                                 shard_idx, max_retries, e)
+                    raise RuntimeError(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts"
+                    ) from e
+            except (ValueError, TypeError, dag_cbor.CBORDecodeError) as e:
+                logger.error("Error fetching or decoding shard %s: %s", shard_idx, e)
+                raise

555-567: Unify IPLD ID handling; avoid coercing to str

Coercing IDs breaks generic CAS backends (e.g., InMemoryCAS expects bytes). Keep IDs as returned by cas.save/load and only decode strings when input truly is a string.

-        if self._dirty_root:
+        if self._dirty_root:
             # Ensure all metadata CIDs are CID objects for correct encoding
             self._root_obj["metadata"] = {
                 k: (CID.decode(v) if isinstance(v, str) else v)
                 for k, v in self._root_obj["metadata"].items()
             }
             root_obj_bytes = dag_cbor.encode(self._root_obj)
             new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            self._root_cid = new_root_cid
             self._dirty_root = False
@@
-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)
             return prototype.buffer.from_bytes(data)
@@
-            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, data_cid_obj)
+        except (ConnectionError, TimeoutError, RuntimeError, ValueError) as e:
+            raise RuntimeError(f"Failed to save data for key {key}") from e
@@
-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | bytes | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = CID.decode(pointer) if isinstance(pointer, str) else pointer  # type: ignore[assignment]

Follow-up: if you want flush() to return a string CID, return str(self._root_cid) at the end.

Also applies to: 585-586, 659-662, 665-673


269-282: Decode once and raise with cause; remove duplicate decode and broad catch

_load_root_from_cid() decodes twice and catches Exception. Decode once, validate, and re-raise with from e.

-        root_bytes = await self.cas.load(self._root_cid)
-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        root_bytes = await self.cas.load(self._root_cid)
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError("Failed to decode root object") from e
🧹 Nitpick comments (3)
tests/test_sharded_zarr_store_coverage.py (1)

132-136: Remove unused caplog argument

caplog is unused in this test. Drop it to satisfy ARG001.

-async def test_sharded_zarr_store_shard_handling_exceptions(
-    create_ipfs: tuple[str, str], caplog
-):
+async def test_sharded_zarr_store_shard_handling_exceptions(
+    create_ipfs: tuple[str, str]
+):
py_hamt/sharded_zarr_store.py (2)

771-776: Make zip strict in graft/resize math

Avoid silent truncation when offsets/ranks mismatch.

-            global_coords = tuple(
-                c_local + c_offset
-                for c_local, c_offset in zip(local_coords, chunk_offset)
-            )
+            global_coords = tuple(
+                c_local + c_offset
+                for c_local, c_offset in zip(local_coords, chunk_offset, strict=True)
+            )
@@
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )

Also applies to: 809-812


122-125: Don’t implement __contains__ as async

in won’t use an async __contains__. Rename to contains() or make it sync.

-    async def __contains__(self, shard_idx: int) -> bool:
-        async with self._cache_lock:
-            return shard_idx in self._cache
+    def contains(self, shard_idx: int) -> bool:
+        # Best-effort check; not acquiring lock is acceptable for a hint
+        return shard_idx in self._cache
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e55a3e3 and ac17139.

📒 Files selected for processing (2)
  • py_hamt/sharded_zarr_store.py (1 hunks)
  • tests/test_sharded_zarr_store_coverage.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/sharded_zarr_store.py (2)
py_hamt/zarr_hamt_store.py (1)
  • ZarrHAMTStore (11-285)
py_hamt/store_httpx.py (3)
  • InMemoryCAS (39-73)
  • ContentAddressedStore (13-36)
  • KuboCAS (76-429)
tests/test_sharded_zarr_store_coverage.py (3)
py_hamt/store_httpx.py (7)
  • KuboCAS (118-586)
  • save (28-32)
  • save (73-76)
  • save (405-444)
  • load (35-42)
  • load (78-115)
  • load (446-505)
tests/testing_utils.py (1)
  • create_ipfs (149-195)
py_hamt/zarr_hamt_store.py (3)
  • read_only (108-111)
  • set_partial_values (225-229)
  • exists (193-199)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


231-233: Avoid specifying long messages outside the exception class

(TRY003)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


274-276: Abstract raise to an inner function

(TRY301)


274-276: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Prefer TypeError exception for invalid type

(TRY004)


278-278: Abstract raise to an inner function

(TRY301)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


279-279: Do not catch blind exception: Exception

(BLE001)


280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


284-286: Avoid specifying long messages outside the exception class

(TRY003)


296-298: Avoid specifying long messages outside the exception class

(TRY003)


321-321: Abstract raise to an inner function

(TRY301)


321-321: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Consider moving this statement to an else block

(TRY300)


340-342: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


340-342: Avoid specifying long messages outside the exception class

(TRY003)


384-386: Avoid specifying long messages outside the exception class

(TRY003)


433-435: Avoid specifying long messages outside the exception class

(TRY003)


441-441: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


441-441: Avoid specifying long messages outside the exception class

(TRY003)


444-444: Avoid specifying long messages outside the exception class

(TRY003)


457-457: Avoid specifying long messages outside the exception class

(TRY003)


479-479: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)


513-513: Comment contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


532-532: Avoid specifying long messages outside the exception class

(TRY003)


582-584: Avoid specifying long messages outside the exception class

(TRY003)


613-615: Avoid specifying long messages outside the exception class

(TRY003)


628-628: Avoid specifying long messages outside the exception class

(TRY003)


641-641: Avoid specifying long messages outside the exception class

(TRY003)


661-661: Do not catch blind exception: Exception

(BLE001)


662-662: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


662-662: Avoid specifying long messages outside the exception class

(TRY003)


714-714: Avoid specifying long messages outside the exception class

(TRY003)


721-721: Avoid specifying long messages outside the exception class

(TRY003)


749-749: Avoid specifying long messages outside the exception class

(TRY003)


772-772: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


795-795: Avoid specifying long messages outside the exception class

(TRY003)


802-802: Avoid specifying long messages outside the exception class

(TRY003)


804-806: Avoid specifying long messages outside the exception class

(TRY003)


811-811: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


838-838: Avoid specifying long messages outside the exception class

(TRY003)


844-846: Avoid specifying long messages outside the exception class

(TRY003)

tests/test_sharded_zarr_store_coverage.py

133-133: Unused function argument: caplog

(ARG001)


348-348: Unused method argument: cid

(ARG002)


390-390: Unused method argument: cid

(ARG002)


431-431: Unused method argument: cid

(ARG002)


457-457: Unused method argument: cid

(ARG002)


496-496: Unused method argument: cid

(ARG002)


499-499: Avoid specifying long messages outside the exception class

(TRY003)


505-505: Unused method argument: data

(ARG002)


505-505: Unused method argument: codec

(ARG002)


600-600: Unused method argument: cid

(ARG002)


600-600: Unused method argument: offset

(ARG002)


600-600: Unused method argument: length

(ARG002)


600-600: Unused method argument: suffix

(ARG002)


673-673: Unused method argument: data

(ARG002)


673-673: Unused method argument: codec

(ARG002)


675-675: Avoid specifying long messages outside the exception class

(TRY003)


710-710: Unused method argument: data

(ARG002)


710-710: Unused method argument: codec

(ARG002)


765-765: Unused method argument: cid

(ARG002)


784-784: Unused function argument: shard_idx

(ARG001)


818-818: Unused method argument: cid

(ARG002)


844-844: Unused function argument: timeout

(ARG001)


877-877: Unused method argument: cid

(ARG002)


900-900: Unused function argument: timeout

(ARG001)


908-908: Unused function argument: shard_idx

(ARG001)

🔇 Additional comments (4)
tests/test_sharded_zarr_store_coverage.py (2)

266-328: LRU cache edge-cases well covered

Nice coverage for empty shard sizing, update of existing entries, and eviction when only dirty shards exist.


801-858: Timeout cleanup scenario is precise

Good use of a manual pending event and asyncio.wait_for patch to validate cleanup of _pending_shard_loads.

py_hamt/sharded_zarr_store.py (2)

29-45: MemoryBoundedLRUCache looks solid

Size accounting, dirty protection, and eviction loop behavior are clean and deterministic.

Also applies to: 55-103


736-787: Grafting path is correct and lock-scoped

Good lock granularity per-target-shard, and dirty marking only on change.

Comment on lines +193 to +201
if not all(cs > 0 for cs in self._chunk_shape):
raise ValueError("All chunk_shape dimensions must be positive.")
if not all(s >= 0 for s in self._array_shape):
raise ValueError("All array_shape dimensions must be non-negative.")

self._chunks_per_dim = tuple(
math.ceil(a / c) if c > 0 else 0
for a, c in zip(self._array_shape, self._chunk_shape)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add rank check and use zip(..., strict=True)

Guard mismatched shapes early and make zip intent explicit.

-        if not all(cs > 0 for cs in self._chunk_shape):
+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+        if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not all(cs > 0 for cs in self._chunk_shape):
raise ValueError("All chunk_shape dimensions must be positive.")
if not all(s >= 0 for s in self._array_shape):
raise ValueError("All array_shape dimensions must be non-negative.")
self._chunks_per_dim = tuple(
math.ceil(a / c) if c > 0 else 0
for a, c in zip(self._array_shape, self._chunk_shape)
)
if len(self._array_shape) != len(self._chunk_shape):
raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
if not all(cs > 0 for cs in self._chunk_shape):
raise ValueError("All chunk_shape dimensions must be positive.")
if not all(s >= 0 for s in self._array_shape):
raise ValueError("All array_shape dimensions must be non-negative.")
self._chunks_per_dim = tuple(
math.ceil(a / c) if c > 0 else 0
for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
)
🧰 Tools
🪛 Ruff (0.12.2)

194-194: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 193 to 201, add an explicit rank
check to ensure len(self._array_shape) == len(self._chunk_shape) and raise a
ValueError with a clear message if they differ, then compute
self._chunks_per_dim using zip(self._array_shape, self._chunk_shape,
strict=True) (keeping the existing math.ceil(a / c) if c > 0 else 0 logic) so
mismatched shapes are caught early and the zip intent is explicit.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
py_hamt/sharded_zarr_store.py (3)

269-282: Decode-once and narrow rethrow; remove duplicate decode

Root is decoded twice and exceptions aren’t chained.

-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except Exception as e:
+            raise ValueError("Failed to decode root object") from e

549-561: Do not coerce root CID to string; only stringify on return

This prevents incompatibilities with non-string IPLD kinds (e.g., InMemoryCAS). Return a string for API consistency.

-            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
+            self._root_cid = new_root_cid
             self._dirty_root = False
-
-        # Ignore because root_cid will always exist after initialization or flush.
-        return self._root_cid  # type: ignore[return-value]
+        return str(self._root_cid)  # unified return type

190-201: Add rank check and use strict zip in geometry

Catch shape rank mismatches early; make zip intent explicit.

-        if not all(cs > 0 for cs in self._chunk_shape):
+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+        if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )
🧹 Nitpick comments (6)
tests/test_sharded_zarr_store_coverage.py (3)

132-134: Remove unused caplog parameter

caplog is not used. Drop it to satisfy linters and reduce noise.

-async def test_sharded_zarr_store_shard_handling_exceptions(
-    create_ipfs: tuple[str, str], caplog
-):
+async def test_sharded_zarr_store_shard_handling_exceptions(
+    create_ipfs: tuple[str, str],
+):

346-359: Prefix unused mock arguments with underscores

Silence ARG001/ARG002 by prefixing unused parameters (cid, data, codec, timeout) with _. Repeat pattern for all mocks in this file.

-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-        async def load(self, cid, offset=None, length=None, suffix=None):
+        async def load(self, _cid, offset=None, length=None, suffix=None):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-    async def mock_wait_for(coro, timeout=None):
+    async def mock_wait_for(coro, _timeout=None):
@@
-    async def mock_wait_for(coro, timeout=None):
+    async def mock_wait_for(coro, _timeout=None):
@@
-    async def mock_cache_get(shard_idx):
+    async def mock_cache_get(_shard_idx):

Also applies to: 387-399, 427-429, 452-464, 487-500, 583-586, 654-657, 690-693, 745-751, 797-801, 854-858, 823-827, 877-884, 885-888


329-335: Avoid brittle line-number references in test docstrings

Docstrings that mention source line numbers will rot quickly. Rephrase to describe behavior/intent instead.

Also applies to: 569-575

py_hamt/sharded_zarr_store.py (3)

21-27: Clarify that size is an estimate; consider naming and doc tweak

sys.getsizeof undercounts nested objects; it’s an estimate. Update docstring accordingly to avoid misleading claims.

-    Memory usage is calculated using sys.getsizeof for accurate sizing.
+    Memory usage is estimated using sys.getsizeof on the list and elements.

Also applies to: 37-45


46-53: Use OrderedDict.move_to_end for LRU bump

Simpler and faster than pop/assign.

-            shard_data = self._cache.pop(shard_idx)
-            self._cache[shard_idx] = shard_data
+            shard_data = self._cache[shard_idx]
+            self._cache.move_to_end(shard_idx, last=True)
             return shard_data

776-781: Backpressure when all shards are dirty

If many shards are dirty, memory can exceed the bound since dirty shards aren’t evicted. Consider optional auto-flush or a high-watermark callback to shed memory.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac17139 and 6cac0f5.

📒 Files selected for processing (2)
  • py_hamt/sharded_zarr_store.py (1 hunks)
  • tests/test_sharded_zarr_store_coverage.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (10)
  • ContentAddressedStore (13-60)
  • load (35-42)
  • load (78-115)
  • load (446-505)
  • save (28-32)
  • save (73-76)
  • save (405-444)
  • InMemoryCAS (39-73)
  • ContentAddressedStore (13-36)
  • KuboCAS (76-429)
tests/test_sharded_zarr_store_coverage.py (12)
  • load (346-358)
  • load (387-388)
  • load (427-428)
  • load (452-463)
  • load (487-494)
  • load (583-584)
  • load (745-748)
  • load (797-799)
  • load (854-857)
  • save (496-501)
  • save (654-656)
  • save (690-693)
tests/test_sharded_zarr_store_coverage.py (2)
py_hamt/store_httpx.py (7)
  • KuboCAS (118-586)
  • save (28-32)
  • save (73-76)
  • save (405-444)
  • load (35-42)
  • load (78-115)
  • load (446-505)
py_hamt/sharded_zarr_store.py (13)
  • ShardedZarrStore (142-869)
  • _load_or_initialize_shard_cache (397-452)
  • get (46-53)
  • get (563-618)
  • resize_store (782-824)
  • MemoryBoundedLRUCache (20-139)
  • _get_shard_size (37-44)
  • put (55-101)
  • dirty_cache_size (137-139)
  • _load_root_from_cid (269-298)
  • with_read_only (470-509)
  • exists (681-692)
  • flush (518-561)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


231-233: Avoid specifying long messages outside the exception class

(TRY003)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


274-276: Abstract raise to an inner function

(TRY301)


274-276: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Prefer TypeError exception for invalid type

(TRY004)


278-278: Abstract raise to an inner function

(TRY301)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


279-279: Do not catch blind exception: Exception

(BLE001)


280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


284-286: Avoid specifying long messages outside the exception class

(TRY003)


296-298: Avoid specifying long messages outside the exception class

(TRY003)


321-321: Abstract raise to an inner function

(TRY301)


321-321: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Consider moving this statement to an else block

(TRY300)


336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


336-338: Avoid specifying long messages outside the exception class

(TRY003)


340-340: Use raise without specifying exception name

Remove exception name

(TRY201)


378-380: Avoid specifying long messages outside the exception class

(TRY003)


427-429: Avoid specifying long messages outside the exception class

(TRY003)


435-435: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


435-435: Avoid specifying long messages outside the exception class

(TRY003)


438-438: Avoid specifying long messages outside the exception class

(TRY003)


451-451: Avoid specifying long messages outside the exception class

(TRY003)


473-473: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)


507-507: Comment contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


526-526: Avoid specifying long messages outside the exception class

(TRY003)


576-578: Avoid specifying long messages outside the exception class

(TRY003)


607-609: Avoid specifying long messages outside the exception class

(TRY003)


622-622: Avoid specifying long messages outside the exception class

(TRY003)


635-635: Avoid specifying long messages outside the exception class

(TRY003)


655-655: Do not catch blind exception: Exception

(BLE001)


656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


656-656: Avoid specifying long messages outside the exception class

(TRY003)


708-708: Avoid specifying long messages outside the exception class

(TRY003)


715-715: Avoid specifying long messages outside the exception class

(TRY003)


743-743: Avoid specifying long messages outside the exception class

(TRY003)


766-766: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


789-789: Avoid specifying long messages outside the exception class

(TRY003)


796-796: Avoid specifying long messages outside the exception class

(TRY003)


798-800: Avoid specifying long messages outside the exception class

(TRY003)


805-805: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


832-832: Avoid specifying long messages outside the exception class

(TRY003)


838-840: Avoid specifying long messages outside the exception class

(TRY003)

tests/test_sharded_zarr_store_coverage.py

133-133: Unused function argument: caplog

(ARG001)


346-346: Unused method argument: cid

(ARG002)


387-387: Unused method argument: cid

(ARG002)


427-427: Unused method argument: cid

(ARG002)


452-452: Unused method argument: cid

(ARG002)


487-487: Unused method argument: cid

(ARG002)


490-490: Avoid specifying long messages outside the exception class

(TRY003)


496-496: Unused method argument: data

(ARG002)


496-496: Unused method argument: codec

(ARG002)


583-583: Unused method argument: cid

(ARG002)


583-583: Unused method argument: offset

(ARG002)


583-583: Unused method argument: length

(ARG002)


583-583: Unused method argument: suffix

(ARG002)


654-654: Unused method argument: data

(ARG002)


654-654: Unused method argument: codec

(ARG002)


656-656: Avoid specifying long messages outside the exception class

(TRY003)


690-690: Unused method argument: data

(ARG002)


690-690: Unused method argument: codec

(ARG002)


745-745: Unused method argument: cid

(ARG002)


764-764: Unused function argument: shard_idx

(ARG001)


797-797: Unused method argument: cid

(ARG002)


823-823: Unused function argument: timeout

(ARG001)


854-854: Unused method argument: cid

(ARG002)


877-877: Unused function argument: timeout

(ARG001)


885-885: Unused function argument: shard_idx

(ARG001)

🔇 Additional comments (3)
tests/test_sharded_zarr_store_coverage.py (1)

304-326: LRU eviction behavior tests look solid

Good coverage of “no clean shards to evict” path; verifies dirty shard protection works as intended.

py_hamt/sharded_zarr_store.py (2)

681-693: exists() looks good

Specific exception handling and shard on-demand load is clean.


470-509: with_read_only shallow clone is efficient and clear

Sharing locks/cache/pending-loads while flipping the flag is appropriate.

Comment on lines +157 to +164
def __init__(
self,
cas: ContentAddressedStore,
read_only: bool,
root_cid: Optional[str] = None,
*,
max_cache_memory_bytes: int = 100 * 1024 * 1024, # 100MB default
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Normalize root CID typing and equality

Stop forcing string conversions internally; accept IPLDKind on input, store as-is, and compare by string form for equality to avoid type mismatches.

-        root_cid: Optional[str] = None,
+        root_cid: Optional[CID | str | bytes] = None,
-        root_cid: Optional[str] = None,
+        root_cid: Optional[CID | str | bytes] = None,
-        return self._root_cid == other._root_cid
+        return str(self._root_cid) == str(other._root_cid)

Also applies to: 209-220, 511-516

Comment on lines +440 to +446
shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]
if shard_cid_obj:
self._pending_shard_loads[shard_idx] = asyncio.Event()
shard_cid_str = str(shard_cid_obj)
await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
else:
empty_shard = [None] * self._chunks_per_shard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid str() on shard CIDs; pass IPLDKind to loader

Keep types consistent with CAS expectations.

-        if shard_cid_obj:
+        if shard_cid_obj:
             self._pending_shard_loads[shard_idx] = asyncio.Event()
-            shard_cid_str = str(shard_cid_obj)
-            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)  # type: ignore[arg-type]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]
if shard_cid_obj:
self._pending_shard_loads[shard_idx] = asyncio.Event()
shard_cid_str = str(shard_cid_obj)
await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
else:
empty_shard = [None] * self._chunks_per_shard
shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]
if shard_cid_obj:
self._pending_shard_loads[shard_idx] = asyncio.Event()
await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # type: ignore[arg-type]
else:
empty_shard = [None] * self._chunks_per_shard
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 440 to 446, avoid calling str() on
shard CID objects — the CAS expects an IPLDKind (CID) not a string. Replace
shard_cid_str = str(shard_cid_obj) with using the original shard_cid_obj (or
explicitly cast to the IPLDKind/ CID type expected by
_fetch_and_cache_full_shard) and pass that value into await
self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj); ensure no string
conversion is performed and that the loader signature accepts the IPLDKind.

Comment on lines +579 to +581
data = await self.cas.load(str(metadata_cid_obj))
return prototype.buffer.from_bytes(data)
# Chunk data request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pass CID objects to CAS.load; don’t convert to str

Keeps IPLD types intact and avoids assumptions about CAS backends.

-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)
-        chunk_cid_str = str(chunk_cid_obj)
@@
-        data = await self.cas.load(
-            chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
-        )
+        data = await self.cas.load(
+            chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
+        )

Also applies to: 596-617

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 579-581 (and similarly 596-617),
the code converts CID objects to strings before calling self.cas.load; change
those calls to pass the CID objects directly (e.g.,
self.cas.load(metadata_cid_obj) instead of self.cas.load(str(metadata_cid_obj)))
so IPLD/CID types are preserved for backends that expect native CID objects;
update all occurrences in the specified ranges to remove str(...) wrapping and
adjust any type hints or variable names if needed to reflect that a CID object
(not a str) is being passed.

Comment on lines +649 to +657
raw_data_bytes = value.to_bytes()
# Save the data to CAS first to get its CID.
# Metadata is often saved as 'raw', chunks as well unless compressed.
try:
data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
await self.set_pointer(key, str(data_cid_obj))
except Exception as e:
raise RuntimeError(f"Failed to save data for key {key}: {e}")
return None # type: ignore[return-value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid stringifying pointers; accept CID|str in set_pointer and chain exceptions

This removes needless encode/decode cycles and fixes incompatibility with non-string IPLD kinds.

-        try:
-            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+        try:
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, data_cid_obj)  # pass IPLDKind directly
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}") from e
-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)

Also applies to: 659-667, 669-679

🧰 Tools
🪛 Ruff (0.12.2)

655-655: Do not catch blind exception: Exception

(BLE001)


656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


656-656: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 649-657 (also apply same changes
to blocks at 659-667 and 669-679), the code stringifies CID objects before
calling set_pointer and raises a new RuntimeError without chaining the original
exception; modify the code to pass the CID object (or accept CID|str) directly
to set_pointer instead of calling str(), update set_pointer's
signature/implementation to accept either a CID object or a string (or normalize
input inside set_pointer), and re-raise the RuntimeError using exception
chaining (raise ... from e) so the original exception context is preserved.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (8)
py_hamt/sharded_zarr_store.py (8)

336-338: Add exception chaining for better debugging.

When retries are exhausted, the original exception should be chained.

                 else:
                     raise RuntimeError(
                         f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
+                    ) from e

441-442: Pass CID object directly to avoid unnecessary conversion.

The CAS expects IPLDKind (CID), not a string.

             self._pending_shard_loads[shard_idx] = asyncio.Event()
-            shard_cid_str = str(shard_cid_obj)
-            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)

Also update the method signature:

     async def _fetch_and_cache_full_shard(
         self,
         shard_idx: int,
-        shard_cid: str,
+        shard_cid: CID,
         max_retries: int = 3,
         retry_delay: float = 1.0,
     ) -> None:

652-654: Improve exception handling and pass CID directly.

Catch specific exceptions and pass the CID object without stringification.

         try:
             data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+            await self.set_pointer(key, data_cid_obj)
+        except (ConnectionError, TimeoutError, RuntimeError) as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}") from e

Also update the set_pointer signature:

-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)

853-853: Store CID object directly, not as string.

Metadata CIDs should be stored as CID objects for consistency.

-        self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cid
+        self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cid

Note: This line already stores the CID object correctly - no change needed.


193-200: Add rank check and use zip(..., strict=True).

Validate that shapes have the same dimensionality and make zip behavior explicit.

+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
         if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
         self._chunks_per_dim = tuple(
             math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
         )

279-281: Remove duplicate decode and improve exception handling.

The root object is decoded twice (line 272 and 281), and the broad Exception catch could hide important errors.

         try:
             self._root_obj = dag_cbor.decode(root_bytes)
             if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
                 raise ValueError(
                     "Root object is not a valid dictionary with 'chunks' key."
                 )
             if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
                 raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError(f"Failed to decode root object: {e}") from e

577-577: Pass CID objects directly to CAS.load.

Converting CIDs to strings breaks compatibility with CAS backends that expect native CID objects.

-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)

594-615: Avoid stringifying chunk CID.

Pass the CID object directly to maintain type consistency.

-        chunk_cid_str = str(chunk_cid_obj)
-
         req_offset = None
         req_length = None
         req_suffix = None

         if byte_range:
             if isinstance(byte_range, RangeByteRequest):
                 req_offset = byte_range.start
                 if byte_range.end is not None:
                     if byte_range.start > byte_range.end:
                         raise ValueError(
                             f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
                         )
                     req_length = byte_range.end - byte_range.start
             elif isinstance(byte_range, OffsetByteRequest):
                 req_offset = byte_range.offset
             elif isinstance(byte_range, SuffixByteRequest):
                 req_suffix = byte_range.suffix
         data = await self.cas.load(
-            chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
+            chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
         )
🧹 Nitpick comments (3)
py_hamt/sharded_zarr_store.py (3)

762-765: Use strict=True for zip to catch mismatched dimensions.

Make the zip behavior explicit to catch potential bugs.

             global_coords = tuple(
                 c_local + c_offset
-                for c_local, c_offset in zip(local_coords, chunk_offset)
+                for c_local, c_offset in zip(local_coords, chunk_offset, strict=True)
             )

801-803: Use strict=True for zip consistency.

Make the zip behavior explicit as recommended in other locations.

         self._chunks_per_dim = tuple(
             math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
         )

624-629: Improve metadata key detection logic.

The current check not key == "zarr.json" with length-based detection is brittle. Consider a more explicit approach.

         if (
             key.endswith("zarr.json")
             and not key.startswith("time/")
             and not key.startswith(("lat/", "latitude/"))
             and not key.startswith(("lon/", "longitude/"))
-            and not key == "zarr.json"
+            and key != "zarr.json"  # Exclude root zarr.json
         ):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cac0f5 and 7529c2f.

📒 Files selected for processing (1)
  • py_hamt/sharded_zarr_store.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (7)
  • ContentAddressedStore (13-60)
  • load (35-42)
  • load (78-115)
  • load (446-505)
  • save (28-32)
  • save (73-76)
  • save (405-444)
tests/test_sharded_zarr_store_coverage.py (12)
  • load (346-358)
  • load (387-388)
  • load (427-428)
  • load (452-463)
  • load (487-494)
  • load (583-584)
  • load (745-748)
  • load (797-799)
  • load (854-857)
  • save (496-501)
  • save (654-656)
  • save (690-693)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


231-233: Avoid specifying long messages outside the exception class

(TRY003)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


274-276: Abstract raise to an inner function

(TRY301)


274-276: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Prefer TypeError exception for invalid type

(TRY004)


278-278: Abstract raise to an inner function

(TRY301)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


279-279: Do not catch blind exception: Exception

(BLE001)


280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


284-286: Avoid specifying long messages outside the exception class

(TRY003)


296-298: Avoid specifying long messages outside the exception class

(TRY003)


321-321: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Consider moving this statement to an else block

(TRY300)


336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


336-338: Avoid specifying long messages outside the exception class

(TRY003)


376-378: Avoid specifying long messages outside the exception class

(TRY003)


425-427: Avoid specifying long messages outside the exception class

(TRY003)


433-433: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


433-433: Avoid specifying long messages outside the exception class

(TRY003)


436-436: Avoid specifying long messages outside the exception class

(TRY003)


449-449: Avoid specifying long messages outside the exception class

(TRY003)


471-471: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)


505-505: Comment contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


524-524: Avoid specifying long messages outside the exception class

(TRY003)


574-576: Avoid specifying long messages outside the exception class

(TRY003)


605-607: Avoid specifying long messages outside the exception class

(TRY003)


620-620: Avoid specifying long messages outside the exception class

(TRY003)


633-633: Avoid specifying long messages outside the exception class

(TRY003)


653-653: Do not catch blind exception: Exception

(BLE001)


654-654: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


654-654: Avoid specifying long messages outside the exception class

(TRY003)


706-706: Avoid specifying long messages outside the exception class

(TRY003)


713-713: Avoid specifying long messages outside the exception class

(TRY003)


741-741: Avoid specifying long messages outside the exception class

(TRY003)


764-764: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


787-787: Avoid specifying long messages outside the exception class

(TRY003)


794-794: Avoid specifying long messages outside the exception class

(TRY003)


796-798: Avoid specifying long messages outside the exception class

(TRY003)


803-803: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


830-830: Avoid specifying long messages outside the exception class

(TRY003)


836-838: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (2)
py_hamt/sharded_zarr_store.py (2)

20-140: Excellent memory-bounded LRU cache implementation!

The MemoryBoundedLRUCache is well-designed with:

  • Accurate memory sizing using sys.getsizeof
  • Protection of dirty shards from eviction
  • Thread-safe operations with asyncio.Lock
  • Proper LRU eviction strategy

This addresses the concern raised in PR comments about unbounded cache growth.


548-556: LGTM - Proper CID handling in flush.

The flush method correctly ensures metadata CIDs are CID objects before encoding and properly marks shards as clean after successful writes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (8)
py_hamt/sharded_zarr_store.py (8)

193-201: Add rank check and use zip(strict=True) to catch shape mismatches early.

-        if not all(cs > 0 for cs in self._chunk_shape):
+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+        if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )

269-281: Decode once; narrow exceptions and chain them.

-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError("Failed to decode root object.") from e

437-445: Don’t stringify CIDs when loading shards; pass IPLDKind through.

-        if shard_cid_obj:
+        if shard_cid_obj:
             self._pending_shard_loads[shard_idx] = asyncio.Event()
-            shard_cid_str = str(shard_cid_obj)
-            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)  # pass CID/bytes directly

560-577: Metadata path: avoid str() on CIDs.

-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)

508-513: Root CID equality should normalize type to avoid CID vs bytes vs str mismatches.

-        return self._root_cid == other._root_cid
+        return str(self._root_cid) == str(other._root_cid)

546-559: Avoid forcing root CID to str; keep IPLDKind internally and return str from flush().

-            root_obj_bytes = dag_cbor.encode(self._root_obj)
-            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            root_obj_bytes = dag_cbor.encode(self._root_obj)
+            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
+            self._root_cid = new_root_cid
             self._dirty_root = False
 
-        # Ignore because root_cid will always exist after initialization or flush.
-        return self._root_cid  # type: ignore[return-value]
+        return str(self._root_cid)

646-654: Pass CID to set_pointer and chain the exception; don’t stringify.

-        try:
-            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+        try:
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, data_cid_obj)
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data for key {key}") from e

656-676: set_pointer should accept CID|str and normalize; current str-only contract is brittle.

-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)
🧹 Nitpick comments (3)
py_hamt/sharded_zarr_store.py (3)

779-810: Simplify num_shards math; no need for None-guard; add zip(strict=True) above.

Minor cleanup: _num_shards is always int; remove the is not None guard.

-        old_num_shards = self._num_shards if self._num_shards is not None else 0
+        old_num_shards = self._num_shards

20-45: Note: sys.getsizeof underestimates composite memory; OK as heuristic.

If precise caps matter, consider tracking element counts and average sizes, or use a sampling approach.


76-101: Eviction stalls when all shards are dirty; cap dirty set or allow forced evict.

As written, memory can exceed max_memory_bytes if everything is dirty. Consider a soft cap on dirty entries or a “force flush before exceed” hook.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7529c2f and 78cf72d.

📒 Files selected for processing (1)
  • py_hamt/sharded_zarr_store.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (7)
  • ContentAddressedStore (13-60)
  • load (35-42)
  • load (78-115)
  • load (446-505)
  • save (28-32)
  • save (73-76)
  • save (405-444)
tests/test_sharded_zarr_store_coverage.py (12)
  • load (346-358)
  • load (387-388)
  • load (427-428)
  • load (452-463)
  • load (487-494)
  • load (583-584)
  • load (745-748)
  • load (797-799)
  • load (854-857)
  • save (496-501)
  • save (654-656)
  • save (690-693)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


231-233: Avoid specifying long messages outside the exception class

(TRY003)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


274-276: Abstract raise to an inner function

(TRY301)


274-276: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Prefer TypeError exception for invalid type

(TRY004)


278-278: Abstract raise to an inner function

(TRY301)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


279-279: Do not catch blind exception: Exception

(BLE001)


280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


283-285: Avoid specifying long messages outside the exception class

(TRY003)


295-297: Avoid specifying long messages outside the exception class

(TRY003)


320-320: Avoid specifying long messages outside the exception class

(TRY003)


326-326: Consider moving this statement to an else block

(TRY300)


335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


335-337: Avoid specifying long messages outside the exception class

(TRY003)


375-377: Avoid specifying long messages outside the exception class

(TRY003)


424-426: Avoid specifying long messages outside the exception class

(TRY003)


432-432: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


432-432: Avoid specifying long messages outside the exception class

(TRY003)


435-435: Avoid specifying long messages outside the exception class

(TRY003)


448-448: Avoid specifying long messages outside the exception class

(TRY003)


470-470: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)


504-504: Comment contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


523-523: Avoid specifying long messages outside the exception class

(TRY003)


573-575: Avoid specifying long messages outside the exception class

(TRY003)


604-606: Avoid specifying long messages outside the exception class

(TRY003)


619-619: Avoid specifying long messages outside the exception class

(TRY003)


632-632: Avoid specifying long messages outside the exception class

(TRY003)


652-652: Do not catch blind exception: Exception

(BLE001)


653-653: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


653-653: Avoid specifying long messages outside the exception class

(TRY003)


705-705: Avoid specifying long messages outside the exception class

(TRY003)


712-712: Avoid specifying long messages outside the exception class

(TRY003)


740-740: Avoid specifying long messages outside the exception class

(TRY003)


763-763: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


786-786: Avoid specifying long messages outside the exception class

(TRY003)


793-793: Avoid specifying long messages outside the exception class

(TRY003)


795-797: Avoid specifying long messages outside the exception class

(TRY003)


802-802: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


829-829: Avoid specifying long messages outside the exception class

(TRY003)


835-837: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (1)
py_hamt/sharded_zarr_store.py (1)

839-846: Fix JSON decode bug: json.loads expects str, not bytes.

This path will raise TypeError. Decode the bytes first.

-        old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid)
-        zarr_metadata_json = json.loads(old_zarr_metadata_bytes)
+        old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid)
+        zarr_metadata_json = json.loads(old_zarr_metadata_bytes.decode("utf-8"))

Likely an incorrect or invalid review comment.

Comment on lines +122 to +125
async def __contains__(self, shard_idx: int) -> bool:
async with self._cache_lock:
return shard_idx in self._cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

async contains is invalid Python; replace with an async method (deadlock/logic risk).

Magic methods cannot be async; using in cache would return a coroutine, not bool. Provide an explicit async API instead.

-    async def __contains__(self, shard_idx: int) -> bool:
-        async with self._cache_lock:
-            return shard_idx in self._cache
+    async def contains(self, shard_idx: int) -> bool:
+        async with self._cache_lock:
+            return shard_idx in self._cache
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def __contains__(self, shard_idx: int) -> bool:
async with self._cache_lock:
return shard_idx in self._cache
async def contains(self, shard_idx: int) -> bool:
async with self._cache_lock:
return shard_idx in self._cache
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 122-125, remove the async magic
method definition (async def __contains__) because dunder methods cannot be
async; instead implement an explicit async API like async def
contains_shard(self, shard_idx: int) -> bool that acquires self._cache_lock and
returns shard_idx in self._cache, and replace or keep a synchronous
__contains__(self, shard_idx: int) -> bool that performs a non-blocking
membership check (return shard_idx in self._cache) if you still want Python's
"in" to work (noting it will be racy); update all call sites to use await
contains_shard(...) where a locked, async check is required.

Comment on lines +299 to +337
async def _fetch_and_cache_full_shard(
self,
shard_idx: int,
shard_cid: str,
max_retries: int = 3,
retry_delay: float = 1.0,
) -> None:
"""
Fetch a shard from CAS and cache it, with retry logic for transient errors.
Args:
shard_idx: The index of the shard to fetch.
shard_cid: The CID of the shard.
max_retries: Maximum number of retry attempts for transient errors.
retry_delay: Delay between retry attempts in seconds.
"""
for attempt in range(max_retries):
try:
shard_data_bytes = await self.cas.load(shard_cid)
decoded_shard = dag_cbor.decode(shard_data_bytes)
if not isinstance(decoded_shard, list):
raise TypeError(f"Shard {shard_idx} did not decode to a list.")
await self._shard_data_cache.put(shard_idx, decoded_shard)
# Always set the Event to unblock waiting coroutines
if shard_idx in self._pending_shard_loads:
self._pending_shard_loads[shard_idx].set()
del self._pending_shard_loads[shard_idx]
return # Success
except (ConnectionError, TimeoutError) as e:
# Handle transient errors (e.g., network issues)
if attempt < max_retries - 1:
await asyncio.sleep(
retry_delay * (2**attempt)
) # Exponential backoff
continue
else:
raise RuntimeError(
f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Ensure pending waiters are always released; accept IPLDKind and handle decode errors.

Without signaling on failure, other tasks can hang for 60s. Also, avoid forcing CIDs to str and catch CBOR decode errors explicitly.

-    async def _fetch_and_cache_full_shard(
+    async def _fetch_and_cache_full_shard(
         self,
         shard_idx: int,
-        shard_cid: str,
+        shard_cid: CID | bytes | str,
         max_retries: int = 3,
         retry_delay: float = 1.0,
     ) -> None:
@@
-        for attempt in range(max_retries):
-            try:
-                shard_data_bytes = await self.cas.load(shard_cid)
-                decoded_shard = dag_cbor.decode(shard_data_bytes)
-                if not isinstance(decoded_shard, list):
-                    raise TypeError(f"Shard {shard_idx} did not decode to a list.")
-                await self._shard_data_cache.put(shard_idx, decoded_shard)
-                # Always set the Event to unblock waiting coroutines
-                if shard_idx in self._pending_shard_loads:
-                    self._pending_shard_loads[shard_idx].set()
-                    del self._pending_shard_loads[shard_idx]
-                return  # Success
-            except (ConnectionError, TimeoutError) as e:
+        for attempt in range(max_retries):
+            try:
+                shard_data_bytes = await self.cas.load(shard_cid)
+                decoded_shard = dag_cbor.decode(shard_data_bytes)
+                if not isinstance(decoded_shard, list):
+                    raise TypeError(f"Shard {shard_idx} did not decode to a list.")
+                await self._shard_data_cache.put(shard_idx, decoded_shard)
+                return  # Success
+            except dag_cbor.CBORDecodeError as e:
+                # Non-transient: don't retry CBOR decode failures
+                raise ValueError(f"Failed to decode shard {shard_idx}.") from e
+            except (ConnectionError, TimeoutError) as e:
                 # Handle transient errors (e.g., network issues)
                 if attempt < max_retries - 1:
                     await asyncio.sleep(
                         retry_delay * (2**attempt)
                     )  # Exponential backoff
                     continue
                 else:
-                    raise RuntimeError(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
+                    raise RuntimeError(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts."
+                    ) from e
+            finally:
+                # Always release waiters if present
+                if shard_idx in self._pending_shard_loads:
+                    self._pending_shard_loads[shard_idx].set()
+                    self._pending_shard_loads.pop(shard_idx, None)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.12.2)

320-320: Avoid specifying long messages outside the exception class

(TRY003)


326-326: Consider moving this statement to an else block

(TRY300)


335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


335-337: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 299-337, ensure pending waiters
are always released and decoding errors are handled explicitly: keep shard_cid
as-is (do not force to str), catch dag_cbor.DecodingError (or the library's
specific CBOR decode exception) in addition to TypeError, and treat valid
decoded shard types as list/IPLD list-kind rather than only python list; on any
failure (decode error, non-list kind, or exhausted retries) set and remove the
Event in self._pending_shard_loads[shard_idx] so waiting coroutines are
unblocked, and re-raise a clear RuntimeError after cleaning up. Also use
exponential backoff as already implemented for transient
ConnectionError/TimeoutError.

Comment on lines +593 to +615
chunk_cid_str = str(chunk_cid_obj)

req_offset = None
req_length = None
req_suffix = None

if byte_range:
if isinstance(byte_range, RangeByteRequest):
req_offset = byte_range.start
if byte_range.end is not None:
if byte_range.start > byte_range.end:
raise ValueError(
f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
)
req_length = byte_range.end - byte_range.start
elif isinstance(byte_range, OffsetByteRequest):
req_offset = byte_range.offset
elif isinstance(byte_range, SuffixByteRequest):
req_suffix = byte_range.suffix
data = await self.cas.load(
chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
)
return prototype.buffer.from_bytes(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Zero-length Range must return empty buffer; also avoid str() on chunk CID.

Current logic computes length 0, yielding an invalid HTTP Range (end = start - 1). Return empty bytes instead and pass CID directly.

-        chunk_cid_str = str(chunk_cid_obj)
-
         req_offset = None
         req_length = None
         req_suffix = None
 
         if byte_range:
             if isinstance(byte_range, RangeByteRequest):
                 req_offset = byte_range.start
                 if byte_range.end is not None:
                     if byte_range.start > byte_range.end:
                         raise ValueError(
                             f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
                         )
-                    req_length = byte_range.end - byte_range.start
+                    if byte_range.end == byte_range.start:
+                        return prototype.buffer.from_bytes(b"")
+                    req_length = byte_range.end - byte_range.start
             elif isinstance(byte_range, OffsetByteRequest):
                 req_offset = byte_range.offset
             elif isinstance(byte_range, SuffixByteRequest):
                 req_suffix = byte_range.suffix
-        data = await self.cas.load(
-            chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
-        )
+        data = await self.cas.load(
+            chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
+        )
         return prototype.buffer.from_bytes(data)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chunk_cid_str = str(chunk_cid_obj)
req_offset = None
req_length = None
req_suffix = None
if byte_range:
if isinstance(byte_range, RangeByteRequest):
req_offset = byte_range.start
if byte_range.end is not None:
if byte_range.start > byte_range.end:
raise ValueError(
f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
)
req_length = byte_range.end - byte_range.start
elif isinstance(byte_range, OffsetByteRequest):
req_offset = byte_range.offset
elif isinstance(byte_range, SuffixByteRequest):
req_suffix = byte_range.suffix
data = await self.cas.load(
chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
)
return prototype.buffer.from_bytes(data)
req_offset = None
req_length = None
req_suffix = None
if byte_range:
if isinstance(byte_range, RangeByteRequest):
req_offset = byte_range.start
if byte_range.end is not None:
if byte_range.start > byte_range.end:
raise ValueError(
f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
)
if byte_range.end == byte_range.start:
return prototype.buffer.from_bytes(b"")
req_length = byte_range.end - byte_range.start
elif isinstance(byte_range, OffsetByteRequest):
req_offset = byte_range.offset
elif isinstance(byte_range, SuffixByteRequest):
req_suffix = byte_range.suffix
data = await self.cas.load(
chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
)
return prototype.buffer.from_bytes(data)
🧰 Tools
🪛 Ruff (0.12.2)

604-606: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 593 to 615, the code currently
converts the chunk CID to string and calls cas.load even for zero-length ranges
which produces an invalid HTTP Range; instead, do not call str() on
chunk_cid_obj — pass the CID object directly to cas.load — and detect a
zero-length RangeByteRequest (end == start) returning
prototype.buffer.from_bytes(b"") immediately without calling cas.load; only set
and pass req_length when length > 0 and skip cas.load entirely for the
zero-length case.

@Faolain Faolain merged commit 743741d into main Sep 15, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants