feat: Sharding #68

TheGreatAlgo · 2025-06-10T12:14:56Z

Summary by CodeRabbit

New Features
- Sharded Zarr store with memory-bounded shard cache, Flat Zarr store option, and a converter + CLI to migrate HAMT stores to sharded layout; new sharded APIs exported.
Enhancements
- Partial byte-range reads supported (range/suffix/offset) and CAS gateway Range support; RPC-based pin/unpin/update/list for pinned content.
Tests
- Extensive new integration, conversion, resizing, grafting, deletion, partial-read, benchmark, and coverage tests.
Documentation
- Added CLAUDE.md with setup, testing (IPFS) guidance, and architecture notes.
Chores
- Pre-commit/ruff config and example CID/script tweaks.

coderabbitai · 2025-06-10T12:15:03Z

Walkthrough

Adds sharded and flat Zarr store implementations, a HAMT→Sharded converter + CLI, partial-read support end-to-end (CAS→HAMT→Zarr), RPC pin management in KuboCAS, many async tests/benchmarks, small export/example/tooling updates, and new contributor docs (CLAUDE.md).

Changes

Cohort / File(s)	Summary
Documentation `CLAUDE.md`	New contributor guide: setup, IPFS test options, lint/format/type commands, architecture and design notes.
Package exports `py_hamt/__init__.py`	Export `ShardedZarrStore`, `convert_hamt_to_sharded`, `sharded_converter_cli`; removed startup print; updated `__all__`.
HAMT core `py_hamt/hamt.py`	`HAMT.get` signature extended with `offset`, `length`, `suffix` forwarded to CAS.load for partial/suffix reads.
Zarr HAMT integration `py_hamt/zarr_hamt_store.py`	Map Zarr ByteRequest→(offset,length,suffix); bypass metadata cache for partial reads; implement concurrent `get_partial_values`; add exception logging.
Sharded Zarr backend `py_hamt/sharded_zarr_store.py`	New `ShardedZarrStore` and `MemoryBoundedLRUCache`: DAG‑CBOR shard index, per‑shard locks, async get/set/delete/list, partial reads, grafting, resizing, flush, read‑only handling, caching and concurrency logic.
HAMT→Sharded converter & CLI `py_hamt/hamt_to_sharded_converter.py`	New async `convert_hamt_to_sharded` and `sharded_converter_cli` to migrate HAMT stores to `ShardedZarrStore`.
Flat reference store `py_hamt/reference/flat_zarr_store.py`	New `FlatZarrStore`: flat CID index layout, async ops, partial reads via concurrent gets, flush/delete/listing, equality.
CAS / HTTP client / pinning `py_hamt/store_httpx.py`	`ContentAddressedStore.load` extended for slicing; added abstract pin RPC methods; `InMemoryCAS.load` supports slicing; `KuboCAS` manages per-event-loop httpx clients, Range requests, `pin_cid`/`unpin_cid`/`pin_update`/`pin_ls` and `pin_on_add`.
Tests — sharded store & coverage `tests/test_sharded_zarr_store*.py`, `tests/test_sharded_zarr_store_coverage.py`	Large async test suites exercising ShardedZarrStore: write/read, concurrency, append/resize, metadata, partial reads, error paths, and extensive failure-mode coverage (mocks used).
Tests — converter & CLI `tests/test_converter.py`	Tests for conversion correctness and CLI behavior (success, defaults, invalid CID).
Tests — deleting / grafting / resizing `tests/test_sharded_store_deleting.py`, `tests/test_sharded_store_grafting.py`, `tests/test_sharded_store_resizing.py`	Targeted tests for delete semantics, grafting scenarios (concurrent/overlap/invalid), resizing (store & variable), persistence and concurrency.
Tests — partial reads & IPFS `tests/test_zarr_ipfs_partial.py`, modified `tests/test_zarr_ipfs*.py`	New partial-read integration tests (range/offset/suffix) for IPFS/HAMT/InMemoryCAS; removed NotImplemented expectations and adjusted typing.
Benchmarks / compare `tests/test_benchmark_stores.py`, `tests/test_cpc_compare.py`	Async benchmarks for `ShardedZarrStore` and `ZarrHAMTStore` measuring write/read timings and throughput.
Tests — Kubo pin `tests/test_kubo_pin.py`	New `test_pinning` exercising pin/unpin/pin_update/pin_ls via `KuboCAS` RPC.
Examples / scripts `fsgs.py`, `public_gateway_example.py`, `run-checks.sh`	Updated example CIDs; `run-checks.sh` adjusted pre-commit invocation (removed --show-diff-on-failure).
Tooling `.pre-commit-config.yaml`, `pyproject.toml`	Bumped ruff hook version; enabled Ruff preview in pyproject.toml.
Small formatting `tests/testing_utils.py`	Minor reformatting of hypothesis strategy call (no behavior change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client as Client
  participant Zarr as ZarrHAMTStore
  participant HAMT as HAMT
  participant CAS as ContentAddressedStore / KuboCAS
  participant Gateway as IPFS Gateway

  Client->>Zarr: get(key, prototype, byte_range)
  Zarr->>Zarr: _map_byte_request(byte_range) → (offset,length,suffix)
  Zarr->>HAMT: get(key, offset, length, suffix)
  HAMT->>CAS: load(cid, offset, length, suffix)
  alt partial read
    CAS->>Gateway: HTTP GET (Range header)
    Gateway-->>CAS: 206 Partial Content
  else full read
    CAS->>Gateway: HTTP GET
    Gateway-->>CAS: 200 OK
  end
  CAS-->>HAMT: bytes
  HAMT-->>Zarr: bytes/decoded
  Zarr-->>Client: buffer

sequenceDiagram
  autonumber
  participant User as User / CLI
  participant Conv as convert_hamt_to_sharded
  participant CAS as KuboCAS
  participant HStore as HAMT (ro)
  participant SStore as ShardedZarrStore (rw)

  User->>Conv: start(hamt_root_cid, chunks_per_shard)
  Conv->>HStore: open(read_only, hamt_root_cid)
  Conv->>SStore: open(rw, array_shape, chunk_shape, chunks_per_shard)
  loop each key
    Conv->>HStore: list/get pointer
    Conv->>SStore: set_pointer(key, CID)
  end
  Conv->>SStore: flush()
  SStore-->>Conv: sharded_root_cid
  Conv-->>User: sharded_root_cid

sequenceDiagram
  autonumber
  participant Dev as Dev
  participant Kubo as KuboCAS
  participant RPC as IPFS RPC

  Dev->>Kubo: pin_cid(CID)
  Kubo->>RPC: /api/v0/pin/add?arg=CID
  RPC-->>Kubo: OK
  Dev->>Kubo: pin_ls()
  Kubo->>RPC: /api/v0/pin/ls
  RPC-->>Kubo: {Pins}
  Dev->>Kubo: pin_update(old,new)
  Kubo->>RPC: /api/v0/pin/update?arg=old&arg=new
  RPC-->>Kubo: OK
  Dev->>Kubo: unpin_cid(CID)
  Kubo->>RPC: /api/v0/pin/rm?arg=CID
  RPC-->>Kubo: OK

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

ci(gha): switch to custom gha which allows 0.36 of kubo installation #76 — overlaps KuboCAS lifecycle/per-event-loop client and pin/load behavior changes.
Fix/httpx public gateway #72 — related edits to store_httpx/KuboCAS and gateway/partial-read handling.
refactor: HAMT to use async and other performance optimizations #44 — overlapping changes to HAMT/Zarr APIs and partial-read support.

Suggested reviewers

abidsikder
Faolain
0xSwego

Poem

"I nibble through CIDs, shard by shard,
hopping past HAMTs both neat and hard.
I fetch a slice, then pin the prize,
convert and flush beneath bright skies.
Thump—new roots bloom; carrot-coded highs." 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.46% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat: Sharding" is concise and directly describes the primary change in the PR — adding sharding support (new ShardedZarrStore, conversion tooling, and related tests). It follows a conventional commit-style prefix and is clear enough for a teammate scanning history to understand the main intent without extraneous detail.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sharding

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (9)

tests/test_cpc_compare.py (1)

1-129: Consider removing or properly integrating this benchmark file.

This entire file is commented out. If these benchmarks are intended for future use, consider:

Moving them to a separate benchmarks directory

Adding proper documentation explaining when/how to use them

If keeping them, address the security concern with the empty API key on line 28
py_hamt/hamt_to_sharded_converter.py (2)
34-34: Remove the unnecessary consolidated=True parameter.

The consolidated=True parameter in xr.open_zarr is typically used when reading consolidated metadata. Since you're opening a HAMT store that doesn't use Zarr's consolidation feature, this parameter has no effect and should be removed for clarity.

Apply this diff:
-    source_dataset = xr.open_zarr(store=source_store, consolidated=True)
+    source_dataset = xr.open_zarr(store=source_store)
39-44: Simplify array and chunk shape extraction.

The intermediate variables array_shape_tuple and chunk_shape_tuple are redundant.

Apply this diff to reduce local variables:
     ordered_dims = list(source_dataset.dims)
-    array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
-    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
-    array_shape = array_shape_tuple
-    chunk_shape = chunk_shape_tuple
+    array_shape = tuple(source_dataset.dims[dim] for dim in ordered_dims)
+    chunk_shape = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
tests/test_benchmark_stores.py (1)

1-313: Consider removing or properly integrating this benchmark file.

Like test_cpc_compare.py, this entire file is commented out. If these benchmarks are needed:

Uncomment and ensure they work with the current test infrastructure

Move to a dedicated benchmarks directory if they're not part of regular CI

Remove if they're no longer needed
tests/test_converter.py (2)
152-152: Remove unnecessary consolidated=True parameter.

Similar to the converter module, the consolidated=True parameter has no effect with HAMT stores.

Apply this diff:
-        test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True)
+        test_ds.to_zarr(store=source_hamt_store, mode="w")
201-201: Remove unnecessary consolidated=True parameter.

Apply this diff:
-        test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True)
+        test_ds.to_zarr(store=source_hamt_store, mode="w")
py_hamt/store.py (1)
85-96: Simplify control flow by removing unnecessary else

The partial read logic is correct, but the control flow can be simplified.

Apply this diff to improve readability:
 if offset is not None:
     start = offset
     if length is not None:
         end = start + length
         return data[start:end]
-    else:
-        return data[start:]
+    return data[start:]
 elif suffix is not None:  # If only length is given, assume start from 0
     return data[-suffix:]
-else:  # Full load
-    return data
+# Full load
+return data
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)
py_hamt/sharded_zarr_store.py (2)
4-4: Remove unused import

The cast import from typing is not used.
-from typing import Optional, cast, Dict, List, Set, Tuple
+from typing import Optional, Dict, List, Set, Tuple
🧰 Tools

🪛 Ruff (0.11.9)

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)

190-191: Consider making excluded array prefixes configurable

The hardcoded exclusion list might limit flexibility for different use cases.

Consider making this configurable:
-excluded_array_prefixes = {"time", "lat", "lon", "latitude", "longitude"}
+# Add as class attribute or constructor parameter
+excluded_array_prefixes = self._excluded_array_prefixes or {"time", "lat", "lon", "latitude", "longitude"}
This would allow users to customize which arrays are excluded from sharding.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 342e545 and 8476806.

📒 Files selected for processing (16)

CLAUDE.md (1 hunks)
py_hamt/__init__.py (1 hunks)
py_hamt/flat_zarr_store.py (1 hunks)
py_hamt/hamt.py (2 hunks)
py_hamt/hamt_to_sharded_converter.py (1 hunks)
py_hamt/sharded_zarr_store.py (1 hunks)
py_hamt/store.py (5 hunks)
py_hamt/zarr_hamt_store.py (3 hunks)
tests/test_benchmark_stores.py (1 hunks)
tests/test_converter.py (1 hunks)
tests/test_cpc_compare.py (1 hunks)
tests/test_sharded_zarr_store.py (1 hunks)
tests/test_zarr_ipfs.py (0 hunks)
tests/test_zarr_ipfs_encrypted.py (4 hunks)
tests/test_zarr_ipfs_partial.py (1 hunks)
tests/testing_utils.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/test_zarr_ipfs.py

🧰 Additional context used

🧬 Code Graph Analysis (5)

py_hamt/hamt.py (4)

py_hamt/zarr_hamt_store.py (1)

get (116-149)

py_hamt/encryption_hamt_store.py (1)

get (146-170)

tests/test_zarr_ipfs_partial.py (1)

load (275-285)

py_hamt/store.py (3)

load (33-40)

load (58-95)

load (286-315)

py_hamt/hamt_to_sharded_converter.py (4)

py_hamt/hamt.py (3)

HAMT (288-705)

build (386-395)

keys (673-687)

py_hamt/store.py (1)

KuboCAS (98-315)

py_hamt/sharded_zarr_store.py (4)

ShardedZarrStore (13-632)

open (62-89)

set_pointer (447-489)

flush (323-360)

py_hamt/zarr_hamt_store.py (1)

ZarrHAMTStore (12-289)

py_hamt/__init__.py (3)

py_hamt/store.py (3)

ContentAddressedStore (11-40)

InMemoryCAS (43-95)

KuboCAS (98-315)

py_hamt/sharded_zarr_store.py (1)

ShardedZarrStore (13-632)

py_hamt/hamt_to_sharded_converter.py (2)

convert_hamt_to_sharded (12-83)

sharded_converter_cli (86-123)

py_hamt/zarr_hamt_store.py (2)

py_hamt/encryption_hamt_store.py (1)

get (146-170)

py_hamt/hamt.py (1)

get (593-608)

py_hamt/store.py (3)

tests/test_zarr_ipfs_partial.py (1)

load (275-285)

py_hamt/hamt.py (4)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

py_hamt/zarr_hamt_store.py (1)

get (116-149)

🪛 Ruff (0.11.9)

py_hamt/hamt_to_sharded_converter.py

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)

tests/test_converter.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)

171-171: f-string without any placeholders

Remove extraneous f prefix

(F541)

241-241: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)

py_hamt/sharded_zarr_store.py

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)

tests/test_sharded_zarr_store.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

2-2: math imported but unused

Remove unused import: math

(F401)

12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)

13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)

25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)

525-525: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

602-602: Redefinition of unused test_sharded_zarr_store_init_invalid_shapes from line 490

(F811)

637-637: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

🪛 Pylint (3.3.7)

py_hamt/hamt_to_sharded_converter.py

[refactor] 12-12: Too many local variables (20/15)

(R0914)

tests/test_converter.py

[refactor] 46-46: Too many local variables (21/15)

(R0914)

[refactor] 135-135: Too many local variables (16/15)

(R0914)

[refactor] 184-184: Too many local variables (17/15)

(R0914)

py_hamt/store.py

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

tests/test_zarr_ipfs_partial.py

[refactor] 62-62: Too many local variables (34/15)

(R0914)

[refactor] 188-188: Redundant comparison - zhs_read == zhs_read

(R0124)

[refactor] 62-62: Too many statements (90/50)

(R0915)

[refactor] 229-229: Too few public methods (0/2)

(R0903)

[refactor] 238-238: Too many local variables (38/15)

(R0914)

[refactor] 238-238: Too many statements (100/50)

(R0915)

py_hamt/sharded_zarr_store.py

[refactor] 13-13: Too many instance attributes (14/7)

(R0902)

[refactor] 62-62: Too many arguments (8/5)

(R0913)

[refactor] 186-186: Too many return statements (8/6)

(R0911)

[refactor] 274-277: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 363-363: Too many local variables (17/15)

(R0914)

[refactor] 363-363: Too many return statements (7/6)

(R0911)

[refactor] 363-363: Too many branches (14/12)

(R0912)

[refactor] 552-557: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

tests/test_sharded_zarr_store.py

[refactor] 162-162: Too many local variables (16/15)

(R0914)

[refactor] 212-212: Too many local variables (17/15)

(R0914)

[refactor] 270-270: Too many local variables (18/15)

(R0914)

[refactor] 313-313: Too many local variables (18/15)

(R0914)

[error] 602-602: function already defined line 490

(E0102)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

py_hamt/hamt_to_sharded_converter.py

[error] 1-12: Ruff lint errors: I001 import block un-sorted; multiple F401 unused imports (json, typing.Dict, typing.Any, zarr.core.buffer.Buffer, BufferPrototype).

tests/test_converter.py

[error] 1-241: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, aiohttp); F541 f-string without placeholders; F841 assigned but unused local variable 'kubo_cas'.

py_hamt/__init__.py

[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.

py_hamt/sharded_zarr_store.py

[error] 1-1: Mypy type errors: multiple argument type incompatibilities and unsupported operand types involving 'None' and 'int' or 'tuple[int, ...] | None'.

[error] 93-678: Mypy type errors: Argument and operator type mismatches, index errors, and union attribute errors detected in multiple lines.

[error] 1-10: Ruff lint errors: I001 import block un-sorted; F401 unused import 'typing.cast'.

tests/test_sharded_zarr_store.py

[error] 550-687: Mypy errors: dict indexing on possibly None type and redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.

[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.

🔇 Additional comments (14)

tests/testing_utils.py (1)

172-172:
❓ Verification inconclusive

Consider using a pinned version instead of "master-latest" for test stability.

Using "ipfs/kubo:master-latest" introduces non-deterministic behavior in tests as the underlying IPFS version can change between test runs. This can lead to:

Flaky tests due to upstream breaking changes

Difficulty reproducing test failures

Inconsistent behavior across different environments

Consider using a specific version tag:
-    image = "ipfs/kubo:master-latest"
+    image = "ipfs/kubo:v0.35.0"  # or latest stable version
If you need to use the latest version, please verify that this change doesn't introduce test instability by running the test suite multiple times:

🏁 Script executed:
#!/bin/bash
# Verify test stability with new IPFS version
for i in {1..5}; do
  echo "Test run $i"
  pytest tests/ --ipfs -v
  if [ $? -ne 0 ]; then
    echo "Test failed on run $i"
    exit 1
  fi
done
echo "All test runs passed"
Length of output: 251
Pin the IPFS Docker image for stable tests

Switching from a moving “master-latest” tag to a fixed release prevents non-deterministic failures in your integration suite.

• File: tests/testing_utils.py
Line: 172
• Current:
    image = "ipfs/kubo:master-latest"
• Suggested change:
-    image = "ipfs/kubo:master-latest"
+    image = "ipfs/kubo:v0.35.0"  # or the latest stable release
Please run your full test suite in an environment where pytest is available to confirm there are no regressions or flakiness.
tests/test_zarr_ipfs_encrypted.py (3)

8-8: LGTM! Import change improves type annotations.

The addition of IPLDKind import supports better type annotations throughout the test file.

93-93: LGTM! Explicit type annotation improves code clarity.

The explicit type annotation makes the variable's type clear and aligns with the import of IPLDKind.

109-109: LGTM! Type ignore comments are appropriately used.

The # type: ignore comments suppress legitimate mypy warnings where the type checker cannot infer the correct types for Zarr store operations.

Also applies to: 196-196

py_hamt/__init__.py (2)

3-3: LGTM! Exposing InMemoryCAS in the public API.

Adding InMemoryCAS to the public exports makes sense for testing scenarios and provides users with an in-memory storage option alongside KuboCAS.

Also applies to: 14-14

5-6: LGTM! New sharding functionality properly exposed.

The addition of ShardedZarrStore, convert_hamt_to_sharded, and sharded_converter_cli to the public API enables users to leverage the new sharding capabilities introduced in this PR.

Also applies to: 16-18

🧰 Tools

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.

py_hamt/hamt.py (2)

11-11: LGTM! Import addition supports new optional parameters.

The Optional import is needed for the new type annotations in the enhanced get method.

593-604: LGTM! Well-implemented partial read support.

The enhancement to the get method properly implements partial read functionality:

Backward compatibility: All new parameters are optional with None defaults

Clear interface: Parameters follow standard byte range conventions (offset, length, suffix)

Proper forwarding: Parameters are correctly passed to the underlying cas.load call

Type safety: Method signature includes proper type annotations

This aligns well with the corresponding changes in py_hamt/store.py where the ContentAddressedStore.load method was enhanced with the same parameters.

CLAUDE.md (1)

1-88: Excellent documentation for AI-assisted development.

This documentation file provides comprehensive guidance for working with the repository:

Clear development workflow: Covers setup, testing, linting, and type checking

Accurate architecture description: Well-documented overview of CAS, HAMT, and Zarr integration layers

Helpful testing guidance: Explains IPFS integration requirements and test execution options

Design pattern documentation: Captures key architectural decisions like async operations and content addressing

This will be valuable for AI-assisted development and new contributors understanding the codebase.

py_hamt/zarr_hamt_store.py (1)

151-163: Well-implemented concurrent partial value retrieval.

The implementation correctly uses asyncio.gather for efficient concurrent fetching of multiple byte ranges. Good addition for supporting partial reads.

py_hamt/store.py (2)

33-39: Good API extension for partial data retrieval

The addition of optional parameters for partial loading is well-designed and maintains backward compatibility with default values.

298-311: Well-implemented HTTP Range header support

The Range header construction correctly follows the HTTP standard with proper handling of all three cases: range with start and end, open-ended range, and suffix range.

tests/test_zarr_ipfs_partial.py (1)

431-461: Excellent test coverage for partial reads

The test comprehensively covers all partial read scenarios including edge cases and error conditions. Well structured!

py_hamt/sharded_zarr_store.py (1)

323-361: Well-structured flush implementation

The flush method properly handles saving dirty shards first, then the root object, with appropriate error handling and state management.

py_hamt/__init__.py

py_hamt/hamt_to_sharded_converter.py

tests/test_converter.py

tests/test_sharded_zarr_store.py

coderabbitai · 2025-06-10T12:24:22Z

tests/test_sharded_zarr_store.py

+import asyncio
+import math
+
+import numpy as np
+import pandas as pd
+import pytest
+import xarray as xr
+from zarr.abc.store import RangeByteRequest
+import zarr.core.buffer
+import dag_cbor
+
+from py_hamt import HAMT, KuboCAS, ShardedZarrStore
+from py_hamt.zarr_hamt_store import ZarrHAMTStore
+
+
+@pytest.fixture(scope="module")
+def random_zarr_dataset():
+    """Creates a random xarray Dataset for benchmarking."""
+    # Using a slightly larger dataset for a more meaningful benchmark
+    times = pd.date_range("2024-01-01", periods=100)
+    lats = np.linspace(-90, 90, 18)
+    lons = np.linspace(-180, 180, 36)
+
+    temp = np.random.randn(len(times), len(lats), len(lons))
+    precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
+
+    ds = xr.Dataset(
+        {
+            "temp": (["time", "lat", "lon"], temp),
+        },
+        coords={"time": times, "lat": lats, "lon": lons},
+    )
+
+    # Define chunking for the store
+    ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})
+    yield ds
+


🛠️ Refactor suggestion

Remove unused imports and variables

Several imports and variables are not used in the test file.

Apply these changes to clean up the code:

-import asyncio -import math import numpy as np import pandas as pd import pytest import xarray as xr from zarr.abc.store import RangeByteRequest import zarr.core.buffer import dag_cbor -from py_hamt import HAMT, KuboCAS, ShardedZarrStore -from py_hamt.zarr_hamt_store import ZarrHAMTStore +from py_hamt import KuboCAS, ShardedZarrStore

Also remove the unused precip variable:

temp = np.random.randn(len(times), len(lats), len(lons)) -precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons))) ds = xr.Dataset( { "temp": (["time", "lat", "lon"], temp), }, coords={"time": times, "lat": lats, "lon": lons}, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import asyncio

import math

import numpy as np

import pandas as pd

import pytest

import xarray as xr

from zarr.abc.store import RangeByteRequest

import zarr.core.buffer

import dag_cbor

from py_hamt import HAMT, KuboCAS, ShardedZarrStore

from py_hamt.zarr_hamt_store import ZarrHAMTStore

@pytest.fixture(scope="module")

def random_zarr_dataset():

"""Creates a random xarray Dataset for benchmarking."""

# Using a slightly larger dataset for a more meaningful benchmark

times = pd.date_range("2024-01-01", periods=100)

lats = np.linspace(-90, 90, 18)

lons = np.linspace(-180, 180, 36)

temp = np.random.randn(len(times), len(lats), len(lons))

precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))

ds = xr.Dataset(

{

"temp": (["time", "lat", "lon"], temp),

},

coords={"time": times, "lat": lats, "lon": lons},

)

# Define chunking for the store

ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})

yield ds

import numpy as np

import pandas as pd

import pytest

import xarray as xr

from zarr.abc.store import RangeByteRequest

import zarr.core.buffer

import dag_cbor

from py_hamt import KuboCAS, ShardedZarrStore

@pytest.fixture(scope="module")

def random_zarr_dataset():

"""Creates a random xarray Dataset for benchmarking."""

# Using a slightly larger dataset for a more meaningful benchmark

times = pd.date_range("2024-01-01", periods=100)

lats = np.linspace(-90, 90, 18)

lons = np.linspace(-180, 180, 36)

temp = np.random.randn(len(times), len(lats), len(lons))

ds = xr.Dataset(

{

"temp": (["time", "lat", "lon"], temp),

},

coords={"time": times, "lat": lats, "lon": lons},

)

# Define chunking for the store

ds = ds.chunk({"time": 20, "lat": 18, "lon": 36})

yield ds

🧰 Tools

🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

2-2: math imported but unused

Remove unused import: math

(F401)

12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)

13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)

25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.

🤖 Prompt for AI Agents

In tests/test_sharded_zarr_store.py lines 1 to 37, remove the unused imports math, RangeByteRequest, zarr.core.buffer, dag_cbor, HAMT, KuboCAS, ShardedZarrStore, and ZarrHAMTStore. Also delete the unused variable precip in the random_zarr_dataset fixture to clean up the code and avoid unnecessary clutter.

coderabbitai · 2025-06-10T12:24:22Z

py_hamt/flat_zarr_store.py

+# Functional Flat Store for Zarr using IPFS. This is ultimately the "Best" way to do it
+# but only if the metadata is small enough to fit in a single CBOR object.
+# This is commented out because we are not using it. However I did not want to delete it
+# because it is a good example of how to implement a Zarr store using a flat index
+
+# import asyncio
+# import math
+# from collections.abc import AsyncIterator, Iterable
+# from typing import Optional, cast
+
+# import dag_cbor
+# import zarr.abc.store
+# import zarr.core.buffer
+# from zarr.core.common import BytesLike
+
+# from .store import ContentAddressedStore
+
+
+# class FlatZarrStore(zarr.abc.store.Store):
+#     """
+#     Implements the Zarr Store API using a flat, predictable layout for chunk CIDs.
+
+#     This store bypasses the need for a HAMT, offering direct, calculated
+#     access to chunk data based on a mathematical formula. It is designed for
+#     dense, multi-dimensional arrays where chunk locations are predictable.
+
+#     The store is structured around a single root CBOR object. This root object contains:
+#     1.  A dictionary mapping metadata keys (like 'zarr.json') to their CIDs.
+#     2.  A single CID pointing to a large, contiguous block of bytes (the "flat index").
+#         This flat index is a concatenation of the CIDs of all data chunks.
+
+#     Accessing a chunk involves:
+#     1.  Loading the root object (if not cached).
+#     2.  Calculating the byte offset of the chunk's CID within the flat index.
+#     3.  Fetching that specific CID using a byte-range request on the flat index.
+#     4.  Fetching the actual chunk data using the retrieved CID.
+
+#     ### Sample Code
+#     ```python
+#     import xarray as xr
+#     import numpy as np
+#     from py_hamt import KuboCAS, FlatZarrStore
+
+#     # --- Write ---
+#     ds = xr.Dataset(
+#         {"data": (("t", "y", "x"), np.arange(24).reshape(2, 3, 4))},
+#     )
+#     cas = KuboCAS()
+#     # When creating, must provide array shape and chunk shape
+#     store_write = await FlatZarrStore.open(
+#         cas,
+#         read_only=False,
+#         array_shape=ds.data.shape,
+#         chunk_shape=ds.data.encoding['chunks']
+#     )
+#     ds.to_zarr(store=store_write, mode="w")
+#     root_cid = await store_write.flush() # IMPORTANT: flush to get final root CID
+#     print(f"Finished writing. Root CID: {root_cid}")
+
+
+#     # --- Read ---
+#     store_read = await FlatZarrStore.open(cas, read_only=True, root_cid=root_cid)
+#     ds_read = xr.open_zarr(store=store_read)
+#     print("Read back dataset:")
+#     print(ds_read)
+#     xr.testing.assert_identical(ds, ds_read)
+#     ```
+#     """
+
+#     def __init__(
+#         self, cas: ContentAddressedStore, read_only: bool, root_cid: Optional[str]
+#     ):
+#         """Use the async `open()` classmethod to instantiate this class."""
+#         super().__init__(read_only=read_only)
+#         self.cas = cas
+#         self._root_cid = root_cid
+#         self._root_obj: Optional[dict] = None
+#         self._flat_index_cache: Optional[bytearray] = None
+#         self._cid_len: Optional[int] = None
+#         self._array_shape: Optional[tuple[int, ...]] = None
+#         self._chunk_shape: Optional[tuple[int, ...]] = None
+#         self._chunks_per_dim: Optional[tuple[int, ...]] = None
+#         self._dirty = False
+
+#     @classmethod
+#     async def open(
+#         cls,
+#         cas: ContentAddressedStore,
+#         read_only: bool,
+#         root_cid: Optional[str] = None,
+#         *,
+#         array_shape: Optional[tuple[int, ...]] = None,
+#         chunk_shape: Optional[tuple[int, ...]] = None,
+#         cid_len: int = 59,  # Default for base32 v1 CIDs like bafy...
+#     ) -> "FlatZarrStore":
+#         """
+#         Asynchronously opens an existing FlatZarrStore or initializes a new one.
+
+#         Args:
+#             cas: The Content Addressed Store (e.g., KuboCAS).
+#             read_only: If True, the store is in read-only mode.
+#             root_cid: The root CID of an existing store to open. Required for read_only.
+#             array_shape: The full shape of the Zarr array. Required for a new writeable store.
+#             chunk_shape: The shape of a single chunk. Required for a new writeable store.
+#             cid_len: The expected byte length of a CID string.
+#         """
+#         store = cls(cas, read_only, root_cid)
+#         if root_cid:
+#             await store._load_root_from_cid()
+#         elif not read_only:
+#             if not all([array_shape, chunk_shape]):
+#                 raise ValueError(
+#                     "array_shape and chunk_shape must be provided for a new store."
+#                 )
+#             store._initialize_new_root(array_shape, chunk_shape, cid_len)
+#         else:
+#             raise ValueError("root_cid must be provided for a read-only store.")
+#         return store
+
+#     def _initialize_new_root(
+#         self,
+#         array_shape: tuple[int, ...],
+#         chunk_shape: tuple[int, ...],
+#         cid_len: int,
+#     ):
+#         self._array_shape = array_shape
+#         self._chunk_shape = chunk_shape
+#         self._cid_len = cid_len
+#         self._chunks_per_dim = tuple(
+#             math.ceil(a / c) for a, c in zip(array_shape, chunk_shape)
+#         )
+#         self._root_obj = {
+#             "manifest_version": "flat_zarr_v1",
+#             "metadata": {},
+#             "chunks": {
+#                 "cid": None,  # Will be filled on first flush
+#                 "array_shape": list(self._array_shape),
+#                 "chunk_shape": list(self._chunk_shape),
+#                 "cid_byte_length": self._cid_len,
+#             },
+#         }
+#         self._dirty = True
+
+#     async def _load_root_from_cid(self):
+#         if not self._root_cid:
+#             raise ValueError("Cannot load root without a root_cid.")
+#         root_bytes = await self.cas.load(self._root_cid)
+#         self._root_obj = dag_cbor.decode(root_bytes)
+#         chunk_info = self._root_obj.get("chunks", {})
+#         self._array_shape = tuple(chunk_info["array_shape"])
+#         self._chunk_shape = tuple(chunk_info["chunk_shape"])
+#         self._cid_len = chunk_info["cid_byte_length"]
+#         self._chunks_per_dim = tuple(
+#             math.ceil(a / c) for a, c in zip(self._array_shape, self._chunk_shape)
+#         )
+
+#     def _parse_chunk_key(self, key: str) -> Optional[tuple[int, ...]]:
+#         if not self._array_shape or not key.startswith("c/"):
+#             return None
+#         parts = key.split("/")
+#         if len(parts) != len(self._array_shape) + 1:
+#             return None
+#         try:
+#             return tuple(map(int, parts[1:]))
+#         except (ValueError, IndexError):
+#             return None
+
+#     async def set_partial_values(   
+#         self, key_start_values: Iterable[tuple[str, int, BytesLike]]
+#     ) -> None:
+#         """@private"""
+#         raise NotImplementedError("Partial writes are not supported by this store.")
+
+#     async def get_partial_values(
+#         self,
+#         prototype: zarr.core.buffer.BufferPrototype,
+#         key_ranges: Iterable[tuple[str, zarr.abc.store.ByteRequest | None]],
+#     ) -> list[zarr.core.buffer.Buffer | None]:
+#         """
+#         Retrieves multiple keys or byte ranges concurrently.
+#         """
+#         tasks = [self.get(key, prototype, byte_range) for key, byte_range in key_ranges]
+#         results = await asyncio.gather(*tasks)
+#         return results
+
+#     def __eq__(self, other: object) -> bool:
+#         """@private"""
+#         if not isinstance(other, FlatZarrStore):
+#             return NotImplemented
+#         return self._root_cid == other._root_cid
+
+#     def _get_chunk_offset(self, chunk_coords: tuple[int, ...]) -> int:
+#         linear_index = 0
+#         multiplier = 1
+#         for i in reversed(range(len(self._chunks_per_dim))):
+#             linear_index += chunk_coords[i] * multiplier
+#             multiplier *= self._chunks_per_dim[i]
+#         return linear_index * self._cid_len
+
+#     async def flush(self) -> str:
+#         """
+#         Writes all pending changes (metadata and chunk index) to the CAS
+#         and returns the new root CID. This MUST be called after all writes are complete.
+#         """
+#         if self.read_only or not self._dirty:
+#             return self._root_cid
+
+#         if self._flat_index_cache is not None:
+#             flat_index_cid_obj = await self.cas.save(
+#                 bytes(self._flat_index_cache), codec="raw"
+#             )
+#             self._root_obj["chunks"]["cid"] = str(flat_index_cid_obj)
+
+#         root_obj_bytes = dag_cbor.encode(self._root_obj)
+#         new_root_cid_obj = await self.cas.save(root_obj_bytes, codec="dag-cbor")
+#         self._root_cid = str(new_root_cid_obj)
+#         self._dirty = False
+#         return self._root_cid
+
+#     async def get(
+#         self,
+#         key: str,
+#         prototype: zarr.core.buffer.BufferPrototype,
+#         byte_range: zarr.abc.store.ByteRequest | None = None,
+#     ) -> zarr.core.buffer.Buffer | None:
+#         """@private"""
+#         if self._root_obj is None:
+#             await self._load_root_from_cid()
+
+#         chunk_coords = self._parse_chunk_key(key)
+#         try:
+#             # Metadata request
+#             if chunk_coords is None:
+#                 metadata_cid = self._root_obj["metadata"].get(key)
+#                 if metadata_cid is None:
+#                     return None
+#                 data = await self.cas.load(metadata_cid)
+#                 return prototype.buffer.from_bytes(data)
+
+#             # Chunk data request
+#             flat_index_cid = self._root_obj["chunks"]["cid"]
+#             if flat_index_cid is None:
+#                 return None
+
+#             offset = self._get_chunk_offset(chunk_coords)
+#             chunk_cid_bytes = await self.cas.load(
+#                 flat_index_cid, offset=offset, length=self._cid_len
+#             )
+
+#             if all(b == 0 for b in chunk_cid_bytes):
+#                 return None  # Chunk doesn't exist
+
+#             chunk_cid = chunk_cid_bytes.decode("ascii")
+#             data = await self.cas.load(chunk_cid)
+#             return prototype.buffer.from_bytes(data)
+
+#         except (KeyError, IndexError):
+#             return None
+
+#     async def set(self, key: str, value: zarr.core.buffer.Buffer) -> None:
+#         """@private"""
+#         if self.read_only:
+#             raise ValueError("Cannot write to a read-only store.")
+#         if self._root_obj is None:
+#             raise RuntimeError("Store not initialized for writing.")
+
+#         self._dirty = True
+#         raw_bytes = value.to_bytes()
+#         value_cid_obj = await self.cas.save(raw_bytes, codec="raw")
+#         value_cid = str(value_cid_obj)
+
+#         if len(value_cid) != self._cid_len:
+#             raise ValueError(
+#                 f"Inconsistent CID length. Expected {self._cid_len}, got {len(value_cid)}"
+#             )
+
+#         chunk_coords = self._parse_chunk_key(key)
+
+#         if chunk_coords is None:  # Metadata
+#             self._root_obj["metadata"][key] = value_cid
+#             return
+
+#         # Chunk Data
+#         if self._flat_index_cache is None:
+#             num_chunks = math.prod(self._chunks_per_dim)
+#             self._flat_index_cache = bytearray(num_chunks * self._cid_len)
+
+#         offset = self._get_chunk_offset(chunk_coords)
+#         self._flat_index_cache[offset : offset + self._cid_len] = value_cid.encode(
+#             "ascii"
+#         )
+
+#     # --- Other required zarr.abc.store methods ---
+
+#     async def exists(self, key: str) -> bool:
+#         """@private"""
+#         # A more efficient version might check for null bytes in the flat index
+#         # but this is functionally correct.
+
+#         # TODO: Optimize this check
+#         return True
+
+
+#         # return (await self.get(key, zarr.core.buffer.Buffer.prototype, None)) is not None
+
+#     @property
+#     def supports_writes(self) -> bool:
+#         """@private"""
+#         return not self.read_only
+
+#     @property
+#     def supports_partial_writes(self) -> bool:
+#         """@private"""
+#         return False  # Each chunk is an immutable object
+
+#     @property
+#     def supports_deletes(self) -> bool:
+#         """@private"""
+#         return not self.read_only 
+
+#     async def delete(self, key: str) -> None:
+#         if self.read_only:
+#             raise ValueError("Cannot delete from a read-only store.")
+#         if self._root_obj is None:
+#             await self._load_root_from_cid()
+#         chunk_coords = self._parse_chunk_key(key)
+#         if chunk_coords is None:
+#             if key in self._root_obj["metadata"]:
+#                 del self._root_obj["metadata"][key]
+#                 self._dirty = True
+#                 return
+#             else:
+#                 raise KeyError(f"Metadata key '{key}' not found.")
+#         flat_index_cid = self._root_obj["chunks"]["cid"]
+#         if self._flat_index_cache is None:
+#             if not flat_index_cid:
+#                 raise KeyError(f"Chunk key '{key}' not found in non-existent index.")
+#             self._flat_index_cache = bytearray(await self.cas.load(flat_index_cid))
+#         offset = self._get_chunk_offset(chunk_coords)
+#         if all(b == 0 for b in self._flat_index_cache[offset : offset + self._cid_len]):
+#             raise KeyError(f"Chunk key '{key}' not found.")
+#         self._flat_index_cache[offset : offset + self._cid_len] = bytearray(self._cid_len)
+#         self._dirty = True
+
+#     @property
+#     def supports_listing(self) -> bool:
+#         """@private"""
+#         return True
+
+#     async def list(self) -> AsyncIterator[str]:
+#         """@private"""
+#         if self._root_obj is None:
+#             await self._load_root_from_cid()
+#         for key in self._root_obj["metadata"]:
+#             yield key
+#         # Note: Listing all chunk keys without reading the index is non-trivial.
+#         # A full implementation might need an efficient way to iterate non-null chunks.
+#         # This basic version only lists metadata.
+
+#     async def list_prefix(self, prefix: str) -> AsyncIterator[str]:
+#         """@private"""
+#         async for key in self.list():
+#             if key.startswith(prefix):
+#                 yield key
+
+#     async def list_dir(self, prefix: str) -> AsyncIterator[str]:
+#         """@private"""
+#         # This simplified version only works for the root.
+#         if prefix == "":
+#             seen = set()
+#             async for key in self.list():
+#                 name = key.split('/')[0]
+#                 if name not in seen:
+#                     seen.add(name)
+#                     yield name


🛠️ Refactor suggestion

Consider removing commented-out code

The entire file consists of commented-out code. While the header explains it's kept as a reference implementation, it's better to remove it and rely on version control history.

Consider one of these approaches:

Remove the file entirely and reference it in documentation or a README if needed

Move it to a examples/ or reference/ directory if it has educational value

If parts are useful, extract them into a documentation file

Keeping large blocks of commented code in the main codebase can cause confusion and maintenance overhead.

🤖 Prompt for AI Agents

In py_hamt/flat_zarr_store.py lines 1 to 375, the entire file is commented-out code serving as a reference implementation. To reduce confusion and maintenance overhead, remove this commented-out code from the main codebase. Instead, either delete the file and keep it accessible via version control history, move it to an examples or reference directory if it has educational value, or extract useful parts into a separate documentation file.

py_hamt/sharded_zarr_store.py

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

py_hamt/store.py (1)

58-96: 🛠️ Refactor suggestion

length-only reads and value checking are missing; logic disagrees with comment

Calling load(id, length=8) currently falls through to the full load path instead of returning the first 8 bytes.
The comment “If only length is given, assume start from 0” is therefore incorrect.
Negative values or a simultaneous offset + suffix are silently accepted, which can produce surprising slices.
There is an unnecessary elif/else chain after early returns (pylint R1705).

Tighten the control flow and handle all combinations explicitly:

@@
-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if (offset is not None) and (suffix is not None):
+            raise ValueError("Specify either offset/length or suffix, not both")
+
+        if offset is not None:
+            if offset < 0:
+                raise ValueError("offset must be ≥ 0")
+            if length is None:
+                return data[offset:]
+            if length < 0:
+                raise ValueError("length must be ≥ 0")
+            return data[offset : offset + length]
+
+        if suffix is not None:
+            if suffix < 0:
+                raise ValueError("suffix must be ≥ 0")
+            return data[-suffix:]
+
+        if length is not None:        # implicit offset == 0
+            if length < 0:
+                raise ValueError("length must be ≥ 0")
+            return data[:length]
+
+        return data                   # full blob

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🧹 Nitpick comments (2)

py_hamt/store.py (2)

33-40: Docstring/method contract not updated for new slice parameters

ContentAddressedStore.load() now accepts offset, length, and suffix, but the docstring (“Retrieve data.”) was left untouched. Add parameter descriptions and clarify precedence/valid-value ranges (e.g. that negative numbers are rejected and that offset & suffix are mutually exclusive).

85-95: Minor: remove superfluous elif/else branches after early returns

Once a branch returns, the following elif/else is redundant; flattening the logic (see previous diff) resolves the pylint R1705 warning.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8476806 and 50fd6e6.

📒 Files selected for processing (1)

py_hamt/store.py (5 hunks)

🧰 Additional context used

🪛 Pylint (3.3.7)

py_hamt/store.py

[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

coderabbitai · 2025-06-10T20:08:17Z

py_hamt/store.py

+    async def load(
+        self,
+        id: IPLDKind,
+        offset: Optional[int] = None,
+        length: Optional[int] = None,
+        suffix: Optional[int] = None,
+    ) -> bytes:
        """@private"""
-        cid = cast(CID, id)  # CID is definitely in the IPLDKind type
+        cid = cast(CID, id)
        url: str = self.gateway_base_url + str(cid)
+        headers: dict[str, str] = {}
+
+        # Construct the Range header if required
+        if offset is not None:
+            start = offset
+            if length is not None:
+                # Standard HTTP Range: bytes=start-end (inclusive)
+                end = start + length - 1
+                headers["Range"] = f"bytes={start}-{end}"
+            else:
+                # Standard HTTP Range: bytes=start- (from start to end)
+                headers["Range"] = f"bytes={start}-"
+        elif suffix is not None:
+            # Standard HTTP Range: bytes=-N (last N bytes)
+            headers["Range"] = f"bytes=-{suffix}"
+
        async with self._sem:  # throttle gateway


⚠️ Potential issue

Partial-range header builder omits the length-only case and lacks validation

KuboCAS.load() mirrors the in-memory logic, so the same edge cases apply:

length without offset results in a full GET (no Range header).
Expected header: Range: bytes=0-{length-1}.

Missing checks for negative values and for mutually-exclusive offset+suffix.

Proposed incremental fix:

@@ - if offset is not None: + if (offset is not None) and (suffix is not None): + raise ValueError("Specify either offset/length or suffix, not both") + + if offset is not None: start = offset if length is not None: @@ headers["Range"] = f"bytes={start}-" - elif suffix is not None: + elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) headers["Range"] = f"bytes=-{suffix}" + elif length is not None: + # Range from byte 0 to length-1 + headers["Range"] = f"bytes=0-{length-1}"

Consider also asserting non-negative integers for all three parameters.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def load(

self,

id: IPLDKind,

offset: Optional[int] = None,

length: Optional[int] = None,

suffix: Optional[int] = None,

) -> bytes:

"""@private"""

cid = cast(CID, id) # CID is definitely in the IPLDKind type

cid = cast(CID, id)

url: str = self.gateway_base_url + str(cid)

headers: dict[str, str] = {}

# Construct the Range header if required

if offset is not None:

start = offset

if length is not None:

# Standard HTTP Range: bytes=start-end (inclusive)

end = start + length - 1

headers["Range"] = f"bytes={start}-{end}"

else:

# Standard HTTP Range: bytes=start- (from start to end)

headers["Range"] = f"bytes={start}-"

elif suffix is not None:

# Standard HTTP Range: bytes=-N (last N bytes)

headers["Range"] = f"bytes=-{suffix}"

async with self._sem: # throttle gateway

async def load(

self,

id: IPLDKind,

offset: Optional[int] = None,

length: Optional[int] = None,

suffix: Optional[int] = None,

) -> bytes:

"""@private"""

cid = cast(CID, id)

url: str = self.gateway_base_url + str(cid)

headers: dict[str, str] = {}

# Construct the Range header if required

if (offset is not None) and (suffix is not None):

raise ValueError("Specify either offset/length or suffix, not both")

if offset is not None:

start = offset

if length is not None:

# Standard HTTP Range: bytes=start-end (inclusive)

end = start + length - 1

headers["Range"] = f"bytes={start}-{end}"

else:

# Standard HTTP Range: bytes=start- (from start to end)

headers["Range"] = f"bytes={start}-"

elif suffix is not None:

# Standard HTTP Range: bytes=-N (last N bytes)

headers["Range"] = f"bytes=-{suffix}"

elif length is not None:

# Range from byte 0 to length-1

headers["Range"] = f"bytes=0-{length-1}"

async with self._sem: # throttle gateway

🤖 Prompt for AI Agents

In py_hamt/store.py around lines 307 to 333, the load method's Range header construction misses handling the case where length is provided without offset, which should set the Range header to bytes=0-(length-1). Additionally, add validation to ensure offset, length, and suffix are non-negative integers and enforce that offset and suffix are not both provided simultaneously. Update the logic to include these checks and construct the Range header accordingly to cover all edge cases.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

tests/test_cpc_compare.py (1)

1-128: 🛠️ Refactor suggestion

Dead-code test module — delete or convert to a real test

Every line in this file is commented out, so the test runner will never execute it. Carrying around disabled test code adds noise and sends mixed signals about intended coverage.
• If you still want the benchmark, move it to an active pytest test (mark it with @pytest.mark.benchmark or similar and gate it behind an opt-in flag).
• Otherwise, delete the file to avoid confusion.

♻️ Duplicate comments (8)

py_hamt/hamt_to_sharded_converter.py (1)
3-10: Remove the unused imports flagged by Ruff (duplicate reminder)

json, Dict, Any, Buffer, and BufferPrototype are not referenced anywhere in this module. Keeping them causes F401 warnings and breaks the linting gate.
-import json
-from typing import Dict, Any
-from zarr.core.buffer import Buffer, BufferPrototype
🧰 Tools

🪛 Ruff (0.11.9)

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)
tests/test_converter.py (2)
1-7: Drop unused imports and sort the remaining block (duplicate reminder)

asyncio and aiohttp are unused, triggering F401. Remove them and keep standard-lib imports first.
-import asyncio
 ...
-from unittest.mock import patch
-import aiohttp
+from unittest.mock import patch
🧰 Tools

🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)

182-182: Remove the unnecessary f-string prefix

No interpolation happens here.
-        assert f"New ShardedZarrStore Root CID" in captured.out
+        assert "New ShardedZarrStore Root CID" in captured.out
🧰 Tools

🪛 Ruff (0.11.9)

182-182: f-string without any placeholders

Remove extraneous f prefix

(F541)
tests/test_sharded_zarr_store.py (2)
1-2: Remove unused imports and variables

Several imports and the precip variable are not used in the test file.

Apply this diff to clean up the code:
-import asyncio
-import math
 
 import numpy as np
 import pandas as pd
 import pytest
 import xarray as xr
 from zarr.abc.store import RangeByteRequest
 import zarr.core.buffer
 import dag_cbor
 
-from py_hamt import HAMT, KuboCAS, ShardedZarrStore
-from py_hamt.zarr_hamt_store import ZarrHAMTStore
+from py_hamt import KuboCAS, ShardedZarrStore
Also remove the unused precip variable:
 temp = np.random.randn(len(times), len(lats), len(lons))
-precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
 
 ds = xr.Dataset(
Also applies to: 12-13, 25-25

🧰 Tools

🪛 Ruff (0.11.9)

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

2-2: math imported but unused

Remove unused import: math

(F401)

647-732: ⚠️ Potential issue

Remove duplicate function definition

The function test_sharded_zarr_store_init_invalid_shapes is defined twice. The commented-out version (lines 509-596) should be removed along with this duplicate definition.

🧰 Tools

🪛 Ruff (0.11.9)

689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)
py_hamt/store.py (1)
334-346: ⚠️ Potential issue

Partial-range header builder omits the length-only case and lacks validation

KuboCAS.load() has a bug where length without offset results in a full GET (no Range header). It should set Range: bytes=0-{length-1}.

Apply this fix:
         # Construct the Range header if required
-        if offset is not None:
+        if (offset is not None) and (suffix is not None):
+            raise ValueError("Specify either offset/length or suffix, not both")
+        
+        if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
         elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
+        elif length is not None:
+            # Range from byte 0 to length-1
+            headers["Range"] = f"bytes=0-{length-1}"
py_hamt/sharded_zarr_store.py (2)
325-334: Add type assertion for better type safety

After checking that _cid_len is not None, add an assertion to help mypy understand the type.
         if self._cid_len is None:  # Should be set
             raise RuntimeError(
                 "Store not initialized: _cid_len is None for shard initialization."
             )
+        assert self._cid_len is not None  # Type assertion for mypy
         if self._chunks_per_shard is None:
             raise RuntimeError(
                 "Store not initialized: _chunks_per_shard is None for shard initialization."
             )
+        assert self._chunks_per_shard is not None  # Type assertion for mypy
         # New shard or shard not yet written, initialize with zeros
         shard_size_bytes = self._chunks_per_shard * self._cid_len
488-498: ⚠️ Potential issue

Add explicit ByteRequest type validation in get()

The code assumes byte_range has .start and .end attributes, which will fail for OffsetByteRequest or SuffixByteRequest types.

Replace the existing byte range handling:
-        req_offset = byte_range.start if byte_range else None
-        req_length = None
-        if byte_range:
-            if byte_range.end is not None:
-                if (
-                    byte_range.start > byte_range.end
-                ):  # Zarr allows start == stop for 0 length
-                    raise ValueError(
-                        f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
-                    )
-                req_length = byte_range.end - byte_range.start
+        from zarr.abc.store import RangeByteRequest, OffsetByteRequest, SuffixByteRequest
+
+        req_offset: Optional[int] = None
+        req_length: Optional[int] = None
+        if byte_range is not None:
+            if isinstance(byte_range, RangeByteRequest):
+                req_offset = byte_range.start
+                if byte_range.end is not None:
+                    if byte_range.start > byte_range.end:
+                        raise ValueError(f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})")
+                    req_length = byte_range.end - byte_range.start
+            elif isinstance(byte_range, OffsetByteRequest):
+                req_offset = byte_range.offset
+            elif isinstance(byte_range, SuffixByteRequest):
+                raise TypeError("SuffixByteRequest is not supported by ShardedZarrStore.get()")
+            else:
+                raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")

🧹 Nitpick comments (12)

py_hamt/hamt_to_sharded_converter.py (1)

50-58: Missing async context manager / explicit close for dest_store

ShardedZarrStore.open() likely allocates buffers and holds a CAS reference.
Wrap it in an async context manager (if available) or explicitly call await dest_store.close() after flush() to avoid descriptor leaks.
py_hamt/manage_pins.py (1)
62-64: Propagate the target RPC explicitly for clarity

The tests call pin_entire_dataset(target_rpc=rpc_base_url), but the CLI path omits the parameter. Passing it makes the behaviour explicit and avoids surprises if ShardedZarrStore later defaults to a different endpoint.
-        await store.pin_entire_dataset()
+        await store.pin_entire_dataset(target_rpc=args.rpc_url)
Apply the same change for unpin_entire_dataset.
tests/test_sharded_zarr_pinning.py (2)
119-134: Fixed sleep can make the test flaky

Relying on await asyncio.sleep(1) assumes the Kubo daemon updates pin state within one second, which is optimistic on busy nodes.

Refactor to poll with a timeout:
async def _wait_until(predicate, timeout=30, interval=0.5):
    start = asyncio.get_event_loop().time()
    while not predicate():
        if asyncio.get_event_loop().time() - start > timeout:
            pytest.fail("Pin state did not converge in time")
        await asyncio.sleep(interval)

await _wait_until(lambda: expected_cids <= await get_pinned_cids(rpc_base_url))
12-20: Use GET for pin/ls for better compatibility

/api/v0/pin/ls accepts GET; POST works on recent Kubo but older nodes reject it. Switching to GET maximises compatibility.
-            async with session.post(url, params={"type": "all"}) as resp:
+            async with session.get(url, params={"type": "all"}) as resp:
tests/test_converter.py (1)
249-252: kubo_cas is created but unused

The context manager is only needed to satisfy type checks; drop the binding to silence F841.
-    async with KuboCAS(
-        rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url
-    ) as kubo_cas:
+    async with KuboCAS(
+        rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url
+    ):
🧰 Tools

🪛 Ruff (0.11.9)

251-251: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)
tests/test_sharded_zarr_store.py (1)
689-689: Remove unused variable assignment

The root_cid variable is assigned but never used.
-        root_cid = await store.flush()
+        await store.flush()
🧰 Tools

🪛 Ruff (0.11.9)

689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)
py_hamt/store.py (1)
94-104: Simplify control flow by removing unnecessary else

The else block after return can be removed to reduce nesting.
         if offset is not None:
             start = offset
             if length is not None:
                 end = start + length
                 return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
+            return data[start:]
+        if suffix is not None:
             return data[-suffix:]
-        else:  # Full load
-            return data
+        # Full load
+        return data
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)
py_hamt/sharded_zarr_store.py (5)
4-4: Remove unused import

The cast import is not used in this file.
-from typing import Optional, cast, Dict, List, Set, Tuple
+from typing import Optional, Dict, List, Set, Tuple
🧰 Tools

🪛 Ruff (0.11.9)

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)

828-833: Use or remove unused exception variable

The exception is assigned to e but never used. Either use it in error handling or remove the assignment.
                         try:
                             await self.cas.unpin_cid(
                                 chunk_cid_str, target_rpc=target_rpc
                             )
-                        except Exception as e:
-                            # ignore
+                        except Exception:
+                            # Ignore failures - chunk might already be unpinned
                             continue
🧰 Tools

🪛 Ruff (0.11.9)

831-831: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

844-847: Use exception information in warning message

The exception should be included in the warning message for better debugging.
             try:
                 await self.cas.unpin_cid(str(shard_cid), target_rpc=target_rpc)
             except Exception as e:
-                print(f"Warning: Could not unpin shard {str(shard_cid)}")
+                print(f"Warning: Could not unpin shard {str(shard_cid)}: {e}")
🧰 Tools

🪛 Ruff (0.11.9)

845-845: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

854-858: Include exception details in warning message

Add the exception information to the warning for better debugging.
                 try:
                     await self.cas.unpin_cid(cid, target_rpc=target_rpc)
                     print(f"Unpinned metadata CID {cid} from {target_rpc}...")
                 except Exception as e:
                     print(
-                        f"Warning: Could not unpin metadata CID {cid}. Likely already unpinned."
+                        f"Warning: Could not unpin metadata CID {cid}: {e}"
                     )
🧰 Tools

🪛 Ruff (0.11.9)

855-855: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

863-868: Include exception details in warning message

Add the exception information to improve debugging.
             try:
                 await self.cas.unpin_cid(self._root_cid, target_rpc=target_rpc)
                 print(f"Unpinned root CID {self._root_cid} from {target_rpc}...")
             except Exception as e:
                 print(
-                    f"Warning: Could not unpin root CID {self._root_cid}. Likely already unpinned."
+                    f"Warning: Could not unpin root CID {self._root_cid}: {e}"
                 )
🧰 Tools

🪛 Ruff (0.11.9)

865-865: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50fd6e6 and b0268ad.

📒 Files selected for processing (10)

py_hamt/flat_zarr_store.py (1 hunks)
py_hamt/hamt_to_sharded_converter.py (1 hunks)
py_hamt/manage_pins.py (1 hunks)
py_hamt/sharded_zarr_store.py (1 hunks)
py_hamt/store.py (7 hunks)
tests/test_benchmark_stores.py (1 hunks)
tests/test_converter.py (1 hunks)
tests/test_cpc_compare.py (1 hunks)
tests/test_sharded_zarr_pinning.py (1 hunks)
tests/test_sharded_zarr_store.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

py_hamt/flat_zarr_store.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_benchmark_stores.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

py_hamt/store.py (4)

tests/test_zarr_ipfs_partial.py (1)

load (275-285)

py_hamt/hamt.py (4)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

py_hamt/sharded_zarr_store.py (1)

get (413-501)

py_hamt/zarr_hamt_store.py (1)

get (116-149)

🪛 Ruff (0.11.9)

tests/test_converter.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

6-6: aiohttp imported but unused

Remove unused import: aiohttp

(F401)

182-182: f-string without any placeholders

Remove extraneous f prefix

(F541)

251-251: Local variable kubo_cas is assigned to but never used

Remove assignment to unused variable kubo_cas

(F841)

tests/test_sharded_zarr_store.py

1-1: asyncio imported but unused

Remove unused import: asyncio

(F401)

2-2: math imported but unused

Remove unused import: math

(F401)

12-12: py_hamt.HAMT imported but unused

Remove unused import: py_hamt.HAMT

(F401)

13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused

Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore

(F401)

25-25: Local variable precip is assigned to but never used

Remove assignment to unused variable precip

(F841)

689-689: Local variable root_cid is assigned to but never used

Remove assignment to unused variable root_cid

(F841)

py_hamt/sharded_zarr_store.py

4-4: typing.cast imported but unused

Remove unused import: typing.cast

(F401)

831-831: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

845-845: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

855-855: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

865-865: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

py_hamt/hamt_to_sharded_converter.py

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.Buffer imported but unused

Remove unused import

(F401)

10-10: zarr.core.buffer.BufferPrototype imported but unused

Remove unused import

(F401)

🪛 Pylint (3.3.7)

tests/test_converter.py

[refactor] 52-52: Too many local variables (21/15)

(R0914)

[refactor] 145-145: Too many local variables (16/15)

(R0914)

[refactor] 196-196: Too many local variables (17/15)

(R0914)

tests/test_sharded_zarr_pinning.py

[refactor] 83-83: Too many local variables (16/15)

(R0914)

tests/test_sharded_zarr_store.py

[refactor] 160-160: Too many local variables (16/15)

(R0914)

[refactor] 210-210: Too many local variables (17/15)

(R0914)

[refactor] 268-268: Too many local variables (18/15)

(R0914)

[refactor] 311-311: Too many local variables (18/15)

(R0914)

py_hamt/store.py

[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

py_hamt/sharded_zarr_store.py

[refactor] 13-13: Too many instance attributes (14/7)

(R0902)

[refactor] 68-68: Too many arguments (8/5)

(R0913)

[refactor] 204-204: Too many return statements (8/6)

(R0911)

[refactor] 296-299: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 289-289: Too many branches (13/12)

(R0912)

[refactor] 413-413: Too many local variables (17/15)

(R0914)

[refactor] 413-413: Too many return statements (7/6)

(R0911)

[refactor] 413-413: Too many branches (14/12)

(R0912)

[refactor] 638-643: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 795-795: Too many branches (15/12)

(R0912)

py_hamt/hamt_to_sharded_converter.py

[refactor] 13-13: Too many local variables (20/15)

(R0914)

🔇 Additional comments (1)

py_hamt/hamt_to_sharded_converter.py (1)

64-74:
❓ Verification inconclusive

Verify CID length before writing into shards

cid_base32_str is written verbatim, but cid_len is hard-coded to 59. A multibase-encoded CIDv1 can be longer than 59 chars (e.g. long multihash algorithms). Writing a longer string will silently truncate or corrupt the shard layout.

Would you run a quick search to see whether any CID in real datasets exceeds 59 characters?

🏁 Script executed:
#!/bin/bash
# Scan existing shards (if any) for long CIDs
fd -e zarr | xargs -I{} grep -oE 'b[a-z0-9]{60,}' {} | head
Length of output: 65
Potential CID length mismatch risk
The converter assumes a fixed cid_len = 59 for base32-encoded CIDv1, but longer multihash algorithms (e.g. sha2-512) will produce CIDs >59 characters and be silently truncated, corrupting shard layout. Please verify whether your data ever generates CIDs longer than 59 bytes or update the code to handle variable-length CIDs.

• File: py_hamt/hamt_to_sharded_converter.py (lines 64–74)
• Suggested fix – add a length guard before writing the pointer:
- await dest_store.set_pointer(key, cid_base32_str)
+ if len(cid_base32_str) > cid_len:
+     raise ValueError(f"CID length {len(cid_base32_str)} exceeds expected {cid_len}")
+ await dest_store.set_pointer(key, cid_base32_str)

coderabbitai · 2025-06-12T14:49:34Z

py_hamt/hamt_to_sharded_converter.py

+    # 2. Introspect the source array to get its configuration
+    print("Reading metadata from source store...")
+
+    # Read the stores metadata to get array shape and chunk shape
+    ordered_dims = list(source_dataset.dims)
+    array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)
+    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+    array_shape = array_shape_tuple
+    chunk_shape = chunk_shape_tuple
+


🛠️ Refactor suggestion

Chunk-shape derivation fails for unchunked datasets

source_dataset.chunks is None for unchunked arrays, so the comprehension will raise TypeError.
Consider a safer fallback:

- chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) + if source_dataset.chunks is None: # unchunked source + chunk_shape_tuple = array_shape_tuple + else: + chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 2. Introspect the source array to get its configuration

print("Reading metadata from source store...")

# Read the stores metadata to get array shape and chunk shape

ordered_dims = list(source_dataset.dims)

array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)

chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)

array_shape = array_shape_tuple

chunk_shape = chunk_shape_tuple

# 2. Introspect the source array to get its configuration

print("Reading metadata from source store...")

# Read the stores metadata to get array shape and chunk shape

ordered_dims = list(source_dataset.dims)

array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims)

- chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)

+ if source_dataset.chunks is None: # unchunked source

+ chunk_shape_tuple = array_shape_tuple

+ else:

+ chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)

array_shape = array_shape_tuple

chunk_shape = chunk_shape_tuple

🤖 Prompt for AI Agents

In py_hamt/hamt_to_sharded_converter.py around lines 36 to 45, the code assumes source_dataset.chunks is always defined, but it can be None for unchunked datasets, causing a TypeError. Modify the chunk_shape_tuple assignment to first check if source_dataset.chunks is not None; if it is None, provide a fallback such as using the full dimension size as the chunk size or a default chunk size. This ensures safe handling of unchunked datasets without errors.

codecov-commenter · 2025-08-21T11:17:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (de33efe) to head (78cf72d).

Additional details and impacted files

@@            Coverage Diff             @@
##              main       #68    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            5         7     +2     
  Lines          707      1310   +603     
==========================================
+ Hits           707      1310   +603

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

py_hamt/store_httpx.py (1)
496-515: Align pin_ls to return a list of CID strings

After reviewing both the stub (line 57) and the full implementation (lines 496–515) in py_hamt/store_httpx.py, and inspecting all call sites (notably in tests/test_kubo_pin.py which uses assert str(cid) in listed_pins), it’s clear that:

The current signature -> list[Dict[str, Any]] is incorrect.

The code actually returns a dict of { cid: { "Type": … }, … }, which happens to support cid in pins but fails type expectations.

Tests only check membership, so returning a flat list[str] of CIDs both satisfies callers and keeps the API surface clean.

Please apply the following mandatory refactor:

• At line 57 (stub):
-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+    async def pin_ls(self, target_rpc: str) -> list[str]:
• At lines 496–515 (implementation):
-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> list[str]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            list[str]: A list of pinned CID strings.
@@
-        pins = response.json().get("Keys", [])
-        return pins
+        keys_map = response.json().get("Keys", {}) or {}
+        return list(keys_map.keys())
• Adjust both docstrings to reflect the new return type (list[str]).

This change preserves existing test behavior, aligns type annotations, and avoids future confusion around the raw RPC payload shape.

♻️ Duplicate comments (5)

py_hamt/zarr_hamt_store.py (1)

175-177: Replace print with structured logging

Use the logging module instead of print for production error reporting.

-        except Exception as e:
-            print(f"Error getting key '{key}' with range {byte_range}: {e}")
-            raise
+        except Exception as e:
+            logger.error("Error getting key %r with range %r: %s", key, byte_range, e)
+            raise

Add once at the top of the module:

+import logging
+logger = logging.getLogger(__name__)

py_hamt/store_httpx.py (4)

77-115: Support length-only reads and simplify control flow in InMemoryCAS.load

Length without offset should return the first N bytes. This was flagged earlier and remains unhandled.

     key = cast(bytes, id)
@@
-        data: bytes
+        data: bytes
         try:
             data = self.store[key]
         except KeyError as exc:
             raise KeyError("Object not found in in-memory store") from exc

-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if length == 0:
+            return b""
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        if length is not None:
+            # length-only → first N bytes
+            return data[:length]
+        if suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data  # full blob

435-451: Remove nonexistent name parameter from pin_cid docstring

Docstring mentions name, which is not in the signature and not sent.

     """
     Pins a CID to the local Kubo node via the RPC API.

     This call is recursive by default, pinning all linked objects.

     Args:
         cid (CID): The Content ID to pin.
-        name (Optional[str]): An optional name for the pin.
+        target_rpc (str): The RPC URL of the Kubo node.
     """

Consider adding a recursive: bool = True keyword if you need to toggle recursion in the future.

43-60: Make pin hook methods fail-fast instead of silently passing

These are effectively abstract hooks. Leaving them as pass hides misuse at runtime.

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def pin_update(
+    async def pin_update(
         self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
     ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError  # pragma: no cover
@@
-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> Dict[str, Any]:
+        """List pinned CIDs in the storage (shape matches Kubo's pin/ls Keys map)."""
+        raise NotImplementedError  # pragma: no cover

Note: This also updates the return type of pin_ls to match Kubo (see later comment).

403-434: Handle length-only HTTP Range requests; guard zero-length/suffix=0

load() still ignores length when offset is None. HTTP Range semantics expect "bytes=0-{length-1}". Also, length==0 and suffix==0 should short-circuit to avoid sending invalid ranges.

     async def load(
         self,
         id: IPLDKind,
         offset: Optional[int] = None,
         length: Optional[int] = None,
         suffix: Optional[int] = None,
     ) -> bytes:
         """@private"""
         cid = cast(CID, id)
         url: str = f"{self.gateway_base_url + str(cid)}"
         headers: Dict[str, str] = {}

+        # Zero-length/suffix guard: avoid invalid HTTP Range values
+        if length == 0 or suffix == 0:
+            return b""
+
         # Construct the Range header if required
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"
+        elif length is not None:
+            # length only → first N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"

         async with self._sem:  # throttle gateway
             client = self._loop_client()
             response = await client.get(url, headers=headers or None)
             response.raise_for_status()
             return response.content

🧹 Nitpick comments (4)

py_hamt/zarr_hamt_store.py (1)
184-191: Tidy gather usage and consider throttling very large fan-out

The comment says "Set return_exceptions=True" but the call sets it to False (default). Remove the stale note. If key_ranges can be large, consider bounding concurrency with a semaphore to avoid scheduling thousands of tasks at once.
-        results = await asyncio.gather(
-            *tasks, return_exceptions=False
-        )  # Set return_exceptions=True for debugging
+        results = await asyncio.gather(*tasks)
         return results
Optional: introduce a local semaphore to cap concurrent get()s.
py_hamt/store_httpx.py (3)
34-41: Clarify partial-read semantics in the base interface

Document that:

offset is 0-based; length is a byte count; end is exclusive when derived from offset+length.

suffix returns the last N bytes.

negative values are invalid; length==0 should return empty bytes.
-    ) -> bytes:
-        """Retrieve data."""
+    ) -> bytes:
+        """Retrieve data.
+
+        Semantics:
+        - offset: 0‑based start position.
+        - length: number of bytes to read; when combined with offset, the "end" is exclusive (HTTP Range end is inclusive and derived as start+length-1).
+        - suffix: last N bytes of the object.
+        - Negative values are invalid; length == 0 returns b"".
+        """
190-193: pinOnAdd: URL building looks good; consider URL-encoding chunker/hash

If chunker ever contains characters requiring encoding (future formats), assemble the RPC URL using params= on client.post(...) rather than manual string concatenation.

Example:
response = await client.post(
    f"{rpc_base_url}/api/v0/add",
    params={"hash": self.hasher, "chunker": self.chunker, "pin": str(pinOnAdd).lower()},
    files=files,
)
Also applies to: 258-260

343-383: Destructor performs async work; keep best‑effort but avoid surprising side effects

__del__ attempts to schedule/drive async cleanup. This is best‑effort but can surprise callers, especially in interpreter shutdown. Consider limiting __del__ to clearing references and documenting that aclose() (or async with) is required for deterministic cleanup.

No code change required now; just a heads‑up. If you want code, I can propose a minimal __del__ that avoids asyncio.run() altogether.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5b1bad8 and bfb4a41.

📒 Files selected for processing (3)

py_hamt/store_httpx.py (8 hunks)
py_hamt/zarr_hamt_store.py (3 hunks)
tests/test_zarr_ipfs.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/test_zarr_ipfs.py

🧰 Additional context used

🧬 Code graph analysis (2)

py_hamt/zarr_hamt_store.py (2)

py_hamt/encryption_hamt_store.py (1)

get (159-183)

py_hamt/hamt.py (2)

get (593-608)

HAMT (287-696)

py_hamt/store_httpx.py (4)

py_hamt/hamt.py (5)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

HAMT (287-696)

py_hamt/zarr_hamt_store.py (1)

get (144-177)

py_hamt/sharded_zarr_store.py (1)

get (319-374)

tests/test_kubo_cas.py (1)

test_kubo_cas (128-145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Create project environment, run all checks (3.12)

coderabbitai · 2025-08-21T11:24:29Z

py_hamt/zarr_hamt_store.py

+    def _map_byte_request(
+        self, byte_range: Optional[zarr.abc.store.ByteRequest]
+    ) -> tuple[Optional[int], Optional[int], Optional[int]]:
+        """Helper to map Zarr ByteRequest to offset, length, suffix."""
+        offset: Optional[int] = None
+        length: Optional[int] = None
+        suffix: Optional[int] = None
+
+        if byte_range:
+            if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
+                offset = byte_range.start
+                length = byte_range.end - byte_range.start
+                if length is not None and length < 0:
+                    raise ValueError("End must be >= start for RangeByteRequest")
+            elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
+                offset = byte_range.offset
+            elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
+                suffix = byte_range.suffix
+            else:
+                raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
+
+        return offset, length, suffix
+


🛠️ Refactor suggestion

Fix RangeByteRequest with open-ended ranges and validate inputs

_map_byte_request unconditionally computes end - start. If end is None (open-ended ranges are valid in Zarr v3), this raises a TypeError. Also, negative start and zero-length ranges should be handled explicitly to avoid downstream invalid HTTP ranges.

Apply:

def _map_byte_request( self, byte_range: Optional[zarr.abc.store.ByteRequest] ) -> tuple[Optional[int], Optional[int], Optional[int]]: """Helper to map Zarr ByteRequest to offset, length, suffix.""" offset: Optional[int] = None length: Optional[int] = None suffix: Optional[int] = None if byte_range: if isinstance(byte_range, zarr.abc.store.RangeByteRequest): - offset = byte_range.start - length = byte_range.end - byte_range.start - if length is not None and length < 0: - raise ValueError("End must be >= start for RangeByteRequest") + if byte_range.start < 0: + raise ValueError("start must be >= 0 for RangeByteRequest") + offset = byte_range.start + if byte_range.end is not None: + if byte_range.end < byte_range.start: + raise ValueError( + "end must be >= start for RangeByteRequest" + ) + length = byte_range.end - byte_range.start # end is exclusive + else: + # open-ended: request from offset to end of blob + length = None elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest): offset = byte_range.offset elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest): suffix = byte_range.suffix else: raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}") return offset, length, suffix

Optionally, short-circuit zero-length reads at the call-site (see next comment) to avoid sending a bytes=start-(start-1) HTTP range.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _map_byte_request(

self, byte_range: Optional[zarr.abc.store.ByteRequest]

) -> tuple[Optional[int], Optional[int], Optional[int]]:

"""Helper to map Zarr ByteRequest to offset, length, suffix."""

offset: Optional[int] = None

length: Optional[int] = None

suffix: Optional[int] = None

if byte_range:

if isinstance(byte_range, zarr.abc.store.RangeByteRequest):

offset = byte_range.start

length = byte_range.end - byte_range.start

if length is not None and length < 0:

raise ValueError("End must be >= start for RangeByteRequest")

elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):

offset = byte_range.offset

elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):

suffix = byte_range.suffix

else:

raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")

return offset, length, suffix

def _map_byte_request(

self, byte_range: Optional[zarr.abc.store.ByteRequest]

) -> tuple[Optional[int], Optional[int], Optional[int]]:

"""Helper to map Zarr ByteRequest to offset, length, suffix."""

offset: Optional[int] = None

length: Optional[int] = None

suffix: Optional[int] = None

if byte_range:

if isinstance(byte_range, zarr.abc.store.RangeByteRequest):

if byte_range.start < 0:

raise ValueError("start must be >= 0 for RangeByteRequest")

offset = byte_range.start

if byte_range.end is not None:

if byte_range.end < byte_range.start:

raise ValueError(

"end must be >= start for RangeByteRequest"

)

length = byte_range.end - byte_range.start # end is exclusive

else:

# open-ended: request from offset to end of blob

length = None

elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):

offset = byte_range.offset

elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):

suffix = byte_range.suffix

else:

raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")

return offset, length, suffix

coderabbitai · 2025-08-21T11:24:29Z

py_hamt/zarr_hamt_store.py

+            if is_metadata and byte_range is None and key in self.metadata_read_cache:
                val = self.metadata_read_cache[key]
            else:
+                offset, length, suffix = self._map_byte_request(byte_range)
                val = cast(
-                    bytes, await self.hamt.get(key)
+                    bytes,
+                    await self.hamt.get(
+                        key, offset=offset, length=length, suffix=suffix
+                    ),
                )  # We know values received will always be bytes since we only store bytes in the HAMT
-                if is_metadata:
+                if is_metadata and byte_range is None:
                    self.metadata_read_cache[key] = val


🛠️ Refactor suggestion

Short-circuit zero-length ranges to avoid invalid HTTP Range headers

If a caller supplies RangeByteRequest(start, end) where start == end, length == 0. Forwarding this to the CAS will build end = start + length - 1, i.e. start-1 (invalid). Return an empty buffer locally for this edge case.

- else: - offset, length, suffix = self._map_byte_request(byte_range) + else: + offset, length, suffix = self._map_byte_request(byte_range) + # Zero-length range → empty result, avoid invalid HTTP Range + if length == 0: + return prototype.buffer.from_bytes(b"") val = cast( bytes, await self.hamt.get( key, offset=offset, length=length, suffix=suffix ), ) # We know values received will always be bytes since we only store bytes in the HAMT if is_metadata and byte_range is None: self.metadata_read_cache[key] = val

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if is_metadata and byte_range is None and key in self.metadata_read_cache:

val = self.metadata_read_cache[key]

else:

offset, length, suffix = self._map_byte_request(byte_range)

val = cast(

bytes, await self.hamt.get(key)

bytes,

await self.hamt.get(

key, offset=offset, length=length, suffix=suffix

),

) # We know values received will always be bytes since we only store bytes in the HAMT

if is_metadata:

if is_metadata and byte_range is None:

self.metadata_read_cache[key] = val

if is_metadata and byte_range is None and key in self.metadata_read_cache:

val = self.metadata_read_cache[key]

else:

offset, length, suffix = self._map_byte_request(byte_range)

# Zero-length range → empty result, avoid invalid HTTP Range

if length == 0:

return prototype.buffer.from_bytes(b"")

val = cast(

bytes,

await self.hamt.get(

key, offset=offset, length=length, suffix=suffix

),

) # We know values received will always be bytes since we only store bytes in the HAMT

if is_metadata and byte_range is None:

self.metadata_read_cache[key] = val

🤖 Prompt for AI Agents

In py_hamt/zarr_hamt_store.py around lines 158 to 169, when a RangeByteRequest has start==end the computed length will be 0 and building an HTTP Range header for the CAS becomes invalid; after calling self._map_byte_request(byte_range) check if length == 0 and if so set val = b"" (return an empty bytes object) instead of calling await self.hamt.get(...), and still populate self.metadata_read_cache[key] when is_metadata and byte_range is None or when appropriate for cached reads; this short-circuits the zero-length case and avoids creating invalid Range headers.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

0xSwego · 2025-08-25T10:42:06Z

py_hamt/store_httpx.py

        *,
        headers: dict[str, str] | None = None,
        auth: Tuple[str, str] | None = None,
+        pinOnAdd: bool = False,


Wrong casing:

Suggested change

pinOnAdd: bool = False,

pin_on_add: bool = False,

0xSwego · 2025-08-25T10:43:22Z

py_hamt/store_httpx.py

+        pinString: str = "true" if pinOnAdd else "false"
+        self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}"


Also here, casing to be fixed:

Suggested change

pinString: str = "true" if pinOnAdd else "false"

self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}"

pin_string: str = "true" if pin_on_add else "false"

self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pin_string}"

Just a couple of casing issue, for the rest, it LGTM

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

py_hamt/store_httpx.py (3)

264-271: Bug: calling get_running_loop() in init when a user supplies a client

Constructing KuboCAS from sync code with a prebuilt AsyncClient raises RuntimeError. Don’t touch the loop in init; stash and return the external client in _loop_client().

-        if client is not None:
-            # A client was supplied by the user. We don't own it.
-            self._owns_client = False
-            self._client_per_loop = {asyncio.get_running_loop(): client}
+        if client is not None:
+            # A client was supplied by the user. We don't own it.
+            self._owns_client = False
+            # Do not access the running loop in __init__
+            self._external_client = client
+            self._client_per_loop = {}
         else:
             # No client supplied. We will own any clients we create.
             self._owns_client = True
             self._client_per_loop = {}

281-312: Return user-supplied client regardless of loop; lazily create owned clients per loop

This complements the previous fix. If the user provided a client, just return it. For owned clients, keep the per-loop behavior.

     def _loop_client(self) -> httpx.AsyncClient:
         """Get or create a client for the current event loop.
@@
         if self._closed:
             if not self._owns_client:
                 raise RuntimeError("KuboCAS is closed; create a new instance")
             # We previously closed all internally-owned clients. Reset the
             # state so that new clients can be created lazily.
             self._closed = False
             self._client_per_loop = {}
 
+        # External client → return as-is, independent of loop.
+        if not self._owns_client:
+            if self._external_client is None:
+                raise RuntimeError("External client missing (internal error)")
+            return self._external_client
+
         loop: asyncio.AbstractEventLoop = asyncio.get_running_loop()
         try:
             return self._client_per_loop[loop]
         except KeyError:
             # Create a new client
             client = httpx.AsyncClient(
                 timeout=60.0,
                 headers=self._default_headers,
                 auth=self._default_auth,
                 limits=httpx.Limits(max_connections=64, max_keepalive_connections=32),
                 # Uncomment when they finally support Robust HTTP/2 GOAWAY responses
                 # http2=True,
             )
             self._client_per_loop[loop] = client
             return client

496-515: pin_ls return type/docstring mismatch and incorrect shape

The RPC returns {"Keys": { "": {...}, ... }}, not a list. The annotation says list[Dict[str, Any]], docstring says List[CID]. Return List[CID] (decoded) for a stable API, or change both to Dict[str, Any]. Below converts to List[CID].

-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> list[CID]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            list[CID]: The pinned CIDs.
@@
-        pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls"
-        async with self._sem:  # throttle RPC
-            client = self._loop_client()
-            response = await client.post(pin_ls_url_base)
-            response.raise_for_status()
-            pins = response.json().get("Keys", [])
-            return pins
+        pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls"
+        async with self._sem:  # throttle RPC
+            client = self._loop_client()
+            response = await client.post(pin_ls_url_base)
+            response.raise_for_status()
+            payload = response.json()
+            keys_map = payload.get("Keys", {}) or {}
+            return [CID.decode(k) for k in keys_map.keys()]

♻️ Duplicate comments (4)

py_hamt/store_httpx.py (4)

439-475: Pin API docstrings align with behavior (recursive=true); nice cleanup

Docs now match the code and removed the non-existent name parameter. Looks good.

43-60: Base-class pin hooks silently no-op; surface misuse with NotImplementedError or mark abstract

Leaving these as pass/[] hides bugs at runtime when a caller expects pinning to work but the concrete CAS doesn’t implement it. Raise NotImplementedError to fail fast (or mark @AbstractMethod if all CASes must implement).

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid not implemented in this CAS")

-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid not implemented in this CAS")

-    async def pin_update(
-        self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
-    ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_update(
+        self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
+    ) -> None:
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update not implemented in this CAS")

-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls not implemented in this CAS")

104-114: Handle length-only loads and remove redundant else-after-return

When only length is provided (offset is None), callers expect the first N bytes. Also, the comment on Line 111 references length under the suffix branch. Tidy the control flow.

-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        elif length is not None:
+            # length only → first N bytes
+            return data[:length]
+        elif suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data  # full blob

415-428: Support length-only HTTP range requests (first N bytes)

When length is provided without offset, set Range to "bytes=0-{length-1}". Today that path falls through and fetches the whole blob.

         # Construct the Range header if required
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
+        elif length is not None:
+            # length only → first N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
         elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"

🧹 Nitpick comments (4)

py_hamt/store_httpx.py (4)
190-192: Nit: prefer snake_case parameter names (pinOnAdd → pin_on_add)

PEP 8 consistency. This public API change is breaking, so consider an alias/deprecation path if you decide to adjust.

343-383: Avoid complex async orchestration in del

Scheduling tasks or calling asyncio.run() in a destructor is fragile and can produce noisy warnings. Since you already expose async context-manager and aclose(), consider simplifying del to only clear references when we own clients and log a warning for leaked clients.

489-494: Be explicit with repeated query params for pin/update

httpx will encode list values as repeated keys, but being explicit avoids ambiguity and IDE type nags; also mirrors curl usage clearly.
-        params = {"arg": [str(old_id), str(new_id)]}
+        params = [("arg", str(old_id)), ("arg", str(new_id))]
         pin_update_url_base: str = f"{target_rpc}/api/v0/pin/update"
         async with self._sem:  # throttle RPC
             client = self._loop_client()
             response = await client.post(pin_update_url_base, params=params)
             response.raise_for_status()
33-41: Nit: avoid using id as a parameter name

It shadows Python’s built-in id(). Consider object_id or content_id in new APIs; not worth churning existing ones.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7d826fa and 559a7ff.

📒 Files selected for processing (1)

py_hamt/store_httpx.py (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

py_hamt/store_httpx.py (3)

py_hamt/hamt.py (5)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

HAMT (287-696)

py_hamt/sharded_zarr_store.py (1)

get (361-416)

py_hamt/zarr_hamt_store.py (1)

get (144-177)

coderabbitai · 2025-08-25T10:44:17Z

py_hamt/store_httpx.py

+        self._owns_client: bool = False
+        self._closed: bool = True
+        self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
+        self._default_headers = headers
+        self._default_auth = auth
+


🛠️ Refactor suggestion

⚠️ Potential issue

Prepare for external client without touching the loop in init

Accessing asyncio.get_running_loop() in init breaks creation from sync contexts. Stash the external client and bind it on first use.

self._owns_client: bool = False self._closed: bool = True self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} self._default_headers = headers self._default_auth = auth + self._external_client: httpx.AsyncClient | None = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._owns_client: bool = False

self._closed: bool = True

self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}

self._default_headers = headers

self._default_auth = auth

self._owns_client: bool = False

self._closed: bool = True

self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}

self._default_headers = headers

self._default_auth = auth

self._external_client: httpx.AsyncClient | None = None

🤖 Prompt for AI Agents

In py_hamt/store_httpx.py around lines 229 to 234, __init__ is calling asyncio.get_running_loop() (or otherwise touching the loop) when an external httpx.AsyncClient may be passed, which breaks sync-side construction; instead, store the provided client object and do not access the event loop in __init__; set self._owns_client to False when an external client is provided and defer creating or binding a per-loop AsyncClient until first use (e.g., in a private _ensure_client() or _get_client() method) where you can safely call asyncio.get_running_loop(), create or attach the client to that loop, and set self._owns_client/ self._closed accordingly.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

py_hamt/store_httpx.py (1)
496-515: Align pin_ls() return type, implementation, and tests

The current implementation of pin_ls() (py_hamt/store_httpx.py lines 496–515) calls the IPFS RPC and does:
pins = response.json().get("Keys", [])
return pins
– at runtime this returns a dict mapping CIDs to metadata, but the signature is -> list[Dict[str, Any]] and the docstring says it returns a list of CIDs. Meanwhile, the tests in tests/test_kubo_pin.py (at lines 26, 34, 48, 59) do:
assert str(cid) in listed_pins
# …
assert str(cid) not in listed_pins_after_unpin
which relies on listed_pins being an iterable of CID strings (dict membership in Python tests keys, so it happens to work today).

We need to pick one consistent API shape and apply the corresponding refactors:

• Approach A – Return a list of {cid, ...} dicts
– Implementation (replace lines 512–515):
diff - pins = response.json().get("Keys", []) - return pins + keys = response.json().get("Keys") or {} + # Convert {"<cid>": {...}, …} to [{"cid": "<cid>", **metadata}, …] + return [{"cid": k, **(v or {})} for k, v in keys.items()]
– Signature & docstring (py_hamt/store_httpx.py lines 496–498 & doc):
```diff
- async def pin_ls(...) -> list[Dict[str, Any]]:
+ async def pin_ls(...) -> list[Dict[str, Any]]:
"""
Lists all pinned CIDs on the local Kubo node via the RPC API.
    Returns:
-       List[CID]: A list of pinned CIDs.
+       list[dict]: Each entry has the form {"cid": "<cid>", "Type": ..., "Metadata": ...}.
    """
```
– Tests (tests/test_kubo_pin.py at lines 26, 34, 48, 59): change membership checks to look inside the returned dicts, e.g.:
diff - assert str(cid) in listed_pins + assert any(p["cid"] == str(cid) for p in listed_pins), f"CID {cid} was not pinned"
and similarly for the other three assertions (listed_pins_after_unpin, listed_pins_after_update, listed_pins_after_unpin_update).

• Approach B – Keep returning the raw dict
– Signature & docstring:
diff - async def pin_ls(...) -> list[Dict[str, Any]]: + async def pin_ls(...) -> Dict[str, Dict[str, Any]]: """ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + Dict[str, dict]: Mapping from each pinned CID string to its metadata. """
– Implementation: leave as is (it already returns the dict).
– Tests: no change necessary.

Please select the intended API contract and update the code (py_hamt/store_httpx.py), docstring, signature, and—if you choose Approach A—the four assertions in tests/test_kubo_pin.py accordingly.

♻️ Duplicate comments (5)

py_hamt/store_httpx.py (5)

43-60: Surface misuse early: make base pin methods fail fast.

These default “no-op” implementations hide bugs at runtime. Prefer raising NotImplementedError to catch accidental calls on CAS backends that don’t implement pinning. This was noted previously.

 async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid is not supported by this CAS")

 async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid is not supported by this CAS")

 async def pin_update(
     self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
 ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update is not supported by this CAS")

 async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls is not supported by this CAS")

229-235: Avoid touching the running loop in init when an external client is provided.

Constructors should be usable from sync contexts; calling asyncio.get_running_loop() here forces an active loop. Stash the external client and bind lazily in _loop_client() instead. This was suggested earlier.

         self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
         self._default_headers = headers
         self._default_auth = auth
+        self._external_client: httpx.AsyncClient | None = None

Outside this range (illustrative change):

# __init__ (replace the current external-client block)
if client is not None:
    self._owns_client = False
    self._external_client = client
else:
    self._owns_client = True

# _loop_client (first lines)
loop = asyncio.get_running_loop()
if loop in self._client_per_loop:
    return self._client_per_loop[loop]
if self._external_client is not None:
    self._client_per_loop[loop] = self._external_client
    return self._external_client
# ... create internal client as you do now

435-459: Expose a recursive flag for pin/unpin for flexibility.

Docstrings say “recursive by default” and you correctly send "true". Exposing a recursive: bool = True kwarg adds control without breaking defaults. Same for unpin_cid.

 async def pin_cid(
     self,
     cid: CID,
-    target_rpc: str = "http://127.0.0.1:5001",
+    target_rpc: str = "http://127.0.0.1:5001",
+    *,
+    recursive: bool = True,
 ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

And for unpin_cid(...):

-    async def unpin_cid(
-        self, cid: CID, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> None:
+    async def unpin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

77-84: Support length-only reads and tighten slice validation in InMemoryCAS.load().

When only length is supplied, callers expect the first N bytes. Also remove “else after return” and validate non-negative inputs. This was raised earlier and still applies.

@@
-        if offset is not None:
+        if length is not None and offset is None:
+            # length only → first N bytes
+            if length < 0:
+                raise ValueError("length must be non-negative")
+            return data[:length]
+        if offset is not None:
             start = offset
+            if start < 0:
+                raise ValueError("offset must be non-negative")
             if length is not None:
-                end = start + length
+                if length < 0:
+                    raise ValueError("length must be non-negative")
+                end = start + length
                 return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
+            return data[start:]
+        if suffix is not None:
+            if suffix <= 0:
+                raise ValueError("suffix must be a positive integer")
             return data[-suffix:]
-        else:  # Full load
-            return data
+        return data  # full blob

Also applies to: 104-115

403-409: HTTP Range: add length-only requests, validate inputs.

load() currently ignores the case where only length is provided; gateway should receive Range: bytes=0-{length-1}. Also reject negative values. This was previously flagged.

@@
-        # Construct the Range header if required
-        if offset is not None:
+        # Construct the Range header if required
+        if offset is not None:
             start = offset
+            if start < 0:
+                raise ValueError("offset must be non-negative")
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
-                end = start + length - 1
+                if length < 0:
+                    raise ValueError("length must be non-negative")
+                end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif length is not None:
+            if length < 0:
+                raise ValueError("length must be non-negative")
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
-            headers["Range"] = f"bytes=-{suffix}"
+            if suffix <= 0:
+                raise ValueError("suffix must be a positive integer")
+            headers["Range"] = f"bytes=-{suffix}"
@@
-            response = await client.get(url, headers=headers or None)
+            response = await client.get(url, headers=headers or None)

Also applies to: 411-428, 431-431

🧹 Nitpick comments (1)

py_hamt/store_httpx.py (1)

34-41: Define and document range-parameter contract (mutual exclusivity and non-negatives).

The extended load(...) signature is good. Please explicitly document and enforce:

offset >= 0, length >= 0, suffix > 0.

suffix must not be used together with offset/length.
This avoids ambiguous requests and undefined slices across implementations.

Would you like me to add a brief docstring here plus input validation in both implementations?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 559a7ff and e537389.

📒 Files selected for processing (1)

py_hamt/store_httpx.py (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

py_hamt/store_httpx.py (4)

py_hamt/hamt.py (4)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

py_hamt/zarr_hamt_store.py (1)

get (144-177)

py_hamt/sharded_zarr_store.py (1)

get (361-416)

tests/test_kubo_cas.py (1)

test_kubo_cas (128-145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Create project environment, run all checks (3.12)

🔇 Additional comments (1)

py_hamt/store_httpx.py (1)

190-190: Nice: explicit pin-on-add toggle in the add RPC URL.

Good feature gate and correct boolean casing in query string.

Also applies to: 258-260

0xSwego

LGTM

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

py_hamt/store_httpx.py (5)

94-99: Fix length-only handling and add input validation in InMemoryCAS.load

length without offset currently ignored; also redundant else after return. Add basic validation and implement length-only semantics.

         key = cast(bytes, id)
         if not isinstance(key, (bytes, bytearray)):  # defensive guard
             raise TypeError(
                 f"InMemoryCAS only supports byte‐hash keys; got {type(id).__name__}"
             )
-        data: bytes
+        # Validate args
+        if length is not None and suffix is not None:
+            raise ValueError("length and suffix are mutually exclusive")
+        if offset is not None and offset < 0:
+            raise ValueError("offset must be >= 0")
+        if length is not None and length < 0:
+            raise ValueError("length must be >= 0")
+        if suffix is not None and suffix < 0:
+            raise ValueError("suffix must be >= 0")
+        data: bytes
@@
-        if offset is not None:
-            start = offset
-            if length is not None:
-                end = start + length
-                return data[start:end]
-            else:
-                return data[start:]
-        elif suffix is not None:  # If only length is given, assume start from 0
-            return data[-suffix:]
-        else:  # Full load
-            return data
+        if offset is not None:
+            start = offset
+            if length is not None:
+                end = start + length
+                return data[start:end]
+            return data[start:]
+        elif length is not None:
+            # length only → first N bytes
+            return data[:length]
+        elif suffix is not None:
+            # trailing N bytes
+            return data[-suffix:]
+        return data

Also applies to: 105-116

233-238: Don’t touch the event loop in init; defer binding external clients

Accessing asyncio.get_running_loop() during construction breaks sync contexts. Stash user client and return it from _loop_client without loop lookups; only create per-loop clients when you own them.

-        self._owns_client: bool = False
-        self._closed: bool = True
-        self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
+        self._owns_client: bool = False
+        self._closed: bool = True
+        self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
+        self._external_client: httpx.AsyncClient | None = None
         self._default_headers = headers
         self._default_auth = auth
@@
-        if client is not None:
-            # A client was supplied by the user. We don't own it.
-            self._owns_client = False
-            self._client_per_loop = {asyncio.get_running_loop(): client}
-        else:
-            # No client supplied. We will own any clients we create.
-            self._owns_client = True
-            self._client_per_loop = {}
+        if client is not None:
+            self._owns_client = False
+            self._external_client = client
+            self._client_per_loop = {}
+        else:
+            self._owns_client = True
+            self._client_per_loop = {}
@@
     def _loop_client(self) -> httpx.AsyncClient:
         """Get or create a client for the current event loop.
@@
-        if self._closed:
+        if self._closed:
             if not self._owns_client:
                 raise RuntimeError("KuboCAS is closed; create a new instance")
             # We previously closed all internally-owned clients. Reset the
             # state so that new clients can be created lazily.
             self._closed = False
             self._client_per_loop = {}
 
+        # If user supplied a client, just reuse it.
+        if self._external_client is not None:
+            return self._external_client
+
         loop: asyncio.AbstractEventLoop = asyncio.get_running_loop()
         try:
             return self._client_per_loop[loop]
         except KeyError:
             # Create a new client
             client = httpx.AsyncClient(

Optional: use WeakKeyDictionary for _client_per_loop to avoid retaining closed loops.

Also applies to: 268-276, 299-330

510-526: Expose recursive flag in pin/unpin and align params

Docstrings say recursive by default; make it configurable and wire through.

-    async def pin_cid(
-        self,
-        cid: CID,
-        target_rpc: str = "http://127.0.0.1:5001",
-    ) -> None:
+    async def pin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}
@@
-    async def unpin_cid(
-        self, cid: CID, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> None:
+    async def unpin_cid(
+        self,
+        cid: CID,
+        target_rpc: str = "http://127.0.0.1:5001",
+        *,
+        recursive: bool = True,
+    ) -> None:
@@
-        params = {"arg": str(cid), "recursive": "true"}
+        params = {"arg": str(cid), "recursive": str(recursive).lower()}

Also applies to: 532-547

44-61: Make pin_ hooks fail-fast instead of silently no-op*

Silent pass/[] hides misuse. Raise NotImplementedError so unsupported backends surface errors early.

-    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Pin a CID in the storage."""
-        pass  # pragma: no cover
+    async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Pin a CID in the storage."""
+        raise NotImplementedError("pin_cid not implemented for this CAS")

-    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
-        """Unpin a CID in the storage."""
-        pass  # pragma: no cover
+    async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None:
+        """Unpin a CID in the storage."""
+        raise NotImplementedError("unpin_cid not implemented for this CAS")

-    async def pin_update(
+    async def pin_update(
         self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str
     ) -> None:
-        """Update the pinned CID in the storage."""
-        pass  # pragma: no cover
+        """Update the pinned CID in the storage."""
+        raise NotImplementedError("pin_update not implemented for this CAS")

-    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
-        """List all pinned CIDs in the storage."""
-        return []  # pragma: no cover
+    async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]:
+        """List all pinned CIDs in the storage."""
+        raise NotImplementedError("pin_ls not implemented for this CAS")

456-471: Support length-only HTTP Range and validate inputs in KuboCAS.load

Handle length without offset and reject invalid combos/negatives. Return b"" for zero-length requests.

-        # Construct the Range header if required
+        # Validate and construct the Range header if required
+        if length is not None and suffix is not None:
+            raise ValueError("length and suffix are mutually exclusive")
+        if offset is not None and offset < 0:
+            raise ValueError("offset must be >= 0")
+        if length is not None and length < 0:
+            raise ValueError("length must be >= 0")
+        if suffix is not None and suffix < 0:
+            raise ValueError("suffix must be >= 0")
+        if length == 0:
+            return b""
         if offset is not None:
             start = offset
             if length is not None:
                 # Standard HTTP Range: bytes=start-end (inclusive)
                 end = start + length - 1
                 headers["Range"] = f"bytes={start}-{end}"
             else:
                 # Standard HTTP Range: bytes=start- (from start to end)
                 headers["Range"] = f"bytes={start}-"
-        elif suffix is not None:
+        elif length is not None:
+            # First N bytes
+            end = length - 1
+            headers["Range"] = f"bytes=0-{end}"
+        elif suffix is not None:
             # Standard HTTP Range: bytes=-N (last N bytes)
             headers["Range"] = f"bytes=-{suffix}"

🧹 Nitpick comments (5)

py_hamt/store_httpx.py (5)
35-41: Clarify and validate range-argument contract in the base interface

Please document precedence and mutual exclusivity of offset/length/suffix (e.g., offset+suffix invalid; length-only means first N bytes) so all backends implement the same behavior. Consider adding lightweight validation in the abstract method’s docstring and enforcing it in concrete implementations.

278-279: Remove duplicate assignment of _default_headers/_default_auth

These are already set at Lines 236–237.
-        self._default_headers = headers
-        self._default_auth = auth
361-401: Be cautious running async cleanup from del

Scheduling tasks or calling asyncio.run in destructors can fire at interpreter shutdown and be brittle. Consider limiting del to clearing references and rely on aclose()/context manager for real cleanup.

548-566: Type consistency for pin_update

Align with other pin methods by accepting CID for old_id/new_id (runtime behavior unchanged).
-    async def pin_update(
-        self,
-        old_id: IPLDKind,
-        new_id: IPLDKind,
+    async def pin_update(
+        self,
+        old_id: CID,
+        new_id: CID,
         target_rpc: str = "http://127.0.0.1:5001",
     ) -> None:
568-586: pin_ls return type/docstring mismatch

Kubo returns {"Keys": {...}}; you return that mapping, not a list. Fix typing, docstring, and default.
-    async def pin_ls(
-        self, target_rpc: str = "http://127.0.0.1:5001"
-    ) -> list[Dict[str, Any]]:
+    async def pin_ls(
+        self, target_rpc: str = "http://127.0.0.1:5001"
+    ) -> dict[str, Any]:
@@
-        Returns:
-            List[CID]: A list of pinned CIDs.
+        Returns:
+            dict[str, Any]: The raw "Keys" mapping from Kubo pin/ls.
@@
-            pins = response.json().get("Keys", [])
+            pins = response.json().get("Keys", {})
             return pins

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e537389 and efb3c64.

📒 Files selected for processing (1)

py_hamt/store_httpx.py (10 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

py_hamt/store_httpx.py (5)

py_hamt/hamt.py (4)

load (138-139)

load (159-168)

load (276-285)

get (593-608)

py_hamt/sharded_zarr_store.py (1)

get (361-416)

py_hamt/zarr_hamt_store.py (1)

get (144-177)

tests/test_public_gateway.py (1)

test_fix_kubocas_load (270-316)

tests/test_kubo_cas.py (2)

test_kubo_urls_explicit (52-71)

test_kubo_cas (130-147)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Create project environment, run all checks (3.12)

🔇 Additional comments (2)

py_hamt/store_httpx.py (2)

262-264: LGTM: pin_on_add wiring

Correctly threads pin flag into the add URL.

405-445: LGTM: save() retries and jitter

Solid retry/backoff with jitter and clear error propagation.

Faolain · 2025-09-10T06:02:49Z

py_hamt/__init__.py

    "InMemoryCAS",
    "KuboCAS",
    "ZarrHAMTStore",
+    "InMemoryCAS",


minor nit but duplicate export of InMemoryCAS, should remove

py_hamt/__init__.py

Faolain · 2025-09-10T06:42:10Z

py_hamt/sharded_zarr_store.py

+        self._resize_complete.set()
+        self._shard_locks: DefaultDict[int, asyncio.Lock] = defaultdict(asyncio.Lock)
+
+        self._shard_data_cache: Dict[int, list[Optional[CID]]] = {}


ChatGPT had a good point here

Memory bounds

_shard_data_cache has no capacity limit and may grow without bound as clients hit many shards in long‑lived processes. Consider
an LRU for shard lists (configurable size) to bound memory.

Instead of a dict can use a LRU which comes with python itself as our apps using this which are long lived processes would forever have their cache grow unbounded no (since there's no eviction policy)?

I asked Claude Opus for a suggestion (can crosscheck with chatgpt 5)
Basic:

Advanced

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

py_hamt/hamt_to_sharded_converter.py (1)
44-44: Chunk shape derivation may fail for unchunked datasets.

The function logic is correct. However, source_dataset.chunks can be None for unchunked arrays, causing a TypeError when accessing source_dataset.chunks[dim][0].

Add a safety check and fallback:
-    chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+    if source_dataset.chunks is None:  # unchunked source
+        chunk_shape_tuple = array_shape_tuple  # Use full array dims as chunk shape
+    else:
+        chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)

🧹 Nitpick comments (6)

tests/test_sharded_zarr_store.py (4)
1-1: Remove unused import.

The asyncio import is not used in the test file.
-import asyncio
-
 import dag_cbor
183-183: Remove print statement from test code.

Test code should not contain print statements.
-        print(f"Data written to ShardedZarrStore with root CID: {final_cid}")
         assert final_cid is not None
200-200: Remove print statement from test code.
-        print("\n✅ Append test successful! Data verified.")
752-752: Remove debugging print statement.

Multiple print statements should be removed from the test code.
-        print(f"Testing with chunk key: {chunk_key}")
Also applies to lines 759, 792, 799, 806, 812, 822.
py_hamt/sharded_zarr_store.py (2)
195-195: Add explicit strict parameter to zip().

Python 3.10+ recommends explicit strict parameter for zip() to catch length mismatches.
-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
Also applies to lines 767 and 806.

809-814: Simplify conditional logic for num_shards calculation.

The conditional check for self._num_shards is not None is unnecessary since it's initialized to 0 in __init__.
-        old_num_shards = self._num_shards if self._num_shards is not None else 0
+        old_num_shards = self._num_shards
         self._num_shards = (
             (self._total_chunks + self._chunks_per_shard - 1) // self._chunks_per_shard
             if self._total_chunks > 0
             else 0
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41143f9 and e55a3e3.

📒 Files selected for processing (7)

py_hamt/__init__.py (2 hunks)
py_hamt/hamt_to_sharded_converter.py (1 hunks)
py_hamt/sharded_zarr_store.py (1 hunks)
tests/test_sharded_store_deleting.py (1 hunks)
tests/test_sharded_store_grafting.py (1 hunks)
tests/test_sharded_zarr_store.py (1 hunks)
tests/testing_utils.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/testing_utils.py

🚧 Files skipped from review as they are similar to previous changes (3)

py_hamt/init.py
tests/test_sharded_store_deleting.py
tests/test_sharded_store_grafting.py

🧰 Additional context used

🧬 Code graph analysis (3)

py_hamt/sharded_zarr_store.py (1)

py_hamt/store_httpx.py (10)

ContentAddressedStore (13-60)

load (35-42)

load (78-115)

load (446-505)

save (28-32)

save (73-76)

save (405-444)

InMemoryCAS (39-73)

ContentAddressedStore (13-36)

KuboCAS (76-429)

tests/test_sharded_zarr_store.py (2)

py_hamt/store_httpx.py (4)

KuboCAS (118-586)

save (28-32)

save (73-76)

save (405-444)

py_hamt/sharded_zarr_store.py (20)

ShardedZarrStore (137-870)

open (205-236)

flush (519-562)

_load_or_initialize_shard_cache (398-453)

get (46-53)

get (564-619)

exists (682-693)

list_prefix (737-740)

delete (707-727)

with_read_only (471-510)

list_dir (859-870)

get_partial_values (462-469)

_parse_chunk_key (343-382)

MemoryBoundedLRUCache (20-134)

put (55-96)

cache_size (127-129)

estimated_memory_usage (122-124)

dirty_cache_size (132-134)

mark_clean (104-107)

clear (109-115)

py_hamt/hamt_to_sharded_converter.py (4)

py_hamt/hamt.py (4)

HAMT (288-705)

keys (673-687)

get_pointer (610-624)

HAMT (287-696)

py_hamt/sharded_zarr_store.py (4)

ShardedZarrStore (137-870)

open (205-236)

set_pointer (660-680)

flush (519-562)

py_hamt/store_httpx.py (2)

KuboCAS (118-586)

KuboCAS (76-429)

py_hamt/zarr_hamt_store.py (3)

ZarrHAMTStore (12-322)

read_only (108-111)

ZarrHAMTStore (11-285)

🪛 Ruff (0.12.2)

py_hamt/sharded_zarr_store.py

189-189: Avoid specifying long messages outside the exception class

(TRY003)

191-191: Avoid specifying long messages outside the exception class

(TRY003)

195-195: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

226-228: Avoid specifying long messages outside the exception class

(TRY003)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

235-235: Avoid specifying long messages outside the exception class

(TRY003)

269-271: Abstract raise to an inner function

(TRY301)

269-271: Avoid specifying long messages outside the exception class

(TRY003)

273-273: Prefer TypeError exception for invalid type

(TRY004)

273-273: Abstract raise to an inner function

(TRY301)

273-273: Avoid specifying long messages outside the exception class

(TRY003)

274-274: Do not catch blind exception: Exception

(BLE001)

275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

275-275: Avoid specifying long messages outside the exception class

(TRY003)

279-281: Avoid specifying long messages outside the exception class

(TRY003)

291-293: Avoid specifying long messages outside the exception class

(TRY003)

316-316: Abstract raise to an inner function

(TRY301)

316-316: Avoid specifying long messages outside the exception class

(TRY003)

322-322: Consider moving this statement to an else block

(TRY300)

335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

335-337: Avoid specifying long messages outside the exception class

(TRY003)

379-381: Avoid specifying long messages outside the exception class

(TRY003)

428-430: Avoid specifying long messages outside the exception class

(TRY003)

436-436: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

436-436: Avoid specifying long messages outside the exception class

(TRY003)

439-439: Avoid specifying long messages outside the exception class

(TRY003)

452-452: Avoid specifying long messages outside the exception class

(TRY003)

474-474: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

508-508: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

527-527: Avoid specifying long messages outside the exception class

(TRY003)

577-579: Avoid specifying long messages outside the exception class

(TRY003)

608-610: Avoid specifying long messages outside the exception class

(TRY003)

623-623: Avoid specifying long messages outside the exception class

(TRY003)

636-636: Avoid specifying long messages outside the exception class

(TRY003)

656-656: Do not catch blind exception: Exception

(BLE001)

657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

657-657: Avoid specifying long messages outside the exception class

(TRY003)

709-709: Avoid specifying long messages outside the exception class

(TRY003)

716-716: Avoid specifying long messages outside the exception class

(TRY003)

744-744: Avoid specifying long messages outside the exception class

(TRY003)

767-767: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

790-790: Avoid specifying long messages outside the exception class

(TRY003)

797-797: Avoid specifying long messages outside the exception class

(TRY003)

799-801: Avoid specifying long messages outside the exception class

(TRY003)

806-806: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

833-833: Avoid specifying long messages outside the exception class

(TRY003)

839-841: Avoid specifying long messages outside the exception class

(TRY003)

py_hamt/hamt_to_sharded_converter.py

125-125: Do not catch blind exception: Exception

(BLE001)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

py_hamt/sharded_zarr_store.py

[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).

🔇 Additional comments (1)

py_hamt/sharded_zarr_store.py (1)

20-135: LGTM! Excellent memory-bounded cache implementation.

The MemoryBoundedLRUCache class effectively addresses the unbounded memory growth concern raised in previous reviews. Key strengths:

Proper LRU eviction when memory limit is exceeded

Protection of dirty shards from eviction

Thread-safe async operations with locks

Accurate memory tracking using sys.getsizeof()

coderabbitai · 2025-09-10T18:36:36Z

py_hamt/hamt_to_sharded_converter.py

+        except Exception as e:
+            print(f"\nAn error occurred: {e}")


🛠️ Refactor suggestion

Avoid catching blind exception.

The broad Exception catch could hide important errors and make debugging difficult.

Catch specific exceptions that are expected during conversion:

- except Exception as e: - print(f"\nAn error occurred: {e}") + except (ValueError, KeyError, RuntimeError) as e: + print(f"\nAn error occurred: {e}") + import sys + sys.exit(1)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

print(f"\nAn error occurred: {e}")

except (ValueError, KeyError, RuntimeError) as e:

print(f"\nAn error occurred: {e}")

import sys

sys.exit(1)

🧰 Tools

🪛 Ruff (0.12.2)

125-125: Do not catch blind exception: Exception

(BLE001)

maybe something to consider here @TheGreatAlgo ?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2025-09-10T18:36:36Z

py_hamt/sharded_zarr_store.py

+        except Exception as e:
+            raise ValueError(f"Failed to decode root object: {e}")
+        self._root_obj = dag_cbor.decode(root_bytes)


⚠️ Potential issue

Duplicate decode operation and broad exception handling.

The root object is decoded twice (line 267 and 276), and the broad Exception catch could hide important errors.

try: self._root_obj = dag_cbor.decode(root_bytes) if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: raise ValueError( "Root object is not a valid dictionary with 'chunks' key." ) if not isinstance(self._root_obj["chunks"]["shard_cids"], list): raise ValueError("shard_cids is not a list.") - except Exception as e: + except (ValueError, dag_cbor.CBORDecodeError) as e: raise ValueError(f"Failed to decode root object: {e}") - self._root_obj = dag_cbor.decode(root_bytes)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

raise ValueError(f"Failed to decode root object: {e}")

self._root_obj = dag_cbor.decode(root_bytes)

try:

self._root_obj = dag_cbor.decode(root_bytes)

if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:

raise ValueError(

"Root object is not a valid dictionary with 'chunks' key."

)

if not isinstance(self._root_obj["chunks"]["shard_cids"], list):

raise ValueError("shard_cids is not a list.")

except (ValueError, dag_cbor.CBORDecodeError) as e:

raise ValueError(f"Failed to decode root object: {e}")

🧰 Tools

🪛 Ruff (0.12.2)

274-274: Do not catch blind exception: Exception

(BLE001)

275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

275-275: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 274 to 276, remove the duplicate dag_cbor.decode call and the overly broad except Exception; instead catch the specific decode-related exception(s) (e.g., ValueError/CBORDecodeError or the library's decode error), assign the decoded result to self._root_obj from the earlier successful decode (or decode once and reuse its variable), and re-raise a ValueError with the original exception chained/preserved so you don't swallow important error details.

hmm it appears to be decoded twice is this correct?

self._root_obj = dag_cbor.decode(root_bytes) on here twice?

@Faolain, you're absolutely correct! This is indeed a bug - dag_cbor.decode(root_bytes) is being called twice:

Line 267: Inside the try block, decodes and assigns to self._root_obj

Line 276: Outside the try block, decodes again and overwrites self._root_obj

This is problematic because:

The validation performed inside the try block becomes meaningless since the result gets overwritten

If the second decode fails, it would raise an unhandled exception

It's wasteful to decode the same data twice

The correct fix is to simply remove line 276 since the decoding and assignment already happened successfully inside the try block:

async def _load_root_from_cid(self): root_bytes = await self.cas.load(self._root_cid) try: self._root_obj = dag_cbor.decode(root_bytes) if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: raise ValueError( "Root object is not a valid dictionary with 'chunks' key." ) if not isinstance(self._root_obj["chunks"]["shard_cids"], list): raise ValueError("shard_cids is not a list.") except (ValueError, dag_cbor.CBORDecodeError) as e: raise ValueError(f"Failed to decode root object: {e}") from e # Remove this duplicate line: # self._root_obj = dag_cbor.decode(root_bytes)

Good catch! This would have caused the validation logic to be bypassed.

@TheGreatAlgo this too

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2025-09-10T18:36:36Z

py_hamt/sharded_zarr_store.py

+                    print(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
+                    )


🛠️ Refactor suggestion

Replace print statements with proper logging.

Print statements should use the logging module for better control in production environments.

+import logging + +logger = logging.getLogger(__name__) + # Log the failure and raise a specific error - print( - f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" - ) + logger.error( + f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" + )

Also applies to line 340 - replace print(f"Error fetching shard {shard_idx}: {e}") with logger.error(...).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 332-334 (and also the similar occurrence at line 340), replace the print statements with proper logging calls: use the module logger (ensure a logger is created/imported at the top, e.g., logger = logging.getLogger(__name__)) and change the prints to logger.error with a clear message including shard_idx and max_retries, and pass exception info (exc_info=True) where appropriate so stack traces are recorded; keep messages concise and consistent with other log lines in the file.

coderabbitai · 2025-09-10T18:36:36Z

py_hamt/sharded_zarr_store.py

+        except Exception as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}")


🛠️ Refactor suggestion

Improve exception handling specificity.

The broad Exception catch could hide important errors during CAS save operations.

try: data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") await self.set_pointer(key, str(data_cid_obj)) - except Exception as e: - raise RuntimeError(f"Failed to save data for key {key}: {e}") + except (ConnectionError, TimeoutError, RuntimeError) as e: + raise RuntimeError(f"Failed to save data for key {key}: {e}") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

raise RuntimeError(f"Failed to save data for key {key}: {e}")

try:

data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")

await self.set_pointer(key, str(data_cid_obj))

except (ConnectionError, TimeoutError, RuntimeError) as e:

raise RuntimeError(f"Failed to save data for key {key}: {e}") from e

🧰 Tools

🪛 Ruff (0.12.2)

656-656: Do not catch blind exception: Exception

(BLE001)

657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

657-657: Avoid specifying long messages outside the exception class

(TRY003)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding

[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 656-657, the broad `except Exception as e` hides specific errors during CAS save; replace it by catching only the expected exceptions from the CAS save operation (e.g., the CAS client’s specific exception types, OSError/IOError, ValueError, etc.), let unexpected exceptions propagate, and when re-raising wrap with `raise RuntimeError(...) from e` to preserve the original traceback; if you don't know exact CAS exception classes, remove the blanket except and/or import and catch the CAS library's specific error types and include `from e` when raising.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

tests/test_sharded_zarr_store_coverage.py (2)

19-23: Delete dead, commented-out blocks

These commented fragments age poorly and confuse intent. Remove them.

Also applies to: 85-89, 180-186, 232-236, 251-263

529-553: Assert retry failures via logging instead of patching print

Production code should log, not print. Switch assertions to caplog.

-    # This should retry and eventually succeed (testing retry logic lines 325-329)
-    with patch("builtins.print") as mock_print:
-        shard_data = await store._load_or_initialize_shard_cache(0)
-        assert shard_data is not None
-        assert len(shard_data) == 4
+    # This should retry and eventually succeed
+    import logging
+    with caplog.at_level(logging.ERROR):
+        shard_data = await store._load_or_initialize_shard_cache(0)
+    assert shard_data is not None
+    assert len(shard_data) == 4
@@
-    # This should hit lines 332-337 (failure after max retries)
-    with patch("builtins.print") as mock_print:
-        with pytest.raises(
-            RuntimeError, match="Failed to fetch shard 0 after 3 attempts"
-        ):
-            await store_fail._load_or_initialize_shard_cache(0)
-        # Should print failure message (line 333)
-        mock_print.assert_called()
+    # This should fail after max retries and log an error
+    with caplog.at_level(logging.ERROR):
+        with pytest.raises(RuntimeError, match="Failed to fetch shard 0 after 3 attempts"):
+            await store_fail._load_or_initialize_shard_cache(0)
+    assert any("Failed to fetch shard 0" in rec.message for rec in caplog.records)

py_hamt/sharded_zarr_store.py (3)

317-347: Replace prints with logging; narrow exceptions and chain causes

Use logging, exponential backoff stays. Chain causes on failure; avoid blind except.

+        import logging
+        logger = logging.getLogger(__name__)
@@
-            except (ConnectionError, TimeoutError) as e:
+            except (ConnectionError, TimeoutError) as e:
                 # Handle transient errors (e.g., network issues)
                 if attempt < max_retries - 1:
                     await asyncio.sleep(
                         retry_delay * (2**attempt)
                     )  # Exponential backoff
                     continue
                 else:
-                    # Log the failure and raise a specific error
-                    print(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
-                    raise RuntimeError(
-                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
-            except Exception as e:
-                # Handle non-transient errors immediately
-                print(f"Error fetching shard {shard_idx}: {e}")
-                raise
+                    logger.error("Failed to fetch shard %s after %s attempts: %s",
+                                 shard_idx, max_retries, e)
+                    raise RuntimeError(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts"
+                    ) from e
+            except (ValueError, TypeError, dag_cbor.CBORDecodeError) as e:
+                logger.error("Error fetching or decoding shard %s: %s", shard_idx, e)
+                raise

555-567: Unify IPLD ID handling; avoid coercing to str

Coercing IDs breaks generic CAS backends (e.g., InMemoryCAS expects bytes). Keep IDs as returned by cas.save/load and only decode strings when input truly is a string.

-        if self._dirty_root:
+        if self._dirty_root:
             # Ensure all metadata CIDs are CID objects for correct encoding
             self._root_obj["metadata"] = {
                 k: (CID.decode(v) if isinstance(v, str) else v)
                 for k, v in self._root_obj["metadata"].items()
             }
             root_obj_bytes = dag_cbor.encode(self._root_obj)
             new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            self._root_cid = new_root_cid
             self._dirty_root = False
@@
-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)
             return prototype.buffer.from_bytes(data)
@@
-            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, data_cid_obj)
+        except (ConnectionError, TimeoutError, RuntimeError, ValueError) as e:
+            raise RuntimeError(f"Failed to save data for key {key}") from e
@@
-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | bytes | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = CID.decode(pointer) if isinstance(pointer, str) else pointer  # type: ignore[assignment]

Follow-up: if you want flush() to return a string CID, return str(self._root_cid) at the end.

Also applies to: 585-586, 659-662, 665-673

269-282: Decode once and raise with cause; remove duplicate decode and broad catch

_load_root_from_cid() decodes twice and catches Exception. Decode once, validate, and re-raise with from e.

-        root_bytes = await self.cas.load(self._root_cid)
-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        root_bytes = await self.cas.load(self._root_cid)
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError("Failed to decode root object") from e

🧹 Nitpick comments (3)

tests/test_sharded_zarr_store_coverage.py (1)

132-136: Remove unused caplog argument

caplog is unused in this test. Drop it to satisfy ARG001.

-async def test_sharded_zarr_store_shard_handling_exceptions(
-    create_ipfs: tuple[str, str], caplog
-):
+async def test_sharded_zarr_store_shard_handling_exceptions(
+    create_ipfs: tuple[str, str]
+):

py_hamt/sharded_zarr_store.py (2)

771-776: Make zip strict in graft/resize math

Avoid silent truncation when offsets/ranks mismatch.

-            global_coords = tuple(
-                c_local + c_offset
-                for c_local, c_offset in zip(local_coords, chunk_offset)
-            )
+            global_coords = tuple(
+                c_local + c_offset
+                for c_local, c_offset in zip(local_coords, chunk_offset, strict=True)
+            )
@@
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )

Also applies to: 809-812

122-125: Don’t implement __contains__ as async

in won’t use an async __contains__. Rename to contains() or make it sync.

-    async def __contains__(self, shard_idx: int) -> bool:
-        async with self._cache_lock:
-            return shard_idx in self._cache
+    def contains(self, shard_idx: int) -> bool:
+        # Best-effort check; not acquiring lock is acceptable for a hint
+        return shard_idx in self._cache

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e55a3e3 and ac17139.

📒 Files selected for processing (2)

py_hamt/sharded_zarr_store.py (1 hunks)
tests/test_sharded_zarr_store_coverage.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

py_hamt/sharded_zarr_store.py (2)

py_hamt/zarr_hamt_store.py (1)

ZarrHAMTStore (11-285)

py_hamt/store_httpx.py (3)

InMemoryCAS (39-73)

ContentAddressedStore (13-36)

KuboCAS (76-429)

tests/test_sharded_zarr_store_coverage.py (3)

py_hamt/store_httpx.py (7)

KuboCAS (118-586)

save (28-32)

save (73-76)

save (405-444)

load (35-42)

load (78-115)

load (446-505)

tests/testing_utils.py (1)

create_ipfs (149-195)

py_hamt/zarr_hamt_store.py (3)

read_only (108-111)

set_partial_values (225-229)

exists (193-199)

🪛 Ruff (0.12.2)

py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)

196-196: Avoid specifying long messages outside the exception class

(TRY003)

200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

231-233: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Avoid specifying long messages outside the exception class

(TRY003)

274-276: Abstract raise to an inner function

(TRY301)

274-276: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Prefer TypeError exception for invalid type

(TRY004)

278-278: Abstract raise to an inner function

(TRY301)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

279-279: Do not catch blind exception: Exception

(BLE001)

280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

284-286: Avoid specifying long messages outside the exception class

(TRY003)

296-298: Avoid specifying long messages outside the exception class

(TRY003)

321-321: Abstract raise to an inner function

(TRY301)

321-321: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Consider moving this statement to an else block

(TRY300)

340-342: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

340-342: Avoid specifying long messages outside the exception class

(TRY003)

384-386: Avoid specifying long messages outside the exception class

(TRY003)

433-435: Avoid specifying long messages outside the exception class

(TRY003)

441-441: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

441-441: Avoid specifying long messages outside the exception class

(TRY003)

444-444: Avoid specifying long messages outside the exception class

(TRY003)

457-457: Avoid specifying long messages outside the exception class

(TRY003)

479-479: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

513-513: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

532-532: Avoid specifying long messages outside the exception class

(TRY003)

582-584: Avoid specifying long messages outside the exception class

(TRY003)

613-615: Avoid specifying long messages outside the exception class

(TRY003)

628-628: Avoid specifying long messages outside the exception class

(TRY003)

641-641: Avoid specifying long messages outside the exception class

(TRY003)

661-661: Do not catch blind exception: Exception

(BLE001)

662-662: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

662-662: Avoid specifying long messages outside the exception class

(TRY003)

714-714: Avoid specifying long messages outside the exception class

(TRY003)

721-721: Avoid specifying long messages outside the exception class

(TRY003)

749-749: Avoid specifying long messages outside the exception class

(TRY003)

772-772: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

795-795: Avoid specifying long messages outside the exception class

(TRY003)

802-802: Avoid specifying long messages outside the exception class

(TRY003)

804-806: Avoid specifying long messages outside the exception class

(TRY003)

811-811: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

838-838: Avoid specifying long messages outside the exception class

(TRY003)

844-846: Avoid specifying long messages outside the exception class

(TRY003)

tests/test_sharded_zarr_store_coverage.py

133-133: Unused function argument: caplog

(ARG001)

348-348: Unused method argument: cid

(ARG002)

390-390: Unused method argument: cid

(ARG002)

431-431: Unused method argument: cid

(ARG002)

457-457: Unused method argument: cid

(ARG002)

496-496: Unused method argument: cid

(ARG002)

499-499: Avoid specifying long messages outside the exception class

(TRY003)

505-505: Unused method argument: data

(ARG002)

505-505: Unused method argument: codec

(ARG002)

600-600: Unused method argument: cid

(ARG002)

600-600: Unused method argument: offset

(ARG002)

600-600: Unused method argument: length

(ARG002)

600-600: Unused method argument: suffix

(ARG002)

673-673: Unused method argument: data

(ARG002)

673-673: Unused method argument: codec

(ARG002)

675-675: Avoid specifying long messages outside the exception class

(TRY003)

710-710: Unused method argument: data

(ARG002)

710-710: Unused method argument: codec

(ARG002)

765-765: Unused method argument: cid

(ARG002)

784-784: Unused function argument: shard_idx

(ARG001)

818-818: Unused method argument: cid

(ARG002)

844-844: Unused function argument: timeout

(ARG001)

877-877: Unused method argument: cid

(ARG002)

900-900: Unused function argument: timeout

(ARG001)

908-908: Unused function argument: shard_idx

(ARG001)

🔇 Additional comments (4)

tests/test_sharded_zarr_store_coverage.py (2)

266-328: LRU cache edge-cases well covered

Nice coverage for empty shard sizing, update of existing entries, and eviction when only dirty shards exist.

801-858: Timeout cleanup scenario is precise

Good use of a manual pending event and asyncio.wait_for patch to validate cleanup of _pending_shard_loads.

py_hamt/sharded_zarr_store.py (2)

29-45: MemoryBoundedLRUCache looks solid

Size accounting, dirty protection, and eviction loop behavior are clean and deterministic.

Also applies to: 55-103

736-787: Grafting path is correct and lock-scoped

Good lock granularity per-target-shard, and dirty marking only on change.

coderabbitai · 2025-09-10T19:34:49Z

py_hamt/sharded_zarr_store.py

+        if not all(cs > 0 for cs in self._chunk_shape):
+            raise ValueError("All chunk_shape dimensions must be positive.")
+        if not all(s >= 0 for s in self._array_shape):
+            raise ValueError("All array_shape dimensions must be non-negative.")
+
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape)
+        )


🛠️ Refactor suggestion

Add rank check and use zip(..., strict=True)

Guard mismatched shapes early and make zip intent explicit.

- if not all(cs > 0 for cs in self._chunk_shape): + if len(self._array_shape) != len(self._chunk_shape): + raise ValueError("array_shape and chunk_shape must have the same number of dimensions.") + if not all(cs > 0 for cs in self._chunk_shape): raise ValueError("All chunk_shape dimensions must be positive.") if not all(s >= 0 for s in self._array_shape): raise ValueError("All array_shape dimensions must be non-negative.") - self._chunks_per_dim = tuple( - math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) - ) + self._chunks_per_dim = tuple( + math.ceil(a / c) if c > 0 else 0 + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not all(cs > 0 for cs in self._chunk_shape):

raise ValueError("All chunk_shape dimensions must be positive.")

if not all(s >= 0 for s in self._array_shape):

raise ValueError("All array_shape dimensions must be non-negative.")

self._chunks_per_dim = tuple(

math.ceil(a / c) if c > 0 else 0

for a, c in zip(self._array_shape, self._chunk_shape)

)

if len(self._array_shape) != len(self._chunk_shape):

raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")

if not all(cs > 0 for cs in self._chunk_shape):

raise ValueError("All chunk_shape dimensions must be positive.")

if not all(s >= 0 for s in self._array_shape):

raise ValueError("All array_shape dimensions must be non-negative.")

self._chunks_per_dim = tuple(

math.ceil(a / c) if c > 0 else 0

for a, c in zip(self._array_shape, self._chunk_shape, strict=True)

)

🧰 Tools

🪛 Ruff (0.12.2)

194-194: Avoid specifying long messages outside the exception class

(TRY003)

196-196: Avoid specifying long messages outside the exception class

(TRY003)

200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 193 to 201, add an explicit rank check to ensure len(self._array_shape) == len(self._chunk_shape) and raise a ValueError with a clear message if they differ, then compute self._chunks_per_dim using zip(self._array_shape, self._chunk_shape, strict=True) (keeping the existing math.ceil(a / c) if c > 0 else 0 logic) so mismatched shapes are caught early and the zip intent is explicit.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

py_hamt/sharded_zarr_store.py (3)

269-282: Decode-once and narrow rethrow; remove duplicate decode

Root is decoded twice and exceptions aren’t chained.

-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except Exception as e:
+            raise ValueError("Failed to decode root object") from e

549-561: Do not coerce root CID to string; only stringify on return

This prevents incompatibilities with non-string IPLD kinds (e.g., InMemoryCAS). Return a string for API consistency.

-            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
+            self._root_cid = new_root_cid
             self._dirty_root = False
-
-        # Ignore because root_cid will always exist after initialization or flush.
-        return self._root_cid  # type: ignore[return-value]
+        return str(self._root_cid)  # unified return type

190-201: Add rank check and use strict zip in geometry

Catch shape rank mismatches early; make zip intent explicit.

-        if not all(cs > 0 for cs in self._chunk_shape):
+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+        if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )

🧹 Nitpick comments (6)

tests/test_sharded_zarr_store_coverage.py (3)
132-134: Remove unused caplog parameter

caplog is not used. Drop it to satisfy linters and reduce noise.
-async def test_sharded_zarr_store_shard_handling_exceptions(
-    create_ipfs: tuple[str, str], caplog
-):
+async def test_sharded_zarr_store_shard_handling_exceptions(
+    create_ipfs: tuple[str, str],
+):
346-359: Prefix unused mock arguments with underscores

Silence ARG001/ARG002 by prefixing unused parameters (cid, data, codec, timeout) with _. Repeat pattern for all mocks in this file.
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def load(self, cid):
+        async def load(self, _cid):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-        async def load(self, cid, offset=None, length=None, suffix=None):
+        async def load(self, _cid, offset=None, length=None, suffix=None):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-        async def save(self, data, codec=None):
+        async def save(self, _data, _codec=None):
@@
-    async def mock_wait_for(coro, timeout=None):
+    async def mock_wait_for(coro, _timeout=None):
@@
-    async def mock_wait_for(coro, timeout=None):
+    async def mock_wait_for(coro, _timeout=None):
@@
-    async def mock_cache_get(shard_idx):
+    async def mock_cache_get(_shard_idx):
Also applies to: 387-399, 427-429, 452-464, 487-500, 583-586, 654-657, 690-693, 745-751, 797-801, 854-858, 823-827, 877-884, 885-888

329-335: Avoid brittle line-number references in test docstrings

Docstrings that mention source line numbers will rot quickly. Rephrase to describe behavior/intent instead.

Also applies to: 569-575
py_hamt/sharded_zarr_store.py (3)
21-27: Clarify that size is an estimate; consider naming and doc tweak

sys.getsizeof undercounts nested objects; it’s an estimate. Update docstring accordingly to avoid misleading claims.
-    Memory usage is calculated using sys.getsizeof for accurate sizing.
+    Memory usage is estimated using sys.getsizeof on the list and elements.
Also applies to: 37-45

46-53: Use OrderedDict.move_to_end for LRU bump

Simpler and faster than pop/assign.
-            shard_data = self._cache.pop(shard_idx)
-            self._cache[shard_idx] = shard_data
+            shard_data = self._cache[shard_idx]
+            self._cache.move_to_end(shard_idx, last=True)
             return shard_data
776-781: Backpressure when all shards are dirty

If many shards are dirty, memory can exceed the bound since dirty shards aren’t evicted. Consider optional auto-flush or a high-watermark callback to shed memory.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac17139 and 6cac0f5.

📒 Files selected for processing (2)

py_hamt/sharded_zarr_store.py (1 hunks)
tests/test_sharded_zarr_store_coverage.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

py_hamt/sharded_zarr_store.py (2)

py_hamt/store_httpx.py (10)

ContentAddressedStore (13-60)

load (35-42)

load (78-115)

load (446-505)

save (28-32)

save (73-76)

save (405-444)

InMemoryCAS (39-73)

ContentAddressedStore (13-36)

KuboCAS (76-429)

tests/test_sharded_zarr_store_coverage.py (12)

load (346-358)

load (387-388)

load (427-428)

load (452-463)

load (487-494)

load (583-584)

load (745-748)

load (797-799)

load (854-857)

save (496-501)

save (654-656)

save (690-693)

tests/test_sharded_zarr_store_coverage.py (2)

py_hamt/store_httpx.py (7)

KuboCAS (118-586)

save (28-32)

save (73-76)

save (405-444)

load (35-42)

load (78-115)

load (446-505)

py_hamt/sharded_zarr_store.py (13)

ShardedZarrStore (142-869)

_load_or_initialize_shard_cache (397-452)

get (46-53)

get (563-618)

resize_store (782-824)

MemoryBoundedLRUCache (20-139)

_get_shard_size (37-44)

put (55-101)

dirty_cache_size (137-139)

_load_root_from_cid (269-298)

with_read_only (470-509)

exists (681-692)

flush (518-561)

🪛 Ruff (0.12.2)

py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)

196-196: Avoid specifying long messages outside the exception class

(TRY003)

200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

231-233: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Avoid specifying long messages outside the exception class

(TRY003)

274-276: Abstract raise to an inner function

(TRY301)

274-276: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Prefer TypeError exception for invalid type

(TRY004)

278-278: Abstract raise to an inner function

(TRY301)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

279-279: Do not catch blind exception: Exception

(BLE001)

280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

284-286: Avoid specifying long messages outside the exception class

(TRY003)

296-298: Avoid specifying long messages outside the exception class

(TRY003)

321-321: Abstract raise to an inner function

(TRY301)

321-321: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Consider moving this statement to an else block

(TRY300)

336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

336-338: Avoid specifying long messages outside the exception class

(TRY003)

340-340: Use raise without specifying exception name

Remove exception name

(TRY201)

378-380: Avoid specifying long messages outside the exception class

(TRY003)

427-429: Avoid specifying long messages outside the exception class

(TRY003)

435-435: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

435-435: Avoid specifying long messages outside the exception class

(TRY003)

438-438: Avoid specifying long messages outside the exception class

(TRY003)

451-451: Avoid specifying long messages outside the exception class

(TRY003)

473-473: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

507-507: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

526-526: Avoid specifying long messages outside the exception class

(TRY003)

576-578: Avoid specifying long messages outside the exception class

(TRY003)

607-609: Avoid specifying long messages outside the exception class

(TRY003)

622-622: Avoid specifying long messages outside the exception class

(TRY003)

635-635: Avoid specifying long messages outside the exception class

(TRY003)

655-655: Do not catch blind exception: Exception

(BLE001)

656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

656-656: Avoid specifying long messages outside the exception class

(TRY003)

708-708: Avoid specifying long messages outside the exception class

(TRY003)

715-715: Avoid specifying long messages outside the exception class

(TRY003)

743-743: Avoid specifying long messages outside the exception class

(TRY003)

766-766: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

789-789: Avoid specifying long messages outside the exception class

(TRY003)

796-796: Avoid specifying long messages outside the exception class

(TRY003)

798-800: Avoid specifying long messages outside the exception class

(TRY003)

805-805: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

832-832: Avoid specifying long messages outside the exception class

(TRY003)

838-840: Avoid specifying long messages outside the exception class

(TRY003)

tests/test_sharded_zarr_store_coverage.py

133-133: Unused function argument: caplog

(ARG001)

346-346: Unused method argument: cid

(ARG002)

387-387: Unused method argument: cid

(ARG002)

427-427: Unused method argument: cid

(ARG002)

452-452: Unused method argument: cid

(ARG002)

487-487: Unused method argument: cid

(ARG002)

490-490: Avoid specifying long messages outside the exception class

(TRY003)

496-496: Unused method argument: data

(ARG002)

496-496: Unused method argument: codec

(ARG002)

583-583: Unused method argument: cid

(ARG002)

583-583: Unused method argument: offset

(ARG002)

583-583: Unused method argument: length

(ARG002)

583-583: Unused method argument: suffix

(ARG002)

654-654: Unused method argument: data

(ARG002)

654-654: Unused method argument: codec

(ARG002)

656-656: Avoid specifying long messages outside the exception class

(TRY003)

690-690: Unused method argument: data

(ARG002)

690-690: Unused method argument: codec

(ARG002)

745-745: Unused method argument: cid

(ARG002)

764-764: Unused function argument: shard_idx

(ARG001)

797-797: Unused method argument: cid

(ARG002)

823-823: Unused function argument: timeout

(ARG001)

854-854: Unused method argument: cid

(ARG002)

877-877: Unused function argument: timeout

(ARG001)

885-885: Unused function argument: shard_idx

(ARG001)

🔇 Additional comments (3)

tests/test_sharded_zarr_store_coverage.py (1)

304-326: LRU eviction behavior tests look solid

Good coverage of “no clean shards to evict” path; verifies dirty shard protection works as intended.

py_hamt/sharded_zarr_store.py (2)

681-693: exists() looks good

Specific exception handling and shard on-demand load is clean.

470-509: with_read_only shallow clone is efficient and clear

Sharing locks/cache/pending-loads while flipping the flag is appropriate.

coderabbitai · 2025-09-10T19:46:14Z

py_hamt/sharded_zarr_store.py

+    def __init__(
+        self,
+        cas: ContentAddressedStore,
+        read_only: bool,
+        root_cid: Optional[str] = None,
+        *,
+        max_cache_memory_bytes: int = 100 * 1024 * 1024,  # 100MB default
+    ):


🛠️ Refactor suggestion

Normalize root CID typing and equality

Stop forcing string conversions internally; accept IPLDKind on input, store as-is, and compare by string form for equality to avoid type mismatches.

- root_cid: Optional[str] = None, + root_cid: Optional[CID | str | bytes] = None,

- root_cid: Optional[str] = None, + root_cid: Optional[CID | str | bytes] = None,

- return self._root_cid == other._root_cid + return str(self._root_cid) == str(other._root_cid)

Also applies to: 209-220, 511-516

py_hamt/sharded_zarr_store.py

coderabbitai · 2025-09-10T19:46:14Z

py_hamt/sharded_zarr_store.py

+        shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]
+        if shard_cid_obj:
+            self._pending_shard_loads[shard_idx] = asyncio.Event()
+            shard_cid_str = str(shard_cid_obj)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+        else:
+            empty_shard = [None] * self._chunks_per_shard


🛠️ Refactor suggestion

Avoid str() on shard CIDs; pass IPLDKind to loader

Keep types consistent with CAS expectations.

- if shard_cid_obj: + if shard_cid_obj: self._pending_shard_loads[shard_idx] = asyncio.Event() - shard_cid_str = str(shard_cid_obj) - await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str) + await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # type: ignore[arg-type]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]

if shard_cid_obj:

self._pending_shard_loads[shard_idx] = asyncio.Event()

shard_cid_str = str(shard_cid_obj)

await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)

else:

empty_shard = [None] * self._chunks_per_shard

shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx]

if shard_cid_obj:

self._pending_shard_loads[shard_idx] = asyncio.Event()

await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # type: ignore[arg-type]

else:

empty_shard = [None] * self._chunks_per_shard

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 440 to 446, avoid calling str() on shard CID objects — the CAS expects an IPLDKind (CID) not a string. Replace shard_cid_str = str(shard_cid_obj) with using the original shard_cid_obj (or explicitly cast to the IPLDKind/ CID type expected by _fetch_and_cache_full_shard) and pass that value into await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj); ensure no string conversion is performed and that the loader signature accepts the IPLDKind.

coderabbitai · 2025-09-10T19:46:14Z

py_hamt/sharded_zarr_store.py

+            data = await self.cas.load(str(metadata_cid_obj))
+            return prototype.buffer.from_bytes(data)
+        # Chunk data request


🛠️ Refactor suggestion

Pass CID objects to CAS.load; don’t convert to str

Keeps IPLD types intact and avoids assumptions about CAS backends.

- data = await self.cas.load(str(metadata_cid_obj)) + data = await self.cas.load(metadata_cid_obj)

- chunk_cid_str = str(chunk_cid_obj) @@ - data = await self.cas.load( - chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix - ) + data = await self.cas.load( + chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix + )

Also applies to: 596-617

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 579-581 (and similarly 596-617), the code converts CID objects to strings before calling self.cas.load; change those calls to pass the CID objects directly (e.g., self.cas.load(metadata_cid_obj) instead of self.cas.load(str(metadata_cid_obj))) so IPLD/CID types are preserved for backends that expect native CID objects; update all occurrences in the specified ranges to remove str(...) wrapping and adjust any type hints or variable names if needed to reflect that a CID object (not a str) is being passed.

coderabbitai · 2025-09-10T19:46:14Z

py_hamt/sharded_zarr_store.py

+        raw_data_bytes = value.to_bytes()
+        # Save the data to CAS first to get its CID.
+        # Metadata is often saved as 'raw', chunks as well unless compressed.
+        try:
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, str(data_cid_obj))
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+        return None  # type: ignore[return-value]


🛠️ Refactor suggestion

Avoid stringifying pointers; accept CID|str in set_pointer and chain exceptions

This removes needless encode/decode cycles and fixes incompatibility with non-string IPLD kinds.

- try: - data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") - await self.set_pointer(key, str(data_cid_obj)) - except Exception as e: - raise RuntimeError(f"Failed to save data for key {key}: {e}") + try: + data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") + await self.set_pointer(key, data_cid_obj) # pass IPLDKind directly + except Exception as e: + raise RuntimeError(f"Failed to save data for key {key}: {e}") from e

- async def set_pointer(self, key: str, pointer: str) -> None: + async def set_pointer(self, key: str, pointer: CID | str) -> None: chunk_coords = self._parse_chunk_key(key) - pointer_cid_obj = CID.decode(pointer) # Convert string to CID object + pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)

Also applies to: 659-667, 669-679

🧰 Tools

🪛 Ruff (0.12.2)

655-655: Do not catch blind exception: Exception

(BLE001)

656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

656-656: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 649-657 (also apply same changes to blocks at 659-667 and 669-679), the code stringifies CID objects before calling set_pointer and raises a new RuntimeError without chaining the original exception; modify the code to pass the CID object (or accept CID|str) directly to set_pointer instead of calling str(), update set_pointer's signature/implementation to accept either a CID object or a string (or normalize input inside set_pointer), and re-raise the RuntimeError using exception chaining (raise ... from e) so the original exception context is preserved.

py_hamt/sharded_zarr_store.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (8)

py_hamt/sharded_zarr_store.py (8)

336-338: Add exception chaining for better debugging.

When retries are exhausted, the original exception should be chained.

                 else:
                     raise RuntimeError(
                         f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
-                    )
+                    ) from e

441-442: Pass CID object directly to avoid unnecessary conversion.

The CAS expects IPLDKind (CID), not a string.

             self._pending_shard_loads[shard_idx] = asyncio.Event()
-            shard_cid_str = str(shard_cid_obj)
-            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)

Also update the method signature:

     async def _fetch_and_cache_full_shard(
         self,
         shard_idx: int,
-        shard_cid: str,
+        shard_cid: CID,
         max_retries: int = 3,
         retry_delay: float = 1.0,
     ) -> None:

652-654: Improve exception handling and pass CID directly.

Catch specific exceptions and pass the CID object without stringification.

         try:
             data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+            await self.set_pointer(key, data_cid_obj)
+        except (ConnectionError, TimeoutError, RuntimeError) as e:
+            raise RuntimeError(f"Failed to save data for key {key}: {e}") from e

Also update the set_pointer signature:

-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)

853-853: Store CID object directly, not as string.

Metadata CIDs should be stored as CID objects for consistency.

-        self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cid
+        self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cid

Note: This line already stores the CID object correctly - no change needed.

193-200: Add rank check and use zip(..., strict=True).

Validate that shapes have the same dimensionality and make zip behavior explicit.

+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
         if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
         self._chunks_per_dim = tuple(
             math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
         )

279-281: Remove duplicate decode and improve exception handling.

The root object is decoded twice (line 272 and 281), and the broad Exception catch could hide important errors.

         try:
             self._root_obj = dag_cbor.decode(root_bytes)
             if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
                 raise ValueError(
                     "Root object is not a valid dictionary with 'chunks' key."
                 )
             if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
                 raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
-        self._root_obj = dag_cbor.decode(root_bytes)
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError(f"Failed to decode root object: {e}") from e

577-577: Pass CID objects directly to CAS.load.

Converting CIDs to strings breaks compatibility with CAS backends that expect native CID objects.

-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)

594-615: Avoid stringifying chunk CID.

Pass the CID object directly to maintain type consistency.

-        chunk_cid_str = str(chunk_cid_obj)
-
         req_offset = None
         req_length = None
         req_suffix = None

         if byte_range:
             if isinstance(byte_range, RangeByteRequest):
                 req_offset = byte_range.start
                 if byte_range.end is not None:
                     if byte_range.start > byte_range.end:
                         raise ValueError(
                             f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
                         )
                     req_length = byte_range.end - byte_range.start
             elif isinstance(byte_range, OffsetByteRequest):
                 req_offset = byte_range.offset
             elif isinstance(byte_range, SuffixByteRequest):
                 req_suffix = byte_range.suffix
         data = await self.cas.load(
-            chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
+            chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
         )

🧹 Nitpick comments (3)

py_hamt/sharded_zarr_store.py (3)

762-765: Use strict=True for zip to catch mismatched dimensions.

Make the zip behavior explicit to catch potential bugs.

             global_coords = tuple(
                 c_local + c_offset
-                for c_local, c_offset in zip(local_coords, chunk_offset)
+                for c_local, c_offset in zip(local_coords, chunk_offset, strict=True)
             )

801-803: Use strict=True for zip consistency.

Make the zip behavior explicit as recommended in other locations.

         self._chunks_per_dim = tuple(
             math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
         )

624-629: Improve metadata key detection logic.

The current check not key == "zarr.json" with length-based detection is brittle. Consider a more explicit approach.

         if (
             key.endswith("zarr.json")
             and not key.startswith("time/")
             and not key.startswith(("lat/", "latitude/"))
             and not key.startswith(("lon/", "longitude/"))
-            and not key == "zarr.json"
+            and key != "zarr.json"  # Exclude root zarr.json
         ):

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cac0f5 and 7529c2f.

📒 Files selected for processing (1)

py_hamt/sharded_zarr_store.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

py_hamt/sharded_zarr_store.py (2)

py_hamt/store_httpx.py (7)

ContentAddressedStore (13-60)

load (35-42)

load (78-115)

load (446-505)

save (28-32)

save (73-76)

save (405-444)

tests/test_sharded_zarr_store_coverage.py (12)

load (346-358)

load (387-388)

load (427-428)

load (452-463)

load (487-494)

load (583-584)

load (745-748)

load (797-799)

load (854-857)

save (496-501)

save (654-656)

save (690-693)

🪛 Ruff (0.12.2)

py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)

196-196: Avoid specifying long messages outside the exception class

(TRY003)

200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

231-233: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Avoid specifying long messages outside the exception class

(TRY003)

274-276: Abstract raise to an inner function

(TRY301)

274-276: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Prefer TypeError exception for invalid type

(TRY004)

278-278: Abstract raise to an inner function

(TRY301)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

279-279: Do not catch blind exception: Exception

(BLE001)

280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

284-286: Avoid specifying long messages outside the exception class

(TRY003)

296-298: Avoid specifying long messages outside the exception class

(TRY003)

321-321: Avoid specifying long messages outside the exception class

(TRY003)

327-327: Consider moving this statement to an else block

(TRY300)

336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

336-338: Avoid specifying long messages outside the exception class

(TRY003)

376-378: Avoid specifying long messages outside the exception class

(TRY003)

425-427: Avoid specifying long messages outside the exception class

(TRY003)

433-433: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

433-433: Avoid specifying long messages outside the exception class

(TRY003)

436-436: Avoid specifying long messages outside the exception class

(TRY003)

449-449: Avoid specifying long messages outside the exception class

(TRY003)

471-471: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

505-505: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

524-524: Avoid specifying long messages outside the exception class

(TRY003)

574-576: Avoid specifying long messages outside the exception class

(TRY003)

605-607: Avoid specifying long messages outside the exception class

(TRY003)

620-620: Avoid specifying long messages outside the exception class

(TRY003)

633-633: Avoid specifying long messages outside the exception class

(TRY003)

653-653: Do not catch blind exception: Exception

(BLE001)

654-654: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

654-654: Avoid specifying long messages outside the exception class

(TRY003)

706-706: Avoid specifying long messages outside the exception class

(TRY003)

713-713: Avoid specifying long messages outside the exception class

(TRY003)

741-741: Avoid specifying long messages outside the exception class

(TRY003)

764-764: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

787-787: Avoid specifying long messages outside the exception class

(TRY003)

794-794: Avoid specifying long messages outside the exception class

(TRY003)

796-798: Avoid specifying long messages outside the exception class

(TRY003)

803-803: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

830-830: Avoid specifying long messages outside the exception class

(TRY003)

836-838: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Create project environment, run all checks (3.12)

🔇 Additional comments (2)

py_hamt/sharded_zarr_store.py (2)

20-140: Excellent memory-bounded LRU cache implementation!

The MemoryBoundedLRUCache is well-designed with:

Accurate memory sizing using sys.getsizeof

Protection of dirty shards from eviction

Thread-safe operations with asyncio.Lock

Proper LRU eviction strategy

This addresses the concern raised in PR comments about unbounded cache growth.

548-556: LGTM - Proper CID handling in flush.

The flush method correctly ensures metadata CIDs are CID objects before encoding and properly marks shards as clean after successful writes.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (8)

py_hamt/sharded_zarr_store.py (8)

193-201: Add rank check and use zip(strict=True) to catch shape mismatches early.

-        if not all(cs > 0 for cs in self._chunk_shape):
+        if len(self._array_shape) != len(self._chunk_shape):
+            raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+        if not all(cs > 0 for cs in self._chunk_shape):
             raise ValueError("All chunk_shape dimensions must be positive.")
         if not all(s >= 0 for s in self._array_shape):
             raise ValueError("All array_shape dimensions must be non-negative.")
 
-        self._chunks_per_dim = tuple(
-            math.ceil(a / c) if c > 0 else 0
-            for a, c in zip(self._array_shape, self._chunk_shape)
-        )
+        self._chunks_per_dim = tuple(
+            math.ceil(a / c) if c > 0 else 0
+            for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+        )

269-281: Decode once; narrow exceptions and chain them.

-        try:
-            self._root_obj = dag_cbor.decode(root_bytes)
-            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
-                raise ValueError(
-                    "Root object is not a valid dictionary with 'chunks' key."
-                )
-            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
-                raise ValueError("shard_cids is not a list.")
-        except Exception as e:
-            raise ValueError(f"Failed to decode root object: {e}")
+        try:
+            self._root_obj = dag_cbor.decode(root_bytes)
+            if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
+                raise ValueError("Root object is not a valid dictionary with 'chunks' key.")
+            if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
+                raise ValueError("shard_cids is not a list.")
+        except (ValueError, dag_cbor.CBORDecodeError) as e:
+            raise ValueError("Failed to decode root object.") from e

437-445: Don’t stringify CIDs when loading shards; pass IPLDKind through.

-        if shard_cid_obj:
+        if shard_cid_obj:
             self._pending_shard_loads[shard_idx] = asyncio.Event()
-            shard_cid_str = str(shard_cid_obj)
-            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+            await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)  # pass CID/bytes directly

560-577: Metadata path: avoid str() on CIDs.

-            data = await self.cas.load(str(metadata_cid_obj))
+            data = await self.cas.load(metadata_cid_obj)

508-513: Root CID equality should normalize type to avoid CID vs bytes vs str mismatches.

-        return self._root_cid == other._root_cid
+        return str(self._root_cid) == str(other._root_cid)

546-559: Avoid forcing root CID to str; keep IPLDKind internally and return str from flush().

-            root_obj_bytes = dag_cbor.encode(self._root_obj)
-            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
-            self._root_cid = str(new_root_cid)
+            root_obj_bytes = dag_cbor.encode(self._root_obj)
+            new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor")
+            self._root_cid = new_root_cid
             self._dirty_root = False
 
-        # Ignore because root_cid will always exist after initialization or flush.
-        return self._root_cid  # type: ignore[return-value]
+        return str(self._root_cid)

646-654: Pass CID to set_pointer and chain the exception; don’t stringify.

-        try:
-            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
-            await self.set_pointer(key, str(data_cid_obj))
-        except Exception as e:
-            raise RuntimeError(f"Failed to save data for key {key}: {e}")
+        try:
+            data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+            await self.set_pointer(key, data_cid_obj)
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data for key {key}") from e

656-676: set_pointer should accept CID|str and normalize; current str-only contract is brittle.

-    async def set_pointer(self, key: str, pointer: str) -> None:
+    async def set_pointer(self, key: str, pointer: CID | str) -> None:
         chunk_coords = self._parse_chunk_key(key)
 
-        pointer_cid_obj = CID.decode(pointer)  # Convert string to CID object
+        pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)

🧹 Nitpick comments (3)

py_hamt/sharded_zarr_store.py (3)
779-810: Simplify num_shards math; no need for None-guard; add zip(strict=True) above.

Minor cleanup: _num_shards is always int; remove the is not None guard.
-        old_num_shards = self._num_shards if self._num_shards is not None else 0
+        old_num_shards = self._num_shards
20-45: Note: sys.getsizeof underestimates composite memory; OK as heuristic.

If precise caps matter, consider tracking element counts and average sizes, or use a sampling approach.

76-101: Eviction stalls when all shards are dirty; cap dirty set or allow forced evict.

As written, memory can exceed max_memory_bytes if everything is dirty. Consider a soft cap on dirty entries or a “force flush before exceed” hook.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7529c2f and 78cf72d.

📒 Files selected for processing (1)

py_hamt/sharded_zarr_store.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

py_hamt/sharded_zarr_store.py (2)

py_hamt/store_httpx.py (7)

ContentAddressedStore (13-60)

load (35-42)

load (78-115)

load (446-505)

save (28-32)

save (73-76)

save (405-444)

tests/test_sharded_zarr_store_coverage.py (12)

load (346-358)

load (387-388)

load (427-428)

load (452-463)

load (487-494)

load (583-584)

load (745-748)

load (797-799)

load (854-857)

save (496-501)

save (654-656)

save (690-693)

🪛 Ruff (0.12.2)

py_hamt/sharded_zarr_store.py

194-194: Avoid specifying long messages outside the exception class

(TRY003)

196-196: Avoid specifying long messages outside the exception class

(TRY003)

200-200: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

231-233: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Avoid specifying long messages outside the exception class

(TRY003)

274-276: Abstract raise to an inner function

(TRY301)

274-276: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Prefer TypeError exception for invalid type

(TRY004)

278-278: Abstract raise to an inner function

(TRY301)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

279-279: Do not catch blind exception: Exception

(BLE001)

280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

283-285: Avoid specifying long messages outside the exception class

(TRY003)

295-297: Avoid specifying long messages outside the exception class

(TRY003)

320-320: Avoid specifying long messages outside the exception class

(TRY003)

326-326: Consider moving this statement to an else block

(TRY300)

335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

335-337: Avoid specifying long messages outside the exception class

(TRY003)

375-377: Avoid specifying long messages outside the exception class

(TRY003)

424-426: Avoid specifying long messages outside the exception class

(TRY003)

432-432: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

432-432: Avoid specifying long messages outside the exception class

(TRY003)

435-435: Avoid specifying long messages outside the exception class

(TRY003)

448-448: Avoid specifying long messages outside the exception class

(TRY003)

470-470: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

504-504: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

523-523: Avoid specifying long messages outside the exception class

(TRY003)

573-575: Avoid specifying long messages outside the exception class

(TRY003)

604-606: Avoid specifying long messages outside the exception class

(TRY003)

619-619: Avoid specifying long messages outside the exception class

(TRY003)

632-632: Avoid specifying long messages outside the exception class

(TRY003)

652-652: Do not catch blind exception: Exception

(BLE001)

653-653: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

653-653: Avoid specifying long messages outside the exception class

(TRY003)

705-705: Avoid specifying long messages outside the exception class

(TRY003)

712-712: Avoid specifying long messages outside the exception class

(TRY003)

740-740: Avoid specifying long messages outside the exception class

(TRY003)

763-763: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

786-786: Avoid specifying long messages outside the exception class

(TRY003)

793-793: Avoid specifying long messages outside the exception class

(TRY003)

795-797: Avoid specifying long messages outside the exception class

(TRY003)

802-802: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

829-829: Avoid specifying long messages outside the exception class

(TRY003)

835-837: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Create project environment, run all checks (3.12)

🔇 Additional comments (1)

py_hamt/sharded_zarr_store.py (1)
839-846: Fix JSON decode bug: json.loads expects str, not bytes.

This path will raise TypeError. Decode the bytes first.
-        old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid)
-        zarr_metadata_json = json.loads(old_zarr_metadata_bytes)
+        old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid)
+        zarr_metadata_json = json.loads(old_zarr_metadata_bytes.decode("utf-8"))
Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-15T15:02:51Z

py_hamt/sharded_zarr_store.py

+    async def __contains__(self, shard_idx: int) -> bool:
+        async with self._cache_lock:
+            return shard_idx in self._cache
+


🛠️ Refactor suggestion

⚠️ Potential issue

async contains is invalid Python; replace with an async method (deadlock/logic risk).

Magic methods cannot be async; using in cache would return a coroutine, not bool. Provide an explicit async API instead.

- async def __contains__(self, shard_idx: int) -> bool: - async with self._cache_lock: - return shard_idx in self._cache + async def contains(self, shard_idx: int) -> bool: + async with self._cache_lock: + return shard_idx in self._cache

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def __contains__(self, shard_idx: int) -> bool:

async with self._cache_lock:

return shard_idx in self._cache

async def contains(self, shard_idx: int) -> bool:

async with self._cache_lock:

return shard_idx in self._cache

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 122-125, remove the async magic method definition (async def __contains__) because dunder methods cannot be async; instead implement an explicit async API like async def contains_shard(self, shard_idx: int) -> bool that acquires self._cache_lock and returns shard_idx in self._cache, and replace or keep a synchronous __contains__(self, shard_idx: int) -> bool that performs a non-blocking membership check (return shard_idx in self._cache) if you still want Python's "in" to work (noting it will be racy); update all call sites to use await contains_shard(...) where a locked, async check is required.

coderabbitai · 2025-09-15T15:02:51Z

py_hamt/sharded_zarr_store.py

+    async def _fetch_and_cache_full_shard(
+        self,
+        shard_idx: int,
+        shard_cid: str,
+        max_retries: int = 3,
+        retry_delay: float = 1.0,
+    ) -> None:
+        """
+        Fetch a shard from CAS and cache it, with retry logic for transient errors.
+
+        Args:
+            shard_idx: The index of the shard to fetch.
+            shard_cid: The CID of the shard.
+            max_retries: Maximum number of retry attempts for transient errors.
+            retry_delay: Delay between retry attempts in seconds.
+        """
+        for attempt in range(max_retries):
+            try:
+                shard_data_bytes = await self.cas.load(shard_cid)
+                decoded_shard = dag_cbor.decode(shard_data_bytes)
+                if not isinstance(decoded_shard, list):
+                    raise TypeError(f"Shard {shard_idx} did not decode to a list.")
+                await self._shard_data_cache.put(shard_idx, decoded_shard)
+                # Always set the Event to unblock waiting coroutines
+                if shard_idx in self._pending_shard_loads:
+                    self._pending_shard_loads[shard_idx].set()
+                    del self._pending_shard_loads[shard_idx]
+                return  # Success
+            except (ConnectionError, TimeoutError) as e:
+                # Handle transient errors (e.g., network issues)
+                if attempt < max_retries - 1:
+                    await asyncio.sleep(
+                        retry_delay * (2**attempt)
+                    )  # Exponential backoff
+                    continue
+                else:
+                    raise RuntimeError(
+                        f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
+                    )


🛠️ Refactor suggestion

⚠️ Potential issue

Ensure pending waiters are always released; accept IPLDKind and handle decode errors.

Without signaling on failure, other tasks can hang for 60s. Also, avoid forcing CIDs to str and catch CBOR decode errors explicitly.

- async def _fetch_and_cache_full_shard( + async def _fetch_and_cache_full_shard( self, shard_idx: int, - shard_cid: str, + shard_cid: CID | bytes | str, max_retries: int = 3, retry_delay: float = 1.0, ) -> None: @@ - for attempt in range(max_retries): - try: - shard_data_bytes = await self.cas.load(shard_cid) - decoded_shard = dag_cbor.decode(shard_data_bytes) - if not isinstance(decoded_shard, list): - raise TypeError(f"Shard {shard_idx} did not decode to a list.") - await self._shard_data_cache.put(shard_idx, decoded_shard) - # Always set the Event to unblock waiting coroutines - if shard_idx in self._pending_shard_loads: - self._pending_shard_loads[shard_idx].set() - del self._pending_shard_loads[shard_idx] - return # Success - except (ConnectionError, TimeoutError) as e: + for attempt in range(max_retries): + try: + shard_data_bytes = await self.cas.load(shard_cid) + decoded_shard = dag_cbor.decode(shard_data_bytes) + if not isinstance(decoded_shard, list): + raise TypeError(f"Shard {shard_idx} did not decode to a list.") + await self._shard_data_cache.put(shard_idx, decoded_shard) + return # Success + except dag_cbor.CBORDecodeError as e: + # Non-transient: don't retry CBOR decode failures + raise ValueError(f"Failed to decode shard {shard_idx}.") from e + except (ConnectionError, TimeoutError) as e: # Handle transient errors (e.g., network issues) if attempt < max_retries - 1: await asyncio.sleep( retry_delay * (2**attempt) ) # Exponential backoff continue else: - raise RuntimeError( - f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" - ) + raise RuntimeError( + f"Failed to fetch shard {shard_idx} after {max_retries} attempts." + ) from e + finally: + # Always release waiters if present + if shard_idx in self._pending_shard_loads: + self._pending_shard_loads[shard_idx].set() + self._pending_shard_loads.pop(shard_idx, None)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

320-320: Avoid specifying long messages outside the exception class

(TRY003)

326-326: Consider moving this statement to an else block

(TRY300)

335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

335-337: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 299-337, ensure pending waiters are always released and decoding errors are handled explicitly: keep shard_cid as-is (do not force to str), catch dag_cbor.DecodingError (or the library's specific CBOR decode exception) in addition to TypeError, and treat valid decoded shard types as list/IPLD list-kind rather than only python list; on any failure (decode error, non-list kind, or exhausted retries) set and remove the Event in self._pending_shard_loads[shard_idx] so waiting coroutines are unblocked, and re-raise a clear RuntimeError after cleaning up. Also use exponential backoff as already implemented for transient ConnectionError/TimeoutError.

coderabbitai · 2025-09-15T15:02:51Z

py_hamt/sharded_zarr_store.py

+        chunk_cid_str = str(chunk_cid_obj)
+
+        req_offset = None
+        req_length = None
+        req_suffix = None
+
+        if byte_range:
+            if isinstance(byte_range, RangeByteRequest):
+                req_offset = byte_range.start
+                if byte_range.end is not None:
+                    if byte_range.start > byte_range.end:
+                        raise ValueError(
+                            f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
+                        )
+                    req_length = byte_range.end - byte_range.start
+            elif isinstance(byte_range, OffsetByteRequest):
+                req_offset = byte_range.offset
+            elif isinstance(byte_range, SuffixByteRequest):
+                req_suffix = byte_range.suffix
+        data = await self.cas.load(
+            chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
+        )
+        return prototype.buffer.from_bytes(data)


⚠️ Potential issue

Zero-length Range must return empty buffer; also avoid str() on chunk CID.

Current logic computes length 0, yielding an invalid HTTP Range (end = start - 1). Return empty bytes instead and pass CID directly.

- chunk_cid_str = str(chunk_cid_obj) - req_offset = None req_length = None req_suffix = None if byte_range: if isinstance(byte_range, RangeByteRequest): req_offset = byte_range.start if byte_range.end is not None: if byte_range.start > byte_range.end: raise ValueError( f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" ) - req_length = byte_range.end - byte_range.start + if byte_range.end == byte_range.start: + return prototype.buffer.from_bytes(b"") + req_length = byte_range.end - byte_range.start elif isinstance(byte_range, OffsetByteRequest): req_offset = byte_range.offset elif isinstance(byte_range, SuffixByteRequest): req_suffix = byte_range.suffix - data = await self.cas.load( - chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix - ) + data = await self.cas.load( + chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix + ) return prototype.buffer.from_bytes(data)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

chunk_cid_str = str(chunk_cid_obj)

req_offset = None

req_length = None

req_suffix = None

if byte_range:

if isinstance(byte_range, RangeByteRequest):

req_offset = byte_range.start

if byte_range.end is not None:

if byte_range.start > byte_range.end:

raise ValueError(

f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"

)

req_length = byte_range.end - byte_range.start

elif isinstance(byte_range, OffsetByteRequest):

req_offset = byte_range.offset

elif isinstance(byte_range, SuffixByteRequest):

req_suffix = byte_range.suffix

data = await self.cas.load(

chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix

)

return prototype.buffer.from_bytes(data)

req_offset = None

req_length = None

req_suffix = None

if byte_range:

if isinstance(byte_range, RangeByteRequest):

req_offset = byte_range.start

if byte_range.end is not None:

if byte_range.start > byte_range.end:

raise ValueError(

f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"

)

if byte_range.end == byte_range.start:

return prototype.buffer.from_bytes(b"")

req_length = byte_range.end - byte_range.start

elif isinstance(byte_range, OffsetByteRequest):

req_offset = byte_range.offset

elif isinstance(byte_range, SuffixByteRequest):

req_suffix = byte_range.suffix

data = await self.cas.load(

chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix

)

return prototype.buffer.from_bytes(data)

🧰 Tools

🪛 Ruff (0.12.2)

604-606: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In py_hamt/sharded_zarr_store.py around lines 593 to 615, the code currently converts the chunk CID to string and calls cas.load even for zero-length ranges which produces an invalid HTTP Range; instead, do not call str() on chunk_cid_obj — pass the CID object directly to cas.load — and detect a zero-length RangeByteRequest (end == start) returning prototype.buffer.from_bytes(b"") immediately without calling cas.load; only set and pass req_length when length > 0 and skip cas.load entirely for the zero-length case.

TheGreatAlgo and others added 17 commits May 30, 2025 08:31

feat: support partials

f62e6c1

Merge branch 'refactor' into get-partials

0016b03

fix: re-add accidentally deleted _map_byte_request

226967f

fix: tidying up

89e8124

fix: full coverage

6e56031

Merge branch 'refactor' into get-partials

c6bdf5c

Merge branch 'refactor' into get-partials

90ec3d0

fix: re-order

46b2f99

fix: update ruff and mypy

0fb9be9

fix: pre-commit

f7f169d

Merge branch 'refactor' into get-partials

3b94888

deps: update kubo to latest in tests

e17003a

test: add test for ipfs gateway partials

72fdcfa

fix: sharding

c92f5ec

fix: converter

4476272

fix: more work on sharding

c981437

Merge branch 'main' into sharding

8476806

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

Merge branch 'main' into sharding

50fd6e6

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

TheGreatAlgo added 8 commits June 12, 2025 10:19

fix: pinning

a87eb19

fix: pre-commit

21e15c3

fix: tidying

247bd69

fix: target

7ae0561

fix: fixing types

3ef4aee

fix: more type cleanups

b0268ad

fix: remove unused

092a538

fix: change imports

dec826e

coderabbitai bot reviewed Jun 12, 2025

View reviewed changes

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

TheGreatAlgo and others added 2 commits August 21, 2025 08:30

fix: with read only

7d826fa

Update py_hamt/store_httpx.py

559a7ff

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

0xSwego requested changes Aug 25, 2025

View reviewed changes

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

fix: fix casing

e537389

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

0xSwego approved these changes Aug 25, 2025

View reviewed changes

TheGreatAlgo added 2 commits September 2, 2025 09:38

Merge branch 'main' into sharding

7a39161

fix: linting

efb3c64

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

Faolain reviewed Sep 10, 2025

View reviewed changes

py_hamt/__init__.py Show resolved Hide resolved

lint(ruff): add stricter rule to __init__

41143f9

Faolain reviewed Sep 10, 2025

View reviewed changes

fix: lru cache

e55a3e3

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

TheGreatAlgo added 3 commits September 10, 2025 14:48

fix: zarr coverage

a3f60f2

fix: final tests

ac17139

fix: small changes

6cac0f5

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

Faolain reviewed Sep 15, 2025

View reviewed changes

py_hamt/sharded_zarr_store.py Outdated Show resolved Hide resolved

fix: small updates

7529c2f

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

fix: remove duplicate

78cf72d

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

Faolain merged commit 743741d into main Sep 15, 2025
3 checks passed

coderabbitai bot mentioned this pull request Nov 26, 2025

fix: add support for forecast datasets #84

Open

		pinString: str = "true" if pinOnAdd else "false"
		self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}"

		except Exception as e:
		raise RuntimeError(f"Failed to save data for key {key}: {e}")

feat: Sharding #68

feat: Sharding #68

Uh oh!

Conversation

TheGreatAlgo commented Jun 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

0xSwego Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

0xSwego Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

0xSwego Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

0xSwego left a comment

TheGreatAlgo commented Jun 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 10, 2025 •

edited

Loading

codecov-commenter commented Aug 21, 2025 •

edited

Loading

Faolain Sep 10, 2025 •

edited

Loading

Faolain Sep 10, 2025 •

edited

Loading

coderabbitai bot Sep 10, 2025 •

edited

Loading

coderabbitai bot Sep 10, 2025 •

edited

Loading