-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Sharding #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Sharding #68
Conversation
WalkthroughAdds sharded and flat Zarr store implementations, a HAMT→Sharded converter + CLI, partial-read support end-to-end (CAS→HAMT→Zarr), RPC pin management in KuboCAS, many async tests/benchmarks, small export/example/tooling updates, and new contributor docs ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client as Client
participant Zarr as ZarrHAMTStore
participant HAMT as HAMT
participant CAS as ContentAddressedStore / KuboCAS
participant Gateway as IPFS Gateway
Client->>Zarr: get(key, prototype, byte_range)
Zarr->>Zarr: _map_byte_request(byte_range) → (offset,length,suffix)
Zarr->>HAMT: get(key, offset, length, suffix)
HAMT->>CAS: load(cid, offset, length, suffix)
alt partial read
CAS->>Gateway: HTTP GET (Range header)
Gateway-->>CAS: 206 Partial Content
else full read
CAS->>Gateway: HTTP GET
Gateway-->>CAS: 200 OK
end
CAS-->>HAMT: bytes
HAMT-->>Zarr: bytes/decoded
Zarr-->>Client: buffer
sequenceDiagram
autonumber
participant User as User / CLI
participant Conv as convert_hamt_to_sharded
participant CAS as KuboCAS
participant HStore as HAMT (ro)
participant SStore as ShardedZarrStore (rw)
User->>Conv: start(hamt_root_cid, chunks_per_shard)
Conv->>HStore: open(read_only, hamt_root_cid)
Conv->>SStore: open(rw, array_shape, chunk_shape, chunks_per_shard)
loop each key
Conv->>HStore: list/get pointer
Conv->>SStore: set_pointer(key, CID)
end
Conv->>SStore: flush()
SStore-->>Conv: sharded_root_cid
Conv-->>User: sharded_root_cid
sequenceDiagram
autonumber
participant Dev as Dev
participant Kubo as KuboCAS
participant RPC as IPFS RPC
Dev->>Kubo: pin_cid(CID)
Kubo->>RPC: /api/v0/pin/add?arg=CID
RPC-->>Kubo: OK
Dev->>Kubo: pin_ls()
Kubo->>RPC: /api/v0/pin/ls
RPC-->>Kubo: {Pins}
Dev->>Kubo: pin_update(old,new)
Kubo->>RPC: /api/v0/pin/update?arg=old&arg=new
RPC-->>Kubo: OK
Dev->>Kubo: unpin_cid(CID)
Kubo->>RPC: /api/v0/pin/rm?arg=CID
RPC-->>Kubo: OK
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Nitpick comments (9)
tests/test_cpc_compare.py (1)
1-129: Consider removing or properly integrating this benchmark file.This entire file is commented out. If these benchmarks are intended for future use, consider:
- Moving them to a separate benchmarks directory
- Adding proper documentation explaining when/how to use them
- If keeping them, address the security concern with the empty API key on line 28
py_hamt/hamt_to_sharded_converter.py (2)
34-34: Remove the unnecessaryconsolidated=Trueparameter.The
consolidated=Trueparameter inxr.open_zarris typically used when reading consolidated metadata. Since you're opening a HAMT store that doesn't use Zarr's consolidation feature, this parameter has no effect and should be removed for clarity.Apply this diff:
- source_dataset = xr.open_zarr(store=source_store, consolidated=True) + source_dataset = xr.open_zarr(store=source_store)
39-44: Simplify array and chunk shape extraction.The intermediate variables
array_shape_tupleandchunk_shape_tupleare redundant.Apply this diff to reduce local variables:
ordered_dims = list(source_dataset.dims) - array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims) - chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) - array_shape = array_shape_tuple - chunk_shape = chunk_shape_tuple + array_shape = tuple(source_dataset.dims[dim] for dim in ordered_dims) + chunk_shape = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)tests/test_benchmark_stores.py (1)
1-313: Consider removing or properly integrating this benchmark file.Like
test_cpc_compare.py, this entire file is commented out. If these benchmarks are needed:
- Uncomment and ensure they work with the current test infrastructure
- Move to a dedicated benchmarks directory if they're not part of regular CI
- Remove if they're no longer needed
tests/test_converter.py (2)
152-152: Remove unnecessaryconsolidated=Trueparameter.Similar to the converter module, the
consolidated=Trueparameter has no effect with HAMT stores.Apply this diff:
- test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True) + test_ds.to_zarr(store=source_hamt_store, mode="w")
201-201: Remove unnecessaryconsolidated=Trueparameter.Apply this diff:
- test_ds.to_zarr(store=source_hamt_store, mode="w", consolidated=True) + test_ds.to_zarr(store=source_hamt_store, mode="w")py_hamt/store.py (1)
85-96: Simplify control flow by removing unnecessary elseThe partial read logic is correct, but the control flow can be simplified.
Apply this diff to improve readability:
if offset is not None: start = offset if length is not None: end = start + length return data[start:end] - else: - return data[start:] + return data[start:] elif suffix is not None: # If only length is given, assume start from 0 return data[-suffix:] -else: # Full load - return data +# Full load +return data🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
py_hamt/sharded_zarr_store.py (2)
4-4: Remove unused importThe
castimport from typing is not used.-from typing import Optional, cast, Dict, List, Set, Tuple +from typing import Optional, Dict, List, Set, Tuple🧰 Tools
🪛 Ruff (0.11.9)
4-4:
typing.castimported but unusedRemove unused import:
typing.cast(F401)
190-191: Consider making excluded array prefixes configurableThe hardcoded exclusion list might limit flexibility for different use cases.
Consider making this configurable:
-excluded_array_prefixes = {"time", "lat", "lon", "latitude", "longitude"} +# Add as class attribute or constructor parameter +excluded_array_prefixes = self._excluded_array_prefixes or {"time", "lat", "lon", "latitude", "longitude"}This would allow users to customize which arrays are excluded from sharding.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
CLAUDE.md(1 hunks)py_hamt/__init__.py(1 hunks)py_hamt/flat_zarr_store.py(1 hunks)py_hamt/hamt.py(2 hunks)py_hamt/hamt_to_sharded_converter.py(1 hunks)py_hamt/sharded_zarr_store.py(1 hunks)py_hamt/store.py(5 hunks)py_hamt/zarr_hamt_store.py(3 hunks)tests/test_benchmark_stores.py(1 hunks)tests/test_converter.py(1 hunks)tests/test_cpc_compare.py(1 hunks)tests/test_sharded_zarr_store.py(1 hunks)tests/test_zarr_ipfs.py(0 hunks)tests/test_zarr_ipfs_encrypted.py(4 hunks)tests/test_zarr_ipfs_partial.py(1 hunks)tests/testing_utils.py(1 hunks)
💤 Files with no reviewable changes (1)
- tests/test_zarr_ipfs.py
🧰 Additional context used
🧬 Code Graph Analysis (5)
py_hamt/hamt.py (4)
py_hamt/zarr_hamt_store.py (1)
get(116-149)py_hamt/encryption_hamt_store.py (1)
get(146-170)tests/test_zarr_ipfs_partial.py (1)
load(275-285)py_hamt/store.py (3)
load(33-40)load(58-95)load(286-315)
py_hamt/hamt_to_sharded_converter.py (4)
py_hamt/hamt.py (3)
HAMT(288-705)build(386-395)keys(673-687)py_hamt/store.py (1)
KuboCAS(98-315)py_hamt/sharded_zarr_store.py (4)
ShardedZarrStore(13-632)open(62-89)set_pointer(447-489)flush(323-360)py_hamt/zarr_hamt_store.py (1)
ZarrHAMTStore(12-289)
py_hamt/__init__.py (3)
py_hamt/store.py (3)
ContentAddressedStore(11-40)InMemoryCAS(43-95)KuboCAS(98-315)py_hamt/sharded_zarr_store.py (1)
ShardedZarrStore(13-632)py_hamt/hamt_to_sharded_converter.py (2)
convert_hamt_to_sharded(12-83)sharded_converter_cli(86-123)
py_hamt/zarr_hamt_store.py (2)
py_hamt/encryption_hamt_store.py (1)
get(146-170)py_hamt/hamt.py (1)
get(593-608)
py_hamt/store.py (3)
tests/test_zarr_ipfs_partial.py (1)
load(275-285)py_hamt/hamt.py (4)
load(138-139)load(159-168)load(276-285)get(593-608)py_hamt/zarr_hamt_store.py (1)
get(116-149)
🪛 Ruff (0.11.9)
py_hamt/hamt_to_sharded_converter.py
3-3: json imported but unused
Remove unused import: json
(F401)
5-5: typing.Dict imported but unused
Remove unused import
(F401)
5-5: typing.Any imported but unused
Remove unused import
(F401)
10-10: zarr.core.buffer.Buffer imported but unused
Remove unused import
(F401)
10-10: zarr.core.buffer.BufferPrototype imported but unused
Remove unused import
(F401)
tests/test_converter.py
1-1: asyncio imported but unused
Remove unused import: asyncio
(F401)
6-6: aiohttp imported but unused
Remove unused import: aiohttp
(F401)
171-171: f-string without any placeholders
Remove extraneous f prefix
(F541)
241-241: Local variable kubo_cas is assigned to but never used
Remove assignment to unused variable kubo_cas
(F841)
py_hamt/sharded_zarr_store.py
4-4: typing.cast imported but unused
Remove unused import: typing.cast
(F401)
tests/test_sharded_zarr_store.py
1-1: asyncio imported but unused
Remove unused import: asyncio
(F401)
2-2: math imported but unused
Remove unused import: math
(F401)
12-12: py_hamt.HAMT imported but unused
Remove unused import: py_hamt.HAMT
(F401)
13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused
Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore
(F401)
25-25: Local variable precip is assigned to but never used
Remove assignment to unused variable precip
(F841)
525-525: Local variable root_cid is assigned to but never used
Remove assignment to unused variable root_cid
(F841)
602-602: Redefinition of unused test_sharded_zarr_store_init_invalid_shapes from line 490
(F811)
637-637: Local variable root_cid is assigned to but never used
Remove assignment to unused variable root_cid
(F841)
🪛 Pylint (3.3.7)
py_hamt/hamt_to_sharded_converter.py
[refactor] 12-12: Too many local variables (20/15)
(R0914)
tests/test_converter.py
[refactor] 46-46: Too many local variables (21/15)
(R0914)
[refactor] 135-135: Too many local variables (16/15)
(R0914)
[refactor] 184-184: Too many local variables (17/15)
(R0914)
py_hamt/store.py
[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
tests/test_zarr_ipfs_partial.py
[refactor] 62-62: Too many local variables (34/15)
(R0914)
[refactor] 188-188: Redundant comparison - zhs_read == zhs_read
(R0124)
[refactor] 62-62: Too many statements (90/50)
(R0915)
[refactor] 229-229: Too few public methods (0/2)
(R0903)
[refactor] 238-238: Too many local variables (38/15)
(R0914)
[refactor] 238-238: Too many statements (100/50)
(R0915)
py_hamt/sharded_zarr_store.py
[refactor] 13-13: Too many instance attributes (14/7)
(R0902)
[refactor] 62-62: Too many arguments (8/5)
(R0913)
[refactor] 186-186: Too many return statements (8/6)
(R0911)
[refactor] 274-277: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 363-363: Too many local variables (17/15)
(R0914)
[refactor] 363-363: Too many return statements (7/6)
(R0911)
[refactor] 363-363: Too many branches (14/12)
(R0912)
[refactor] 552-557: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
tests/test_sharded_zarr_store.py
[refactor] 162-162: Too many local variables (16/15)
(R0914)
[refactor] 212-212: Too many local variables (17/15)
(R0914)
[refactor] 270-270: Too many local variables (18/15)
(R0914)
[refactor] 313-313: Too many local variables (18/15)
(R0914)
[error] 602-602: function already defined line 490
(E0102)
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
py_hamt/hamt_to_sharded_converter.py
[error] 1-12: Ruff lint errors: I001 import block un-sorted; multiple F401 unused imports (json, typing.Dict, typing.Any, zarr.core.buffer.Buffer, BufferPrototype).
tests/test_converter.py
[error] 1-241: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, aiohttp); F541 f-string without placeholders; F841 assigned but unused local variable 'kubo_cas'.
py_hamt/__init__.py
[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.
py_hamt/sharded_zarr_store.py
[error] 1-1: Mypy type errors: multiple argument type incompatibilities and unsupported operand types involving 'None' and 'int' or 'tuple[int, ...] | None'.
[error] 93-678: Mypy type errors: Argument and operator type mismatches, index errors, and union attribute errors detected in multiple lines.
[error] 1-10: Ruff lint errors: I001 import block un-sorted; F401 unused import 'typing.cast'.
tests/test_sharded_zarr_store.py
[error] 550-687: Mypy errors: dict indexing on possibly None type and redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.
[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.
🔇 Additional comments (14)
tests/testing_utils.py (1)
172-172:❓ Verification inconclusive
Consider using a pinned version instead of "master-latest" for test stability.
Using
"ipfs/kubo:master-latest"introduces non-deterministic behavior in tests as the underlying IPFS version can change between test runs. This can lead to:
- Flaky tests due to upstream breaking changes
- Difficulty reproducing test failures
- Inconsistent behavior across different environments
Consider using a specific version tag:
- image = "ipfs/kubo:master-latest" + image = "ipfs/kubo:v0.35.0" # or latest stable versionIf you need to use the latest version, please verify that this change doesn't introduce test instability by running the test suite multiple times:
🏁 Script executed:
#!/bin/bash # Verify test stability with new IPFS version for i in {1..5}; do echo "Test run $i" pytest tests/ --ipfs -v if [ $? -ne 0 ]; then echo "Test failed on run $i" exit 1 fi done echo "All test runs passed"Length of output: 251
Pin the IPFS Docker image for stable tests
Switching from a moving “master-latest” tag to a fixed release prevents non-deterministic failures in your integration suite.
• File: tests/testing_utils.py
Line: 172
• Current:image = "ipfs/kubo:master-latest"• Suggested change:
- image = "ipfs/kubo:master-latest" + image = "ipfs/kubo:v0.35.0" # or the latest stable releasePlease run your full test suite in an environment where pytest is available to confirm there are no regressions or flakiness.
tests/test_zarr_ipfs_encrypted.py (3)
8-8: LGTM! Import change improves type annotations.The addition of
IPLDKindimport supports better type annotations throughout the test file.
93-93: LGTM! Explicit type annotation improves code clarity.The explicit type annotation makes the variable's type clear and aligns with the import of
IPLDKind.
109-109: LGTM! Type ignore comments are appropriately used.The
# type: ignorecomments suppress legitimate mypy warnings where the type checker cannot infer the correct types for Zarr store operations.Also applies to: 196-196
py_hamt/__init__.py (2)
3-3: LGTM! Exposing InMemoryCAS in the public API.Adding
InMemoryCASto the public exports makes sense for testing scenarios and provides users with an in-memory storage option alongsideKuboCAS.Also applies to: 14-14
5-6: LGTM! New sharding functionality properly exposed.The addition of
ShardedZarrStore,convert_hamt_to_sharded, andsharded_converter_clito the public API enables users to leverage the new sharding capabilities introduced in this PR.Also applies to: 16-18
🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
[error] 1-6: Ruff lint error I001: Import block is un-sorted or un-formatted.
py_hamt/hamt.py (2)
11-11: LGTM! Import addition supports new optional parameters.The
Optionalimport is needed for the new type annotations in the enhancedgetmethod.
593-604: LGTM! Well-implemented partial read support.The enhancement to the
getmethod properly implements partial read functionality:
- Backward compatibility: All new parameters are optional with
Nonedefaults- Clear interface: Parameters follow standard byte range conventions (
offset,length,suffix)- Proper forwarding: Parameters are correctly passed to the underlying
cas.loadcall- Type safety: Method signature includes proper type annotations
This aligns well with the corresponding changes in
py_hamt/store.pywhere theContentAddressedStore.loadmethod was enhanced with the same parameters.CLAUDE.md (1)
1-88: Excellent documentation for AI-assisted development.This documentation file provides comprehensive guidance for working with the repository:
- Clear development workflow: Covers setup, testing, linting, and type checking
- Accurate architecture description: Well-documented overview of CAS, HAMT, and Zarr integration layers
- Helpful testing guidance: Explains IPFS integration requirements and test execution options
- Design pattern documentation: Captures key architectural decisions like async operations and content addressing
This will be valuable for AI-assisted development and new contributors understanding the codebase.
py_hamt/zarr_hamt_store.py (1)
151-163: Well-implemented concurrent partial value retrieval.The implementation correctly uses
asyncio.gatherfor efficient concurrent fetching of multiple byte ranges. Good addition for supporting partial reads.py_hamt/store.py (2)
33-39: Good API extension for partial data retrievalThe addition of optional parameters for partial loading is well-designed and maintains backward compatibility with default values.
298-311: Well-implemented HTTP Range header supportThe Range header construction correctly follows the HTTP standard with proper handling of all three cases: range with start and end, open-ended range, and suffix range.
tests/test_zarr_ipfs_partial.py (1)
431-461: Excellent test coverage for partial readsThe test comprehensively covers all partial read scenarios including edge cases and error conditions. Well structured!
py_hamt/sharded_zarr_store.py (1)
323-361: Well-structured flush implementationThe flush method properly handles saving dirty shards first, then the root object, with appropriate error handling and state management.
| import asyncio | ||
| import math | ||
|
|
||
| import numpy as np | ||
| import pandas as pd | ||
| import pytest | ||
| import xarray as xr | ||
| from zarr.abc.store import RangeByteRequest | ||
| import zarr.core.buffer | ||
| import dag_cbor | ||
|
|
||
| from py_hamt import HAMT, KuboCAS, ShardedZarrStore | ||
| from py_hamt.zarr_hamt_store import ZarrHAMTStore | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def random_zarr_dataset(): | ||
| """Creates a random xarray Dataset for benchmarking.""" | ||
| # Using a slightly larger dataset for a more meaningful benchmark | ||
| times = pd.date_range("2024-01-01", periods=100) | ||
| lats = np.linspace(-90, 90, 18) | ||
| lons = np.linspace(-180, 180, 36) | ||
|
|
||
| temp = np.random.randn(len(times), len(lats), len(lons)) | ||
| precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons))) | ||
|
|
||
| ds = xr.Dataset( | ||
| { | ||
| "temp": (["time", "lat", "lon"], temp), | ||
| }, | ||
| coords={"time": times, "lat": lats, "lon": lons}, | ||
| ) | ||
|
|
||
| # Define chunking for the store | ||
| ds = ds.chunk({"time": 20, "lat": 18, "lon": 36}) | ||
| yield ds | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused imports and variables
Several imports and variables are not used in the test file.
Apply these changes to clean up the code:
-import asyncio
-import math
import numpy as np
import pandas as pd
import pytest
import xarray as xr
from zarr.abc.store import RangeByteRequest
import zarr.core.buffer
import dag_cbor
-from py_hamt import HAMT, KuboCAS, ShardedZarrStore
-from py_hamt.zarr_hamt_store import ZarrHAMTStore
+from py_hamt import KuboCAS, ShardedZarrStoreAlso remove the unused precip variable:
temp = np.random.randn(len(times), len(lats), len(lons))
-precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
ds = xr.Dataset(
{
"temp": (["time", "lat", "lon"], temp),
},
coords={"time": times, "lat": lats, "lon": lons},
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import asyncio | |
| import math | |
| import numpy as np | |
| import pandas as pd | |
| import pytest | |
| import xarray as xr | |
| from zarr.abc.store import RangeByteRequest | |
| import zarr.core.buffer | |
| import dag_cbor | |
| from py_hamt import HAMT, KuboCAS, ShardedZarrStore | |
| from py_hamt.zarr_hamt_store import ZarrHAMTStore | |
| @pytest.fixture(scope="module") | |
| def random_zarr_dataset(): | |
| """Creates a random xarray Dataset for benchmarking.""" | |
| # Using a slightly larger dataset for a more meaningful benchmark | |
| times = pd.date_range("2024-01-01", periods=100) | |
| lats = np.linspace(-90, 90, 18) | |
| lons = np.linspace(-180, 180, 36) | |
| temp = np.random.randn(len(times), len(lats), len(lons)) | |
| precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons))) | |
| ds = xr.Dataset( | |
| { | |
| "temp": (["time", "lat", "lon"], temp), | |
| }, | |
| coords={"time": times, "lat": lats, "lon": lons}, | |
| ) | |
| # Define chunking for the store | |
| ds = ds.chunk({"time": 20, "lat": 18, "lon": 36}) | |
| yield ds | |
| import numpy as np | |
| import pandas as pd | |
| import pytest | |
| import xarray as xr | |
| from zarr.abc.store import RangeByteRequest | |
| import zarr.core.buffer | |
| import dag_cbor | |
| from py_hamt import KuboCAS, ShardedZarrStore | |
| @pytest.fixture(scope="module") | |
| def random_zarr_dataset(): | |
| """Creates a random xarray Dataset for benchmarking.""" | |
| # Using a slightly larger dataset for a more meaningful benchmark | |
| times = pd.date_range("2024-01-01", periods=100) | |
| lats = np.linspace(-90, 90, 18) | |
| lons = np.linspace(-180, 180, 36) | |
| temp = np.random.randn(len(times), len(lats), len(lons)) | |
| ds = xr.Dataset( | |
| { | |
| "temp": (["time", "lat", "lon"], temp), | |
| }, | |
| coords={"time": times, "lat": lats, "lon": lons}, | |
| ) | |
| # Define chunking for the store | |
| ds = ds.chunk({"time": 20, "lat": 18, "lon": 36}) | |
| yield ds |
🧰 Tools
🪛 Ruff (0.11.9)
1-1: asyncio imported but unused
Remove unused import: asyncio
(F401)
2-2: math imported but unused
Remove unused import: math
(F401)
12-12: py_hamt.HAMT imported but unused
Remove unused import: py_hamt.HAMT
(F401)
13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused
Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore
(F401)
25-25: Local variable precip is assigned to but never used
Remove assignment to unused variable precip
(F841)
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
[error] 1-637: Ruff lint errors: I001 import block un-sorted; F401 unused imports (asyncio, math, py_hamt.HAMT, py_hamt.zarr_hamt_store.ZarrHAMTStore); F841 assigned but unused local variables 'precip', 'root_cid'; F811 redefinition of test function 'test_sharded_zarr_store_init_invalid_shapes'.
🤖 Prompt for AI Agents
In tests/test_sharded_zarr_store.py lines 1 to 37, remove the unused imports
math, RangeByteRequest, zarr.core.buffer, dag_cbor, HAMT, KuboCAS,
ShardedZarrStore, and ZarrHAMTStore. Also delete the unused variable precip in
the random_zarr_dataset fixture to clean up the code and avoid unnecessary
clutter.
py_hamt/flat_zarr_store.py
Outdated
| # Functional Flat Store for Zarr using IPFS. This is ultimately the "Best" way to do it | ||
| # but only if the metadata is small enough to fit in a single CBOR object. | ||
| # This is commented out because we are not using it. However I did not want to delete it | ||
| # because it is a good example of how to implement a Zarr store using a flat index | ||
|
|
||
| # import asyncio | ||
| # import math | ||
| # from collections.abc import AsyncIterator, Iterable | ||
| # from typing import Optional, cast | ||
|
|
||
| # import dag_cbor | ||
| # import zarr.abc.store | ||
| # import zarr.core.buffer | ||
| # from zarr.core.common import BytesLike | ||
|
|
||
| # from .store import ContentAddressedStore | ||
|
|
||
|
|
||
| # class FlatZarrStore(zarr.abc.store.Store): | ||
| # """ | ||
| # Implements the Zarr Store API using a flat, predictable layout for chunk CIDs. | ||
|
|
||
| # This store bypasses the need for a HAMT, offering direct, calculated | ||
| # access to chunk data based on a mathematical formula. It is designed for | ||
| # dense, multi-dimensional arrays where chunk locations are predictable. | ||
|
|
||
| # The store is structured around a single root CBOR object. This root object contains: | ||
| # 1. A dictionary mapping metadata keys (like 'zarr.json') to their CIDs. | ||
| # 2. A single CID pointing to a large, contiguous block of bytes (the "flat index"). | ||
| # This flat index is a concatenation of the CIDs of all data chunks. | ||
|
|
||
| # Accessing a chunk involves: | ||
| # 1. Loading the root object (if not cached). | ||
| # 2. Calculating the byte offset of the chunk's CID within the flat index. | ||
| # 3. Fetching that specific CID using a byte-range request on the flat index. | ||
| # 4. Fetching the actual chunk data using the retrieved CID. | ||
|
|
||
| # ### Sample Code | ||
| # ```python | ||
| # import xarray as xr | ||
| # import numpy as np | ||
| # from py_hamt import KuboCAS, FlatZarrStore | ||
|
|
||
| # # --- Write --- | ||
| # ds = xr.Dataset( | ||
| # {"data": (("t", "y", "x"), np.arange(24).reshape(2, 3, 4))}, | ||
| # ) | ||
| # cas = KuboCAS() | ||
| # # When creating, must provide array shape and chunk shape | ||
| # store_write = await FlatZarrStore.open( | ||
| # cas, | ||
| # read_only=False, | ||
| # array_shape=ds.data.shape, | ||
| # chunk_shape=ds.data.encoding['chunks'] | ||
| # ) | ||
| # ds.to_zarr(store=store_write, mode="w") | ||
| # root_cid = await store_write.flush() # IMPORTANT: flush to get final root CID | ||
| # print(f"Finished writing. Root CID: {root_cid}") | ||
|
|
||
|
|
||
| # # --- Read --- | ||
| # store_read = await FlatZarrStore.open(cas, read_only=True, root_cid=root_cid) | ||
| # ds_read = xr.open_zarr(store=store_read) | ||
| # print("Read back dataset:") | ||
| # print(ds_read) | ||
| # xr.testing.assert_identical(ds, ds_read) | ||
| # ``` | ||
| # """ | ||
|
|
||
| # def __init__( | ||
| # self, cas: ContentAddressedStore, read_only: bool, root_cid: Optional[str] | ||
| # ): | ||
| # """Use the async `open()` classmethod to instantiate this class.""" | ||
| # super().__init__(read_only=read_only) | ||
| # self.cas = cas | ||
| # self._root_cid = root_cid | ||
| # self._root_obj: Optional[dict] = None | ||
| # self._flat_index_cache: Optional[bytearray] = None | ||
| # self._cid_len: Optional[int] = None | ||
| # self._array_shape: Optional[tuple[int, ...]] = None | ||
| # self._chunk_shape: Optional[tuple[int, ...]] = None | ||
| # self._chunks_per_dim: Optional[tuple[int, ...]] = None | ||
| # self._dirty = False | ||
|
|
||
| # @classmethod | ||
| # async def open( | ||
| # cls, | ||
| # cas: ContentAddressedStore, | ||
| # read_only: bool, | ||
| # root_cid: Optional[str] = None, | ||
| # *, | ||
| # array_shape: Optional[tuple[int, ...]] = None, | ||
| # chunk_shape: Optional[tuple[int, ...]] = None, | ||
| # cid_len: int = 59, # Default for base32 v1 CIDs like bafy... | ||
| # ) -> "FlatZarrStore": | ||
| # """ | ||
| # Asynchronously opens an existing FlatZarrStore or initializes a new one. | ||
|
|
||
| # Args: | ||
| # cas: The Content Addressed Store (e.g., KuboCAS). | ||
| # read_only: If True, the store is in read-only mode. | ||
| # root_cid: The root CID of an existing store to open. Required for read_only. | ||
| # array_shape: The full shape of the Zarr array. Required for a new writeable store. | ||
| # chunk_shape: The shape of a single chunk. Required for a new writeable store. | ||
| # cid_len: The expected byte length of a CID string. | ||
| # """ | ||
| # store = cls(cas, read_only, root_cid) | ||
| # if root_cid: | ||
| # await store._load_root_from_cid() | ||
| # elif not read_only: | ||
| # if not all([array_shape, chunk_shape]): | ||
| # raise ValueError( | ||
| # "array_shape and chunk_shape must be provided for a new store." | ||
| # ) | ||
| # store._initialize_new_root(array_shape, chunk_shape, cid_len) | ||
| # else: | ||
| # raise ValueError("root_cid must be provided for a read-only store.") | ||
| # return store | ||
|
|
||
| # def _initialize_new_root( | ||
| # self, | ||
| # array_shape: tuple[int, ...], | ||
| # chunk_shape: tuple[int, ...], | ||
| # cid_len: int, | ||
| # ): | ||
| # self._array_shape = array_shape | ||
| # self._chunk_shape = chunk_shape | ||
| # self._cid_len = cid_len | ||
| # self._chunks_per_dim = tuple( | ||
| # math.ceil(a / c) for a, c in zip(array_shape, chunk_shape) | ||
| # ) | ||
| # self._root_obj = { | ||
| # "manifest_version": "flat_zarr_v1", | ||
| # "metadata": {}, | ||
| # "chunks": { | ||
| # "cid": None, # Will be filled on first flush | ||
| # "array_shape": list(self._array_shape), | ||
| # "chunk_shape": list(self._chunk_shape), | ||
| # "cid_byte_length": self._cid_len, | ||
| # }, | ||
| # } | ||
| # self._dirty = True | ||
|
|
||
| # async def _load_root_from_cid(self): | ||
| # if not self._root_cid: | ||
| # raise ValueError("Cannot load root without a root_cid.") | ||
| # root_bytes = await self.cas.load(self._root_cid) | ||
| # self._root_obj = dag_cbor.decode(root_bytes) | ||
| # chunk_info = self._root_obj.get("chunks", {}) | ||
| # self._array_shape = tuple(chunk_info["array_shape"]) | ||
| # self._chunk_shape = tuple(chunk_info["chunk_shape"]) | ||
| # self._cid_len = chunk_info["cid_byte_length"] | ||
| # self._chunks_per_dim = tuple( | ||
| # math.ceil(a / c) for a, c in zip(self._array_shape, self._chunk_shape) | ||
| # ) | ||
|
|
||
| # def _parse_chunk_key(self, key: str) -> Optional[tuple[int, ...]]: | ||
| # if not self._array_shape or not key.startswith("c/"): | ||
| # return None | ||
| # parts = key.split("/") | ||
| # if len(parts) != len(self._array_shape) + 1: | ||
| # return None | ||
| # try: | ||
| # return tuple(map(int, parts[1:])) | ||
| # except (ValueError, IndexError): | ||
| # return None | ||
|
|
||
| # async def set_partial_values( | ||
| # self, key_start_values: Iterable[tuple[str, int, BytesLike]] | ||
| # ) -> None: | ||
| # """@private""" | ||
| # raise NotImplementedError("Partial writes are not supported by this store.") | ||
|
|
||
| # async def get_partial_values( | ||
| # self, | ||
| # prototype: zarr.core.buffer.BufferPrototype, | ||
| # key_ranges: Iterable[tuple[str, zarr.abc.store.ByteRequest | None]], | ||
| # ) -> list[zarr.core.buffer.Buffer | None]: | ||
| # """ | ||
| # Retrieves multiple keys or byte ranges concurrently. | ||
| # """ | ||
| # tasks = [self.get(key, prototype, byte_range) for key, byte_range in key_ranges] | ||
| # results = await asyncio.gather(*tasks) | ||
| # return results | ||
|
|
||
| # def __eq__(self, other: object) -> bool: | ||
| # """@private""" | ||
| # if not isinstance(other, FlatZarrStore): | ||
| # return NotImplemented | ||
| # return self._root_cid == other._root_cid | ||
|
|
||
| # def _get_chunk_offset(self, chunk_coords: tuple[int, ...]) -> int: | ||
| # linear_index = 0 | ||
| # multiplier = 1 | ||
| # for i in reversed(range(len(self._chunks_per_dim))): | ||
| # linear_index += chunk_coords[i] * multiplier | ||
| # multiplier *= self._chunks_per_dim[i] | ||
| # return linear_index * self._cid_len | ||
|
|
||
| # async def flush(self) -> str: | ||
| # """ | ||
| # Writes all pending changes (metadata and chunk index) to the CAS | ||
| # and returns the new root CID. This MUST be called after all writes are complete. | ||
| # """ | ||
| # if self.read_only or not self._dirty: | ||
| # return self._root_cid | ||
|
|
||
| # if self._flat_index_cache is not None: | ||
| # flat_index_cid_obj = await self.cas.save( | ||
| # bytes(self._flat_index_cache), codec="raw" | ||
| # ) | ||
| # self._root_obj["chunks"]["cid"] = str(flat_index_cid_obj) | ||
|
|
||
| # root_obj_bytes = dag_cbor.encode(self._root_obj) | ||
| # new_root_cid_obj = await self.cas.save(root_obj_bytes, codec="dag-cbor") | ||
| # self._root_cid = str(new_root_cid_obj) | ||
| # self._dirty = False | ||
| # return self._root_cid | ||
|
|
||
| # async def get( | ||
| # self, | ||
| # key: str, | ||
| # prototype: zarr.core.buffer.BufferPrototype, | ||
| # byte_range: zarr.abc.store.ByteRequest | None = None, | ||
| # ) -> zarr.core.buffer.Buffer | None: | ||
| # """@private""" | ||
| # if self._root_obj is None: | ||
| # await self._load_root_from_cid() | ||
|
|
||
| # chunk_coords = self._parse_chunk_key(key) | ||
| # try: | ||
| # # Metadata request | ||
| # if chunk_coords is None: | ||
| # metadata_cid = self._root_obj["metadata"].get(key) | ||
| # if metadata_cid is None: | ||
| # return None | ||
| # data = await self.cas.load(metadata_cid) | ||
| # return prototype.buffer.from_bytes(data) | ||
|
|
||
| # # Chunk data request | ||
| # flat_index_cid = self._root_obj["chunks"]["cid"] | ||
| # if flat_index_cid is None: | ||
| # return None | ||
|
|
||
| # offset = self._get_chunk_offset(chunk_coords) | ||
| # chunk_cid_bytes = await self.cas.load( | ||
| # flat_index_cid, offset=offset, length=self._cid_len | ||
| # ) | ||
|
|
||
| # if all(b == 0 for b in chunk_cid_bytes): | ||
| # return None # Chunk doesn't exist | ||
|
|
||
| # chunk_cid = chunk_cid_bytes.decode("ascii") | ||
| # data = await self.cas.load(chunk_cid) | ||
| # return prototype.buffer.from_bytes(data) | ||
|
|
||
| # except (KeyError, IndexError): | ||
| # return None | ||
|
|
||
| # async def set(self, key: str, value: zarr.core.buffer.Buffer) -> None: | ||
| # """@private""" | ||
| # if self.read_only: | ||
| # raise ValueError("Cannot write to a read-only store.") | ||
| # if self._root_obj is None: | ||
| # raise RuntimeError("Store not initialized for writing.") | ||
|
|
||
| # self._dirty = True | ||
| # raw_bytes = value.to_bytes() | ||
| # value_cid_obj = await self.cas.save(raw_bytes, codec="raw") | ||
| # value_cid = str(value_cid_obj) | ||
|
|
||
| # if len(value_cid) != self._cid_len: | ||
| # raise ValueError( | ||
| # f"Inconsistent CID length. Expected {self._cid_len}, got {len(value_cid)}" | ||
| # ) | ||
|
|
||
| # chunk_coords = self._parse_chunk_key(key) | ||
|
|
||
| # if chunk_coords is None: # Metadata | ||
| # self._root_obj["metadata"][key] = value_cid | ||
| # return | ||
|
|
||
| # # Chunk Data | ||
| # if self._flat_index_cache is None: | ||
| # num_chunks = math.prod(self._chunks_per_dim) | ||
| # self._flat_index_cache = bytearray(num_chunks * self._cid_len) | ||
|
|
||
| # offset = self._get_chunk_offset(chunk_coords) | ||
| # self._flat_index_cache[offset : offset + self._cid_len] = value_cid.encode( | ||
| # "ascii" | ||
| # ) | ||
|
|
||
| # # --- Other required zarr.abc.store methods --- | ||
|
|
||
| # async def exists(self, key: str) -> bool: | ||
| # """@private""" | ||
| # # A more efficient version might check for null bytes in the flat index | ||
| # # but this is functionally correct. | ||
|
|
||
| # # TODO: Optimize this check | ||
| # return True | ||
|
|
||
|
|
||
| # # return (await self.get(key, zarr.core.buffer.Buffer.prototype, None)) is not None | ||
|
|
||
| # @property | ||
| # def supports_writes(self) -> bool: | ||
| # """@private""" | ||
| # return not self.read_only | ||
|
|
||
| # @property | ||
| # def supports_partial_writes(self) -> bool: | ||
| # """@private""" | ||
| # return False # Each chunk is an immutable object | ||
|
|
||
| # @property | ||
| # def supports_deletes(self) -> bool: | ||
| # """@private""" | ||
| # return not self.read_only | ||
|
|
||
| # async def delete(self, key: str) -> None: | ||
| # if self.read_only: | ||
| # raise ValueError("Cannot delete from a read-only store.") | ||
| # if self._root_obj is None: | ||
| # await self._load_root_from_cid() | ||
| # chunk_coords = self._parse_chunk_key(key) | ||
| # if chunk_coords is None: | ||
| # if key in self._root_obj["metadata"]: | ||
| # del self._root_obj["metadata"][key] | ||
| # self._dirty = True | ||
| # return | ||
| # else: | ||
| # raise KeyError(f"Metadata key '{key}' not found.") | ||
| # flat_index_cid = self._root_obj["chunks"]["cid"] | ||
| # if self._flat_index_cache is None: | ||
| # if not flat_index_cid: | ||
| # raise KeyError(f"Chunk key '{key}' not found in non-existent index.") | ||
| # self._flat_index_cache = bytearray(await self.cas.load(flat_index_cid)) | ||
| # offset = self._get_chunk_offset(chunk_coords) | ||
| # if all(b == 0 for b in self._flat_index_cache[offset : offset + self._cid_len]): | ||
| # raise KeyError(f"Chunk key '{key}' not found.") | ||
| # self._flat_index_cache[offset : offset + self._cid_len] = bytearray(self._cid_len) | ||
| # self._dirty = True | ||
|
|
||
| # @property | ||
| # def supports_listing(self) -> bool: | ||
| # """@private""" | ||
| # return True | ||
|
|
||
| # async def list(self) -> AsyncIterator[str]: | ||
| # """@private""" | ||
| # if self._root_obj is None: | ||
| # await self._load_root_from_cid() | ||
| # for key in self._root_obj["metadata"]: | ||
| # yield key | ||
| # # Note: Listing all chunk keys without reading the index is non-trivial. | ||
| # # A full implementation might need an efficient way to iterate non-null chunks. | ||
| # # This basic version only lists metadata. | ||
|
|
||
| # async def list_prefix(self, prefix: str) -> AsyncIterator[str]: | ||
| # """@private""" | ||
| # async for key in self.list(): | ||
| # if key.startswith(prefix): | ||
| # yield key | ||
|
|
||
| # async def list_dir(self, prefix: str) -> AsyncIterator[str]: | ||
| # """@private""" | ||
| # # This simplified version only works for the root. | ||
| # if prefix == "": | ||
| # seen = set() | ||
| # async for key in self.list(): | ||
| # name = key.split('/')[0] | ||
| # if name not in seen: | ||
| # seen.add(name) | ||
| # yield name No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider removing commented-out code
The entire file consists of commented-out code. While the header explains it's kept as a reference implementation, it's better to remove it and rely on version control history.
Consider one of these approaches:
- Remove the file entirely and reference it in documentation or a README if needed
- Move it to a
examples/orreference/directory if it has educational value - If parts are useful, extract them into a documentation file
Keeping large blocks of commented code in the main codebase can cause confusion and maintenance overhead.
🤖 Prompt for AI Agents
In py_hamt/flat_zarr_store.py lines 1 to 375, the entire file is commented-out
code serving as a reference implementation. To reduce confusion and maintenance
overhead, remove this commented-out code from the main codebase. Instead, either
delete the file and keep it accessible via version control history, move it to
an examples or reference directory if it has educational value, or extract
useful parts into a separate documentation file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
py_hamt/store.py (1)
58-96: 🛠️ Refactor suggestion
length-only reads and value checking are missing; logic disagrees with comment
- Calling
load(id, length=8)currently falls through to the full load path instead of returning the first 8 bytes.- The comment “If only length is given, assume start from 0” is therefore incorrect.
- Negative values or a simultaneous
offset+suffixare silently accepted, which can produce surprising slices.- There is an unnecessary
elif/elsechain after earlyreturns (pylint R1705).Tighten the control flow and handle all combinations explicitly:
@@ - if offset is not None: - start = offset - if length is not None: - end = start + length - return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 - return data[-suffix:] - else: # Full load - return data + if (offset is not None) and (suffix is not None): + raise ValueError("Specify either offset/length or suffix, not both") + + if offset is not None: + if offset < 0: + raise ValueError("offset must be ≥ 0") + if length is None: + return data[offset:] + if length < 0: + raise ValueError("length must be ≥ 0") + return data[offset : offset + length] + + if suffix is not None: + if suffix < 0: + raise ValueError("suffix must be ≥ 0") + return data[-suffix:] + + if length is not None: # implicit offset == 0 + if length < 0: + raise ValueError("length must be ≥ 0") + return data[:length] + + return data # full blob🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
🧹 Nitpick comments (2)
py_hamt/store.py (2)
33-40: Docstring/method contract not updated for new slice parameters
ContentAddressedStore.load()now acceptsoffset,length, andsuffix, but the docstring (“Retrieve data.”) was left untouched. Add parameter descriptions and clarify precedence/valid-value ranges (e.g. that negative numbers are rejected and thatoffset&suffixare mutually exclusive).
85-95: Minor: remove superfluouselif/elsebranches after earlyreturnsOnce a branch returns, the following
elif/elseis redundant; flattening the logic (see previous diff) resolves the pylint R1705 warning.🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
py_hamt/store.py(5 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
py_hamt/store.py
[refactor] 87-91: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
py_hamt/store.py
Outdated
| async def load( | ||
| self, | ||
| id: IPLDKind, | ||
| offset: Optional[int] = None, | ||
| length: Optional[int] = None, | ||
| suffix: Optional[int] = None, | ||
| ) -> bytes: | ||
| """@private""" | ||
| cid = cast(CID, id) # CID is definitely in the IPLDKind type | ||
| cid = cast(CID, id) | ||
| url: str = self.gateway_base_url + str(cid) | ||
| headers: dict[str, str] = {} | ||
|
|
||
| # Construct the Range header if required | ||
| if offset is not None: | ||
| start = offset | ||
| if length is not None: | ||
| # Standard HTTP Range: bytes=start-end (inclusive) | ||
| end = start + length - 1 | ||
| headers["Range"] = f"bytes={start}-{end}" | ||
| else: | ||
| # Standard HTTP Range: bytes=start- (from start to end) | ||
| headers["Range"] = f"bytes={start}-" | ||
| elif suffix is not None: | ||
| # Standard HTTP Range: bytes=-N (last N bytes) | ||
| headers["Range"] = f"bytes=-{suffix}" | ||
|
|
||
| async with self._sem: # throttle gateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial-range header builder omits the length-only case and lacks validation
KuboCAS.load() mirrors the in-memory logic, so the same edge cases apply:
lengthwithoutoffsetresults in a full GET (noRangeheader).
Expected header:Range: bytes=0-{length-1}.- Missing checks for negative values and for mutually-exclusive
offset+suffix.
Proposed incremental fix:
@@
- if offset is not None:
+ if (offset is not None) and (suffix is not None):
+ raise ValueError("Specify either offset/length or suffix, not both")
+
+ if offset is not None:
start = offset
if length is not None:
@@
headers["Range"] = f"bytes={start}-"
- elif suffix is not None:
+ elif suffix is not None:
# Standard HTTP Range: bytes=-N (last N bytes)
headers["Range"] = f"bytes=-{suffix}"
+ elif length is not None:
+ # Range from byte 0 to length-1
+ headers["Range"] = f"bytes=0-{length-1}"Consider also asserting non-negative integers for all three parameters.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def load( | |
| self, | |
| id: IPLDKind, | |
| offset: Optional[int] = None, | |
| length: Optional[int] = None, | |
| suffix: Optional[int] = None, | |
| ) -> bytes: | |
| """@private""" | |
| cid = cast(CID, id) # CID is definitely in the IPLDKind type | |
| cid = cast(CID, id) | |
| url: str = self.gateway_base_url + str(cid) | |
| headers: dict[str, str] = {} | |
| # Construct the Range header if required | |
| if offset is not None: | |
| start = offset | |
| if length is not None: | |
| # Standard HTTP Range: bytes=start-end (inclusive) | |
| end = start + length - 1 | |
| headers["Range"] = f"bytes={start}-{end}" | |
| else: | |
| # Standard HTTP Range: bytes=start- (from start to end) | |
| headers["Range"] = f"bytes={start}-" | |
| elif suffix is not None: | |
| # Standard HTTP Range: bytes=-N (last N bytes) | |
| headers["Range"] = f"bytes=-{suffix}" | |
| async with self._sem: # throttle gateway | |
| async def load( | |
| self, | |
| id: IPLDKind, | |
| offset: Optional[int] = None, | |
| length: Optional[int] = None, | |
| suffix: Optional[int] = None, | |
| ) -> bytes: | |
| """@private""" | |
| cid = cast(CID, id) | |
| url: str = self.gateway_base_url + str(cid) | |
| headers: dict[str, str] = {} | |
| # Construct the Range header if required | |
| if (offset is not None) and (suffix is not None): | |
| raise ValueError("Specify either offset/length or suffix, not both") | |
| if offset is not None: | |
| start = offset | |
| if length is not None: | |
| # Standard HTTP Range: bytes=start-end (inclusive) | |
| end = start + length - 1 | |
| headers["Range"] = f"bytes={start}-{end}" | |
| else: | |
| # Standard HTTP Range: bytes=start- (from start to end) | |
| headers["Range"] = f"bytes={start}-" | |
| elif suffix is not None: | |
| # Standard HTTP Range: bytes=-N (last N bytes) | |
| headers["Range"] = f"bytes=-{suffix}" | |
| elif length is not None: | |
| # Range from byte 0 to length-1 | |
| headers["Range"] = f"bytes=0-{length-1}" | |
| async with self._sem: # throttle gateway |
🤖 Prompt for AI Agents
In py_hamt/store.py around lines 307 to 333, the load method's Range header
construction misses handling the case where length is provided without offset,
which should set the Range header to bytes=0-(length-1). Additionally, add
validation to ensure offset, length, and suffix are non-negative integers and
enforce that offset and suffix are not both provided simultaneously. Update the
logic to include these checks and construct the Range header accordingly to
cover all edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tests/test_cpc_compare.py (1)
1-128: 🛠️ Refactor suggestionDead-code test module — delete or convert to a real test
Every line in this file is commented out, so the test runner will never execute it. Carrying around disabled test code adds noise and sends mixed signals about intended coverage.
• If you still want the benchmark, move it to an activepytesttest (mark it with@pytest.mark.benchmarkor similar and gate it behind an opt-in flag).
• Otherwise, delete the file to avoid confusion.
♻️ Duplicate comments (8)
py_hamt/hamt_to_sharded_converter.py (1)
3-10: Remove the unused imports flagged by Ruff (duplicate reminder)
json,Dict,Any,Buffer, andBufferPrototypeare not referenced anywhere in this module. Keeping them causes F401 warnings and breaks the linting gate.-import json -from typing import Dict, Any -from zarr.core.buffer import Buffer, BufferPrototype🧰 Tools
🪛 Ruff (0.11.9)
3-3:
jsonimported but unusedRemove unused import:
json(F401)
5-5:
typing.Dictimported but unusedRemove unused import
(F401)
5-5:
typing.Anyimported but unusedRemove unused import
(F401)
10-10:
zarr.core.buffer.Bufferimported but unusedRemove unused import
(F401)
10-10:
zarr.core.buffer.BufferPrototypeimported but unusedRemove unused import
(F401)
tests/test_converter.py (2)
1-7: Drop unused imports and sort the remaining block (duplicate reminder)
asyncioandaiohttpare unused, triggering F401. Remove them and keep standard-lib imports first.-import asyncio ... -from unittest.mock import patch -import aiohttp +from unittest.mock import patch🧰 Tools
🪛 Ruff (0.11.9)
1-1:
asyncioimported but unusedRemove unused import:
asyncio(F401)
6-6:
aiohttpimported but unusedRemove unused import:
aiohttp(F401)
182-182: Remove the unnecessary f-string prefixNo interpolation happens here.
- assert f"New ShardedZarrStore Root CID" in captured.out + assert "New ShardedZarrStore Root CID" in captured.out🧰 Tools
🪛 Ruff (0.11.9)
182-182: f-string without any placeholders
Remove extraneous
fprefix(F541)
tests/test_sharded_zarr_store.py (2)
1-2: Remove unused imports and variablesSeveral imports and the
precipvariable are not used in the test file.Apply this diff to clean up the code:
-import asyncio -import math import numpy as np import pandas as pd import pytest import xarray as xr from zarr.abc.store import RangeByteRequest import zarr.core.buffer import dag_cbor -from py_hamt import HAMT, KuboCAS, ShardedZarrStore -from py_hamt.zarr_hamt_store import ZarrHAMTStore +from py_hamt import KuboCAS, ShardedZarrStoreAlso remove the unused
precipvariable:temp = np.random.randn(len(times), len(lats), len(lons)) -precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons))) ds = xr.Dataset(Also applies to: 12-13, 25-25
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
asyncioimported but unusedRemove unused import:
asyncio(F401)
2-2:
mathimported but unusedRemove unused import:
math(F401)
647-732:⚠️ Potential issueRemove duplicate function definition
The function
test_sharded_zarr_store_init_invalid_shapesis defined twice. The commented-out version (lines 509-596) should be removed along with this duplicate definition.🧰 Tools
🪛 Ruff (0.11.9)
689-689: Local variable
root_cidis assigned to but never usedRemove assignment to unused variable
root_cid(F841)
py_hamt/store.py (1)
334-346:⚠️ Potential issuePartial-range header builder omits the
length-only case and lacks validation
KuboCAS.load()has a bug wherelengthwithoutoffsetresults in a full GET (noRangeheader). It should setRange: bytes=0-{length-1}.Apply this fix:
# Construct the Range header if required - if offset is not None: + if (offset is not None) and (suffix is not None): + raise ValueError("Specify either offset/length or suffix, not both") + + if offset is not None: start = offset if length is not None: # Standard HTTP Range: bytes=start-end (inclusive) end = start + length - 1 headers["Range"] = f"bytes={start}-{end}" else: # Standard HTTP Range: bytes=start- (from start to end) headers["Range"] = f"bytes={start}-" elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) headers["Range"] = f"bytes=-{suffix}" + elif length is not None: + # Range from byte 0 to length-1 + headers["Range"] = f"bytes=0-{length-1}"py_hamt/sharded_zarr_store.py (2)
325-334: Add type assertion for better type safetyAfter checking that
_cid_lenis not None, add an assertion to help mypy understand the type.if self._cid_len is None: # Should be set raise RuntimeError( "Store not initialized: _cid_len is None for shard initialization." ) + assert self._cid_len is not None # Type assertion for mypy if self._chunks_per_shard is None: raise RuntimeError( "Store not initialized: _chunks_per_shard is None for shard initialization." ) + assert self._chunks_per_shard is not None # Type assertion for mypy # New shard or shard not yet written, initialize with zeros shard_size_bytes = self._chunks_per_shard * self._cid_len
488-498:⚠️ Potential issueAdd explicit ByteRequest type validation in
get()The code assumes
byte_rangehas.startand.endattributes, which will fail forOffsetByteRequestorSuffixByteRequesttypes.Replace the existing byte range handling:
- req_offset = byte_range.start if byte_range else None - req_length = None - if byte_range: - if byte_range.end is not None: - if ( - byte_range.start > byte_range.end - ): # Zarr allows start == stop for 0 length - raise ValueError( - f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" - ) - req_length = byte_range.end - byte_range.start + from zarr.abc.store import RangeByteRequest, OffsetByteRequest, SuffixByteRequest + + req_offset: Optional[int] = None + req_length: Optional[int] = None + if byte_range is not None: + if isinstance(byte_range, RangeByteRequest): + req_offset = byte_range.start + if byte_range.end is not None: + if byte_range.start > byte_range.end: + raise ValueError(f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})") + req_length = byte_range.end - byte_range.start + elif isinstance(byte_range, OffsetByteRequest): + req_offset = byte_range.offset + elif isinstance(byte_range, SuffixByteRequest): + raise TypeError("SuffixByteRequest is not supported by ShardedZarrStore.get()") + else: + raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
🧹 Nitpick comments (12)
py_hamt/hamt_to_sharded_converter.py (1)
50-58: Missing async context manager / explicit close fordest_store
ShardedZarrStore.open()likely allocates buffers and holds a CAS reference.
Wrap it in an async context manager (if available) or explicitly callawait dest_store.close()afterflush()to avoid descriptor leaks.py_hamt/manage_pins.py (1)
62-64: Propagate the target RPC explicitly for clarityThe tests call
pin_entire_dataset(target_rpc=rpc_base_url), but the CLI path omits the parameter. Passing it makes the behaviour explicit and avoids surprises ifShardedZarrStorelater defaults to a different endpoint.- await store.pin_entire_dataset() + await store.pin_entire_dataset(target_rpc=args.rpc_url)Apply the same change for
unpin_entire_dataset.tests/test_sharded_zarr_pinning.py (2)
119-134: Fixed sleep can make the test flakyRelying on
await asyncio.sleep(1)assumes the Kubo daemon updates pin state within one second, which is optimistic on busy nodes.Refactor to poll with a timeout:
async def _wait_until(predicate, timeout=30, interval=0.5): start = asyncio.get_event_loop().time() while not predicate(): if asyncio.get_event_loop().time() - start > timeout: pytest.fail("Pin state did not converge in time") await asyncio.sleep(interval) await _wait_until(lambda: expected_cids <= await get_pinned_cids(rpc_base_url))
12-20: Use GET forpin/lsfor better compatibility
/api/v0/pin/lsaccepts GET; POST works on recent Kubo but older nodes reject it. Switching to GET maximises compatibility.- async with session.post(url, params={"type": "all"}) as resp: + async with session.get(url, params={"type": "all"}) as resp:tests/test_converter.py (1)
249-252:kubo_casis created but unusedThe context manager is only needed to satisfy type checks; drop the binding to silence F841.
- async with KuboCAS( - rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url - ) as kubo_cas: + async with KuboCAS( + rpc_base_url=rpc_base_url, gateway_base_url=gateway_base_url + ):🧰 Tools
🪛 Ruff (0.11.9)
251-251: Local variable
kubo_casis assigned to but never usedRemove assignment to unused variable
kubo_cas(F841)
tests/test_sharded_zarr_store.py (1)
689-689: Remove unused variable assignmentThe
root_cidvariable is assigned but never used.- root_cid = await store.flush() + await store.flush()🧰 Tools
🪛 Ruff (0.11.9)
689-689: Local variable
root_cidis assigned to but never usedRemove assignment to unused variable
root_cid(F841)
py_hamt/store.py (1)
94-104: Simplify control flow by removing unnecessary elseThe else block after return can be removed to reduce nesting.
if offset is not None: start = offset if length is not None: end = start + length return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 + return data[start:] + if suffix is not None: return data[-suffix:] - else: # Full load - return data + # Full load + return data🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
py_hamt/sharded_zarr_store.py (5)
4-4: Remove unused importThe
castimport is not used in this file.-from typing import Optional, cast, Dict, List, Set, Tuple +from typing import Optional, Dict, List, Set, Tuple🧰 Tools
🪛 Ruff (0.11.9)
4-4:
typing.castimported but unusedRemove unused import:
typing.cast(F401)
828-833: Use or remove unused exception variableThe exception is assigned to
ebut never used. Either use it in error handling or remove the assignment.try: await self.cas.unpin_cid( chunk_cid_str, target_rpc=target_rpc ) - except Exception as e: - # ignore + except Exception: + # Ignore failures - chunk might already be unpinned continue🧰 Tools
🪛 Ruff (0.11.9)
831-831: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
844-847: Use exception information in warning messageThe exception should be included in the warning message for better debugging.
try: await self.cas.unpin_cid(str(shard_cid), target_rpc=target_rpc) except Exception as e: - print(f"Warning: Could not unpin shard {str(shard_cid)}") + print(f"Warning: Could not unpin shard {str(shard_cid)}: {e}")🧰 Tools
🪛 Ruff (0.11.9)
845-845: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
854-858: Include exception details in warning messageAdd the exception information to the warning for better debugging.
try: await self.cas.unpin_cid(cid, target_rpc=target_rpc) print(f"Unpinned metadata CID {cid} from {target_rpc}...") except Exception as e: print( - f"Warning: Could not unpin metadata CID {cid}. Likely already unpinned." + f"Warning: Could not unpin metadata CID {cid}: {e}" )🧰 Tools
🪛 Ruff (0.11.9)
855-855: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
863-868: Include exception details in warning messageAdd the exception information to improve debugging.
try: await self.cas.unpin_cid(self._root_cid, target_rpc=target_rpc) print(f"Unpinned root CID {self._root_cid} from {target_rpc}...") except Exception as e: print( - f"Warning: Could not unpin root CID {self._root_cid}. Likely already unpinned." + f"Warning: Could not unpin root CID {self._root_cid}: {e}" )🧰 Tools
🪛 Ruff (0.11.9)
865-865: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
py_hamt/flat_zarr_store.py(1 hunks)py_hamt/hamt_to_sharded_converter.py(1 hunks)py_hamt/manage_pins.py(1 hunks)py_hamt/sharded_zarr_store.py(1 hunks)py_hamt/store.py(7 hunks)tests/test_benchmark_stores.py(1 hunks)tests/test_converter.py(1 hunks)tests/test_cpc_compare.py(1 hunks)tests/test_sharded_zarr_pinning.py(1 hunks)tests/test_sharded_zarr_store.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- py_hamt/flat_zarr_store.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_benchmark_stores.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
py_hamt/store.py (4)
tests/test_zarr_ipfs_partial.py (1)
load(275-285)py_hamt/hamt.py (4)
load(138-139)load(159-168)load(276-285)get(593-608)py_hamt/sharded_zarr_store.py (1)
get(413-501)py_hamt/zarr_hamt_store.py (1)
get(116-149)
🪛 Ruff (0.11.9)
tests/test_converter.py
1-1: asyncio imported but unused
Remove unused import: asyncio
(F401)
6-6: aiohttp imported but unused
Remove unused import: aiohttp
(F401)
182-182: f-string without any placeholders
Remove extraneous f prefix
(F541)
251-251: Local variable kubo_cas is assigned to but never used
Remove assignment to unused variable kubo_cas
(F841)
tests/test_sharded_zarr_store.py
1-1: asyncio imported but unused
Remove unused import: asyncio
(F401)
2-2: math imported but unused
Remove unused import: math
(F401)
12-12: py_hamt.HAMT imported but unused
Remove unused import: py_hamt.HAMT
(F401)
13-13: py_hamt.zarr_hamt_store.ZarrHAMTStore imported but unused
Remove unused import: py_hamt.zarr_hamt_store.ZarrHAMTStore
(F401)
25-25: Local variable precip is assigned to but never used
Remove assignment to unused variable precip
(F841)
689-689: Local variable root_cid is assigned to but never used
Remove assignment to unused variable root_cid
(F841)
py_hamt/sharded_zarr_store.py
4-4: typing.cast imported but unused
Remove unused import: typing.cast
(F401)
831-831: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
845-845: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
855-855: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
865-865: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
py_hamt/hamt_to_sharded_converter.py
3-3: json imported but unused
Remove unused import: json
(F401)
5-5: typing.Dict imported but unused
Remove unused import
(F401)
5-5: typing.Any imported but unused
Remove unused import
(F401)
10-10: zarr.core.buffer.Buffer imported but unused
Remove unused import
(F401)
10-10: zarr.core.buffer.BufferPrototype imported but unused
Remove unused import
(F401)
🪛 Pylint (3.3.7)
tests/test_converter.py
[refactor] 52-52: Too many local variables (21/15)
(R0914)
[refactor] 145-145: Too many local variables (16/15)
(R0914)
[refactor] 196-196: Too many local variables (17/15)
(R0914)
tests/test_sharded_zarr_pinning.py
[refactor] 83-83: Too many local variables (16/15)
(R0914)
tests/test_sharded_zarr_store.py
[refactor] 160-160: Too many local variables (16/15)
(R0914)
[refactor] 210-210: Too many local variables (17/15)
(R0914)
[refactor] 268-268: Too many local variables (18/15)
(R0914)
[refactor] 311-311: Too many local variables (18/15)
(R0914)
py_hamt/store.py
[refactor] 96-100: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
py_hamt/sharded_zarr_store.py
[refactor] 13-13: Too many instance attributes (14/7)
(R0902)
[refactor] 68-68: Too many arguments (8/5)
(R0913)
[refactor] 204-204: Too many return statements (8/6)
(R0911)
[refactor] 296-299: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 289-289: Too many branches (13/12)
(R0912)
[refactor] 413-413: Too many local variables (17/15)
(R0914)
[refactor] 413-413: Too many return statements (7/6)
(R0911)
[refactor] 413-413: Too many branches (14/12)
(R0912)
[refactor] 638-643: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 795-795: Too many branches (15/12)
(R0912)
py_hamt/hamt_to_sharded_converter.py
[refactor] 13-13: Too many local variables (20/15)
(R0914)
🔇 Additional comments (1)
py_hamt/hamt_to_sharded_converter.py (1)
64-74:❓ Verification inconclusive
Verify CID length before writing into shards
cid_base32_stris written verbatim, butcid_lenis hard-coded to 59. A multibase-encoded CIDv1 can be longer than 59 chars (e.g. long multihash algorithms). Writing a longer string will silently truncate or corrupt the shard layout.Would you run a quick search to see whether any CID in real datasets exceeds 59 characters?
🏁 Script executed:
#!/bin/bash # Scan existing shards (if any) for long CIDs fd -e zarr | xargs -I{} grep -oE 'b[a-z0-9]{60,}' {} | headLength of output: 65
Potential CID length mismatch risk
The converter assumes a fixedcid_len = 59for base32-encoded CIDv1, but longer multihash algorithms (e.g. sha2-512) will produce CIDs >59 characters and be silently truncated, corrupting shard layout. Please verify whether your data ever generates CIDs longer than 59 bytes or update the code to handle variable-length CIDs.• File:
py_hamt/hamt_to_sharded_converter.py(lines 64–74)
• Suggested fix – add a length guard before writing the pointer:- await dest_store.set_pointer(key, cid_base32_str) + if len(cid_base32_str) > cid_len: + raise ValueError(f"CID length {len(cid_base32_str)} exceeds expected {cid_len}") + await dest_store.set_pointer(key, cid_base32_str)
| # 2. Introspect the source array to get its configuration | ||
| print("Reading metadata from source store...") | ||
|
|
||
| # Read the stores metadata to get array shape and chunk shape | ||
| ordered_dims = list(source_dataset.dims) | ||
| array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims) | ||
| chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) | ||
| array_shape = array_shape_tuple | ||
| chunk_shape = chunk_shape_tuple | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Chunk-shape derivation fails for unchunked datasets
source_dataset.chunks is None for unchunked arrays, so the comprehension will raise TypeError.
Consider a safer fallback:
- chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
+ if source_dataset.chunks is None: # unchunked source
+ chunk_shape_tuple = array_shape_tuple
+ else:
+ chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 2. Introspect the source array to get its configuration | |
| print("Reading metadata from source store...") | |
| # Read the stores metadata to get array shape and chunk shape | |
| ordered_dims = list(source_dataset.dims) | |
| array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims) | |
| chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) | |
| array_shape = array_shape_tuple | |
| chunk_shape = chunk_shape_tuple | |
| # 2. Introspect the source array to get its configuration | |
| print("Reading metadata from source store...") | |
| # Read the stores metadata to get array shape and chunk shape | |
| ordered_dims = list(source_dataset.dims) | |
| array_shape_tuple = tuple(source_dataset.dims[dim] for dim in ordered_dims) | |
| - chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) | |
| + if source_dataset.chunks is None: # unchunked source | |
| + chunk_shape_tuple = array_shape_tuple | |
| + else: | |
| + chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) | |
| array_shape = array_shape_tuple | |
| chunk_shape = chunk_shape_tuple |
🤖 Prompt for AI Agents
In py_hamt/hamt_to_sharded_converter.py around lines 36 to 45, the code assumes
source_dataset.chunks is always defined, but it can be None for unchunked
datasets, causing a TypeError. Modify the chunk_shape_tuple assignment to first
check if source_dataset.chunks is not None; if it is None, provide a fallback
such as using the full dimension size as the chunk size or a default chunk size.
This ensures safe handling of unchunked datasets without errors.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #68 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 5 7 +2
Lines 707 1310 +603
==========================================
+ Hits 707 1310 +603 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
py_hamt/store_httpx.py (1)
496-515: Alignpin_lsto return a list of CID stringsAfter reviewing both the stub (line 57) and the full implementation (lines 496–515) in
py_hamt/store_httpx.py, and inspecting all call sites (notably intests/test_kubo_pin.pywhich usesassert str(cid) in listed_pins), it’s clear that:
- The current signature
-> list[Dict[str, Any]]is incorrect.- The code actually returns a dict of
{ cid: { "Type": … }, … }, which happens to supportcid in pinsbut fails type expectations.- Tests only check membership, so returning a flat
list[str]of CIDs both satisfies callers and keeps the API surface clean.Please apply the following mandatory refactor:
• At line 57 (stub):
- async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: + async def pin_ls(self, target_rpc: str) -> list[str]:• At lines 496–515 (implementation):
- async def pin_ls( - self, target_rpc: str = "http://127.0.0.1:5001" - ) -> list[Dict[str, Any]]: + async def pin_ls( + self, target_rpc: str = "http://127.0.0.1:5001" + ) -> list[str]: @@ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + list[str]: A list of pinned CID strings. @@ - pins = response.json().get("Keys", []) - return pins + keys_map = response.json().get("Keys", {}) or {} + return list(keys_map.keys())• Adjust both docstrings to reflect the new return type (
list[str]).This change preserves existing test behavior, aligns type annotations, and avoids future confusion around the raw RPC payload shape.
♻️ Duplicate comments (5)
py_hamt/zarr_hamt_store.py (1)
175-177: Replace print with structured loggingUse the logging module instead of
- except Exception as e: - print(f"Error getting key '{key}' with range {byte_range}: {e}") - raise + except Exception as e: + logger.error("Error getting key %r with range %r: %s", key, byte_range, e) + raiseAdd once at the top of the module:
+import logging +logger = logging.getLogger(__name__)py_hamt/store_httpx.py (4)
77-115: Support length-only reads and simplify control flow in InMemoryCAS.loadLength without offset should return the first N bytes. This was flagged earlier and remains unhandled.
key = cast(bytes, id) @@ - data: bytes + data: bytes try: data = self.store[key] except KeyError as exc: raise KeyError("Object not found in in-memory store") from exc - if offset is not None: - start = offset - if length is not None: - end = start + length - return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 - return data[-suffix:] - else: # Full load - return data + if length == 0: + return b"" + if offset is not None: + start = offset + if length is not None: + end = start + length + return data[start:end] + return data[start:] + if length is not None: + # length-only → first N bytes + return data[:length] + if suffix is not None: + # trailing N bytes + return data[-suffix:] + return data # full blob
435-451: Remove nonexistentnameparameter from pin_cid docstringDocstring mentions
name, which is not in the signature and not sent.""" Pins a CID to the local Kubo node via the RPC API. This call is recursive by default, pinning all linked objects. Args: cid (CID): The Content ID to pin. - name (Optional[str]): An optional name for the pin. + target_rpc (str): The RPC URL of the Kubo node. """Consider adding a
recursive: bool = Truekeyword if you need to toggle recursion in the future.
43-60: Make pin hook methods fail-fast instead of silently passingThese are effectively abstract hooks. Leaving them as
passhides misuse at runtime.- async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Pin a CID in the storage.""" - pass # pragma: no cover + async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Pin a CID in the storage.""" + raise NotImplementedError # pragma: no cover @@ - async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Unpin a CID in the storage.""" - pass # pragma: no cover + async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Unpin a CID in the storage.""" + raise NotImplementedError # pragma: no cover @@ - async def pin_update( + async def pin_update( self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str ) -> None: - """Update the pinned CID in the storage.""" - pass # pragma: no cover + """Update the pinned CID in the storage.""" + raise NotImplementedError # pragma: no cover @@ - async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: - """List all pinned CIDs in the storage.""" - return [] # pragma: no cover + async def pin_ls(self, target_rpc: str) -> Dict[str, Any]: + """List pinned CIDs in the storage (shape matches Kubo's pin/ls Keys map).""" + raise NotImplementedError # pragma: no coverNote: This also updates the return type of
pin_lsto match Kubo (see later comment).
403-434: Handle length-only HTTP Range requests; guard zero-length/suffix=0
load()still ignoreslengthwhenoffsetisNone. HTTP Range semantics expect"bytes=0-{length-1}". Also,length==0andsuffix==0should short-circuit to avoid sending invalid ranges.async def load( self, id: IPLDKind, offset: Optional[int] = None, length: Optional[int] = None, suffix: Optional[int] = None, ) -> bytes: """@private""" cid = cast(CID, id) url: str = f"{self.gateway_base_url + str(cid)}" headers: Dict[str, str] = {} + # Zero-length/suffix guard: avoid invalid HTTP Range values + if length == 0 or suffix == 0: + return b"" + # Construct the Range header if required if offset is not None: start = offset if length is not None: # Standard HTTP Range: bytes=start-end (inclusive) end = start + length - 1 headers["Range"] = f"bytes={start}-{end}" else: # Standard HTTP Range: bytes=start- (from start to end) headers["Range"] = f"bytes={start}-" - elif suffix is not None: + elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) headers["Range"] = f"bytes=-{suffix}" + elif length is not None: + # length only → first N bytes + end = length - 1 + headers["Range"] = f"bytes=0-{end}" async with self._sem: # throttle gateway client = self._loop_client() response = await client.get(url, headers=headers or None) response.raise_for_status() return response.content
🧹 Nitpick comments (4)
py_hamt/zarr_hamt_store.py (1)
184-191: Tidy gather usage and consider throttling very large fan-outThe comment says "Set return_exceptions=True" but the call sets it to False (default). Remove the stale note. If
key_rangescan be large, consider bounding concurrency with a semaphore to avoid scheduling thousands of tasks at once.- results = await asyncio.gather( - *tasks, return_exceptions=False - ) # Set return_exceptions=True for debugging + results = await asyncio.gather(*tasks) return resultsOptional: introduce a local semaphore to cap concurrent
get()s.py_hamt/store_httpx.py (3)
34-41: Clarify partial-read semantics in the base interfaceDocument that:
- offset is 0-based; length is a byte count; end is exclusive when derived from offset+length.
- suffix returns the last N bytes.
- negative values are invalid; length==0 should return empty bytes.
- ) -> bytes: - """Retrieve data.""" + ) -> bytes: + """Retrieve data. + + Semantics: + - offset: 0‑based start position. + - length: number of bytes to read; when combined with offset, the "end" is exclusive (HTTP Range end is inclusive and derived as start+length-1). + - suffix: last N bytes of the object. + - Negative values are invalid; length == 0 returns b"". + """
190-193: pinOnAdd: URL building looks good; consider URL-encoding chunker/hashIf
chunkerever contains characters requiring encoding (future formats), assemble the RPC URL usingparams=onclient.post(...)rather than manual string concatenation.Example:
response = await client.post( f"{rpc_base_url}/api/v0/add", params={"hash": self.hasher, "chunker": self.chunker, "pin": str(pinOnAdd).lower()}, files=files, )Also applies to: 258-260
343-383: Destructor performs async work; keep best‑effort but avoid surprising side effects
__del__attempts to schedule/drive async cleanup. This is best‑effort but can surprise callers, especially in interpreter shutdown. Consider limiting__del__to clearing references and documenting thataclose()(orasync with) is required for deterministic cleanup.No code change required now; just a heads‑up. If you want code, I can propose a minimal
__del__that avoidsasyncio.run()altogether.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
py_hamt/store_httpx.py(8 hunks)py_hamt/zarr_hamt_store.py(3 hunks)tests/test_zarr_ipfs.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/test_zarr_ipfs.py
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/zarr_hamt_store.py (2)
py_hamt/encryption_hamt_store.py (1)
get(159-183)py_hamt/hamt.py (2)
get(593-608)HAMT(287-696)
py_hamt/store_httpx.py (4)
py_hamt/hamt.py (5)
load(138-139)load(159-168)load(276-285)get(593-608)HAMT(287-696)py_hamt/zarr_hamt_store.py (1)
get(144-177)py_hamt/sharded_zarr_store.py (1)
get(319-374)tests/test_kubo_cas.py (1)
test_kubo_cas(128-145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Create project environment, run all checks (3.12)
| def _map_byte_request( | ||
| self, byte_range: Optional[zarr.abc.store.ByteRequest] | ||
| ) -> tuple[Optional[int], Optional[int], Optional[int]]: | ||
| """Helper to map Zarr ByteRequest to offset, length, suffix.""" | ||
| offset: Optional[int] = None | ||
| length: Optional[int] = None | ||
| suffix: Optional[int] = None | ||
|
|
||
| if byte_range: | ||
| if isinstance(byte_range, zarr.abc.store.RangeByteRequest): | ||
| offset = byte_range.start | ||
| length = byte_range.end - byte_range.start | ||
| if length is not None and length < 0: | ||
| raise ValueError("End must be >= start for RangeByteRequest") | ||
| elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest): | ||
| offset = byte_range.offset | ||
| elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest): | ||
| suffix = byte_range.suffix | ||
| else: | ||
| raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}") | ||
|
|
||
| return offset, length, suffix | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix RangeByteRequest with open-ended ranges and validate inputs
_map_byte_request unconditionally computes end - start. If end is None (open-ended ranges are valid in Zarr v3), this raises a TypeError. Also, negative start and zero-length ranges should be handled explicitly to avoid downstream invalid HTTP ranges.
Apply:
def _map_byte_request(
self, byte_range: Optional[zarr.abc.store.ByteRequest]
) -> tuple[Optional[int], Optional[int], Optional[int]]:
"""Helper to map Zarr ByteRequest to offset, length, suffix."""
offset: Optional[int] = None
length: Optional[int] = None
suffix: Optional[int] = None
if byte_range:
if isinstance(byte_range, zarr.abc.store.RangeByteRequest):
- offset = byte_range.start
- length = byte_range.end - byte_range.start
- if length is not None and length < 0:
- raise ValueError("End must be >= start for RangeByteRequest")
+ if byte_range.start < 0:
+ raise ValueError("start must be >= 0 for RangeByteRequest")
+ offset = byte_range.start
+ if byte_range.end is not None:
+ if byte_range.end < byte_range.start:
+ raise ValueError(
+ "end must be >= start for RangeByteRequest"
+ )
+ length = byte_range.end - byte_range.start # end is exclusive
+ else:
+ # open-ended: request from offset to end of blob
+ length = None
elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest):
offset = byte_range.offset
elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest):
suffix = byte_range.suffix
else:
raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}")
return offset, length, suffixOptionally, short-circuit zero-length reads at the call-site (see next comment) to avoid sending a bytes=start-(start-1) HTTP range.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _map_byte_request( | |
| self, byte_range: Optional[zarr.abc.store.ByteRequest] | |
| ) -> tuple[Optional[int], Optional[int], Optional[int]]: | |
| """Helper to map Zarr ByteRequest to offset, length, suffix.""" | |
| offset: Optional[int] = None | |
| length: Optional[int] = None | |
| suffix: Optional[int] = None | |
| if byte_range: | |
| if isinstance(byte_range, zarr.abc.store.RangeByteRequest): | |
| offset = byte_range.start | |
| length = byte_range.end - byte_range.start | |
| if length is not None and length < 0: | |
| raise ValueError("End must be >= start for RangeByteRequest") | |
| elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest): | |
| offset = byte_range.offset | |
| elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest): | |
| suffix = byte_range.suffix | |
| else: | |
| raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}") | |
| return offset, length, suffix | |
| def _map_byte_request( | |
| self, byte_range: Optional[zarr.abc.store.ByteRequest] | |
| ) -> tuple[Optional[int], Optional[int], Optional[int]]: | |
| """Helper to map Zarr ByteRequest to offset, length, suffix.""" | |
| offset: Optional[int] = None | |
| length: Optional[int] = None | |
| suffix: Optional[int] = None | |
| if byte_range: | |
| if isinstance(byte_range, zarr.abc.store.RangeByteRequest): | |
| if byte_range.start < 0: | |
| raise ValueError("start must be >= 0 for RangeByteRequest") | |
| offset = byte_range.start | |
| if byte_range.end is not None: | |
| if byte_range.end < byte_range.start: | |
| raise ValueError( | |
| "end must be >= start for RangeByteRequest" | |
| ) | |
| length = byte_range.end - byte_range.start # end is exclusive | |
| else: | |
| # open-ended: request from offset to end of blob | |
| length = None | |
| elif isinstance(byte_range, zarr.abc.store.OffsetByteRequest): | |
| offset = byte_range.offset | |
| elif isinstance(byte_range, zarr.abc.store.SuffixByteRequest): | |
| suffix = byte_range.suffix | |
| else: | |
| raise TypeError(f"Unsupported ByteRequest type: {type(byte_range)}") | |
| return offset, length, suffix |
| if is_metadata and byte_range is None and key in self.metadata_read_cache: | ||
| val = self.metadata_read_cache[key] | ||
| else: | ||
| offset, length, suffix = self._map_byte_request(byte_range) | ||
| val = cast( | ||
| bytes, await self.hamt.get(key) | ||
| bytes, | ||
| await self.hamt.get( | ||
| key, offset=offset, length=length, suffix=suffix | ||
| ), | ||
| ) # We know values received will always be bytes since we only store bytes in the HAMT | ||
| if is_metadata: | ||
| if is_metadata and byte_range is None: | ||
| self.metadata_read_cache[key] = val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Short-circuit zero-length ranges to avoid invalid HTTP Range headers
If a caller supplies RangeByteRequest(start, end) where start == end, length == 0. Forwarding this to the CAS will build end = start + length - 1, i.e. start-1 (invalid). Return an empty buffer locally for this edge case.
- else:
- offset, length, suffix = self._map_byte_request(byte_range)
+ else:
+ offset, length, suffix = self._map_byte_request(byte_range)
+ # Zero-length range → empty result, avoid invalid HTTP Range
+ if length == 0:
+ return prototype.buffer.from_bytes(b"")
val = cast(
bytes,
await self.hamt.get(
key, offset=offset, length=length, suffix=suffix
),
) # We know values received will always be bytes since we only store bytes in the HAMT
if is_metadata and byte_range is None:
self.metadata_read_cache[key] = val📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_metadata and byte_range is None and key in self.metadata_read_cache: | |
| val = self.metadata_read_cache[key] | |
| else: | |
| offset, length, suffix = self._map_byte_request(byte_range) | |
| val = cast( | |
| bytes, await self.hamt.get(key) | |
| bytes, | |
| await self.hamt.get( | |
| key, offset=offset, length=length, suffix=suffix | |
| ), | |
| ) # We know values received will always be bytes since we only store bytes in the HAMT | |
| if is_metadata: | |
| if is_metadata and byte_range is None: | |
| self.metadata_read_cache[key] = val | |
| if is_metadata and byte_range is None and key in self.metadata_read_cache: | |
| val = self.metadata_read_cache[key] | |
| else: | |
| offset, length, suffix = self._map_byte_request(byte_range) | |
| # Zero-length range → empty result, avoid invalid HTTP Range | |
| if length == 0: | |
| return prototype.buffer.from_bytes(b"") | |
| val = cast( | |
| bytes, | |
| await self.hamt.get( | |
| key, offset=offset, length=length, suffix=suffix | |
| ), | |
| ) # We know values received will always be bytes since we only store bytes in the HAMT | |
| if is_metadata and byte_range is None: | |
| self.metadata_read_cache[key] = val |
🤖 Prompt for AI Agents
In py_hamt/zarr_hamt_store.py around lines 158 to 169, when a RangeByteRequest
has start==end the computed length will be 0 and building an HTTP Range header
for the CAS becomes invalid; after calling self._map_byte_request(byte_range)
check if length == 0 and if so set val = b"" (return an empty bytes object)
instead of calling await self.hamt.get(...), and still populate
self.metadata_read_cache[key] when is_metadata and byte_range is None or when
appropriate for cached reads; this short-circuits the zero-length case and
avoids creating invalid Range headers.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
py_hamt/store_httpx.py
Outdated
| *, | ||
| headers: dict[str, str] | None = None, | ||
| auth: Tuple[str, str] | None = None, | ||
| pinOnAdd: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong casing:
| pinOnAdd: bool = False, | |
| pin_on_add: bool = False, |
py_hamt/store_httpx.py
Outdated
| pinString: str = "true" if pinOnAdd else "false" | ||
| self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here, casing to be fixed:
| pinString: str = "true" if pinOnAdd else "false" | |
| self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pinString}" | |
| pin_string: str = "true" if pin_on_add else "false" | |
| self.rpc_url: str = f"{rpc_base_url}/api/v0/add?hash={self.hasher}&chunker={self.chunker}&pin={pin_string}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of casing issue, for the rest, it LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
py_hamt/store_httpx.py (3)
264-271: Bug: calling get_running_loop() in init when a user supplies a clientConstructing KuboCAS from sync code with a prebuilt AsyncClient raises RuntimeError. Don’t touch the loop in init; stash and return the external client in _loop_client().
- if client is not None: - # A client was supplied by the user. We don't own it. - self._owns_client = False - self._client_per_loop = {asyncio.get_running_loop(): client} + if client is not None: + # A client was supplied by the user. We don't own it. + self._owns_client = False + # Do not access the running loop in __init__ + self._external_client = client + self._client_per_loop = {} else: # No client supplied. We will own any clients we create. self._owns_client = True self._client_per_loop = {}
281-312: Return user-supplied client regardless of loop; lazily create owned clients per loopThis complements the previous fix. If the user provided a client, just return it. For owned clients, keep the per-loop behavior.
def _loop_client(self) -> httpx.AsyncClient: """Get or create a client for the current event loop. @@ if self._closed: if not self._owns_client: raise RuntimeError("KuboCAS is closed; create a new instance") # We previously closed all internally-owned clients. Reset the # state so that new clients can be created lazily. self._closed = False self._client_per_loop = {} + # External client → return as-is, independent of loop. + if not self._owns_client: + if self._external_client is None: + raise RuntimeError("External client missing (internal error)") + return self._external_client + loop: asyncio.AbstractEventLoop = asyncio.get_running_loop() try: return self._client_per_loop[loop] except KeyError: # Create a new client client = httpx.AsyncClient( timeout=60.0, headers=self._default_headers, auth=self._default_auth, limits=httpx.Limits(max_connections=64, max_keepalive_connections=32), # Uncomment when they finally support Robust HTTP/2 GOAWAY responses # http2=True, ) self._client_per_loop[loop] = client return client
496-515: pin_ls return type/docstring mismatch and incorrect shapeThe RPC returns {"Keys": { "": {...}, ... }}, not a list. The annotation says list[Dict[str, Any]], docstring says List[CID]. Return List[CID] (decoded) for a stable API, or change both to Dict[str, Any]. Below converts to List[CID].
- async def pin_ls( - self, target_rpc: str = "http://127.0.0.1:5001" - ) -> list[Dict[str, Any]]: + async def pin_ls( + self, target_rpc: str = "http://127.0.0.1:5001" + ) -> list[CID]: @@ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + list[CID]: The pinned CIDs. @@ - pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls" - async with self._sem: # throttle RPC - client = self._loop_client() - response = await client.post(pin_ls_url_base) - response.raise_for_status() - pins = response.json().get("Keys", []) - return pins + pin_ls_url_base: str = f"{target_rpc}/api/v0/pin/ls" + async with self._sem: # throttle RPC + client = self._loop_client() + response = await client.post(pin_ls_url_base) + response.raise_for_status() + payload = response.json() + keys_map = payload.get("Keys", {}) or {} + return [CID.decode(k) for k in keys_map.keys()]
♻️ Duplicate comments (4)
py_hamt/store_httpx.py (4)
439-475: Pin API docstrings align with behavior (recursive=true); nice cleanupDocs now match the code and removed the non-existent name parameter. Looks good.
43-60: Base-class pin hooks silently no-op; surface misuse with NotImplementedError or mark abstractLeaving these as pass/[] hides bugs at runtime when a caller expects pinning to work but the concrete CAS doesn’t implement it. Raise NotImplementedError to fail fast (or mark @AbstractMethod if all CASes must implement).
- async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Pin a CID in the storage.""" - pass # pragma: no cover + async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Pin a CID in the storage.""" + raise NotImplementedError("pin_cid not implemented in this CAS") - async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Unpin a CID in the storage.""" - pass # pragma: no cover + async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Unpin a CID in the storage.""" + raise NotImplementedError("unpin_cid not implemented in this CAS") - async def pin_update( - self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str - ) -> None: - """Update the pinned CID in the storage.""" - pass # pragma: no cover + async def pin_update( + self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str + ) -> None: + """Update the pinned CID in the storage.""" + raise NotImplementedError("pin_update not implemented in this CAS") - async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: - """List all pinned CIDs in the storage.""" - return [] # pragma: no cover + async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: + """List all pinned CIDs in the storage.""" + raise NotImplementedError("pin_ls not implemented in this CAS")
104-114: Handle length-only loads and remove redundant else-after-returnWhen only length is provided (offset is None), callers expect the first N bytes. Also, the comment on Line 111 references length under the suffix branch. Tidy the control flow.
- if offset is not None: - start = offset - if length is not None: - end = start + length - return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 - return data[-suffix:] - else: # Full load - return data + if offset is not None: + start = offset + if length is not None: + end = start + length + return data[start:end] + return data[start:] + elif length is not None: + # length only → first N bytes + return data[:length] + elif suffix is not None: + # trailing N bytes + return data[-suffix:] + return data # full blob
415-428: Support length-only HTTP range requests (first N bytes)When length is provided without offset, set Range to "bytes=0-{length-1}". Today that path falls through and fetches the whole blob.
# Construct the Range header if required if offset is not None: start = offset if length is not None: # Standard HTTP Range: bytes=start-end (inclusive) end = start + length - 1 headers["Range"] = f"bytes={start}-{end}" else: # Standard HTTP Range: bytes=start- (from start to end) headers["Range"] = f"bytes={start}-" + elif length is not None: + # length only → first N bytes + end = length - 1 + headers["Range"] = f"bytes=0-{end}" elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) headers["Range"] = f"bytes=-{suffix}"
🧹 Nitpick comments (4)
py_hamt/store_httpx.py (4)
190-192: Nit: prefer snake_case parameter names (pinOnAdd→pin_on_add)PEP 8 consistency. This public API change is breaking, so consider an alias/deprecation path if you decide to adjust.
343-383: Avoid complex async orchestration in delScheduling tasks or calling asyncio.run() in a destructor is fragile and can produce noisy warnings. Since you already expose async context-manager and aclose(), consider simplifying del to only clear references when we own clients and log a warning for leaked clients.
489-494: Be explicit with repeated query params for pin/updatehttpx will encode list values as repeated keys, but being explicit avoids ambiguity and IDE type nags; also mirrors curl usage clearly.
- params = {"arg": [str(old_id), str(new_id)]} + params = [("arg", str(old_id)), ("arg", str(new_id))] pin_update_url_base: str = f"{target_rpc}/api/v0/pin/update" async with self._sem: # throttle RPC client = self._loop_client() response = await client.post(pin_update_url_base, params=params) response.raise_for_status()
33-41: Nit: avoid using id as a parameter nameIt shadows Python’s built-in id(). Consider object_id or content_id in new APIs; not worth churning existing ones.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
py_hamt/store_httpx.py(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (3)
py_hamt/hamt.py (5)
load(138-139)load(159-168)load(276-285)get(593-608)HAMT(287-696)py_hamt/sharded_zarr_store.py (1)
get(361-416)py_hamt/zarr_hamt_store.py (1)
get(144-177)
| self._owns_client: bool = False | ||
| self._closed: bool = True | ||
| self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} | ||
| self._default_headers = headers | ||
| self._default_auth = auth | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prepare for external client without touching the loop in init
Accessing asyncio.get_running_loop() in init breaks creation from sync contexts. Stash the external client and bind it on first use.
self._owns_client: bool = False
self._closed: bool = True
self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {}
self._default_headers = headers
self._default_auth = auth
+ self._external_client: httpx.AsyncClient | None = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self._owns_client: bool = False | |
| self._closed: bool = True | |
| self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} | |
| self._default_headers = headers | |
| self._default_auth = auth | |
| self._owns_client: bool = False | |
| self._closed: bool = True | |
| self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} | |
| self._default_headers = headers | |
| self._default_auth = auth | |
| self._external_client: httpx.AsyncClient | None = None |
🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 229 to 234, __init__ is calling
asyncio.get_running_loop() (or otherwise touching the loop) when an external
httpx.AsyncClient may be passed, which breaks sync-side construction; instead,
store the provided client object and do not access the event loop in __init__;
set self._owns_client to False when an external client is provided and defer
creating or binding a per-loop AsyncClient until first use (e.g., in a private
_ensure_client() or _get_client() method) where you can safely call
asyncio.get_running_loop(), create or attach the client to that loop, and set
self._owns_client/ self._closed accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
py_hamt/store_httpx.py (1)
496-515: Alignpin_ls()return type, implementation, and testsThe current implementation of
pin_ls()(py_hamt/store_httpx.py lines 496–515) calls the IPFS RPC and does:pins = response.json().get("Keys", []) return pins– at runtime this returns a dict mapping CIDs to metadata, but the signature is
-> list[Dict[str, Any]]and the docstring says it returns a list of CIDs. Meanwhile, the tests in tests/test_kubo_pin.py (at lines 26, 34, 48, 59) do:assert str(cid) in listed_pins # … assert str(cid) not in listed_pins_after_unpinwhich relies on
listed_pinsbeing an iterable of CID strings (dict membership in Python tests keys, so it happens to work today).We need to pick one consistent API shape and apply the corresponding refactors:
• Approach A – Return a list of
{cid, ...}dicts
– Implementation (replace lines 512–515):
diff - pins = response.json().get("Keys", []) - return pins + keys = response.json().get("Keys") or {} + # Convert {"<cid>": {...}, …} to [{"cid": "<cid>", **metadata}, …] + return [{"cid": k, **(v or {})} for k, v in keys.items()]
– Signature & docstring (py_hamt/store_httpx.py lines 496–498 & doc):
```diff
- async def pin_ls(...) -> list[Dict[str, Any]]:
+ async def pin_ls(...) -> list[Dict[str, Any]]:
"""
Lists all pinned CIDs on the local Kubo node via the RPC API.Returns: - List[CID]: A list of pinned CIDs. + list[dict]: Each entry has the form {"cid": "<cid>", "Type": ..., "Metadata": ...}. """ ```– Tests (tests/test_kubo_pin.py at lines 26, 34, 48, 59): change membership checks to look inside the returned dicts, e.g.:
diff - assert str(cid) in listed_pins + assert any(p["cid"] == str(cid) for p in listed_pins), f"CID {cid} was not pinned"
and similarly for the other three assertions (listed_pins_after_unpin,listed_pins_after_update,listed_pins_after_unpin_update).• Approach B – Keep returning the raw dict
– Signature & docstring:
diff - async def pin_ls(...) -> list[Dict[str, Any]]: + async def pin_ls(...) -> Dict[str, Dict[str, Any]]: """ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + Dict[str, dict]: Mapping from each pinned CID string to its metadata. """
– Implementation: leave as is (it already returns the dict).
– Tests: no change necessary.Please select the intended API contract and update the code (py_hamt/store_httpx.py), docstring, signature, and—if you choose Approach A—the four assertions in tests/test_kubo_pin.py accordingly.
♻️ Duplicate comments (5)
py_hamt/store_httpx.py (5)
43-60: Surface misuse early: make base pin methods fail fast.These default “no-op” implementations hide bugs at runtime. Prefer raising
NotImplementedErrorto catch accidental calls on CAS backends that don’t implement pinning. This was noted previously.async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Pin a CID in the storage.""" - pass # pragma: no cover + """Pin a CID in the storage.""" + raise NotImplementedError("pin_cid is not supported by this CAS") async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Unpin a CID in the storage.""" - pass # pragma: no cover + """Unpin a CID in the storage.""" + raise NotImplementedError("unpin_cid is not supported by this CAS") async def pin_update( self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str ) -> None: - """Update the pinned CID in the storage.""" - pass # pragma: no cover + """Update the pinned CID in the storage.""" + raise NotImplementedError("pin_update is not supported by this CAS") async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: - """List all pinned CIDs in the storage.""" - return [] # pragma: no cover + """List all pinned CIDs in the storage.""" + raise NotImplementedError("pin_ls is not supported by this CAS")
229-235: Avoid touching the running loop in init when an external client is provided.Constructors should be usable from sync contexts; calling
asyncio.get_running_loop()here forces an active loop. Stash the external client and bind lazily in_loop_client()instead. This was suggested earlier.self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} self._default_headers = headers self._default_auth = auth + self._external_client: httpx.AsyncClient | None = NoneOutside this range (illustrative change):
# __init__ (replace the current external-client block) if client is not None: self._owns_client = False self._external_client = client else: self._owns_client = True # _loop_client (first lines) loop = asyncio.get_running_loop() if loop in self._client_per_loop: return self._client_per_loop[loop] if self._external_client is not None: self._client_per_loop[loop] = self._external_client return self._external_client # ... create internal client as you do now
435-459: Expose arecursiveflag for pin/unpin for flexibility.Docstrings say “recursive by default” and you correctly send
"true". Exposing arecursive: bool = Truekwarg adds control without breaking defaults. Same forunpin_cid.async def pin_cid( self, cid: CID, - target_rpc: str = "http://127.0.0.1:5001", + target_rpc: str = "http://127.0.0.1:5001", + *, + recursive: bool = True, ) -> None: @@ - params = {"arg": str(cid), "recursive": "true"} + params = {"arg": str(cid), "recursive": str(recursive).lower()}And for
unpin_cid(...):- async def unpin_cid( - self, cid: CID, target_rpc: str = "http://127.0.0.1:5001" - ) -> None: + async def unpin_cid( + self, + cid: CID, + target_rpc: str = "http://127.0.0.1:5001", + *, + recursive: bool = True, + ) -> None: @@ - params = {"arg": str(cid), "recursive": "true"} + params = {"arg": str(cid), "recursive": str(recursive).lower()}
77-84: Support length-only reads and tighten slice validation in InMemoryCAS.load().When only
lengthis supplied, callers expect the first N bytes. Also remove “else after return” and validate non-negative inputs. This was raised earlier and still applies.@@ - if offset is not None: + if length is not None and offset is None: + # length only → first N bytes + if length < 0: + raise ValueError("length must be non-negative") + return data[:length] + if offset is not None: start = offset + if start < 0: + raise ValueError("offset must be non-negative") if length is not None: - end = start + length + if length < 0: + raise ValueError("length must be non-negative") + end = start + length return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 + return data[start:] + if suffix is not None: + if suffix <= 0: + raise ValueError("suffix must be a positive integer") return data[-suffix:] - else: # Full load - return data + return data # full blobAlso applies to: 104-115
403-409: HTTP Range: add length-only requests, validate inputs.
load()currently ignores the case where onlylengthis provided; gateway should receiveRange: bytes=0-{length-1}. Also reject negative values. This was previously flagged.@@ - # Construct the Range header if required - if offset is not None: + # Construct the Range header if required + if offset is not None: start = offset + if start < 0: + raise ValueError("offset must be non-negative") if length is not None: # Standard HTTP Range: bytes=start-end (inclusive) - end = start + length - 1 + if length < 0: + raise ValueError("length must be non-negative") + end = start + length - 1 headers["Range"] = f"bytes={start}-{end}" else: # Standard HTTP Range: bytes=start- (from start to end) headers["Range"] = f"bytes={start}-" - elif suffix is not None: + elif length is not None: + if length < 0: + raise ValueError("length must be non-negative") + end = length - 1 + headers["Range"] = f"bytes=0-{end}" + elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) - headers["Range"] = f"bytes=-{suffix}" + if suffix <= 0: + raise ValueError("suffix must be a positive integer") + headers["Range"] = f"bytes=-{suffix}" @@ - response = await client.get(url, headers=headers or None) + response = await client.get(url, headers=headers or None)Also applies to: 411-428, 431-431
🧹 Nitpick comments (1)
py_hamt/store_httpx.py (1)
34-41: Define and document range-parameter contract (mutual exclusivity and non-negatives).The extended
load(...)signature is good. Please explicitly document and enforce:
offset >= 0,length >= 0,suffix > 0.suffixmust not be used together withoffset/length.
This avoids ambiguous requests and undefined slices across implementations.Would you like me to add a brief docstring here plus input validation in both implementations?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
py_hamt/store_httpx.py(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (4)
py_hamt/hamt.py (4)
load(138-139)load(159-168)load(276-285)get(593-608)py_hamt/zarr_hamt_store.py (1)
get(144-177)py_hamt/sharded_zarr_store.py (1)
get(361-416)tests/test_kubo_cas.py (1)
test_kubo_cas(128-145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (1)
py_hamt/store_httpx.py (1)
190-190: Nice: explicit pin-on-add toggle in the add RPC URL.Good feature gate and correct boolean casing in query string.
Also applies to: 258-260
0xSwego
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
py_hamt/store_httpx.py (5)
94-99: Fix length-only handling and add input validation in InMemoryCAS.loadlength without offset currently ignored; also redundant else after return. Add basic validation and implement length-only semantics.
key = cast(bytes, id) if not isinstance(key, (bytes, bytearray)): # defensive guard raise TypeError( f"InMemoryCAS only supports byte‐hash keys; got {type(id).__name__}" ) - data: bytes + # Validate args + if length is not None and suffix is not None: + raise ValueError("length and suffix are mutually exclusive") + if offset is not None and offset < 0: + raise ValueError("offset must be >= 0") + if length is not None and length < 0: + raise ValueError("length must be >= 0") + if suffix is not None and suffix < 0: + raise ValueError("suffix must be >= 0") + data: bytes @@ - if offset is not None: - start = offset - if length is not None: - end = start + length - return data[start:end] - else: - return data[start:] - elif suffix is not None: # If only length is given, assume start from 0 - return data[-suffix:] - else: # Full load - return data + if offset is not None: + start = offset + if length is not None: + end = start + length + return data[start:end] + return data[start:] + elif length is not None: + # length only → first N bytes + return data[:length] + elif suffix is not None: + # trailing N bytes + return data[-suffix:] + return dataAlso applies to: 105-116
233-238: Don’t touch the event loop in init; defer binding external clientsAccessing asyncio.get_running_loop() during construction breaks sync contexts. Stash user client and return it from _loop_client without loop lookups; only create per-loop clients when you own them.
- self._owns_client: bool = False - self._closed: bool = True - self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} + self._owns_client: bool = False + self._closed: bool = True + self._client_per_loop: Dict[asyncio.AbstractEventLoop, httpx.AsyncClient] = {} + self._external_client: httpx.AsyncClient | None = None self._default_headers = headers self._default_auth = auth @@ - if client is not None: - # A client was supplied by the user. We don't own it. - self._owns_client = False - self._client_per_loop = {asyncio.get_running_loop(): client} - else: - # No client supplied. We will own any clients we create. - self._owns_client = True - self._client_per_loop = {} + if client is not None: + self._owns_client = False + self._external_client = client + self._client_per_loop = {} + else: + self._owns_client = True + self._client_per_loop = {} @@ def _loop_client(self) -> httpx.AsyncClient: """Get or create a client for the current event loop. @@ - if self._closed: + if self._closed: if not self._owns_client: raise RuntimeError("KuboCAS is closed; create a new instance") # We previously closed all internally-owned clients. Reset the # state so that new clients can be created lazily. self._closed = False self._client_per_loop = {} + # If user supplied a client, just reuse it. + if self._external_client is not None: + return self._external_client + loop: asyncio.AbstractEventLoop = asyncio.get_running_loop() try: return self._client_per_loop[loop] except KeyError: # Create a new client client = httpx.AsyncClient(Optional: use WeakKeyDictionary for _client_per_loop to avoid retaining closed loops.
Also applies to: 268-276, 299-330
510-526: Expose recursive flag in pin/unpin and align paramsDocstrings say recursive by default; make it configurable and wire through.
- async def pin_cid( - self, - cid: CID, - target_rpc: str = "http://127.0.0.1:5001", - ) -> None: + async def pin_cid( + self, + cid: CID, + target_rpc: str = "http://127.0.0.1:5001", + *, + recursive: bool = True, + ) -> None: @@ - params = {"arg": str(cid), "recursive": "true"} + params = {"arg": str(cid), "recursive": str(recursive).lower()} @@ - async def unpin_cid( - self, cid: CID, target_rpc: str = "http://127.0.0.1:5001" - ) -> None: + async def unpin_cid( + self, + cid: CID, + target_rpc: str = "http://127.0.0.1:5001", + *, + recursive: bool = True, + ) -> None: @@ - params = {"arg": str(cid), "recursive": "true"} + params = {"arg": str(cid), "recursive": str(recursive).lower()}Also applies to: 532-547
44-61: Make pin_ hooks fail-fast instead of silently no-op*Silent pass/[] hides misuse. Raise NotImplementedError so unsupported backends surface errors early.
- async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Pin a CID in the storage.""" - pass # pragma: no cover + async def pin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Pin a CID in the storage.""" + raise NotImplementedError("pin_cid not implemented for this CAS") - async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: - """Unpin a CID in the storage.""" - pass # pragma: no cover + async def unpin_cid(self, id: IPLDKind, target_rpc: str) -> None: + """Unpin a CID in the storage.""" + raise NotImplementedError("unpin_cid not implemented for this CAS") - async def pin_update( + async def pin_update( self, old_id: IPLDKind, new_id: IPLDKind, target_rpc: str ) -> None: - """Update the pinned CID in the storage.""" - pass # pragma: no cover + """Update the pinned CID in the storage.""" + raise NotImplementedError("pin_update not implemented for this CAS") - async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: - """List all pinned CIDs in the storage.""" - return [] # pragma: no cover + async def pin_ls(self, target_rpc: str) -> list[Dict[str, Any]]: + """List all pinned CIDs in the storage.""" + raise NotImplementedError("pin_ls not implemented for this CAS")
456-471: Support length-only HTTP Range and validate inputs in KuboCAS.loadHandle length without offset and reject invalid combos/negatives. Return b"" for zero-length requests.
- # Construct the Range header if required + # Validate and construct the Range header if required + if length is not None and suffix is not None: + raise ValueError("length and suffix are mutually exclusive") + if offset is not None and offset < 0: + raise ValueError("offset must be >= 0") + if length is not None and length < 0: + raise ValueError("length must be >= 0") + if suffix is not None and suffix < 0: + raise ValueError("suffix must be >= 0") + if length == 0: + return b"" if offset is not None: start = offset if length is not None: # Standard HTTP Range: bytes=start-end (inclusive) end = start + length - 1 headers["Range"] = f"bytes={start}-{end}" else: # Standard HTTP Range: bytes=start- (from start to end) headers["Range"] = f"bytes={start}-" - elif suffix is not None: + elif length is not None: + # First N bytes + end = length - 1 + headers["Range"] = f"bytes=0-{end}" + elif suffix is not None: # Standard HTTP Range: bytes=-N (last N bytes) headers["Range"] = f"bytes=-{suffix}"
🧹 Nitpick comments (5)
py_hamt/store_httpx.py (5)
35-41: Clarify and validate range-argument contract in the base interfacePlease document precedence and mutual exclusivity of offset/length/suffix (e.g., offset+suffix invalid; length-only means first N bytes) so all backends implement the same behavior. Consider adding lightweight validation in the abstract method’s docstring and enforcing it in concrete implementations.
278-279: Remove duplicate assignment of _default_headers/_default_authThese are already set at Lines 236–237.
- self._default_headers = headers - self._default_auth = auth
361-401: Be cautious running async cleanup from delScheduling tasks or calling asyncio.run in destructors can fire at interpreter shutdown and be brittle. Consider limiting del to clearing references and rely on aclose()/context manager for real cleanup.
548-566: Type consistency for pin_updateAlign with other pin methods by accepting CID for old_id/new_id (runtime behavior unchanged).
- async def pin_update( - self, - old_id: IPLDKind, - new_id: IPLDKind, + async def pin_update( + self, + old_id: CID, + new_id: CID, target_rpc: str = "http://127.0.0.1:5001", ) -> None:
568-586: pin_ls return type/docstring mismatchKubo returns {"Keys": {...}}; you return that mapping, not a list. Fix typing, docstring, and default.
- async def pin_ls( - self, target_rpc: str = "http://127.0.0.1:5001" - ) -> list[Dict[str, Any]]: + async def pin_ls( + self, target_rpc: str = "http://127.0.0.1:5001" + ) -> dict[str, Any]: @@ - Returns: - List[CID]: A list of pinned CIDs. + Returns: + dict[str, Any]: The raw "Keys" mapping from Kubo pin/ls. @@ - pins = response.json().get("Keys", []) + pins = response.json().get("Keys", {}) return pins
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
py_hamt/store_httpx.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/store_httpx.py (5)
py_hamt/hamt.py (4)
load(138-139)load(159-168)load(276-285)get(593-608)py_hamt/sharded_zarr_store.py (1)
get(361-416)py_hamt/zarr_hamt_store.py (1)
get(144-177)tests/test_public_gateway.py (1)
test_fix_kubocas_load(270-316)tests/test_kubo_cas.py (2)
test_kubo_urls_explicit(52-71)test_kubo_cas(130-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (2)
py_hamt/store_httpx.py (2)
262-264: LGTM: pin_on_add wiringCorrectly threads pin flag into the add URL.
405-445: LGTM: save() retries and jitterSolid retry/backoff with jitter and clear error propagation.
py_hamt/__init__.py
Outdated
| "InMemoryCAS", | ||
| "KuboCAS", | ||
| "ZarrHAMTStore", | ||
| "InMemoryCAS", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit but duplicate export of InMemoryCAS, should remove
py_hamt/sharded_zarr_store.py
Outdated
| self._resize_complete.set() | ||
| self._shard_locks: DefaultDict[int, asyncio.Lock] = defaultdict(asyncio.Lock) | ||
|
|
||
| self._shard_data_cache: Dict[int, list[Optional[CID]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT had a good point here
- Memory bounds
_shard_data_cachehas no capacity limit and may grow without bound as clients hit many shards in long‑lived processes. Consider
an LRU for shard lists (configurable size) to bound memory.
Instead of a dict can use a LRU which comes with python itself as our apps using this which are long lived processes would forever have their cache grow unbounded no (since there's no eviction policy)?
I asked Claude Opus for a suggestion (can crosscheck with chatgpt 5)
Basic:

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
py_hamt/hamt_to_sharded_converter.py (1)
44-44: Chunk shape derivation may fail for unchunked datasets.The function logic is correct. However,
source_dataset.chunkscan beNonefor unchunked arrays, causing aTypeErrorwhen accessingsource_dataset.chunks[dim][0].Add a safety check and fallback:
- chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims) + if source_dataset.chunks is None: # unchunked source + chunk_shape_tuple = array_shape_tuple # Use full array dims as chunk shape + else: + chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
🧹 Nitpick comments (6)
tests/test_sharded_zarr_store.py (4)
1-1: Remove unused import.The
asyncioimport is not used in the test file.-import asyncio - import dag_cbor
183-183: Remove print statement from test code.Test code should not contain print statements.
- print(f"Data written to ShardedZarrStore with root CID: {final_cid}") assert final_cid is not None
200-200: Remove print statement from test code.- print("\n✅ Append test successful! Data verified.")
752-752: Remove debugging print statement.Multiple print statements should be removed from the test code.
- print(f"Testing with chunk key: {chunk_key}")Also applies to lines 759, 792, 799, 806, 812, 822.
py_hamt/sharded_zarr_store.py (2)
195-195: Add explicitstrictparameter tozip().Python 3.10+ recommends explicit
strictparameter forzip()to catch length mismatches.- for a, c in zip(self._array_shape, self._chunk_shape) + for a, c in zip(self._array_shape, self._chunk_shape, strict=True)Also applies to lines 767 and 806.
809-814: Simplify conditional logic for num_shards calculation.The conditional check for
self._num_shards is not Noneis unnecessary since it's initialized to 0 in__init__.- old_num_shards = self._num_shards if self._num_shards is not None else 0 + old_num_shards = self._num_shards self._num_shards = ( (self._total_chunks + self._chunks_per_shard - 1) // self._chunks_per_shard if self._total_chunks > 0 else 0 )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
py_hamt/__init__.py(2 hunks)py_hamt/hamt_to_sharded_converter.py(1 hunks)py_hamt/sharded_zarr_store.py(1 hunks)tests/test_sharded_store_deleting.py(1 hunks)tests/test_sharded_store_grafting.py(1 hunks)tests/test_sharded_zarr_store.py(1 hunks)tests/testing_utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/testing_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
- py_hamt/init.py
- tests/test_sharded_store_deleting.py
- tests/test_sharded_store_grafting.py
🧰 Additional context used
🧬 Code graph analysis (3)
py_hamt/sharded_zarr_store.py (1)
py_hamt/store_httpx.py (10)
ContentAddressedStore(13-60)load(35-42)load(78-115)load(446-505)save(28-32)save(73-76)save(405-444)InMemoryCAS(39-73)ContentAddressedStore(13-36)KuboCAS(76-429)
tests/test_sharded_zarr_store.py (2)
py_hamt/store_httpx.py (4)
KuboCAS(118-586)save(28-32)save(73-76)save(405-444)py_hamt/sharded_zarr_store.py (20)
ShardedZarrStore(137-870)open(205-236)flush(519-562)_load_or_initialize_shard_cache(398-453)get(46-53)get(564-619)exists(682-693)list_prefix(737-740)delete(707-727)with_read_only(471-510)list_dir(859-870)get_partial_values(462-469)_parse_chunk_key(343-382)MemoryBoundedLRUCache(20-134)put(55-96)cache_size(127-129)estimated_memory_usage(122-124)dirty_cache_size(132-134)mark_clean(104-107)clear(109-115)
py_hamt/hamt_to_sharded_converter.py (4)
py_hamt/hamt.py (4)
HAMT(288-705)keys(673-687)get_pointer(610-624)HAMT(287-696)py_hamt/sharded_zarr_store.py (4)
ShardedZarrStore(137-870)open(205-236)set_pointer(660-680)flush(519-562)py_hamt/store_httpx.py (2)
KuboCAS(118-586)KuboCAS(76-429)py_hamt/zarr_hamt_store.py (3)
ZarrHAMTStore(12-322)read_only(108-111)ZarrHAMTStore(11-285)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py
189-189: Avoid specifying long messages outside the exception class
(TRY003)
191-191: Avoid specifying long messages outside the exception class
(TRY003)
195-195: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
226-228: Avoid specifying long messages outside the exception class
(TRY003)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
235-235: Avoid specifying long messages outside the exception class
(TRY003)
269-271: Abstract raise to an inner function
(TRY301)
269-271: Avoid specifying long messages outside the exception class
(TRY003)
273-273: Prefer TypeError exception for invalid type
(TRY004)
273-273: Abstract raise to an inner function
(TRY301)
273-273: Avoid specifying long messages outside the exception class
(TRY003)
274-274: Do not catch blind exception: Exception
(BLE001)
275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
275-275: Avoid specifying long messages outside the exception class
(TRY003)
279-281: Avoid specifying long messages outside the exception class
(TRY003)
291-293: Avoid specifying long messages outside the exception class
(TRY003)
316-316: Abstract raise to an inner function
(TRY301)
316-316: Avoid specifying long messages outside the exception class
(TRY003)
322-322: Consider moving this statement to an else block
(TRY300)
335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
335-337: Avoid specifying long messages outside the exception class
(TRY003)
379-381: Avoid specifying long messages outside the exception class
(TRY003)
428-430: Avoid specifying long messages outside the exception class
(TRY003)
436-436: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
436-436: Avoid specifying long messages outside the exception class
(TRY003)
439-439: Avoid specifying long messages outside the exception class
(TRY003)
452-452: Avoid specifying long messages outside the exception class
(TRY003)
474-474: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
508-508: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
527-527: Avoid specifying long messages outside the exception class
(TRY003)
577-579: Avoid specifying long messages outside the exception class
(TRY003)
608-610: Avoid specifying long messages outside the exception class
(TRY003)
623-623: Avoid specifying long messages outside the exception class
(TRY003)
636-636: Avoid specifying long messages outside the exception class
(TRY003)
656-656: Do not catch blind exception: Exception
(BLE001)
657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
657-657: Avoid specifying long messages outside the exception class
(TRY003)
709-709: Avoid specifying long messages outside the exception class
(TRY003)
716-716: Avoid specifying long messages outside the exception class
(TRY003)
744-744: Avoid specifying long messages outside the exception class
(TRY003)
767-767: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
790-790: Avoid specifying long messages outside the exception class
(TRY003)
797-797: Avoid specifying long messages outside the exception class
(TRY003)
799-801: Avoid specifying long messages outside the exception class
(TRY003)
806-806: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
833-833: Avoid specifying long messages outside the exception class
(TRY003)
839-841: Avoid specifying long messages outside the exception class
(TRY003)
py_hamt/hamt_to_sharded_converter.py
125-125: Do not catch blind exception: Exception
(BLE001)
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
py_hamt/sharded_zarr_store.py
[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).
🔇 Additional comments (1)
py_hamt/sharded_zarr_store.py (1)
20-135: LGTM! Excellent memory-bounded cache implementation.The
MemoryBoundedLRUCacheclass effectively addresses the unbounded memory growth concern raised in previous reviews. Key strengths:
- Proper LRU eviction when memory limit is exceeded
- Protection of dirty shards from eviction
- Thread-safe async operations with locks
- Accurate memory tracking using
sys.getsizeof()
| except Exception as e: | ||
| print(f"\nAn error occurred: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid catching blind exception.
The broad Exception catch could hide important errors and make debugging difficult.
Catch specific exceptions that are expected during conversion:
- except Exception as e:
- print(f"\nAn error occurred: {e}")
+ except (ValueError, KeyError, RuntimeError) as e:
+ print(f"\nAn error occurred: {e}")
+ import sys
+ sys.exit(1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| print(f"\nAn error occurred: {e}") | |
| except (ValueError, KeyError, RuntimeError) as e: | |
| print(f"\nAn error occurred: {e}") | |
| import sys | |
| sys.exit(1) |
🧰 Tools
🪛 Ruff (0.12.2)
125-125: Do not catch blind exception: Exception
(BLE001)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something to consider here @TheGreatAlgo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
py_hamt/sharded_zarr_store.py
Outdated
| except Exception as e: | ||
| raise ValueError(f"Failed to decode root object: {e}") | ||
| self._root_obj = dag_cbor.decode(root_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate decode operation and broad exception handling.
The root object is decoded twice (line 267 and 276), and the broad Exception catch could hide important errors.
try:
self._root_obj = dag_cbor.decode(root_bytes)
if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
raise ValueError(
"Root object is not a valid dictionary with 'chunks' key."
)
if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
raise ValueError("shard_cids is not a list.")
- except Exception as e:
+ except (ValueError, dag_cbor.CBORDecodeError) as e:
raise ValueError(f"Failed to decode root object: {e}")
- self._root_obj = dag_cbor.decode(root_bytes)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| raise ValueError(f"Failed to decode root object: {e}") | |
| self._root_obj = dag_cbor.decode(root_bytes) | |
| try: | |
| self._root_obj = dag_cbor.decode(root_bytes) | |
| if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: | |
| raise ValueError( | |
| "Root object is not a valid dictionary with 'chunks' key." | |
| ) | |
| if not isinstance(self._root_obj["chunks"]["shard_cids"], list): | |
| raise ValueError("shard_cids is not a list.") | |
| except (ValueError, dag_cbor.CBORDecodeError) as e: | |
| raise ValueError(f"Failed to decode root object: {e}") |
🧰 Tools
🪛 Ruff (0.12.2)
274-274: Do not catch blind exception: Exception
(BLE001)
275-275: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
275-275: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 274 to 276, remove the duplicate
dag_cbor.decode call and the overly broad except Exception; instead catch the
specific decode-related exception(s) (e.g., ValueError/CBORDecodeError or the
library's decode error), assign the decoded result to self._root_obj from the
earlier successful decode (or decode once and reuse its variable), and re-raise
a ValueError with the original exception chained/preserved so you don't swallow
important error details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it appears to be decoded twice is this correct?
self._root_obj = dag_cbor.decode(root_bytes) on here twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Faolain, you're absolutely correct! This is indeed a bug - dag_cbor.decode(root_bytes) is being called twice:
- Line 267: Inside the try block, decodes and assigns to
self._root_obj - Line 276: Outside the try block, decodes again and overwrites
self._root_obj
This is problematic because:
- The validation performed inside the try block becomes meaningless since the result gets overwritten
- If the second decode fails, it would raise an unhandled exception
- It's wasteful to decode the same data twice
The correct fix is to simply remove line 276 since the decoding and assignment already happened successfully inside the try block:
async def _load_root_from_cid(self):
root_bytes = await self.cas.load(self._root_cid)
try:
self._root_obj = dag_cbor.decode(root_bytes)
if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj:
raise ValueError(
"Root object is not a valid dictionary with 'chunks' key."
)
if not isinstance(self._root_obj["chunks"]["shard_cids"], list):
raise ValueError("shard_cids is not a list.")
except (ValueError, dag_cbor.CBORDecodeError) as e:
raise ValueError(f"Failed to decode root object: {e}") from e
# Remove this duplicate line:
# self._root_obj = dag_cbor.decode(root_bytes)Good catch! This would have caused the validation logic to be bypassed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheGreatAlgo this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
py_hamt/sharded_zarr_store.py
Outdated
| print( | ||
| f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace print statements with proper logging.
Print statements should use the logging module for better control in production environments.
+import logging
+
+logger = logging.getLogger(__name__)
+
# Log the failure and raise a specific error
- print(
- f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
- )
+ logger.error(
+ f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
+ )Also applies to line 340 - replace print(f"Error fetching shard {shard_idx}: {e}") with logger.error(...).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 332-334 (and also the similar
occurrence at line 340), replace the print statements with proper logging calls:
use the module logger (ensure a logger is created/imported at the top, e.g.,
logger = logging.getLogger(__name__)) and change the prints to logger.error with
a clear message including shard_idx and max_retries, and pass exception info
(exc_info=True) where appropriate so stack traces are recorded; keep messages
concise and consistent with other log lines in the file.
| except Exception as e: | ||
| raise RuntimeError(f"Failed to save data for key {key}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve exception handling specificity.
The broad Exception catch could hide important errors during CAS save operations.
try:
data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
await self.set_pointer(key, str(data_cid_obj))
- except Exception as e:
- raise RuntimeError(f"Failed to save data for key {key}: {e}")
+ except (ConnectionError, TimeoutError, RuntimeError) as e:
+ raise RuntimeError(f"Failed to save data for key {key}: {e}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| raise RuntimeError(f"Failed to save data for key {key}: {e}") | |
| try: | |
| data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") | |
| await self.set_pointer(key, str(data_cid_obj)) | |
| except (ConnectionError, TimeoutError, RuntimeError) as e: | |
| raise RuntimeError(f"Failed to save data for key {key}: {e}") from e |
🧰 Tools
🪛 Ruff (0.12.2)
656-656: Do not catch blind exception: Exception
(BLE001)
657-657: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
657-657: Avoid specifying long messages outside the exception class
(TRY003)
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag sharding
[warning] 40-657: Coverage gap: 23 statements not covered. 95% coverage in this file (missing lines: 40, 64-65, 96, 269, 273-275, 325-335, 428-436, 452, 527, 656-657).
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 656-657, the broad `except
Exception as e` hides specific errors during CAS save; replace it by catching
only the expected exceptions from the CAS save operation (e.g., the CAS client’s
specific exception types, OSError/IOError, ValueError, etc.), let unexpected
exceptions propagate, and when re-raising wrap with `raise RuntimeError(...)
from e` to preserve the original traceback; if you don't know exact CAS
exception classes, remove the blanket except and/or import and catch the CAS
library's specific error types and include `from e` when raising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
tests/test_sharded_zarr_store_coverage.py (2)
19-23: Delete dead, commented-out blocksThese commented fragments age poorly and confuse intent. Remove them.
Also applies to: 85-89, 180-186, 232-236, 251-263
529-553: Assert retry failures via logging instead of patchingProduction code should log, not print. Switch assertions to
caplog.- # This should retry and eventually succeed (testing retry logic lines 325-329) - with patch("builtins.print") as mock_print: - shard_data = await store._load_or_initialize_shard_cache(0) - assert shard_data is not None - assert len(shard_data) == 4 + # This should retry and eventually succeed + import logging + with caplog.at_level(logging.ERROR): + shard_data = await store._load_or_initialize_shard_cache(0) + assert shard_data is not None + assert len(shard_data) == 4 @@ - # This should hit lines 332-337 (failure after max retries) - with patch("builtins.print") as mock_print: - with pytest.raises( - RuntimeError, match="Failed to fetch shard 0 after 3 attempts" - ): - await store_fail._load_or_initialize_shard_cache(0) - # Should print failure message (line 333) - mock_print.assert_called() + # This should fail after max retries and log an error + with caplog.at_level(logging.ERROR): + with pytest.raises(RuntimeError, match="Failed to fetch shard 0 after 3 attempts"): + await store_fail._load_or_initialize_shard_cache(0) + assert any("Failed to fetch shard 0" in rec.message for rec in caplog.records)py_hamt/sharded_zarr_store.py (3)
317-347: Replace prints with logging; narrow exceptions and chain causesUse
logging, exponential backoff stays. Chain causes on failure; avoid blindexcept.+ import logging + logger = logging.getLogger(__name__) @@ - except (ConnectionError, TimeoutError) as e: + except (ConnectionError, TimeoutError) as e: # Handle transient errors (e.g., network issues) if attempt < max_retries - 1: await asyncio.sleep( retry_delay * (2**attempt) ) # Exponential backoff continue else: - # Log the failure and raise a specific error - print( - f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" - ) - raise RuntimeError( - f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" - ) - except Exception as e: - # Handle non-transient errors immediately - print(f"Error fetching shard {shard_idx}: {e}") - raise + logger.error("Failed to fetch shard %s after %s attempts: %s", + shard_idx, max_retries, e) + raise RuntimeError( + f"Failed to fetch shard {shard_idx} after {max_retries} attempts" + ) from e + except (ValueError, TypeError, dag_cbor.CBORDecodeError) as e: + logger.error("Error fetching or decoding shard %s: %s", shard_idx, e) + raise
555-567: Unify IPLD ID handling; avoid coercing tostrCoercing IDs breaks generic CAS backends (e.g., InMemoryCAS expects bytes). Keep IDs as returned by
cas.save/loadand only decode strings when input truly is a string.- if self._dirty_root: + if self._dirty_root: # Ensure all metadata CIDs are CID objects for correct encoding self._root_obj["metadata"] = { k: (CID.decode(v) if isinstance(v, str) else v) for k, v in self._root_obj["metadata"].items() } root_obj_bytes = dag_cbor.encode(self._root_obj) new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor") - self._root_cid = str(new_root_cid) + self._root_cid = new_root_cid self._dirty_root = False @@ - data = await self.cas.load(str(metadata_cid_obj)) + data = await self.cas.load(metadata_cid_obj) return prototype.buffer.from_bytes(data) @@ - data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") - await self.set_pointer(key, str(data_cid_obj)) - except Exception as e: - raise RuntimeError(f"Failed to save data for key {key}: {e}") + data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") + await self.set_pointer(key, data_cid_obj) + except (ConnectionError, TimeoutError, RuntimeError, ValueError) as e: + raise RuntimeError(f"Failed to save data for key {key}") from e @@ - async def set_pointer(self, key: str, pointer: str) -> None: + async def set_pointer(self, key: str, pointer: CID | bytes | str) -> None: chunk_coords = self._parse_chunk_key(key) - pointer_cid_obj = CID.decode(pointer) # Convert string to CID object + pointer_cid_obj = CID.decode(pointer) if isinstance(pointer, str) else pointer # type: ignore[assignment]Follow-up: if you want
flush()to return a string CID, returnstr(self._root_cid)at the end.Also applies to: 585-586, 659-662, 665-673
269-282: Decode once and raise with cause; remove duplicate decode and broad catch
_load_root_from_cid()decodes twice and catchesException. Decode once, validate, and re-raise withfrom e.- root_bytes = await self.cas.load(self._root_cid) - try: - self._root_obj = dag_cbor.decode(root_bytes) - if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: - raise ValueError( - "Root object is not a valid dictionary with 'chunks' key." - ) - if not isinstance(self._root_obj["chunks"]["shard_cids"], list): - raise ValueError("shard_cids is not a list.") - except Exception as e: - raise ValueError(f"Failed to decode root object: {e}") - self._root_obj = dag_cbor.decode(root_bytes) + root_bytes = await self.cas.load(self._root_cid) + try: + self._root_obj = dag_cbor.decode(root_bytes) + if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: + raise ValueError("Root object is not a valid dictionary with 'chunks' key.") + if not isinstance(self._root_obj["chunks"]["shard_cids"], list): + raise ValueError("shard_cids is not a list.") + except (ValueError, dag_cbor.CBORDecodeError) as e: + raise ValueError("Failed to decode root object") from e
🧹 Nitpick comments (3)
tests/test_sharded_zarr_store_coverage.py (1)
132-136: Remove unusedcaplogargument
caplogis unused in this test. Drop it to satisfy ARG001.-async def test_sharded_zarr_store_shard_handling_exceptions( - create_ipfs: tuple[str, str], caplog -): +async def test_sharded_zarr_store_shard_handling_exceptions( + create_ipfs: tuple[str, str] +):py_hamt/sharded_zarr_store.py (2)
771-776: Makezipstrict in graft/resize mathAvoid silent truncation when offsets/ranks mismatch.
- global_coords = tuple( - c_local + c_offset - for c_local, c_offset in zip(local_coords, chunk_offset) - ) + global_coords = tuple( + c_local + c_offset + for c_local, c_offset in zip(local_coords, chunk_offset, strict=True) + ) @@ - self._chunks_per_dim = tuple( - math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) - ) + self._chunks_per_dim = tuple( + math.ceil(a / c) if c > 0 else 0 + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) + )Also applies to: 809-812
122-125: Don’t implement__contains__as async
inwon’t use an async__contains__. Rename tocontains()or make it sync.- async def __contains__(self, shard_idx: int) -> bool: - async with self._cache_lock: - return shard_idx in self._cache + def contains(self, shard_idx: int) -> bool: + # Best-effort check; not acquiring lock is acceptable for a hint + return shard_idx in self._cache
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
py_hamt/sharded_zarr_store.py(1 hunks)tests/test_sharded_zarr_store_coverage.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/sharded_zarr_store.py (2)
py_hamt/zarr_hamt_store.py (1)
ZarrHAMTStore(11-285)py_hamt/store_httpx.py (3)
InMemoryCAS(39-73)ContentAddressedStore(13-36)KuboCAS(76-429)
tests/test_sharded_zarr_store_coverage.py (3)
py_hamt/store_httpx.py (7)
KuboCAS(118-586)save(28-32)save(73-76)save(405-444)load(35-42)load(78-115)load(446-505)tests/testing_utils.py (1)
create_ipfs(149-195)py_hamt/zarr_hamt_store.py (3)
read_only(108-111)set_partial_values(225-229)exists(193-199)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py
194-194: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
200-200: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
231-233: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Avoid specifying long messages outside the exception class
(TRY003)
274-276: Abstract raise to an inner function
(TRY301)
274-276: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Prefer TypeError exception for invalid type
(TRY004)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
280-280: Avoid specifying long messages outside the exception class
(TRY003)
284-286: Avoid specifying long messages outside the exception class
(TRY003)
296-298: Avoid specifying long messages outside the exception class
(TRY003)
321-321: Abstract raise to an inner function
(TRY301)
321-321: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Consider moving this statement to an else block
(TRY300)
340-342: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
340-342: Avoid specifying long messages outside the exception class
(TRY003)
384-386: Avoid specifying long messages outside the exception class
(TRY003)
433-435: Avoid specifying long messages outside the exception class
(TRY003)
441-441: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
441-441: Avoid specifying long messages outside the exception class
(TRY003)
444-444: Avoid specifying long messages outside the exception class
(TRY003)
457-457: Avoid specifying long messages outside the exception class
(TRY003)
479-479: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
513-513: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
532-532: Avoid specifying long messages outside the exception class
(TRY003)
582-584: Avoid specifying long messages outside the exception class
(TRY003)
613-615: Avoid specifying long messages outside the exception class
(TRY003)
628-628: Avoid specifying long messages outside the exception class
(TRY003)
641-641: Avoid specifying long messages outside the exception class
(TRY003)
661-661: Do not catch blind exception: Exception
(BLE001)
662-662: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
662-662: Avoid specifying long messages outside the exception class
(TRY003)
714-714: Avoid specifying long messages outside the exception class
(TRY003)
721-721: Avoid specifying long messages outside the exception class
(TRY003)
749-749: Avoid specifying long messages outside the exception class
(TRY003)
772-772: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
795-795: Avoid specifying long messages outside the exception class
(TRY003)
802-802: Avoid specifying long messages outside the exception class
(TRY003)
804-806: Avoid specifying long messages outside the exception class
(TRY003)
811-811: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
838-838: Avoid specifying long messages outside the exception class
(TRY003)
844-846: Avoid specifying long messages outside the exception class
(TRY003)
tests/test_sharded_zarr_store_coverage.py
133-133: Unused function argument: caplog
(ARG001)
348-348: Unused method argument: cid
(ARG002)
390-390: Unused method argument: cid
(ARG002)
431-431: Unused method argument: cid
(ARG002)
457-457: Unused method argument: cid
(ARG002)
496-496: Unused method argument: cid
(ARG002)
499-499: Avoid specifying long messages outside the exception class
(TRY003)
505-505: Unused method argument: data
(ARG002)
505-505: Unused method argument: codec
(ARG002)
600-600: Unused method argument: cid
(ARG002)
600-600: Unused method argument: offset
(ARG002)
600-600: Unused method argument: length
(ARG002)
600-600: Unused method argument: suffix
(ARG002)
673-673: Unused method argument: data
(ARG002)
673-673: Unused method argument: codec
(ARG002)
675-675: Avoid specifying long messages outside the exception class
(TRY003)
710-710: Unused method argument: data
(ARG002)
710-710: Unused method argument: codec
(ARG002)
765-765: Unused method argument: cid
(ARG002)
784-784: Unused function argument: shard_idx
(ARG001)
818-818: Unused method argument: cid
(ARG002)
844-844: Unused function argument: timeout
(ARG001)
877-877: Unused method argument: cid
(ARG002)
900-900: Unused function argument: timeout
(ARG001)
908-908: Unused function argument: shard_idx
(ARG001)
🔇 Additional comments (4)
tests/test_sharded_zarr_store_coverage.py (2)
266-328: LRU cache edge-cases well coveredNice coverage for empty shard sizing, update of existing entries, and eviction when only dirty shards exist.
801-858: Timeout cleanup scenario is preciseGood use of a manual pending event and
asyncio.wait_forpatch to validate cleanup of_pending_shard_loads.py_hamt/sharded_zarr_store.py (2)
29-45: MemoryBoundedLRUCache looks solidSize accounting, dirty protection, and eviction loop behavior are clean and deterministic.
Also applies to: 55-103
736-787: Grafting path is correct and lock-scopedGood lock granularity per-target-shard, and dirty marking only on change.
| if not all(cs > 0 for cs in self._chunk_shape): | ||
| raise ValueError("All chunk_shape dimensions must be positive.") | ||
| if not all(s >= 0 for s in self._array_shape): | ||
| raise ValueError("All array_shape dimensions must be non-negative.") | ||
|
|
||
| self._chunks_per_dim = tuple( | ||
| math.ceil(a / c) if c > 0 else 0 | ||
| for a, c in zip(self._array_shape, self._chunk_shape) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add rank check and use zip(..., strict=True)
Guard mismatched shapes early and make zip intent explicit.
- if not all(cs > 0 for cs in self._chunk_shape):
+ if len(self._array_shape) != len(self._chunk_shape):
+ raise ValueError("array_shape and chunk_shape must have the same number of dimensions.")
+ if not all(cs > 0 for cs in self._chunk_shape):
raise ValueError("All chunk_shape dimensions must be positive.")
if not all(s >= 0 for s in self._array_shape):
raise ValueError("All array_shape dimensions must be non-negative.")
- self._chunks_per_dim = tuple(
- math.ceil(a / c) if c > 0 else 0
- for a, c in zip(self._array_shape, self._chunk_shape)
- )
+ self._chunks_per_dim = tuple(
+ math.ceil(a / c) if c > 0 else 0
+ for a, c in zip(self._array_shape, self._chunk_shape, strict=True)
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not all(cs > 0 for cs in self._chunk_shape): | |
| raise ValueError("All chunk_shape dimensions must be positive.") | |
| if not all(s >= 0 for s in self._array_shape): | |
| raise ValueError("All array_shape dimensions must be non-negative.") | |
| self._chunks_per_dim = tuple( | |
| math.ceil(a / c) if c > 0 else 0 | |
| for a, c in zip(self._array_shape, self._chunk_shape) | |
| ) | |
| if len(self._array_shape) != len(self._chunk_shape): | |
| raise ValueError("array_shape and chunk_shape must have the same number of dimensions.") | |
| if not all(cs > 0 for cs in self._chunk_shape): | |
| raise ValueError("All chunk_shape dimensions must be positive.") | |
| if not all(s >= 0 for s in self._array_shape): | |
| raise ValueError("All array_shape dimensions must be non-negative.") | |
| self._chunks_per_dim = tuple( | |
| math.ceil(a / c) if c > 0 else 0 | |
| for a, c in zip(self._array_shape, self._chunk_shape, strict=True) | |
| ) |
🧰 Tools
🪛 Ruff (0.12.2)
194-194: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
200-200: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 193 to 201, add an explicit rank
check to ensure len(self._array_shape) == len(self._chunk_shape) and raise a
ValueError with a clear message if they differ, then compute
self._chunks_per_dim using zip(self._array_shape, self._chunk_shape,
strict=True) (keeping the existing math.ceil(a / c) if c > 0 else 0 logic) so
mismatched shapes are caught early and the zip intent is explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (3)
py_hamt/sharded_zarr_store.py (3)
269-282: Decode-once and narrow rethrow; remove duplicate decodeRoot is decoded twice and exceptions aren’t chained.
- try: - self._root_obj = dag_cbor.decode(root_bytes) - if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: - raise ValueError( - "Root object is not a valid dictionary with 'chunks' key." - ) - if not isinstance(self._root_obj["chunks"]["shard_cids"], list): - raise ValueError("shard_cids is not a list.") - except Exception as e: - raise ValueError(f"Failed to decode root object: {e}") - self._root_obj = dag_cbor.decode(root_bytes) + try: + self._root_obj = dag_cbor.decode(root_bytes) + if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: + raise ValueError("Root object is not a valid dictionary with 'chunks' key.") + if not isinstance(self._root_obj["chunks"]["shard_cids"], list): + raise ValueError("shard_cids is not a list.") + except Exception as e: + raise ValueError("Failed to decode root object") from e
549-561: Do not coerce root CID to string; only stringify on returnThis prevents incompatibilities with non-string IPLD kinds (e.g., InMemoryCAS). Return a string for API consistency.
- new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor") - self._root_cid = str(new_root_cid) + new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor") + self._root_cid = new_root_cid self._dirty_root = False - - # Ignore because root_cid will always exist after initialization or flush. - return self._root_cid # type: ignore[return-value] + return str(self._root_cid) # unified return type
190-201: Add rank check and use strict zip in geometryCatch shape rank mismatches early; make
zipintent explicit.- if not all(cs > 0 for cs in self._chunk_shape): + if len(self._array_shape) != len(self._chunk_shape): + raise ValueError("array_shape and chunk_shape must have the same number of dimensions.") + if not all(cs > 0 for cs in self._chunk_shape): raise ValueError("All chunk_shape dimensions must be positive.") if not all(s >= 0 for s in self._array_shape): raise ValueError("All array_shape dimensions must be non-negative.") - self._chunks_per_dim = tuple( - math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) - ) + self._chunks_per_dim = tuple( + math.ceil(a / c) if c > 0 else 0 + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) + )
🧹 Nitpick comments (6)
tests/test_sharded_zarr_store_coverage.py (3)
132-134: Remove unused caplog parameter
caplogis not used. Drop it to satisfy linters and reduce noise.-async def test_sharded_zarr_store_shard_handling_exceptions( - create_ipfs: tuple[str, str], caplog -): +async def test_sharded_zarr_store_shard_handling_exceptions( + create_ipfs: tuple[str, str], +):
346-359: Prefix unused mock arguments with underscoresSilence ARG001/ARG002 by prefixing unused parameters (
cid,data,codec,timeout) with_. Repeat pattern for all mocks in this file.- async def load(self, cid): + async def load(self, _cid): @@ - async def load(self, cid): + async def load(self, _cid): @@ - async def load(self, cid): + async def load(self, _cid): @@ - async def load(self, cid): + async def load(self, _cid): @@ - async def save(self, data, codec=None): + async def save(self, _data, _codec=None): @@ - async def load(self, cid, offset=None, length=None, suffix=None): + async def load(self, _cid, offset=None, length=None, suffix=None): @@ - async def save(self, data, codec=None): + async def save(self, _data, _codec=None): @@ - async def save(self, data, codec=None): + async def save(self, _data, _codec=None): @@ - async def mock_wait_for(coro, timeout=None): + async def mock_wait_for(coro, _timeout=None): @@ - async def mock_wait_for(coro, timeout=None): + async def mock_wait_for(coro, _timeout=None): @@ - async def mock_cache_get(shard_idx): + async def mock_cache_get(_shard_idx):Also applies to: 387-399, 427-429, 452-464, 487-500, 583-586, 654-657, 690-693, 745-751, 797-801, 854-858, 823-827, 877-884, 885-888
329-335: Avoid brittle line-number references in test docstringsDocstrings that mention source line numbers will rot quickly. Rephrase to describe behavior/intent instead.
Also applies to: 569-575
py_hamt/sharded_zarr_store.py (3)
21-27: Clarify that size is an estimate; consider naming and doc tweak
sys.getsizeofundercounts nested objects; it’s an estimate. Update docstring accordingly to avoid misleading claims.- Memory usage is calculated using sys.getsizeof for accurate sizing. + Memory usage is estimated using sys.getsizeof on the list and elements.Also applies to: 37-45
46-53: Use OrderedDict.move_to_end for LRU bumpSimpler and faster than pop/assign.
- shard_data = self._cache.pop(shard_idx) - self._cache[shard_idx] = shard_data + shard_data = self._cache[shard_idx] + self._cache.move_to_end(shard_idx, last=True) return shard_data
776-781: Backpressure when all shards are dirtyIf many shards are dirty, memory can exceed the bound since dirty shards aren’t evicted. Consider optional auto-flush or a high-watermark callback to shed memory.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
py_hamt/sharded_zarr_store.py(1 hunks)tests/test_sharded_zarr_store_coverage.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (10)
ContentAddressedStore(13-60)load(35-42)load(78-115)load(446-505)save(28-32)save(73-76)save(405-444)InMemoryCAS(39-73)ContentAddressedStore(13-36)KuboCAS(76-429)tests/test_sharded_zarr_store_coverage.py (12)
load(346-358)load(387-388)load(427-428)load(452-463)load(487-494)load(583-584)load(745-748)load(797-799)load(854-857)save(496-501)save(654-656)save(690-693)
tests/test_sharded_zarr_store_coverage.py (2)
py_hamt/store_httpx.py (7)
KuboCAS(118-586)save(28-32)save(73-76)save(405-444)load(35-42)load(78-115)load(446-505)py_hamt/sharded_zarr_store.py (13)
ShardedZarrStore(142-869)_load_or_initialize_shard_cache(397-452)get(46-53)get(563-618)resize_store(782-824)MemoryBoundedLRUCache(20-139)_get_shard_size(37-44)put(55-101)dirty_cache_size(137-139)_load_root_from_cid(269-298)with_read_only(470-509)exists(681-692)flush(518-561)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py
194-194: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
200-200: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
231-233: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Avoid specifying long messages outside the exception class
(TRY003)
274-276: Abstract raise to an inner function
(TRY301)
274-276: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Prefer TypeError exception for invalid type
(TRY004)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
280-280: Avoid specifying long messages outside the exception class
(TRY003)
284-286: Avoid specifying long messages outside the exception class
(TRY003)
296-298: Avoid specifying long messages outside the exception class
(TRY003)
321-321: Abstract raise to an inner function
(TRY301)
321-321: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Consider moving this statement to an else block
(TRY300)
336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
336-338: Avoid specifying long messages outside the exception class
(TRY003)
340-340: Use raise without specifying exception name
Remove exception name
(TRY201)
378-380: Avoid specifying long messages outside the exception class
(TRY003)
427-429: Avoid specifying long messages outside the exception class
(TRY003)
435-435: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
435-435: Avoid specifying long messages outside the exception class
(TRY003)
438-438: Avoid specifying long messages outside the exception class
(TRY003)
451-451: Avoid specifying long messages outside the exception class
(TRY003)
473-473: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
507-507: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
526-526: Avoid specifying long messages outside the exception class
(TRY003)
576-578: Avoid specifying long messages outside the exception class
(TRY003)
607-609: Avoid specifying long messages outside the exception class
(TRY003)
622-622: Avoid specifying long messages outside the exception class
(TRY003)
635-635: Avoid specifying long messages outside the exception class
(TRY003)
655-655: Do not catch blind exception: Exception
(BLE001)
656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
656-656: Avoid specifying long messages outside the exception class
(TRY003)
708-708: Avoid specifying long messages outside the exception class
(TRY003)
715-715: Avoid specifying long messages outside the exception class
(TRY003)
743-743: Avoid specifying long messages outside the exception class
(TRY003)
766-766: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
789-789: Avoid specifying long messages outside the exception class
(TRY003)
796-796: Avoid specifying long messages outside the exception class
(TRY003)
798-800: Avoid specifying long messages outside the exception class
(TRY003)
805-805: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
832-832: Avoid specifying long messages outside the exception class
(TRY003)
838-840: Avoid specifying long messages outside the exception class
(TRY003)
tests/test_sharded_zarr_store_coverage.py
133-133: Unused function argument: caplog
(ARG001)
346-346: Unused method argument: cid
(ARG002)
387-387: Unused method argument: cid
(ARG002)
427-427: Unused method argument: cid
(ARG002)
452-452: Unused method argument: cid
(ARG002)
487-487: Unused method argument: cid
(ARG002)
490-490: Avoid specifying long messages outside the exception class
(TRY003)
496-496: Unused method argument: data
(ARG002)
496-496: Unused method argument: codec
(ARG002)
583-583: Unused method argument: cid
(ARG002)
583-583: Unused method argument: offset
(ARG002)
583-583: Unused method argument: length
(ARG002)
583-583: Unused method argument: suffix
(ARG002)
654-654: Unused method argument: data
(ARG002)
654-654: Unused method argument: codec
(ARG002)
656-656: Avoid specifying long messages outside the exception class
(TRY003)
690-690: Unused method argument: data
(ARG002)
690-690: Unused method argument: codec
(ARG002)
745-745: Unused method argument: cid
(ARG002)
764-764: Unused function argument: shard_idx
(ARG001)
797-797: Unused method argument: cid
(ARG002)
823-823: Unused function argument: timeout
(ARG001)
854-854: Unused method argument: cid
(ARG002)
877-877: Unused function argument: timeout
(ARG001)
885-885: Unused function argument: shard_idx
(ARG001)
🔇 Additional comments (3)
tests/test_sharded_zarr_store_coverage.py (1)
304-326: LRU eviction behavior tests look solidGood coverage of “no clean shards to evict” path; verifies dirty shard protection works as intended.
py_hamt/sharded_zarr_store.py (2)
681-693: exists() looks goodSpecific exception handling and shard on-demand load is clean.
470-509: with_read_only shallow clone is efficient and clearSharing locks/cache/pending-loads while flipping the flag is appropriate.
| def __init__( | ||
| self, | ||
| cas: ContentAddressedStore, | ||
| read_only: bool, | ||
| root_cid: Optional[str] = None, | ||
| *, | ||
| max_cache_memory_bytes: int = 100 * 1024 * 1024, # 100MB default | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Normalize root CID typing and equality
Stop forcing string conversions internally; accept IPLDKind on input, store as-is, and compare by string form for equality to avoid type mismatches.
- root_cid: Optional[str] = None,
+ root_cid: Optional[CID | str | bytes] = None,- root_cid: Optional[str] = None,
+ root_cid: Optional[CID | str | bytes] = None,- return self._root_cid == other._root_cid
+ return str(self._root_cid) == str(other._root_cid)Also applies to: 209-220, 511-516
| shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx] | ||
| if shard_cid_obj: | ||
| self._pending_shard_loads[shard_idx] = asyncio.Event() | ||
| shard_cid_str = str(shard_cid_obj) | ||
| await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str) | ||
| else: | ||
| empty_shard = [None] * self._chunks_per_shard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid str() on shard CIDs; pass IPLDKind to loader
Keep types consistent with CAS expectations.
- if shard_cid_obj:
+ if shard_cid_obj:
self._pending_shard_loads[shard_idx] = asyncio.Event()
- shard_cid_str = str(shard_cid_obj)
- await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str)
+ await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # type: ignore[arg-type]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx] | |
| if shard_cid_obj: | |
| self._pending_shard_loads[shard_idx] = asyncio.Event() | |
| shard_cid_str = str(shard_cid_obj) | |
| await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str) | |
| else: | |
| empty_shard = [None] * self._chunks_per_shard | |
| shard_cid_obj = self._root_obj["chunks"]["shard_cids"][shard_idx] | |
| if shard_cid_obj: | |
| self._pending_shard_loads[shard_idx] = asyncio.Event() | |
| await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # type: ignore[arg-type] | |
| else: | |
| empty_shard = [None] * self._chunks_per_shard |
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 440 to 446, avoid calling str() on
shard CID objects — the CAS expects an IPLDKind (CID) not a string. Replace
shard_cid_str = str(shard_cid_obj) with using the original shard_cid_obj (or
explicitly cast to the IPLDKind/ CID type expected by
_fetch_and_cache_full_shard) and pass that value into await
self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj); ensure no string
conversion is performed and that the loader signature accepts the IPLDKind.
| data = await self.cas.load(str(metadata_cid_obj)) | ||
| return prototype.buffer.from_bytes(data) | ||
| # Chunk data request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Pass CID objects to CAS.load; don’t convert to str
Keeps IPLD types intact and avoids assumptions about CAS backends.
- data = await self.cas.load(str(metadata_cid_obj))
+ data = await self.cas.load(metadata_cid_obj)- chunk_cid_str = str(chunk_cid_obj)
@@
- data = await self.cas.load(
- chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
- )
+ data = await self.cas.load(
+ chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
+ )Also applies to: 596-617
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 579-581 (and similarly 596-617),
the code converts CID objects to strings before calling self.cas.load; change
those calls to pass the CID objects directly (e.g.,
self.cas.load(metadata_cid_obj) instead of self.cas.load(str(metadata_cid_obj)))
so IPLD/CID types are preserved for backends that expect native CID objects;
update all occurrences in the specified ranges to remove str(...) wrapping and
adjust any type hints or variable names if needed to reflect that a CID object
(not a str) is being passed.
| raw_data_bytes = value.to_bytes() | ||
| # Save the data to CAS first to get its CID. | ||
| # Metadata is often saved as 'raw', chunks as well unless compressed. | ||
| try: | ||
| data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") | ||
| await self.set_pointer(key, str(data_cid_obj)) | ||
| except Exception as e: | ||
| raise RuntimeError(f"Failed to save data for key {key}: {e}") | ||
| return None # type: ignore[return-value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid stringifying pointers; accept CID|str in set_pointer and chain exceptions
This removes needless encode/decode cycles and fixes incompatibility with non-string IPLD kinds.
- try:
- data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
- await self.set_pointer(key, str(data_cid_obj))
- except Exception as e:
- raise RuntimeError(f"Failed to save data for key {key}: {e}")
+ try:
+ data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw")
+ await self.set_pointer(key, data_cid_obj) # pass IPLDKind directly
+ except Exception as e:
+ raise RuntimeError(f"Failed to save data for key {key}: {e}") from e- async def set_pointer(self, key: str, pointer: str) -> None:
+ async def set_pointer(self, key: str, pointer: CID | str) -> None:
chunk_coords = self._parse_chunk_key(key)
- pointer_cid_obj = CID.decode(pointer) # Convert string to CID object
+ pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)Also applies to: 659-667, 669-679
🧰 Tools
🪛 Ruff (0.12.2)
655-655: Do not catch blind exception: Exception
(BLE001)
656-656: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
656-656: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 649-657 (also apply same changes
to blocks at 659-667 and 669-679), the code stringifies CID objects before
calling set_pointer and raises a new RuntimeError without chaining the original
exception; modify the code to pass the CID object (or accept CID|str) directly
to set_pointer instead of calling str(), update set_pointer's
signature/implementation to accept either a CID object or a string (or normalize
input inside set_pointer), and re-raise the RuntimeError using exception
chaining (raise ... from e) so the original exception context is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (8)
py_hamt/sharded_zarr_store.py (8)
336-338: Add exception chaining for better debugging.When retries are exhausted, the original exception should be chained.
else: raise RuntimeError( f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" - ) + ) from e
441-442: Pass CID object directly to avoid unnecessary conversion.The CAS expects IPLDKind (CID), not a string.
self._pending_shard_loads[shard_idx] = asyncio.Event() - shard_cid_str = str(shard_cid_obj) - await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str) + await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj)Also update the method signature:
async def _fetch_and_cache_full_shard( self, shard_idx: int, - shard_cid: str, + shard_cid: CID, max_retries: int = 3, retry_delay: float = 1.0, ) -> None:
652-654: Improve exception handling and pass CID directly.Catch specific exceptions and pass the CID object without stringification.
try: data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") - await self.set_pointer(key, str(data_cid_obj)) - except Exception as e: - raise RuntimeError(f"Failed to save data for key {key}: {e}") + await self.set_pointer(key, data_cid_obj) + except (ConnectionError, TimeoutError, RuntimeError) as e: + raise RuntimeError(f"Failed to save data for key {key}: {e}") from eAlso update the set_pointer signature:
- async def set_pointer(self, key: str, pointer: str) -> None: + async def set_pointer(self, key: str, pointer: CID | str) -> None: chunk_coords = self._parse_chunk_key(key) - pointer_cid_obj = CID.decode(pointer) # Convert string to CID object + pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)
853-853: Store CID object directly, not as string.Metadata CIDs should be stored as CID objects for consistency.
- self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cid + self._root_obj["metadata"][zarr_metadata_key] = new_zarr_metadata_cidNote: This line already stores the CID object correctly - no change needed.
193-200: Add rank check and usezip(..., strict=True).Validate that shapes have the same dimensionality and make zip behavior explicit.
+ if len(self._array_shape) != len(self._chunk_shape): + raise ValueError("array_shape and chunk_shape must have the same number of dimensions.") if not all(cs > 0 for cs in self._chunk_shape): raise ValueError("All chunk_shape dimensions must be positive.") if not all(s >= 0 for s in self._array_shape): raise ValueError("All array_shape dimensions must be non-negative.") self._chunks_per_dim = tuple( math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) )
279-281: Remove duplicate decode and improve exception handling.The root object is decoded twice (line 272 and 281), and the broad
Exceptioncatch could hide important errors.try: self._root_obj = dag_cbor.decode(root_bytes) if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: raise ValueError( "Root object is not a valid dictionary with 'chunks' key." ) if not isinstance(self._root_obj["chunks"]["shard_cids"], list): raise ValueError("shard_cids is not a list.") - except Exception as e: - raise ValueError(f"Failed to decode root object: {e}") - self._root_obj = dag_cbor.decode(root_bytes) + except (ValueError, dag_cbor.CBORDecodeError) as e: + raise ValueError(f"Failed to decode root object: {e}") from e
577-577: Pass CID objects directly to CAS.load.Converting CIDs to strings breaks compatibility with CAS backends that expect native CID objects.
- data = await self.cas.load(str(metadata_cid_obj)) + data = await self.cas.load(metadata_cid_obj)
594-615: Avoid stringifying chunk CID.Pass the CID object directly to maintain type consistency.
- chunk_cid_str = str(chunk_cid_obj) - req_offset = None req_length = None req_suffix = None if byte_range: if isinstance(byte_range, RangeByteRequest): req_offset = byte_range.start if byte_range.end is not None: if byte_range.start > byte_range.end: raise ValueError( f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" ) req_length = byte_range.end - byte_range.start elif isinstance(byte_range, OffsetByteRequest): req_offset = byte_range.offset elif isinstance(byte_range, SuffixByteRequest): req_suffix = byte_range.suffix data = await self.cas.load( - chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix + chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix )
🧹 Nitpick comments (3)
py_hamt/sharded_zarr_store.py (3)
762-765: Use strict=True for zip to catch mismatched dimensions.Make the zip behavior explicit to catch potential bugs.
global_coords = tuple( c_local + c_offset - for c_local, c_offset in zip(local_coords, chunk_offset) + for c_local, c_offset in zip(local_coords, chunk_offset, strict=True) )
801-803: Use strict=True for zip consistency.Make the zip behavior explicit as recommended in other locations.
self._chunks_per_dim = tuple( math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) )
624-629: Improve metadata key detection logic.The current check
not key == "zarr.json"with length-based detection is brittle. Consider a more explicit approach.if ( key.endswith("zarr.json") and not key.startswith("time/") and not key.startswith(("lat/", "latitude/")) and not key.startswith(("lon/", "longitude/")) - and not key == "zarr.json" + and key != "zarr.json" # Exclude root zarr.json ):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
py_hamt/sharded_zarr_store.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (7)
ContentAddressedStore(13-60)load(35-42)load(78-115)load(446-505)save(28-32)save(73-76)save(405-444)tests/test_sharded_zarr_store_coverage.py (12)
load(346-358)load(387-388)load(427-428)load(452-463)load(487-494)load(583-584)load(745-748)load(797-799)load(854-857)save(496-501)save(654-656)save(690-693)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py
194-194: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
200-200: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
231-233: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Avoid specifying long messages outside the exception class
(TRY003)
274-276: Abstract raise to an inner function
(TRY301)
274-276: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Prefer TypeError exception for invalid type
(TRY004)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
280-280: Avoid specifying long messages outside the exception class
(TRY003)
284-286: Avoid specifying long messages outside the exception class
(TRY003)
296-298: Avoid specifying long messages outside the exception class
(TRY003)
321-321: Avoid specifying long messages outside the exception class
(TRY003)
327-327: Consider moving this statement to an else block
(TRY300)
336-338: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
336-338: Avoid specifying long messages outside the exception class
(TRY003)
376-378: Avoid specifying long messages outside the exception class
(TRY003)
425-427: Avoid specifying long messages outside the exception class
(TRY003)
433-433: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
433-433: Avoid specifying long messages outside the exception class
(TRY003)
436-436: Avoid specifying long messages outside the exception class
(TRY003)
449-449: Avoid specifying long messages outside the exception class
(TRY003)
471-471: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
505-505: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
524-524: Avoid specifying long messages outside the exception class
(TRY003)
574-576: Avoid specifying long messages outside the exception class
(TRY003)
605-607: Avoid specifying long messages outside the exception class
(TRY003)
620-620: Avoid specifying long messages outside the exception class
(TRY003)
633-633: Avoid specifying long messages outside the exception class
(TRY003)
653-653: Do not catch blind exception: Exception
(BLE001)
654-654: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
654-654: Avoid specifying long messages outside the exception class
(TRY003)
706-706: Avoid specifying long messages outside the exception class
(TRY003)
713-713: Avoid specifying long messages outside the exception class
(TRY003)
741-741: Avoid specifying long messages outside the exception class
(TRY003)
764-764: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
787-787: Avoid specifying long messages outside the exception class
(TRY003)
794-794: Avoid specifying long messages outside the exception class
(TRY003)
796-798: Avoid specifying long messages outside the exception class
(TRY003)
803-803: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
830-830: Avoid specifying long messages outside the exception class
(TRY003)
836-838: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (2)
py_hamt/sharded_zarr_store.py (2)
20-140: Excellent memory-bounded LRU cache implementation!The
MemoryBoundedLRUCacheis well-designed with:
- Accurate memory sizing using
sys.getsizeof- Protection of dirty shards from eviction
- Thread-safe operations with asyncio.Lock
- Proper LRU eviction strategy
This addresses the concern raised in PR comments about unbounded cache growth.
548-556: LGTM - Proper CID handling in flush.The flush method correctly ensures metadata CIDs are CID objects before encoding and properly marks shards as clean after successful writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (8)
py_hamt/sharded_zarr_store.py (8)
193-201: Add rank check and use zip(strict=True) to catch shape mismatches early.- if not all(cs > 0 for cs in self._chunk_shape): + if len(self._array_shape) != len(self._chunk_shape): + raise ValueError("array_shape and chunk_shape must have the same number of dimensions.") + if not all(cs > 0 for cs in self._chunk_shape): raise ValueError("All chunk_shape dimensions must be positive.") if not all(s >= 0 for s in self._array_shape): raise ValueError("All array_shape dimensions must be non-negative.") - self._chunks_per_dim = tuple( - math.ceil(a / c) if c > 0 else 0 - for a, c in zip(self._array_shape, self._chunk_shape) - ) + self._chunks_per_dim = tuple( + math.ceil(a / c) if c > 0 else 0 + for a, c in zip(self._array_shape, self._chunk_shape, strict=True) + )
269-281: Decode once; narrow exceptions and chain them.- try: - self._root_obj = dag_cbor.decode(root_bytes) - if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: - raise ValueError( - "Root object is not a valid dictionary with 'chunks' key." - ) - if not isinstance(self._root_obj["chunks"]["shard_cids"], list): - raise ValueError("shard_cids is not a list.") - except Exception as e: - raise ValueError(f"Failed to decode root object: {e}") + try: + self._root_obj = dag_cbor.decode(root_bytes) + if not isinstance(self._root_obj, dict) or "chunks" not in self._root_obj: + raise ValueError("Root object is not a valid dictionary with 'chunks' key.") + if not isinstance(self._root_obj["chunks"]["shard_cids"], list): + raise ValueError("shard_cids is not a list.") + except (ValueError, dag_cbor.CBORDecodeError) as e: + raise ValueError("Failed to decode root object.") from e
437-445: Don’t stringify CIDs when loading shards; pass IPLDKind through.- if shard_cid_obj: + if shard_cid_obj: self._pending_shard_loads[shard_idx] = asyncio.Event() - shard_cid_str = str(shard_cid_obj) - await self._fetch_and_cache_full_shard(shard_idx, shard_cid_str) + await self._fetch_and_cache_full_shard(shard_idx, shard_cid_obj) # pass CID/bytes directly
560-577: Metadata path: avoid str() on CIDs.- data = await self.cas.load(str(metadata_cid_obj)) + data = await self.cas.load(metadata_cid_obj)
508-513: Root CID equality should normalize type to avoid CID vs bytes vs str mismatches.- return self._root_cid == other._root_cid + return str(self._root_cid) == str(other._root_cid)
546-559: Avoid forcing root CID to str; keep IPLDKind internally and return str from flush().- root_obj_bytes = dag_cbor.encode(self._root_obj) - new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor") - self._root_cid = str(new_root_cid) + root_obj_bytes = dag_cbor.encode(self._root_obj) + new_root_cid = await self.cas.save(root_obj_bytes, codec="dag-cbor") + self._root_cid = new_root_cid self._dirty_root = False - # Ignore because root_cid will always exist after initialization or flush. - return self._root_cid # type: ignore[return-value] + return str(self._root_cid)
646-654: Pass CID to set_pointer and chain the exception; don’t stringify.- try: - data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") - await self.set_pointer(key, str(data_cid_obj)) - except Exception as e: - raise RuntimeError(f"Failed to save data for key {key}: {e}") + try: + data_cid_obj = await self.cas.save(raw_data_bytes, codec="raw") + await self.set_pointer(key, data_cid_obj) + except Exception as e: + raise RuntimeError(f"Failed to save data for key {key}") from e
656-676: set_pointer should accept CID|str and normalize; current str-only contract is brittle.- async def set_pointer(self, key: str, pointer: str) -> None: + async def set_pointer(self, key: str, pointer: CID | str) -> None: chunk_coords = self._parse_chunk_key(key) - pointer_cid_obj = CID.decode(pointer) # Convert string to CID object + pointer_cid_obj = pointer if isinstance(pointer, CID) else CID.decode(pointer)
🧹 Nitpick comments (3)
py_hamt/sharded_zarr_store.py (3)
779-810: Simplify num_shards math; no need for None-guard; add zip(strict=True) above.Minor cleanup:
_num_shardsis always int; remove theis not Noneguard.- old_num_shards = self._num_shards if self._num_shards is not None else 0 + old_num_shards = self._num_shards
20-45: Note: sys.getsizeof underestimates composite memory; OK as heuristic.If precise caps matter, consider tracking element counts and average sizes, or use a sampling approach.
76-101: Eviction stalls when all shards are dirty; cap dirty set or allow forced evict.As written, memory can exceed
max_memory_bytesif everything is dirty. Consider a soft cap on dirty entries or a “force flush before exceed” hook.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
py_hamt/sharded_zarr_store.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
py_hamt/sharded_zarr_store.py (2)
py_hamt/store_httpx.py (7)
ContentAddressedStore(13-60)load(35-42)load(78-115)load(446-505)save(28-32)save(73-76)save(405-444)tests/test_sharded_zarr_store_coverage.py (12)
load(346-358)load(387-388)load(427-428)load(452-463)load(487-494)load(583-584)load(745-748)load(797-799)load(854-857)save(496-501)save(654-656)save(690-693)
🪛 Ruff (0.12.2)
py_hamt/sharded_zarr_store.py
194-194: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
200-200: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
231-233: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Avoid specifying long messages outside the exception class
(TRY003)
274-276: Abstract raise to an inner function
(TRY301)
274-276: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Prefer TypeError exception for invalid type
(TRY004)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
280-280: Avoid specifying long messages outside the exception class
(TRY003)
283-285: Avoid specifying long messages outside the exception class
(TRY003)
295-297: Avoid specifying long messages outside the exception class
(TRY003)
320-320: Avoid specifying long messages outside the exception class
(TRY003)
326-326: Consider moving this statement to an else block
(TRY300)
335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
335-337: Avoid specifying long messages outside the exception class
(TRY003)
375-377: Avoid specifying long messages outside the exception class
(TRY003)
424-426: Avoid specifying long messages outside the exception class
(TRY003)
432-432: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
432-432: Avoid specifying long messages outside the exception class
(TRY003)
435-435: Avoid specifying long messages outside the exception class
(TRY003)
448-448: Avoid specifying long messages outside the exception class
(TRY003)
470-470: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
504-504: Comment contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
523-523: Avoid specifying long messages outside the exception class
(TRY003)
573-575: Avoid specifying long messages outside the exception class
(TRY003)
604-606: Avoid specifying long messages outside the exception class
(TRY003)
619-619: Avoid specifying long messages outside the exception class
(TRY003)
632-632: Avoid specifying long messages outside the exception class
(TRY003)
652-652: Do not catch blind exception: Exception
(BLE001)
653-653: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
653-653: Avoid specifying long messages outside the exception class
(TRY003)
705-705: Avoid specifying long messages outside the exception class
(TRY003)
712-712: Avoid specifying long messages outside the exception class
(TRY003)
740-740: Avoid specifying long messages outside the exception class
(TRY003)
763-763: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
786-786: Avoid specifying long messages outside the exception class
(TRY003)
793-793: Avoid specifying long messages outside the exception class
(TRY003)
795-797: Avoid specifying long messages outside the exception class
(TRY003)
802-802: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
829-829: Avoid specifying long messages outside the exception class
(TRY003)
835-837: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Create project environment, run all checks (3.12)
🔇 Additional comments (1)
py_hamt/sharded_zarr_store.py (1)
839-846: Fix JSON decode bug: json.loads expects str, not bytes.This path will raise
TypeError. Decode the bytes first.- old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid) - zarr_metadata_json = json.loads(old_zarr_metadata_bytes) + old_zarr_metadata_bytes = await self.cas.load(old_zarr_metadata_cid) + zarr_metadata_json = json.loads(old_zarr_metadata_bytes.decode("utf-8"))Likely an incorrect or invalid review comment.
| async def __contains__(self, shard_idx: int) -> bool: | ||
| async with self._cache_lock: | ||
| return shard_idx in self._cache | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
async contains is invalid Python; replace with an async method (deadlock/logic risk).
Magic methods cannot be async; using in cache would return a coroutine, not bool. Provide an explicit async API instead.
- async def __contains__(self, shard_idx: int) -> bool:
- async with self._cache_lock:
- return shard_idx in self._cache
+ async def contains(self, shard_idx: int) -> bool:
+ async with self._cache_lock:
+ return shard_idx in self._cache📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def __contains__(self, shard_idx: int) -> bool: | |
| async with self._cache_lock: | |
| return shard_idx in self._cache | |
| async def contains(self, shard_idx: int) -> bool: | |
| async with self._cache_lock: | |
| return shard_idx in self._cache |
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 122-125, remove the async magic
method definition (async def __contains__) because dunder methods cannot be
async; instead implement an explicit async API like async def
contains_shard(self, shard_idx: int) -> bool that acquires self._cache_lock and
returns shard_idx in self._cache, and replace or keep a synchronous
__contains__(self, shard_idx: int) -> bool that performs a non-blocking
membership check (return shard_idx in self._cache) if you still want Python's
"in" to work (noting it will be racy); update all call sites to use await
contains_shard(...) where a locked, async check is required.
| async def _fetch_and_cache_full_shard( | ||
| self, | ||
| shard_idx: int, | ||
| shard_cid: str, | ||
| max_retries: int = 3, | ||
| retry_delay: float = 1.0, | ||
| ) -> None: | ||
| """ | ||
| Fetch a shard from CAS and cache it, with retry logic for transient errors. | ||
| Args: | ||
| shard_idx: The index of the shard to fetch. | ||
| shard_cid: The CID of the shard. | ||
| max_retries: Maximum number of retry attempts for transient errors. | ||
| retry_delay: Delay between retry attempts in seconds. | ||
| """ | ||
| for attempt in range(max_retries): | ||
| try: | ||
| shard_data_bytes = await self.cas.load(shard_cid) | ||
| decoded_shard = dag_cbor.decode(shard_data_bytes) | ||
| if not isinstance(decoded_shard, list): | ||
| raise TypeError(f"Shard {shard_idx} did not decode to a list.") | ||
| await self._shard_data_cache.put(shard_idx, decoded_shard) | ||
| # Always set the Event to unblock waiting coroutines | ||
| if shard_idx in self._pending_shard_loads: | ||
| self._pending_shard_loads[shard_idx].set() | ||
| del self._pending_shard_loads[shard_idx] | ||
| return # Success | ||
| except (ConnectionError, TimeoutError) as e: | ||
| # Handle transient errors (e.g., network issues) | ||
| if attempt < max_retries - 1: | ||
| await asyncio.sleep( | ||
| retry_delay * (2**attempt) | ||
| ) # Exponential backoff | ||
| continue | ||
| else: | ||
| raise RuntimeError( | ||
| f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure pending waiters are always released; accept IPLDKind and handle decode errors.
Without signaling on failure, other tasks can hang for 60s. Also, avoid forcing CIDs to str and catch CBOR decode errors explicitly.
- async def _fetch_and_cache_full_shard(
+ async def _fetch_and_cache_full_shard(
self,
shard_idx: int,
- shard_cid: str,
+ shard_cid: CID | bytes | str,
max_retries: int = 3,
retry_delay: float = 1.0,
) -> None:
@@
- for attempt in range(max_retries):
- try:
- shard_data_bytes = await self.cas.load(shard_cid)
- decoded_shard = dag_cbor.decode(shard_data_bytes)
- if not isinstance(decoded_shard, list):
- raise TypeError(f"Shard {shard_idx} did not decode to a list.")
- await self._shard_data_cache.put(shard_idx, decoded_shard)
- # Always set the Event to unblock waiting coroutines
- if shard_idx in self._pending_shard_loads:
- self._pending_shard_loads[shard_idx].set()
- del self._pending_shard_loads[shard_idx]
- return # Success
- except (ConnectionError, TimeoutError) as e:
+ for attempt in range(max_retries):
+ try:
+ shard_data_bytes = await self.cas.load(shard_cid)
+ decoded_shard = dag_cbor.decode(shard_data_bytes)
+ if not isinstance(decoded_shard, list):
+ raise TypeError(f"Shard {shard_idx} did not decode to a list.")
+ await self._shard_data_cache.put(shard_idx, decoded_shard)
+ return # Success
+ except dag_cbor.CBORDecodeError as e:
+ # Non-transient: don't retry CBOR decode failures
+ raise ValueError(f"Failed to decode shard {shard_idx}.") from e
+ except (ConnectionError, TimeoutError) as e:
# Handle transient errors (e.g., network issues)
if attempt < max_retries - 1:
await asyncio.sleep(
retry_delay * (2**attempt)
) # Exponential backoff
continue
else:
- raise RuntimeError(
- f"Failed to fetch shard {shard_idx} after {max_retries} attempts: {e}"
- )
+ raise RuntimeError(
+ f"Failed to fetch shard {shard_idx} after {max_retries} attempts."
+ ) from e
+ finally:
+ # Always release waiters if present
+ if shard_idx in self._pending_shard_loads:
+ self._pending_shard_loads[shard_idx].set()
+ self._pending_shard_loads.pop(shard_idx, None)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.12.2)
320-320: Avoid specifying long messages outside the exception class
(TRY003)
326-326: Consider moving this statement to an else block
(TRY300)
335-337: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
335-337: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 299-337, ensure pending waiters
are always released and decoding errors are handled explicitly: keep shard_cid
as-is (do not force to str), catch dag_cbor.DecodingError (or the library's
specific CBOR decode exception) in addition to TypeError, and treat valid
decoded shard types as list/IPLD list-kind rather than only python list; on any
failure (decode error, non-list kind, or exhausted retries) set and remove the
Event in self._pending_shard_loads[shard_idx] so waiting coroutines are
unblocked, and re-raise a clear RuntimeError after cleaning up. Also use
exponential backoff as already implemented for transient
ConnectionError/TimeoutError.
| chunk_cid_str = str(chunk_cid_obj) | ||
|
|
||
| req_offset = None | ||
| req_length = None | ||
| req_suffix = None | ||
|
|
||
| if byte_range: | ||
| if isinstance(byte_range, RangeByteRequest): | ||
| req_offset = byte_range.start | ||
| if byte_range.end is not None: | ||
| if byte_range.start > byte_range.end: | ||
| raise ValueError( | ||
| f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" | ||
| ) | ||
| req_length = byte_range.end - byte_range.start | ||
| elif isinstance(byte_range, OffsetByteRequest): | ||
| req_offset = byte_range.offset | ||
| elif isinstance(byte_range, SuffixByteRequest): | ||
| req_suffix = byte_range.suffix | ||
| data = await self.cas.load( | ||
| chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix | ||
| ) | ||
| return prototype.buffer.from_bytes(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero-length Range must return empty buffer; also avoid str() on chunk CID.
Current logic computes length 0, yielding an invalid HTTP Range (end = start - 1). Return empty bytes instead and pass CID directly.
- chunk_cid_str = str(chunk_cid_obj)
-
req_offset = None
req_length = None
req_suffix = None
if byte_range:
if isinstance(byte_range, RangeByteRequest):
req_offset = byte_range.start
if byte_range.end is not None:
if byte_range.start > byte_range.end:
raise ValueError(
f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})"
)
- req_length = byte_range.end - byte_range.start
+ if byte_range.end == byte_range.start:
+ return prototype.buffer.from_bytes(b"")
+ req_length = byte_range.end - byte_range.start
elif isinstance(byte_range, OffsetByteRequest):
req_offset = byte_range.offset
elif isinstance(byte_range, SuffixByteRequest):
req_suffix = byte_range.suffix
- data = await self.cas.load(
- chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix
- )
+ data = await self.cas.load(
+ chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix
+ )
return prototype.buffer.from_bytes(data)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chunk_cid_str = str(chunk_cid_obj) | |
| req_offset = None | |
| req_length = None | |
| req_suffix = None | |
| if byte_range: | |
| if isinstance(byte_range, RangeByteRequest): | |
| req_offset = byte_range.start | |
| if byte_range.end is not None: | |
| if byte_range.start > byte_range.end: | |
| raise ValueError( | |
| f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" | |
| ) | |
| req_length = byte_range.end - byte_range.start | |
| elif isinstance(byte_range, OffsetByteRequest): | |
| req_offset = byte_range.offset | |
| elif isinstance(byte_range, SuffixByteRequest): | |
| req_suffix = byte_range.suffix | |
| data = await self.cas.load( | |
| chunk_cid_str, offset=req_offset, length=req_length, suffix=req_suffix | |
| ) | |
| return prototype.buffer.from_bytes(data) | |
| req_offset = None | |
| req_length = None | |
| req_suffix = None | |
| if byte_range: | |
| if isinstance(byte_range, RangeByteRequest): | |
| req_offset = byte_range.start | |
| if byte_range.end is not None: | |
| if byte_range.start > byte_range.end: | |
| raise ValueError( | |
| f"Byte range start ({byte_range.start}) cannot be greater than end ({byte_range.end})" | |
| ) | |
| if byte_range.end == byte_range.start: | |
| return prototype.buffer.from_bytes(b"") | |
| req_length = byte_range.end - byte_range.start | |
| elif isinstance(byte_range, OffsetByteRequest): | |
| req_offset = byte_range.offset | |
| elif isinstance(byte_range, SuffixByteRequest): | |
| req_suffix = byte_range.suffix | |
| data = await self.cas.load( | |
| chunk_cid_obj, offset=req_offset, length=req_length, suffix=req_suffix | |
| ) | |
| return prototype.buffer.from_bytes(data) |
🧰 Tools
🪛 Ruff (0.12.2)
604-606: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In py_hamt/sharded_zarr_store.py around lines 593 to 615, the code currently
converts the chunk CID to string and calls cas.load even for zero-length ranges
which produces an invalid HTTP Range; instead, do not call str() on
chunk_cid_obj — pass the CID object directly to cas.load — and detect a
zero-length RangeByteRequest (end == start) returning
prototype.buffer.from_bytes(b"") immediately without calling cas.load; only set
and pass req_length when length > 0 and skip cas.load entirely for the
zero-length case.

Summary by CodeRabbit
New Features
Enhancements
Tests
Documentation
Chores