Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
f62e6c1
feat: support partials
TheGreatAlgo May 30, 2025
0016b03
Merge branch 'refactor' into get-partials
TheGreatAlgo May 30, 2025
226967f
fix: re-add accidentally deleted _map_byte_request
TheGreatAlgo May 30, 2025
89e8124
fix: tidying up
TheGreatAlgo Jun 2, 2025
6e56031
fix: full coverage
TheGreatAlgo Jun 2, 2025
c6bdf5c
Merge branch 'refactor' into get-partials
TheGreatAlgo Jun 2, 2025
90ec3d0
Merge branch 'refactor' into get-partials
TheGreatAlgo Jun 3, 2025
46b2f99
fix: re-order
TheGreatAlgo Jun 3, 2025
0fb9be9
fix: update ruff and mypy
TheGreatAlgo Jun 3, 2025
f7f169d
fix: pre-commit
TheGreatAlgo Jun 3, 2025
3b94888
Merge branch 'refactor' into get-partials
TheGreatAlgo Jun 3, 2025
e17003a
deps: update kubo to latest in tests
Faolain Jun 4, 2025
72fdcfa
test: add test for ipfs gateway partials
Faolain Jun 4, 2025
c92f5ec
fix: sharding
TheGreatAlgo Jun 5, 2025
4476272
fix: converter
TheGreatAlgo Jun 6, 2025
c981437
fix: more work on sharding
TheGreatAlgo Jun 9, 2025
8476806
Merge branch 'main' into sharding
TheGreatAlgo Jun 10, 2025
50fd6e6
Merge branch 'main' into sharding
TheGreatAlgo Jun 10, 2025
a87eb19
fix: pinning
TheGreatAlgo Jun 12, 2025
21e15c3
fix: pre-commit
TheGreatAlgo Jun 12, 2025
247bd69
fix: tidying
TheGreatAlgo Jun 12, 2025
7ae0561
fix: target
TheGreatAlgo Jun 12, 2025
3ef4aee
fix: fixing types
TheGreatAlgo Jun 12, 2025
b0268ad
fix: more type cleanups
TheGreatAlgo Jun 12, 2025
092a538
fix: remove unused
TheGreatAlgo Jun 12, 2025
dec826e
fix: change imports
TheGreatAlgo Jun 12, 2025
cac1ad0
Merge branch 'main' into sharding
TheGreatAlgo Jun 16, 2025
3834667
fix: fix ruff
TheGreatAlgo Jun 16, 2025
87e9085
fix: remove aiohttp
TheGreatAlgo Jun 16, 2025
5eeaedf
fix: add tuple
TheGreatAlgo Jun 16, 2025
89481f8
Merge branch 'main' into sharding
TheGreatAlgo Jun 17, 2025
4991475
fix: ruff and version
TheGreatAlgo Jun 17, 2025
573845b
fix: remove aiohttp
TheGreatAlgo Jun 17, 2025
255e10e
fix: remove -s
TheGreatAlgo Jun 17, 2025
888a139
fix: more changes
TheGreatAlgo Jun 19, 2025
1288f55
fix: test era5
TheGreatAlgo Jun 26, 2025
4c92d2c
Update test_sharded_zarr_store.py
TheGreatAlgo Jun 27, 2025
309ea4a
fix: logging
TheGreatAlgo Jun 27, 2025
be19965
fix: async pinning, chunker
TheGreatAlgo Jun 27, 2025
79796a6
fix: dag cbor
TheGreatAlgo Jul 2, 2025
999a85a
fix: remove print
TheGreatAlgo Jul 2, 2025
6f79dbb
fix: sharding print and default
TheGreatAlgo Jul 3, 2025
78d7621
fix: fix race condition
TheGreatAlgo Jul 7, 2025
efa3429
fix: revert metadata logic
TheGreatAlgo Jul 7, 2025
2f74b56
fix: debug key
TheGreatAlgo Jul 7, 2025
6fc813b
fix: more debug
TheGreatAlgo Jul 7, 2025
36e1ad0
fix: more debug
TheGreatAlgo Jul 7, 2025
a57b434
fix: array shape
TheGreatAlgo Jul 7, 2025
a67e661
fix: tests and race condition on caches
TheGreatAlgo Jul 7, 2025
a60039b
fix: more locks
TheGreatAlgo Jul 7, 2025
2d18f4d
fix: more tests
TheGreatAlgo Jul 7, 2025
fb4fc37
fix: remove comment
TheGreatAlgo Jul 7, 2025
f5ab517
Merge branch 'main' into sharding
TheGreatAlgo Jul 7, 2025
eded043
fix: tests
TheGreatAlgo Jul 7, 2025
cfda0c7
fix: update test
TheGreatAlgo Jul 7, 2025
be0048b
fix: add concat test
TheGreatAlgo Jul 8, 2025
d1ebf1f
fix: helpful test for local testing
TheGreatAlgo Jul 8, 2025
a226a10
fix: kubo store httpx coverage
TheGreatAlgo Jul 10, 2025
95982d2
fix: full coverage
TheGreatAlgo Jul 11, 2025
5c2580a
fix: fix mypy
TheGreatAlgo Jul 11, 2025
2fcff2b
ci: update ipfs from 0.35 to 0.36
Faolain Jul 15, 2025
47c04c5
ci: revert back to 0.35 from 0.36 to find out why Error: IPFS API ser…
Faolain Jul 15, 2025
a9a36fb
fix: change integer math
TheGreatAlgo Jul 16, 2025
07bae20
fix: print debug
TheGreatAlgo Jul 16, 2025
fcf0119
fix: remove debug
TheGreatAlgo Jul 16, 2025
1ef78a8
fix: reformat
TheGreatAlgo Jul 16, 2025
63e59a8
fix: print key
TheGreatAlgo Jul 24, 2025
d2fca81
fix: update tests
TheGreatAlgo Jul 25, 2025
7b20b2a
fix: update formatting
TheGreatAlgo Jul 28, 2025
4acd195
fix: update cids
TheGreatAlgo Jul 28, 2025
5b1bad8
fix: more changes
TheGreatAlgo Jul 28, 2025
311fd17
Merge branch 'main' into sharding
TheGreatAlgo Aug 21, 2025
bfb4a41
fix: remove duplicate
TheGreatAlgo Aug 21, 2025
7d826fa
fix: with read only
TheGreatAlgo Aug 21, 2025
559a7ff
Update py_hamt/store_httpx.py
0xSwego Aug 25, 2025
e537389
fix: fix casing
TheGreatAlgo Aug 25, 2025
7a39161
Merge branch 'main' into sharding
TheGreatAlgo Sep 2, 2025
efb3c64
fix: linting
TheGreatAlgo Sep 2, 2025
41143f9
lint(ruff): add stricter rule to __init__
Faolain Sep 10, 2025
e55a3e3
fix: lru cache
TheGreatAlgo Sep 10, 2025
a3f60f2
fix: zarr coverage
TheGreatAlgo Sep 10, 2025
ac17139
fix: final tests
TheGreatAlgo Sep 10, 2025
6cac0f5
fix: small changes
TheGreatAlgo Sep 10, 2025
7529c2f
fix: small updates
TheGreatAlgo Sep 15, 2025
78cf72d
fix: remove duplicate
TheGreatAlgo Sep 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ repos:
- id: mixed-line-ending
- id: trailing-whitespace
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.11.11
rev: v0.12.12
hooks:
- id: ruff-check
- id: ruff-format
Expand Down
87 changes: 87 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Common Development Commands

Setup environment:
```bash
uv sync
source .venv/bin/activate
pre-commit install
```

Run all checks (tests, linting, formatting, type checking):
```bash
bash run-checks.sh
```

Run tests:
```bash
# All tests (requires IPFS daemon or Docker)
pytest --ipfs --cov=py_hamt tests/

# Quick tests without IPFS integration
pytest --cov=py_hamt tests/

# Single test file
pytest tests/test_hamt.py

# Coverage report
uv run coverage report --fail-under=100 --show-missing
```

Linting and formatting:
```bash
# Run all pre-commit hooks
uv run pre-commit run --all-files --show-diff-on-failure

# Fix auto-fixable ruff issues
uv run ruff check --fix
```

Type checking and other tools:
```bash
# Type checking is handled by pre-commit hooks (mypy)
# Documentation preview
uv run pdoc py_hamt
```

## Architecture Overview

py-hamt implements a Hash Array Mapped Trie (HAMT) for IPFS/IPLD content-addressed storage. The core architecture follows this pattern:

1. **ContentAddressedStore (CAS)** - Abstract storage layer (store.py)
- `KuboCAS` - IPFS/Kubo implementation for production
- `InMemoryCAS` - In-memory implementation for testing

2. **HAMT** - Core data structure (hamt.py)
- Uses blake3 hashing by default
- Implements content-addressed trie for efficient key-value storage
- Supports async operations for large datasets

3. **ZarrHAMTStore** - Zarr integration (zarr_hamt_store.py)
- Implements zarr.abc.store.Store interface
- Enables storing large Zarr arrays on IPFS via HAMT
- Keys stored verbatim, values as raw bytes

4. **Encryption Layer** - Optional encryption (encryption_hamt_store.py)
- `SimpleEncryptedZarrHAMTStore` for fully encrypted storage

## Key Design Patterns

- All storage operations are async to handle IPFS network calls
- Content addressing means identical data gets same hash/CID
- HAMT provides O(log n) access time for large key sets
- Store abstractions allow swapping storage backends
- Type hints required throughout (mypy enforced)
- 100% test coverage required with hypothesis property-based testing

## IPFS Integration Requirements

Tests require either:
- Local IPFS daemon running (`ipfs daemon`)
- Docker available for containerized IPFS
- Neither (unit tests only, integration tests skip)

The `--ipfs` pytest flag controls IPFS test execution.
2 changes: 1 addition & 1 deletion fsgs.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@


async def main():
cid = "bafyr4iecw3faqyvj75psutabk2jxpddpjdokdy5b26jdnjjzpkzbgb5xoq"
cid = "bafyr4ibiduv7ml3jeyl3gn6cjcrcizfqss7j64rywpbj3whr7tc6xipt3y"

# Use KuboCAS as an async context manager
async with KuboCAS() as kubo_cas: # connects to a local kubo node
Expand Down
2 changes: 1 addition & 1 deletion public_gateway_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ async def fetch_zarr_from_gateway(cid: str, gateway: str = "https://ipfs.io"):

async def main():
# Example CID - this points to a weather dataset stored on IPFS
cid = "bafyr4iecw3faqyvj75psutabk2jxpddpjdokdy5b26jdnjjzpkzbgb5xoq"
cid = "bafyr4ibiduv7ml3jeyl3gn6cjcrcizfqss7j64rywpbj3whr7tc6xipt3y"

# Try different public gateways
gateways = [
Expand Down
7 changes: 5 additions & 2 deletions py_hamt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from .encryption_hamt_store import SimpleEncryptedZarrHAMTStore
from .hamt import HAMT, blake3_hashfn
from .hamt_to_sharded_converter import convert_hamt_to_sharded, sharded_converter_cli
from .sharded_zarr_store import ShardedZarrStore
from .store_httpx import ContentAddressedStore, InMemoryCAS, KuboCAS
from .zarr_hamt_store import ZarrHAMTStore

Expand All @@ -11,6 +13,7 @@
"KuboCAS",
"ZarrHAMTStore",
"SimpleEncryptedZarrHAMTStore",
"ShardedZarrStore",
"convert_hamt_to_sharded",
"sharded_converter_cli",
]

print("Running py-hamt from source!")
13 changes: 11 additions & 2 deletions py_hamt/hamt.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
Callable,
Dict,
Iterator,
Optional,
cast,
)

Expand Down Expand Up @@ -589,10 +590,18 @@ async def delete(self, key: str) -> None:
# If we didn't make a change, then this key must not exist within the HAMT
raise KeyError

async def get(self, key: str) -> IPLDKind:
async def get(
self,
key: str,
offset: Optional[int] = None,
length: Optional[int] = None,
suffix: Optional[int] = None,
) -> IPLDKind:
"""Get a value."""
pointer: IPLDKind = await self.get_pointer(key)
data: bytes = await self.cas.load(pointer)
data: bytes = await self.cas.load(
pointer, offset=offset, length=length, suffix=suffix
)
if self.values_are_bytes:
return data
else:
Expand Down
130 changes: 130 additions & 0 deletions py_hamt/hamt_to_sharded_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
import argparse
import asyncio
import time

import xarray as xr
from multiformats import CID

from .hamt import HAMT
from .sharded_zarr_store import ShardedZarrStore
from .store_httpx import KuboCAS
from .zarr_hamt_store import ZarrHAMTStore


async def convert_hamt_to_sharded(
cas: KuboCAS, hamt_root_cid: str, chunks_per_shard: int
) -> str:
"""
Converts a Zarr dataset from a HAMT-based store to a ShardedZarrStore.
Args:
cas: An initialized ContentAddressedStore instance (KuboCAS).
hamt_root_cid: The root CID of the source ZarrHAMTStore.
chunks_per_shard: The number of chunks to group into a single shard in the new store.
Returns:
The root CID of the newly created ShardedZarrStore.
"""
print(f"--- Starting Conversion from HAMT Root {hamt_root_cid} ---")
start_time = time.perf_counter()
# 1. Open the source HAMT store for reading
print("Opening source HAMT store...")
hamt_ro = await HAMT.build(
cas=cas, root_node_id=hamt_root_cid, values_are_bytes=True, read_only=True
)
source_store = ZarrHAMTStore(hamt_ro, read_only=True)
source_dataset = xr.open_zarr(store=source_store, consolidated=True)
# 2. Introspect the source array to get its configuration
print("Reading metadata from source store...")

# Read the stores metadata to get array shape and chunk shape
data_var_name = next(iter(source_dataset.data_vars))
ordered_dims = list(source_dataset[data_var_name].dims)
array_shape_tuple = tuple(source_dataset.sizes[dim] for dim in ordered_dims)
chunk_shape_tuple = tuple(source_dataset.chunks[dim][0] for dim in ordered_dims)
array_shape = array_shape_tuple
chunk_shape = chunk_shape_tuple

# 3. Create the destination ShardedZarrStore for writing
print(
f"Initializing new ShardedZarrStore with {chunks_per_shard} chunks per shard..."
)
dest_store = await ShardedZarrStore.open(
cas=cas,
read_only=False,
array_shape=array_shape,
chunk_shape=chunk_shape,
chunks_per_shard=chunks_per_shard,
)

print("Destination store initialized.")

# 4. Iterate and copy all data from source to destination
print("Starting data migration...")
count = 0
async for key in hamt_ro.keys():
count += 1
# Read the raw data (metadata or chunk) from the source
cid: CID = await hamt_ro.get_pointer(key)
cid_base32_str = str(cid.encode("base32"))

# Write the exact same key-value pair to the destination.
await dest_store.set_pointer(key, cid_base32_str)
if count % 200 == 0: # pragma: no cover
print(f"Migrated {count} keys...") # pragma: no cover

print(f"Migration of {count} total keys complete.")

# 5. Finalize the new store by flushing it to the CAS
print("Flushing new store to get final root CID...")
new_root_cid = await dest_store.flush()
end_time = time.perf_counter()

print("\n--- Conversion Complete! ---")
print(f"Total time: {end_time - start_time:.2f} seconds")
print(f"New ShardedZarrStore Root CID: {new_root_cid}")
return new_root_cid


async def sharded_converter_cli():
parser = argparse.ArgumentParser(
description="Convert a Zarr HAMT store to a Sharded Zarr store."
)
parser.add_argument(
"hamt_cid", type=str, help="The root CID of the source Zarr HAMT store."
)
parser.add_argument(
"--chunks-per-shard",
type=int,
default=6250,
help="Number of chunk CIDs to store per shard in the new store.",
)
parser.add_argument(
"--rpc-url",
type=str,
default="http://127.0.0.1:5001",
help="The URL of the IPFS Kubo RPC API.",
)
parser.add_argument(
"--gateway-url",
type=str,
default="http://127.0.0.1:8080",
help="The URL of the IPFS Gateway.",
)
args = parser.parse_args()
# Initialize the KuboCAS client with the provided RPC and Gateway URLs
async with KuboCAS(
rpc_base_url=args.rpc_url, gateway_base_url=args.gateway_url
) as cas_client:
try:
await convert_hamt_to_sharded(
cas=cas_client,
hamt_root_cid=args.hamt_cid,
chunks_per_shard=args.chunks_per_shard,
)
except Exception as e:
print(f"\nAn error occurred: {e}")
Comment on lines +125 to +126
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid catching blind exception.

The broad Exception catch could hide important errors and make debugging difficult.

Catch specific exceptions that are expected during conversion:

-        except Exception as e:
-            print(f"\nAn error occurred: {e}")
+        except (ValueError, KeyError, RuntimeError) as e:
+            print(f"\nAn error occurred: {e}")
+            import sys
+            sys.exit(1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
print(f"\nAn error occurred: {e}")
except (ValueError, KeyError, RuntimeError) as e:
print(f"\nAn error occurred: {e}")
import sys
sys.exit(1)
🧰 Tools
🪛 Ruff (0.12.2)

125-125: Do not catch blind exception: Exception

(BLE001)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-09-15 at 1 37 31 AM

maybe something to consider here @TheGreatAlgo ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!



if __name__ == "__main__":
asyncio.run(sharded_converter_cli()) # pragma: no cover
Loading