Skip to content

Conversation

@Faolain
Copy link
Contributor

@Faolain Faolain commented Jun 12, 2025

Fix for adding public gateway support to py-hamt fix/codex-async-2 branch #71

⏺ Great! The example works with the public gateway.

  Here's a summary of the changes we made:

  1. Fixed URL construction in the KuboCAS class:
    - Properly handles base URLs with or without trailing slashes
    - Prevents duplicate "ipfs/" in URLs
    - Adds proper format parameter for public gateways
  2. Added IPLD content negotiation:
    - Added Accept headers to request IPLD content in raw form
    - Used ?format=dag-cbor for reliable IPLD content delivery
  3. Created comprehensive tests:
    - Added test_public_gateway.py to test various gateway configurations
    - Created a working example script for users
  4. Additional improvements:
    - Made KuboCAS more robust with URL path handling
    - Enhanced error handling for different gateway responses

  These changes make py-hamt work with both local and public gateways, providing more
  flexibility for users.

After some edits:

Perfect! We now have 100% test coverage across the entire codebase, including the
  trailing slash path in the URL construction. The linter just made some formatting fixes,
  but all the tests pass successfully.

  To summarize what we accomplished:

  1. Added a specific test case test_trailing_slash_gateway to cover the missing branch
  where a gateway URL has a trailing slash
  2. Verified that this test executes the previously uncovered code path
  3. Achieved 100% test coverage for the entire codebase
  4. Ensured that py-hamt works correctly with both local and public gateways, with or
  without trailing slashes in URLs

  These changes make the library more robust and reliable when working with different
  gateway URL formats.

Proof of use also in dClimate/zarr-benchmarking#2 /zarr-benchmarking/zarr_v3-hamt_v5-httpx folder

Maybe some memory profiling with this compared to the aiohttp makes tons of requests and keeping track of memory usage could be a good idea. This way we can see if the aiohttp session thing truly has a memory leak vs this.

Summary by CodeRabbit

  • New Features

    • Introduced scripts for accessing and analyzing Zarr datasets from IPFS using both local and public gateways.
    • Added support for a new in-memory content-addressed store.
  • Bug Fixes

    • Improved compatibility with public IPFS gateways for loading IPLD content.
  • Refactor

    • Migrated all network operations from aiohttp to httpx for asynchronous HTTP requests.
    • Updated internal and test code to use httpx.AsyncClient instead of aiohttp.ClientSession.
  • Tests

    • Added and updated tests to ensure compatibility with httpx and public gateway behavior.
    • Enhanced tests to verify resource cleanup and error handling.
  • Chores

    • Updated dependencies to require httpx and removed aiohttp.
    • Incremented project version to 3.1.0.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 12, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update replaces the asynchronous HTTP client library from aiohttp to httpx throughout the codebase, affecting both core functionality and tests. It adapts the KuboCAS class and related logic to use httpx.AsyncClient, updates dependencies, and introduces new scripts and tests for working with Zarr datasets via IPFS gateways, including public gateway support and sample data access.

Changes

Files / Group Change Summary
py_hamt/store_httpx.py Migrated from aiohttp to httpx for all HTTP operations in KuboCAS; updated logic, docstrings, and resource cleanup.
py_hamt/__init__.py, py_hamt/hamt.py Updated imports to use .store_httpx; added InMemoryCAS to exports; added debug print.
pyproject.toml Switched dependency from aiohttp to httpx; bumped version to 3.1.0.
fsgs.py, public_gateway_example.py Added new scripts for accessing Zarr datasets via IPFS using KuboCAS and public gateways.
tests/conftest.py, tests/test_kubo_cas.py, tests/test_kubocas_auth.py, tests/test_kubocas_session.py, tests/test_async.py Updated all test fixtures and logic to use httpx.AsyncClient instead of aiohttp.ClientSession; adapted resource checks and cleanup.
tests/test_public_gateway.py Added new async tests for public IPFS gateway access and DAG-CBOR content verification.
tests/test_branch_anchors.py, tests/test_hamt.py, tests/test_zarr_ipfs.py, tests/performance_tests.py Consolidated and simplified imports to use py_hamt directly for InMemoryCAS and related classes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script (fsgs.py/public_gateway_example.py)
    participant KuboCAS (httpx)
    participant IPFS Gateway/Node
    participant HAMT/Zarr Store
    participant xarray

    User->>Script: Run script
    Script->>KuboCAS: Connect (async context)
    KuboCAS->>IPFS Gateway/Node: Fetch HAMT data (httpx.AsyncClient)
    KuboCAS-->>Script: Return HAMT structure
    Script->>HAMT/Zarr Store: Wrap HAMT as Zarr store
    Script->>xarray: Open dataset from Zarr store
    xarray-->>Script: Provide dataset
    Script->>User: Print variables, access data, etc.
Loading

Possibly related PRs

  • Zarr v3 #37: Adds a new IPFSZarr3 class for Zarr v3 store abstraction over HAMT, complementing this PR's focus on dataset access via KuboCAS.

Suggested reviewers

  • abidsikder

Poem

🐇
A hop from aiohttp to httpx we take,
Through gateways and nodes, new journeys we make.
With Zarr on IPFS, async flows anew,
Tests and imports tidied, dependencies too.
The rabbit now cheers, with code running fast—
May your datasets be open, and your bugs in the past!

✨ Finishing Touches
🧪 Generate Unit Tests
  • Create PR with Unit Tests
  • Commit Unit Tests in branch fix/httpx-public-gateway
  • Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Faolain Faolain changed the base branch from main to fix/codex-async-2 June 12, 2025 06:09
@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (89b8510) to head (f9e5d3e).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #72   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          619       624    +5     
=========================================
+ Hits           619       624    +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 4

🔭 Outside diff range comments (2)
py_hamt/hamt.py (1)

39-44: ⚠️ Potential issue

Bug: int.from_bytes missing required byteorder argument

int.from_bytes(hash_bytes) raises TypeError at runtime (byteorder is required).

-    hash_as_int: int = int.from_bytes(hash_bytes)
+    # big-endian keeps behaviour consistent with multihash byte layout
+    hash_as_int: int = int.from_bytes(hash_bytes, "big")

Without this fix, every HAMT operation that calls extract_bits() will crash.

py_hamt/__init__.py (1)

16-17: ⚠️ Potential issue

Avoid printing on import

A library must stay silent when merely imported; the unconditional
print("Running py-hamt from source!") pollutes stdout for every consumer (tests, notebooks, daemons, etc.) and breaks code that captures/logs output strictly.

-print("Running py-hamt from source!")
+# Uncomment locally if you really need a debug hint, but do **not** ship this.
+# print("Running py-hamt from source!")
♻️ Duplicate comments (1)
tests/test_kubocas_auth.py (1)

69-71: Same issue – use the public accessor

As above, switch to client.auth to keep the test resilient to future httpx
changes.

-assert isinstance(client._auth, httpx.BasicAuth)
+assert isinstance(client.auth, httpx.BasicAuth)
🧹 Nitpick comments (13)
pyproject.toml (1)

3-9: Dependency update LGTM, but re-run lockfile

The switch to httpx[http2] >=0.27.0 and version bump to 3.1.0 make sense. Remember to regenerate any lockfiles (poetry.lock, requirements*.txt, etc.) so CI installs the correct set of transitive deps.

tests/conftest.py (1)

10-13: Rename fixture for clarity

The fixture is still called global_client_session, evoking the old aiohttp.ClientSession.
Renaming to something like global_httpx_client would prevent confusion.

-async def global_client_session():
+async def global_httpx_client():

(Be sure to update call-sites.)

tests/test_kubo_cas.py (1)

131-137: Redundant client scopes

Creating an additional httpx.AsyncClient() inside the test is fine, but it duplicates what global_client_session already provides.
If the intention is solely coverage, consider marking that explicitly in a comment to avoid future clean-ups removing it as “dead” code.

fsgs.py (1)

11-17: Top-level prints may clutter CLI tools

The diagnostic print() calls run every time the script is imported (e.g. by a test runner).
Either guard them with if __name__ == "__main__": or move into main().

-print(f"Python version: {sys.version}")
-print(f"Python path: {os.path.dirname(sys.executable)}")
-
-print(f"httpx version: {httpx.__version__}")
+if __name__ == "__main__":
+    print(f"Python version: {sys.version}")
+    print(f"Python path: {os.path.dirname(sys.executable)}")
+    print(f"httpx version: {httpx.__version__}")
tests/test_kubocas_auth.py (1)

48-52: Avoid poking into httpx internals

AsyncClient._auth is a private attribute and not part of httpx’ public API.
Relying on it may break the tests when the library bumps versions. Prefer the
public client.auth property or equality checks on headers.

-assert isinstance(client._auth, httpx.BasicAuth)
+assert isinstance(client.auth, httpx.BasicAuth)
tests/test_async.py (2)

29-33: Unnecessary async wrapper adds complexity

_create() contains no await statements – it is effectively synchronous.
Dropping the async keyword simplifies the code and avoids the odd pattern
run_until_complete on a coroutine that performs no awaits.

-async def _create():
-    # Force creation of a client in this loop
-    sess = cas._loop_client()
-    other_session_container.append(sess)
-    other_loop_container.append(loop)
+def _create():
+    # Force creation of a client in this loop
+    sess = cas._loop_client()
+    other_session_container.append(sess)
+    other_loop_container.append(loop)

111-118: Calling __del__ directly bypasses GC semantics

Invoking cas.__del__() manually is unusual; the dunder is intended for the
garbage-collector. In most situations del cas; gc.collect() (as you already
do elsewhere) is sufficient and closer to real-world behaviour. Consider
removing the direct call unless you have a very specific code-coverage need.

public_gateway_example.py (2)

40-44: Synchronous xarray.open_zarr blocks the event loop

xarray.open_zarr is CPU / I/O heavy and synchronous. Running it inside an
async function will block the event loop for potentially seconds. Wrap the call
in a thread-pool to keep the loop responsive:

-        zarr_ds: Dataset = xr.open_zarr(store=zhs)
+        zarr_ds: Dataset = await asyncio.to_thread(xr.open_zarr, store=zhs)

Optionally close the dataset after use (zarr_ds.close()) to release resources.


81-83: Broad bare except Exception masks useful diagnostics

Catching all exceptions can hide bugs (e.g. KeyboardInterrupt, MemoryError
or subtle library failures). Catch Exception only for the gateways you
expect to fail, or at least log the traceback to aid debugging.

tests/test_kubocas_session.py (1)

56-80: Potential IndexError if the worker thread errors

container[0] is accessed unconditionally. If the worker crashes before
appending, the test will raise an IndexError, masking the original failure.
Defensively check the container or re-raise captured exceptions.

ready.wait()
-if not container:
-    raise RuntimeError("Worker thread failed to produce a client")
-return container[0]
+assert container, "Worker thread failed – see logs above"
+return container[0]
tests/test_public_gateway.py (1)

57-76: Public-network tests are prone to flakiness

Hitting real gateways introduces variability (DNS, rate-limits, outages).
Consider marking these tests @pytest.mark.network/@pytest.mark.flaky or
skipping when an environment variable isn’t set, so regular CI runs stay green.

py_hamt/store_httpx.py (2)

309-319: URL construction is brittle – use urllib.parse to avoid double slashes

Manual string concatenation can yield //ipfs/ or duplicate query strings. Prefer:

from urllib.parse import urljoin, urlencode

path = f"/ipfs/{cid}"
url = urljoin(base_url, path)
if is_public_gateway:
    url = f"{url}?{urlencode({'format': 'dag-cbor'})}"

Improves correctness and readability.

🧰 Tools
🪛 GitHub Actions: Triggered on push from Faolain to branch/tag fix/httpx-public-gateway

[error] 316-316: Coverage failure: line 316 is missing coverage. Total coverage 99% is less than required 100%.


321-330: Hard-coded public-gateway list

Embedding the gateway list hampers extensibility. Expose a configurable set (param or class attr) so callers can add e.g. gateway.pinata.cloud without patching the library.

🛑 Comments failed to post (4)
tests/test_public_gateway.py (1)

12-25: 🛠️ Refactor suggestion

HTTP requests lack explicit timeout

Network-bound tests without a timeout can hang CI indefinitely if a gateway
stalls. Supply a per-request or client-level timeout (e.g. timeout=15.0) to
fail fast and surface flaky network issues.

-        client = httpx.AsyncClient(follow_redirects=True)
+        client = httpx.AsyncClient(follow_redirects=True, timeout=15.0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def verify_response_content(url: str, client=None):
    """Fetch and verify the response from a given URL"""
    should_close = False
    if client is None:
        client = httpx.AsyncClient(follow_redirects=True, timeout=15.0)
        should_close = True

    try:
        # Print request info
        print(f"Testing URL: {url}")

        # Fetch content
        response = await client.get(url)
        response.raise_for_status()
🤖 Prompt for AI Agents
In tests/test_public_gateway.py around lines 12 to 25, the HTTP requests made
with httpx.AsyncClient lack an explicit timeout, which can cause tests to hang
indefinitely if the network stalls. Add a timeout parameter, such as
timeout=15.0 seconds, either when creating the AsyncClient instance or directly
in the client.get call, to ensure requests fail fast and prevent hanging tests.
py_hamt/store_httpx.py (3)

259-277: ⚠️ Potential issue

__del__ may raise or silently swallow cleanup errors

Running asyncio.run() from __del__ can explode with
RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks / uvicorn, and create_task() schedules work that is never awaited.
A safer pattern is to rely solely on explicit await cas.aclose()/context-manager and make __del__ a no-op except for logging an explicit warning when resources are still open.

🤖 Prompt for AI Agents
In py_hamt/store_httpx.py lines 259 to 277, the __del__ method currently tries
to close resources asynchronously using asyncio.run or create_task, which can
cause RuntimeError or unawaited tasks. Modify __del__ to avoid running async
code; instead, make it a no-op except logging a warning if resources remain
open, and rely on explicit await calls or context managers for cleanup.

194-205: 🛠️ Refactor suggestion

Potential client‐cache leak across short-lived event loops

_client_per_loop never evicts entries, so creating many short-lived loops (e.g. in unit-test workers or threaded tasks) will accumulate open clients in the dict.
Consider an LRU or finaliser that removes entries once the referenced loop is closed to avoid an unbounded memory/FD leak.

🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 194 to 205, the _client_per_loop
dictionary stores clients keyed by event loops but never removes entries,
causing a potential memory and file descriptor leak when many short-lived loops
are created. To fix this, implement a mechanism to detect when an event loop is
closed and remove its associated client from the dictionary, such as using weak
references with finalizers or an LRU cache that evicts old entries, ensuring
clients tied to closed loops are properly cleaned up.

284-292: 💡 Verification agent

❓ Verification inconclusive

Multipart body lacks filename – some Kubo versions reject it

files = {"file": data} sends the part without a filename; Kubo’s /add endpoint occasionally returns 400 for such payloads.
Safer:

-files = {"file": data}
+files = {"file": ("blob", data)}

so a proper Content-Disposition is emitted.


Ensure multipart upload includes a filename

  • File: py_hamt/store_httpx.py, lines 284–292
- files = {"file": data}
+ files = {"file": ("blob", data)}

Including the filename (“blob”) in the Content-Disposition header prevents Kubo’s /add endpoint from rejecting the request with a 400 error when no filename is provided.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            # Create multipart form data
-           files = {"file": data}
+           files = {"file": ("blob", data)}

            # Send the POST request
            client = self._loop_client()
            response = await client.post(self.rpc_url, files=files)
            response.raise_for_status()
            cid_str: str = response.json()["Hash"]
🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 284 to 292, the multipart form data is
sent without specifying a filename, which can cause Kubo's /add endpoint to
reject the request with a 400 error. To fix this, modify the files dictionary to
include a tuple with the filename "blob" and the data, ensuring the
Content-Disposition header includes the filename. For example, set files to
{"file": ("blob", data)} before sending the POST request.

@TheGreatAlgo TheGreatAlgo changed the base branch from fix/codex-async-2 to main June 16, 2025 11:03
@Faolain Faolain merged commit 9b0dc09 into main Jun 16, 2025
3 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants