-
Notifications
You must be signed in to change notification settings - Fork 0
Fix/httpx public gateway #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis update replaces the asynchronous HTTP client library from Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script (fsgs.py/public_gateway_example.py)
participant KuboCAS (httpx)
participant IPFS Gateway/Node
participant HAMT/Zarr Store
participant xarray
User->>Script: Run script
Script->>KuboCAS: Connect (async context)
KuboCAS->>IPFS Gateway/Node: Fetch HAMT data (httpx.AsyncClient)
KuboCAS-->>Script: Return HAMT structure
Script->>HAMT/Zarr Store: Wrap HAMT as Zarr store
Script->>xarray: Open dataset from Zarr store
xarray-->>Script: Provide dataset
Script->>User: Print variables, access data, etc.
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #72 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 5 5
Lines 619 624 +5
=========================================
+ Hits 619 624 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
py_hamt/hamt.py (1)
39-44:⚠️ Potential issueBug:
int.from_bytesmissing requiredbyteorderargument
int.from_bytes(hash_bytes)raisesTypeErrorat runtime (byteorderis required).- hash_as_int: int = int.from_bytes(hash_bytes) + # big-endian keeps behaviour consistent with multihash byte layout + hash_as_int: int = int.from_bytes(hash_bytes, "big")Without this fix, every HAMT operation that calls
extract_bits()will crash.py_hamt/__init__.py (1)
16-17:⚠️ Potential issueAvoid printing on import
A library must stay silent when merely imported; the unconditional
print("Running py-hamt from source!")pollutes stdout for every consumer (tests, notebooks, daemons, etc.) and breaks code that captures/logs output strictly.-print("Running py-hamt from source!") +# Uncomment locally if you really need a debug hint, but do **not** ship this. +# print("Running py-hamt from source!")
♻️ Duplicate comments (1)
tests/test_kubocas_auth.py (1)
69-71: Same issue – use the public accessorAs above, switch to
client.authto keep the test resilient to future httpx
changes.-assert isinstance(client._auth, httpx.BasicAuth) +assert isinstance(client.auth, httpx.BasicAuth)
🧹 Nitpick comments (13)
pyproject.toml (1)
3-9: Dependency update LGTM, but re-run lockfileThe switch to
httpx[http2] >=0.27.0and version bump to 3.1.0 make sense. Remember to regenerate any lockfiles (poetry.lock,requirements*.txt, etc.) so CI installs the correct set of transitive deps.tests/conftest.py (1)
10-13: Rename fixture for clarityThe fixture is still called
global_client_session, evoking the oldaiohttp.ClientSession.
Renaming to something likeglobal_httpx_clientwould prevent confusion.-async def global_client_session(): +async def global_httpx_client():(Be sure to update call-sites.)
tests/test_kubo_cas.py (1)
131-137: Redundant client scopesCreating an additional
httpx.AsyncClient()inside the test is fine, but it duplicates whatglobal_client_sessionalready provides.
If the intention is solely coverage, consider marking that explicitly in a comment to avoid future clean-ups removing it as “dead” code.fsgs.py (1)
11-17: Top-level prints may clutter CLI toolsThe diagnostic
print()calls run every time the script is imported (e.g. by a test runner).
Either guard them withif __name__ == "__main__":or move intomain().-print(f"Python version: {sys.version}") -print(f"Python path: {os.path.dirname(sys.executable)}") -… -print(f"httpx version: {httpx.__version__}") +if __name__ == "__main__": + print(f"Python version: {sys.version}") + print(f"Python path: {os.path.dirname(sys.executable)}") + print(f"httpx version: {httpx.__version__}")tests/test_kubocas_auth.py (1)
48-52: Avoid poking into httpx internals
AsyncClient._authis a private attribute and not part of httpx’ public API.
Relying on it may break the tests when the library bumps versions. Prefer the
publicclient.authproperty or equality checks on headers.-assert isinstance(client._auth, httpx.BasicAuth) +assert isinstance(client.auth, httpx.BasicAuth)tests/test_async.py (2)
29-33: Unnecessary async wrapper adds complexity
_create()contains noawaitstatements – it is effectively synchronous.
Dropping theasynckeyword simplifies the code and avoids the odd pattern
run_until_completeon a coroutine that performs no awaits.-async def _create(): - # Force creation of a client in this loop - sess = cas._loop_client() - other_session_container.append(sess) - other_loop_container.append(loop) +def _create(): + # Force creation of a client in this loop + sess = cas._loop_client() + other_session_container.append(sess) + other_loop_container.append(loop)
111-118: Calling__del__directly bypasses GC semanticsInvoking
cas.__del__()manually is unusual; the dunder is intended for the
garbage-collector. In most situationsdel cas; gc.collect()(as you already
do elsewhere) is sufficient and closer to real-world behaviour. Consider
removing the direct call unless you have a very specific code-coverage need.public_gateway_example.py (2)
40-44: Synchronousxarray.open_zarrblocks the event loop
xarray.open_zarris CPU / I/O heavy and synchronous. Running it inside an
async function will block the event loop for potentially seconds. Wrap the call
in a thread-pool to keep the loop responsive:- zarr_ds: Dataset = xr.open_zarr(store=zhs) + zarr_ds: Dataset = await asyncio.to_thread(xr.open_zarr, store=zhs)Optionally close the dataset after use (
zarr_ds.close()) to release resources.
81-83: Broad bareexcept Exceptionmasks useful diagnosticsCatching all exceptions can hide bugs (e.g.
KeyboardInterrupt,MemoryError
or subtle library failures). CatchExceptiononly for the gateways you
expect to fail, or at least log the traceback to aid debugging.tests/test_kubocas_session.py (1)
56-80: Potential IndexError if the worker thread errors
container[0]is accessed unconditionally. If the worker crashes before
appending, the test will raise anIndexError, masking the original failure.
Defensively check the container or re-raise captured exceptions.ready.wait() -if not container: - raise RuntimeError("Worker thread failed to produce a client") -return container[0] +assert container, "Worker thread failed – see logs above" +return container[0]tests/test_public_gateway.py (1)
57-76: Public-network tests are prone to flakinessHitting real gateways introduces variability (DNS, rate-limits, outages).
Consider marking these tests@pytest.mark.network/@pytest.mark.flakyor
skipping when an environment variable isn’t set, so regular CI runs stay green.py_hamt/store_httpx.py (2)
309-319: URL construction is brittle – useurllib.parseto avoid double slashesManual string concatenation can yield
//ipfs/or duplicate query strings. Prefer:from urllib.parse import urljoin, urlencode path = f"/ipfs/{cid}" url = urljoin(base_url, path) if is_public_gateway: url = f"{url}?{urlencode({'format': 'dag-cbor'})}"Improves correctness and readability.
🧰 Tools
🪛 GitHub Actions: Triggered on push from Faolain to branch/tag fix/httpx-public-gateway
[error] 316-316: Coverage failure: line 316 is missing coverage. Total coverage 99% is less than required 100%.
321-330: Hard-coded public-gateway listEmbedding the gateway list hampers extensibility. Expose a configurable set (param or class attr) so callers can add e.g.
gateway.pinata.cloudwithout patching the library.
🛑 Comments failed to post (4)
tests/test_public_gateway.py (1)
12-25: 🛠️ Refactor suggestion
HTTP requests lack explicit timeout
Network-bound tests without a timeout can hang CI indefinitely if a gateway
stalls. Supply a per-request or client-level timeout (e.g.timeout=15.0) to
fail fast and surface flaky network issues.- client = httpx.AsyncClient(follow_redirects=True) + client = httpx.AsyncClient(follow_redirects=True, timeout=15.0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.async def verify_response_content(url: str, client=None): """Fetch and verify the response from a given URL""" should_close = False if client is None: client = httpx.AsyncClient(follow_redirects=True, timeout=15.0) should_close = True try: # Print request info print(f"Testing URL: {url}") # Fetch content response = await client.get(url) response.raise_for_status()🤖 Prompt for AI Agents
In tests/test_public_gateway.py around lines 12 to 25, the HTTP requests made with httpx.AsyncClient lack an explicit timeout, which can cause tests to hang indefinitely if the network stalls. Add a timeout parameter, such as timeout=15.0 seconds, either when creating the AsyncClient instance or directly in the client.get call, to ensure requests fail fast and prevent hanging tests.py_hamt/store_httpx.py (3)
259-277:
⚠️ Potential issue
__del__may raise or silently swallow cleanup errorsRunning
asyncio.run()from__del__can explode with
RuntimeError: asyncio.run() cannot be called from a running event loopin notebooks / uvicorn, andcreate_task()schedules work that is never awaited.
A safer pattern is to rely solely on explicitawait cas.aclose()/context-manager and make__del__a no-op except for logging an explicit warning when resources are still open.🤖 Prompt for AI Agents
In py_hamt/store_httpx.py lines 259 to 277, the __del__ method currently tries to close resources asynchronously using asyncio.run or create_task, which can cause RuntimeError or unawaited tasks. Modify __del__ to avoid running async code; instead, make it a no-op except logging a warning if resources remain open, and rely on explicit await calls or context managers for cleanup.
194-205: 🛠️ Refactor suggestion
Potential client‐cache leak across short-lived event loops
_client_per_loopnever evicts entries, so creating many short-lived loops (e.g. in unit-test workers or threaded tasks) will accumulate open clients in the dict.
Consider an LRU or finaliser that removes entries once the referenced loop is closed to avoid an unbounded memory/FD leak.🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 194 to 205, the _client_per_loop dictionary stores clients keyed by event loops but never removes entries, causing a potential memory and file descriptor leak when many short-lived loops are created. To fix this, implement a mechanism to detect when an event loop is closed and remove its associated client from the dictionary, such as using weak references with finalizers or an LRU cache that evicts old entries, ensuring clients tied to closed loops are properly cleaned up.
284-292: 💡 Verification agent
❓ Verification inconclusive
Multipart body lacks filename – some Kubo versions reject it
files = {"file": data}sends the part without afilename; Kubo’s/addendpoint occasionally returns 400 for such payloads.
Safer:-files = {"file": data} +files = {"file": ("blob", data)}so a proper
Content-Dispositionis emitted.
Ensure multipart upload includes a filename
- File: py_hamt/store_httpx.py, lines 284–292
- files = {"file": data} + files = {"file": ("blob", data)}Including the filename (“blob”) in the
Content-Dispositionheader prevents Kubo’s/addendpoint from rejecting the request with a 400 error when no filename is provided.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Create multipart form data - files = {"file": data} + files = {"file": ("blob", data)} # Send the POST request client = self._loop_client() response = await client.post(self.rpc_url, files=files) response.raise_for_status() cid_str: str = response.json()["Hash"]🤖 Prompt for AI Agents
In py_hamt/store_httpx.py around lines 284 to 292, the multipart form data is sent without specifying a filename, which can cause Kubo's /add endpoint to reject the request with a 400 error. To fix this, modify the files dictionary to include a tuple with the filename "blob" and the data, ensuring the Content-Disposition header includes the filename. For example, set files to {"file": ("blob", data)} before sending the POST request.
Fix for adding public gateway support to py-hamt fix/codex-async-2 branch #71
After some edits:
Proof of use also in dClimate/zarr-benchmarking#2
/zarr-benchmarking/zarr_v3-hamt_v5-httpxfolderMaybe some memory profiling with this compared to the aiohttp makes tons of requests and keeping track of memory usage could be a good idea. This way we can see if the aiohttp session thing truly has a memory leak vs this.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Chores