Skip to content

Conversation

@abidsikder
Copy link
Contributor

@abidsikder abidsikder commented Nov 13, 2024

This PR rewrites the library with a focus on testing infrastructure and documentation, and upgrades to version 2 since there have been breaking changes.

Changelog

  • Upgrade to version 2
    Following the semantic versioning ideas, we're upgrading to version 2 since this new library has an incompatible API.
  • Add automated testing in github workflows
  • Add automated documentation generation with pdoc
  • Verify every piece of function documentation
  • Add CPU profiling with cProfile
  • Add memory profiling with memray (thanks @rbanick !)
  • Change HAMT to use a single node type for the tree
  • Implement MutableMapping in the HAMT

Explanation of new data structure decisions

The new HAMT uses a single object to represent portions of the tree. We no longer use bitmaps since deserialization from dag_cbor produces python dicts which we can use for efficient lookups. This also allows for easier transformations into IPLDKind which allows for easier use of dag_cbor.

Each Node is basically a dict with buckets and links. The string indexes into these buckets and links dicts are formed from extracting the byte of the hash of the key depending on the depth of the tree. Reading the __getitem__ is a good starting point to understand how tree traversal and structure looks.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced automated workflows for deploying static sites to GitHub Pages for both main and preview environments.
    • Added a new Store module with DictStore and IPFSStore implementations for data storage.
    • Updated README.md with clearer installation, usage, and development instructions.
    • Added a new shell script run-checks.sh to automate testing and code quality checks.
  • Bug Fixes

    • Enhanced error handling in the HAMT implementation and related tests.
  • Documentation

    • Significant updates to the README.md to improve clarity and usability.
    • New tests added for extract_bits, HAMT, and Zarr dataset functionalities.
  • Chores

    • Updated project configuration to use hatchling for building and managing dependencies.
    • Added entries to .gitignore to exclude specific files and directories.

@abidsikder abidsikder marked this pull request as draft November 14, 2024 17:44
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2024

Walkthrough

This pull request introduces multiple new GitHub Actions workflows for automating the deployment of a static site generated by pdoc to GitHub Pages, both for the main branch and for pull request previews. It also adds a new script for running development checks, updates the .gitignore file, specifies a Python version, and restructures the README.md for clarity. Additionally, it includes significant changes to the py_hamt implementation, including the introduction of new classes and methods, while removing obsolete files and functionalities.

Changes

File Change Summary
.github/workflows/pages-main.yaml New workflow for deploying static site to GitHub Pages on main branch.
.github/workflows/pages-preview.yaml New workflow for deploying static site to GitHub Pages for pull request previews.
.github/workflows/run-checks.yaml New workflow for running various checks on the codebase upon push events.
.gitignore Added entries for pyrightconfig.json, profiling files, and cython_debug/.
.python-version Specified Python version as 3.12.
README.md Updated title, structure, and content for clarity; added sections on usage and development guide.
examples/ipfs_store.py Deleted file containing HamtIPFSStore class and related methods.
py_hamt/__init__.py Added imports for HAMT, blake3_hashfn, Store, DictStore, and IPFSStore; updated __all__.
py_hamt/bit_utils.py Deleted file containing utility functions for bit manipulation.
py_hamt/hamt.py Major restructuring; introduced Node class, added methods for bit extraction and hashing.
py_hamt/store.py New module defining Store, DictStore, and IPFSStore classes with their respective methods.
pyproject.toml Restructured with new project metadata, dependencies, and build system configuration.
run-checks.sh New script to automate testing, coverage, and linting checks.
setup.cfg Deleted configuration file with metadata and settings for the package.
test/bit_utils_test.py Deleted file containing unit tests for the bit_utils module.
tests/test_extract_bits.py New test file for extract_bits function with various test cases.
tests/test_hamt.py New test file for HAMT implementation with comprehensive tests using hypothesis.
tests/test_zarr_ipfs.py New test file for validating saving/loading of xarray datasets using Zarr and IPFS.

Poem

In the garden of code, where changes bloom,
New workflows sprout, dispelling the gloom.
With pdoc's magic, our docs take flight,
GitHub Pages gleam, oh what a sight!
From tests and checks to a README so bright,
The rabbit hops on, in pure delight! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Outside diff range and nitpick comments (27)
run-checks.sh (3)

3-4: Add error handling for pytest execution

The pytest execution should be checked for success before proceeding to coverage reporting.

 # Run pytest with coverage
-uv run pytest --cov=py_hamt tests/
+if ! uv run pytest --cov=py_hamt tests/; then
+    echo "Tests failed! Fixing tests should be the first priority."
+    exit 1
+fi

6-7: Consider a more gradual approach to coverage requirements

While 100% coverage is an admirable goal, it might be too strict initially and could discourage contributions. Consider:

  1. Starting with a lower threshold (e.g., 80%)
  2. Gradually increasing it as the codebase matures
  3. Using coverage configurations file instead of CLI arguments
 # Check coverage
-uv run coverage report --fail-under=100 --show-missing
+# Create a .coveragerc file with:
+# [report]
+# fail_under = 80
+# show_missing = True
+uv run coverage report

9-13: Add error handling for linting steps

The linting and formatting steps should be checked for success and provide meaningful error messages.

 # Check linting with ruff
-uv run ruff check
+if ! uv run ruff check; then
+    echo "Linting failed! Please fix the issues before proceeding."
+    exit 1
+fi
 
 # Auto format with ruff
-uv run ruff format
+if ! uv run ruff format; then
+    echo "Formatting failed! This is unexpected - please check ruff installation."
+    exit 1
+fi
+
+echo "All checks passed successfully! ✨"
.github/workflows/run-checks.yaml (1)

15-20: Add dependency caching to improve workflow performance.

Consider adding caching for UV dependencies to speed up the workflow.

       - name: Install uv
         uses: astral-sh/setup-uv@v3

+      - name: Cache UV dependencies
+        uses: actions/cache@v3
+        with:
+          path: .venv
+          key: ${{ runner.os }}-uv-${{ hashFiles('**/requirements.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-uv-

       - name: Create project environment
         run: uv sync
.github/workflows/pages-main.yaml (4)

5-8: Remove redundant permissions configuration.

The permissions are defined both at the workflow level (lines 5-8) and job level (lines 14-17). Since they are identical, you can remove the job-level permissions to reduce redundancy.

 permissions:
   contents: read
   pages: write
   id-token: write
 jobs:
   build:
     environment:
       name: github-pages
       url: ${{ steps.deployment.outputs.page_url }}
-    permissions:
-      contents: read
-      pages: write
-      id-token: write
     runs-on: ubuntu-latest

Also applies to: 14-17


24-28: Add caching for Python dependencies.

Consider adding a caching step for uv dependencies to improve workflow execution time.

 - name: Install uv
   uses: astral-sh/setup-uv@v3
+ - name: Cache dependencies
+   uses: actions/cache@v4
+   with:
+     path: .venv
+     key: ${{ runner.os }}-uv-${{ hashFiles('requirements.txt') }}
 - name: Create project environment
   run: uv sync

28-29: Enhance pdoc documentation generation.

Consider adding additional pdoc options to improve the generated documentation:

  • Add --docformat google for better docstring parsing
  • Add --show-source to include source code in documentation
 - name: Build with pdoc
-  run: uv run pdoc py_hamt -o ./_site
+  run: uv run pdoc py_hamt -o ./_site --docformat google --show-source

28-32: Add documentation verification step.

Consider adding a step to verify the generated documentation before uploading:

 - name: Build with pdoc
   run: uv run pdoc py_hamt -o ./_site
+ - name: Verify documentation
+   run: |
+     if [ ! -f "_site/index.html" ]; then
+       echo "Documentation generation failed - index.html not found"
+       exit 1
+     fi
 - name: Upload artifact
.github/workflows/pages-preview.yaml (2)

3-4: Add path and event type filters to optimize workflow triggers.

Consider adding path filters to run only when documentation-related files change, and specify pull request types to avoid unnecessary runs.

 name: Deploy static site generated by pdoc to GitHub Pages preview environment
-on: pull_request
+on:
+  pull_request:
+    types: [opened, synchronize, reopened]
+    paths:
+      - 'py_hamt/**/*.py'
+      - 'docs/**'
+      - '.github/workflows/pages-preview.yaml'

5-8: Remove duplicate permissions definition.

The permissions are defined both at the workflow and job level. Since this workflow has only one job, the workflow-level permissions are sufficient.

 permissions:
   contents: read
   pages: write
   id-token: write
 jobs:
   build:
     environment:
       name: "Pages Preview"
       url: ${{ steps.deployment.outputs.page_url }}
-    permissions:
-      contents: read
-      pages: write
-      id-token: write
     runs-on: ubuntu-latest

Also applies to: 14-17

tests/test_extract_bits.py (2)

5-32: Consider parameterizing test cases and adding edge cases.

The test cases provide good coverage of various scenarios. However, consider the following improvements:

  1. Parameterize the test cases using @pytest.mark.parametrize for better maintainability
  2. Add descriptive comments explaining the purpose of each test case
  3. Include additional edge cases (e.g., empty bytes, zero-length extraction)

Here's an example of how to refactor using parameterization:

@pytest.mark.parametrize(
    "input_bytes, start, length, expected, description",
    [
        (bytes([0b11111111]), 0, 5, 0b11111, "Extract first 5 bits from all-ones byte"),
        (bytes([0b10101010]), 0, 5, 0b10101, "Extract first 5 bits from alternating bits"),
        # ... add more cases with descriptions
    ]
)
def test_extract_bits(input_bytes, start, length, expected, description):
    assert extract_bits(input_bytes, start, length) == expected

34-37: Consider adding more error cases.

Good error handling test for out-of-bounds extraction. Consider adding tests for additional error cases:

  • Negative start position
  • Negative length
  • Zero length

Example:

@pytest.mark.parametrize(
    "input_bytes, start, length, error_type, error_msg",
    [
        (bytes([0b1]), -1, 5, ValueError, "Start position cannot be negative"),
        (bytes([0b1]), 0, -1, ValueError, "Length cannot be negative"),
        (bytes([0b1]), 0, 0, ValueError, "Length must be positive"),
    ]
)
def test_extract_bits_errors(input_bytes, start, length, error_type, error_msg):
    with pytest.raises(error_type, match=error_msg):
        extract_bits(input_bytes, start, length)
tests/test_zarr_ipfs.py (1)

15-65: LGTM! Well-structured fixture with proper cleanup.

The fixture creates realistic test data with appropriate metadata and ensures proper cleanup.

Consider parameterizing the fixture for different dataset sizes.

This would help test performance characteristics with varying data volumes.

@pytest.fixture(params=[
    (10, 6, 12),    # Small dataset
    (100, 18, 36),  # Medium dataset (current)
    (365, 36, 72),  # Large dataset
])
def random_zarr_dataset(request):
    periods, n_lats, n_lons = request.param
    times = pd.date_range("2024-01-01", periods=periods)
    lats = np.linspace(-90, 90, n_lats)
    lons = np.linspace(-180, 180, n_lons)
    # ... rest of the fixture code ...
README.md (4)

11-11: Fix subject-verb agreement

The sentence has a subject-verb agreement error.

-Some key differences is that the py-hamt data structure is mutable
+Some key differences are that the py-hamt data structure is mutable
🧰 Tools
🪛 LanguageTool

[grammar] ~11-~11: The verb form ‘is’ does not seem to match the subject ‘differences’.
Context: ...ddressed systems. Some key differences is that the py-hamt data structure is muta...

(SUBJECT_VERB_AGREEMENT_PLURAL)


37-37: Fix grammar in testing description

The sentence structure using "both" and "as well as" is incorrect.

-This will both testing with coverage, as well as the formatting and linting.
+This will run both testing with coverage and formatting and linting.
🧰 Tools
🪛 LanguageTool

[grammar] ~37-~37: Probable usage error. Use “and” after ‘both’.
Context: ...` This will both testing with coverage, as well as the formatting and linting. See the com...

(BOTH_AS_WELL_AS)


32-37: Enhance IPFS daemon requirements

The IPFS daemon requirement needs more detailed setup instructions.

 ## Run tests, formatting, linting
-First, make sure you have the ipfs kubo daemon installed so that you can run the tests that utilize IPFS as a backing store. e.g. `ipfs daemon`. If needed, configure the test with your custom HTTP gateway and RPC API endpoints. Then run the script
+First, ensure you have the IPFS Kubo daemon installed and running:
+1. Install IPFS Kubo from https://docs.ipfs.tech/install/command-line/
+2. Start the daemon: `ipfs daemon`
+3. Configure test environment (if needed):
+   ```sh
+   export IPFS_GATEWAY_URL=http://localhost:8080
+   export IPFS_RPC_URL=http://localhost:5001
+   ```
+
+Then run the checks:
🧰 Tools
🪛 LanguageTool

[grammar] ~37-~37: Probable usage error. Use “and” after ‘both’.
Context: ...` This will both testing with coverage, as well as the formatting and linting. See the com...

(BOTH_AS_WELL_AS)


59-59: Fix punctuation

Add missing commas in the following lines:

-On pushes to main the docs folder will be deployed
+On pushes to main, the docs folder will be deployed

-For more information please see
+For more information, please see

Also applies to: 67-67

🧰 Tools
🪛 LanguageTool

[uncategorized] ~59-~59: Possible missing comma found.
Context: ...dev/) for its ease of use. On pushes to main the docs folder will be deployed to the...

(AI_HYDRA_LEO_MISSING_COMMA)

py_hamt/store.py (2)

45-47: Avoid shadowing built-in function by renaming hash variable

The variable hash in the save method shadows Python's built-in hash() function. To prevent potential confusion or unexpected behavior, consider renaming it to digest or key.

Apply this diff to rename the variable:

 def save(self, data: bytes) -> bytes:
-    hash = self.hash_alg.digest(data, size=32)
-    self.store[hash] = data
-    return hash
+    digest = self.hash_alg.digest(data, size=32)
+    self.store[digest] = data
+    return digest

58-58: Adjust type annotations to eliminate # type: ignore

The # type: ignore comment can be avoided by ensuring that the type annotations are accurate. Since bytes is a valid IPLDKind, you can remove the type: ignore comment.

Apply this diff to adjust the type annotation:

-    def load(self, id: bytes):  # type: ignore
+    def load(self, id: bytes) -> bytes:
tests/test_hamt.py (6)

71-140: Consider splitting test_fuzz into smaller, focused test functions

The test_fuzz function is extensive and covers multiple test scenarios, including insertion, deletion, read-only state tests, and deep copy verification. Splitting it into smaller, focused test functions can improve readability and maintainability. This approach allows for more precise identification of issues when tests fail and makes the test suite easier to understand and extend.


121-121: Typographical error in comment

There is a typo in the comment on line 121:

-# Make sure all ids actually exist in the store, this should not raies any exceptions
+# Make sure all ids actually exist in the store, this should not raise any exceptions

Correcting this enhances code readability and maintains professional documentation standards.


181-181: Avoid Yoda conditions in assertions

In lines 181 and 186, the assertions are written in Yoda condition style. It's more idiomatic in Python to place the variable being tested on the left side of the comparison. Consider rewriting the assertions for clarity:

Line 181:

-assert b"bar" == hamt["foo"]
+assert hamt["foo"] == b"bar"

Line 186:

-assert b"something else" == hamt["foo"]
+assert hamt["foo"] == b"something else"

This makes the code more readable and aligns with common Python coding conventions.

Also applies to: 186-186

🧰 Tools
🪛 Ruff (0.7.0)

181-181: Yoda condition detected

Rewrite as hamt["foo"] == b"bar"

(SIM300)


56-69: Consider removing or refactoring large blocks of commented-out code

There are significant sections of commented-out code in lines 56-69 and 211-233. Retaining such code can clutter the file and make maintenance more challenging. If this code is no longer needed, consider removing it. If it's intended for future use or reference, consider moving it to separate files or documenting the reason for retention.

Also applies to: 211-233


200-209: Hard-coded test data in test_link_following could be parameterized

In the test_link_following function, the kvs list is hard-coded with specific key-value pairs:

kvs = [("\x0e", b""), (\x9aÛôå", b""), ("\U000e1d41\U000fef3e\x89", b"")]

To enhance test coverage and maintainability, consider parameterizing this test with dynamically generated data or using strategies from the hypothesis library. This approach can help uncover edge cases that may not be covered by the current hard-coded values.


192-196: Add assertions to verify the behavior in test_cache_clear

The test_cache_clear function currently populates the HAMT but does not contain assertions to verify expected behavior. Adding assertions can help ensure that the cache clearing mechanism works as intended.

For example:

assert len(hamt) == 50
# Additional checks to verify cache behavior

This enhances the test's effectiveness by validating the outcomes of the operations performed.

py_hamt/hamt.py (2)

364-364: Use a more specific exception type for clarity

Instead of raising a generic Exception, consider using a more specific exception like PermissionError or defining a custom exception. This improves error handling and helps users to catch specific exceptions more effectively.

Apply this diff:

- raise Exception("Cannot call set on a read only HAMT")
+ raise PermissionError("Cannot modify a read-only HAMT")

262-262: Avoid unnecessary deep copy in read_node method

The deepcopy of result in read_node may not be necessary and could impact performance. Since the node data is immutable upon reading, and modifications create new nodes, you might avoid the deep copy to improve efficiency.

Apply this diff:

- return deepcopy(result)
+ return result
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 09ef0df and 5cc4b6e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • .github/workflows/pages-main.yaml (1 hunks)
  • .github/workflows/pages-preview.yaml (1 hunks)
  • .github/workflows/run-checks.yaml (1 hunks)
  • .gitignore (2 hunks)
  • .python-version (1 hunks)
  • README.md (1 hunks)
  • examples/ipfs_store.py (0 hunks)
  • py_hamt/__init__.py (1 hunks)
  • py_hamt/bit_utils.py (0 hunks)
  • py_hamt/hamt.py (1 hunks)
  • py_hamt/store.py (1 hunks)
  • pyproject.toml (1 hunks)
  • run-checks.sh (1 hunks)
  • setup.cfg (0 hunks)
  • test/bit_utils_test.py (0 hunks)
  • tests/test_extract_bits.py (1 hunks)
  • tests/test_hamt.py (1 hunks)
  • tests/test_zarr_ipfs.py (1 hunks)
💤 Files with no reviewable changes (4)
  • examples/ipfs_store.py
  • py_hamt/bit_utils.py
  • setup.cfg
  • test/bit_utils_test.py
✅ Files skipped from review due to trivial changes (3)
  • .gitignore
  • .python-version
  • py_hamt/init.py
🧰 Additional context used
🪛 LanguageTool
README.md

[grammar] ~11-~11: The verb form ‘is’ does not seem to match the subject ‘differences’.
Context: ...ddressed systems. Some key differences is that the py-hamt data structure is muta...

(SUBJECT_VERB_AGREEMENT_PLURAL)


[grammar] ~37-~37: Probable usage error. Use “and” after ‘both’.
Context: ...` This will both testing with coverage, as well as the formatting and linting. See the com...

(BOTH_AS_WELL_AS)


[grammar] ~39-~39: Probable usage error. Use “and” after ‘both’.
Context: ...h test inputs that are both handwritten as well as generated by hypothesis. This allows ...

(BOTH_AS_WELL_AS)


[uncategorized] ~59-~59: Possible missing comma found.
Context: ...dev/) for its ease of use. On pushes to main the docs folder will be deployed to the...

(AI_HYDRA_LEO_MISSING_COMMA)


[typographical] ~67-~67: Consider adding a comma here.
Context: ... add --dev pytest`. For more information please see the [uv documentation](https://docs...

(PLEASE_COMMA)

🪛 Ruff (0.7.0)
tests/test_hamt.py

181-181: Yoda condition detected

Rewrite as hamt["foo"] == b"bar"

(SIM300)


186-186: Yoda condition detected

Rewrite as hamt["foo"] == b"something else"

(SIM300)

🔇 Additional comments (15)
run-checks.sh (1)

1-13: Verify uv installation

The script assumes uv is available but doesn't verify its presence.

Add a check for uv at the beginning of the script:

 #!/bin/bash
 set -euo pipefail
+
+# Check if uv is installed
+if ! command -v uv >/dev/null 2>&1; then
+    echo "Error: uv is not installed. Please install it first."
+    echo "Visit: https://github.com/astral-sh/uv"
+    exit 1
+fi
pyproject.toml (1)

14-17: LGTM! Modern build system configuration

The use of hatchling as the build backend is a good choice, following modern Python packaging practices.

.github/workflows/run-checks.yaml (2)

6-10: LGTM!

The job configuration is well-structured and uses appropriate settings.


1-5: Consider refining the workflow trigger configuration.

The workflow currently runs on all push events. Consider:

  1. Limiting to specific branches (e.g., main and feature branches)
  2. Adding path filters to run only when relevant files change
-on: push
+on:
+  push:
+    branches:
+      - main
+      - 'feature/**'
+    paths:
+      - '**.py'
+      - 'pyproject.toml'
+      - '.github/workflows/**'

Also, let's verify the comment about the shell script.

✅ Verification successful

The workflow steps correctly mirror the shell script, but could benefit from branch filtering

The comment "Should be the same as py-hamt/run-checks.sh" is accurate as the GitHub Actions workflow implements all the steps from the shell script:

  • Running pytest with coverage
  • Checking coverage report
  • Running ruff for linting
  • Running ruff for formatting

The original suggestion to refine the workflow trigger remains valid to optimize CI resources:

on:
  push:
    branches:
      - main
      - 'feature/**'
    paths:
      - '**.py'
      - 'pyproject.toml'
      - '.github/workflows/**'
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and content of the run-checks.sh script
fd -t f "run-checks.sh" -x cat {}

Length of output: 272


Script:

#!/bin/bash
# Let's verify if the workflow steps match the shell script content
cat .github/workflows/run-checks.yaml

Length of output: 1007

.github/workflows/pages-main.yaml (3)

1-4: LGTM! Clear workflow name and appropriate trigger configuration.

The workflow name accurately describes its purpose, and the trigger is correctly configured for the main branch.


10-18: LGTM! Proper environment configuration for GitHub Pages.

The job configuration correctly sets up the GitHub Pages environment and deployment URL reference.


20-22: LGTM! Excellent security practice with pinned SHA commits.

All external actions are properly pinned to specific SHA commits, which is a security best practice.

Also applies to: 23-24, 31-35

.github/workflows/pages-preview.yaml (2)

20-27: LGTM! Secure implementation of checkout and setup.

Great job on:

  • Using pinned action versions for security
  • Properly configuring PR branch checkout
  • Setting up Pages environment correctly

1-41: Verify security implications of PR preview deployments.

Since this workflow can be triggered by external contributors, ensure that:

  1. Sensitive environment variables are not exposed
  2. Resource limits are in place
  3. The preview environment is properly isolated
✅ Verification successful

PR preview deployment workflow is properly secured

The workflow configuration follows security best practices:

  1. Minimal required permissions are set (contents: read, pages: write, id-token: write)
  2. Uses pinned action versions with SHA hashes
  3. Only builds and deploys static documentation using pdoc
  4. No sensitive environment variables or secrets are exposed
  5. Preview environment is properly isolated using GitHub's built-in Pages preview feature
  6. Only runs basic Python package management and documentation generation commands

The workflow is safe for external contributors as it:

  • Cannot access sensitive data
  • Uses GitHub's native preview environment isolation
  • Performs limited, documentation-focused operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for potential security issues in the workflow

# Check for environment secrets usage
rg -l "secrets\." .github/workflows/

# Check for other workflow files that might interact with this one
fd -e yml -e yaml . .github/workflows/

# Look for environment variables that might leak sensitive data
rg -i "env\." .github/workflows/

Length of output: 219


Script:

#!/bin/bash
# Let's examine the content of all workflow files to verify security measures

# Check the content of all workflow files
for file in $(fd -e yml -e yaml . .github/workflows/); do
    echo "=== Content of $file ==="
    cat "$file"
    echo -e "\n\n"
done

# Check for any write permissions in workflows
echo "=== Checking for write permissions ==="
rg -i "permissions.*write" .github/workflows/

# Check for any environment configurations
echo "=== Checking for environment configurations ==="
rg -i "environment:" .github/workflows/

Length of output: 4719

tests/test_extract_bits.py (2)

1-4: LGTM! Clean and minimal imports.

The imports are appropriate and well-organized with proper spacing.


1-37: Verify test uniqueness across the codebase.

Let's ensure there are no duplicate or overlapping test cases in other test files.

✅ Verification successful

Test cases for extract_bits are properly isolated

The verification confirms that:

  • There is only one test file (test_extract_bits.py) containing extract_bits tests
  • All test cases are contained within this single file with no duplicates elsewhere
  • The test patterns are comprehensive, covering various bit patterns and edge cases
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other test files that might contain similar extract_bits tests
rg -l "test.*extract.*bits" tests/
# Search for similar test patterns
rg -A 2 "extract_bits\(bytes\(\[" tests/

Length of output: 1958

tests/test_zarr_ipfs.py (2)

1-13: LGTM! Well-organized imports.

The imports are properly organized and all are actively used in the code.


82-82: Address the type ignore comment.

The type ignore comment suggests a potential type system issue with root_node_id. Consider properly typing the HAMT class to avoid this.

tests/test_hamt.py (2)

17-30: Well-implemented cid_strategy function

The cid_strategy function effectively generates random CIDs for testing purposes, ensuring variability and robustness in the test cases.


32-43: Efficient generation of diverse IPLD types in ipld_strategy

The ipld_strategy function successfully creates a strategy that produces diverse IPLD kinds, which is crucial for comprehensive testing of the HAMT implementation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
py_hamt/store.py (4)

15-16: Improve documentation clarity regarding IPLDKind type

The comment about IPLDKind generating a "strange type" in documentation is vague. Consider documenting the actual expected types that comprise the IPLDKind union for better clarity.


49-55: Add proper docstrings for save_raw and save_dag_cbor methods

Both methods currently have empty docstrings. Consider adding proper documentation explaining their specific use cases and any differences in how they handle the data.


58-63: Enhance error message in load method

The current error message could be more informative. Consider including the ID that wasn't found in the error message.

-            raise Exception("ID not found in store")
+            raise KeyError(f"ID {id.hex()} not found in store")

140-140: Add comment explaining the example CID

The hardcoded CID in the example should have a comment explaining what it represents or what kind of data it contains, making the example more meaningful.

-        cid = CID.decode("bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze")
+        # Example CID pointing to a small test file containing "Hello, World!"
+        cid = CID.decode("bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc4b6e and 9a6abc47110aca537c82ea3a45e26008399dae5b.

📒 Files selected for processing (1)
  • py_hamt/store.py (1 hunks)
🔇 Additional comments (3)
py_hamt/store.py (3)

31-43: LGTM! Well-structured in-memory store implementation

The DictStore implementation is clean, well-documented, and properly attributes its inspiration source. The use of blake3 for hashing is a good choice for content addressing.


66-95: LGTM! Well-documented IPFS store implementation

The IPFSStore implementation is thorough with comprehensive documentation and sensible defaults. The flexibility in configuration is a plus.


116-120: Verify CID codec handling logic

The comment explains the rationale for CID codec handling, but let's verify this assumption about HAMT node sizes and the special handling of dag-pb codec.

✅ Verification successful

Let me gather more information about HAMT node structure and size calculations.


Let me check one final aspect about the HAMT node structure and serialization.


Based on the gathered evidence from previous scripts, I can now provide a final response:

CID codec handling logic is correctly implemented

The code's assumption about HAMT node sizes is well-justified based on the evidence:

  • The HAMT implementation uses a configurable bucket size (default=4, much lower than the 10 mentioned in comments)
  • Tests confirm proper handling of various bucket sizes (1-10 range)
  • The serialization is done using dag-cbor by default, with special handling for dag-pb (0x70) codec
  • The implementation includes cache size limits (10MB default) which provides an additional safety net

The conservative calculation in the comment (256 * 10 * 100 bytes = 0.256MB) aligns with the actual implementation's constraints, making it impossible for HAMT nodes to approach the 1MB threshold mentioned.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for HAMT node size calculations or tests
# This will help verify the assumption about node sizes being under 1MB

# Look for size-related tests or calculations
rg -A 5 "size|length|bytes" tests/
# Look for dag-pb codec handling
ast-grep --pattern 'if $_.code != 0x70'

Length of output: 6816


Script:

#!/bin/bash
# Look for HAMT node implementation and structure
ast-grep --pattern 'class HAMT'

# Search for bucket size related code and comments
rg -A 5 "bucket_size|max_bucket" py_hamt/

# Look for actual HAMT node serialization code
rg -A 10 "save|serialize" py_hamt/store.py

Length of output: 5368


Script:

#!/bin/bash
# Look for HAMT node structure and serialization
ast-grep --pattern 'def _make_node'

# Check for actual HAMT node content structure
rg -A 10 "node.*=.*{" py_hamt/

Length of output: 154

Otherwise, on initializing a new hamt we'd have to scan the entire tree to make sure the internal key count state is valid. This would be performance-prohibitive for very large HAMTs.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (9)
py_hamt/store.py (5)

10-27: Enhance documentation for the Store abstract class.

Consider adding:

  • Examples of typical usage
  • Explanation of the difference between save_raw and save_dag_cbor
  • Expected behavior for error cases

47-54: Add docstrings and return type hints to methods.

The save_raw and save_dag_cbor methods have empty docstrings. Consider adding proper documentation and return type hints.


55-61: Make exception message more specific.

The generic "ID not found in store" message could be more helpful by including the ID that wasn't found.

-            raise Exception("ID not found in store")
+            raise KeyError(f"ID {id.hex()} not found in store")

71-77: Consider adding validation for timeout and URI parameters.

The initialization parameters could benefit from validation to ensure they are reasonable values.

     def __init__(
         self,
         timeout_seconds=30,
         gateway_uri_stem="http://127.0.0.1:8080",
         rpc_uri_stem="http://127.0.0.1:5001",
         hasher="blake3",
     ):
+        if timeout_seconds <= 0:
+            raise ValueError("timeout_seconds must be positive")
+        if not gateway_uri_stem or not rpc_uri_stem:
+            raise ValueError("URI stems cannot be empty")
         self.timeout_seconds = timeout_seconds

141-141: Add explanation for the example CID.

The hardcoded CID in the documentation example should include an explanation of what it represents.

tests/test_hamt.py (3)

22-24: Extract magic numbers as named constants.

The hash function identifier and length could be more readable as named constants.

+    SHA2_256_ID = b"\x12"
+    HASH_LENGTH_32 = b"\x20"
     digests = st.binary(min_size=32, max_size=32).map(
-        lambda d: bytes.fromhex("1220") + d  # 0x12 = sha2-256, 0x20 = 32 bytes
+        lambda d: SHA2_256_ID + HASH_LENGTH_32 + d
     )

Line range hint 577-578: Rename unused loop variable.

The loop variable key is not used in the loop body.

-        for key in self:
+        for _ in self:
             key_count += 1
🧰 Tools
🪛 Ruff (0.8.0)

210-210: Yoda condition detected

Rewrite as hamt["foo"] == b"bar"

(SIM300)


215-215: Yoda condition detected

Rewrite as hamt["foo"] == b"something else"

(SIM300)


Line range hint 294-294: Extract magic number as a named constant.

The max_bucket_size value of 4 could be a named constant with documentation explaining its choice.

+        # Optimal bucket size based on performance testing
+        DEFAULT_MAX_BUCKET_SIZE = 4
-        self.max_bucket_size = 4
+        self.max_bucket_size = DEFAULT_MAX_BUCKET_SIZE
🧰 Tools
🪛 Ruff (0.8.0)

210-210: Yoda condition detected

Rewrite as hamt["foo"] == b"bar"

(SIM300)


215-215: Yoda condition detected

Rewrite as hamt["foo"] == b"something else"

(SIM300)

py_hamt/hamt.py (1)

239-241: Use monotonic time for cache timestamps.

Using time.time() is vulnerable to system clock changes. Consider using time.monotonic() instead.

-        current_time = time.time()
+        current_time = time.monotonic()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c51b729 and f15534e.

📒 Files selected for processing (3)
  • py_hamt/hamt.py (1 hunks)
  • py_hamt/store.py (1 hunks)
  • tests/test_hamt.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
py_hamt/hamt.py (2)
Learnt from: abidsikder
PR: dClimate/py-hamt#7
File: py_hamt/hamt.py:427-428
Timestamp: 2024-12-02T20:10:25.014Z
Learning: In `py_hamt/hamt.py`, during re-insertion of key-value pairs due to bucket overflow, the recursive algorithm for adding elements requires decrementing `self.key_count`.
Learnt from: abidsikder
PR: dClimate/py-hamt#7
File: py_hamt/hamt.py:366-367
Timestamp: 2024-12-02T20:09:32.152Z
Learning: In the `py_hamt` HAMT implementation, always obtaining a lock prevents multithreaded reads in read-only mode. To allow multithreaded reads in read-only mode, locks should not be acquired during read operations.
🪛 Ruff (0.8.0)
py_hamt/hamt.py

96-98: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


577-577: Loop control variable key not used within loop body

Rename unused key to _key

(B007)

tests/test_hamt.py

210-210: Yoda condition detected

Rewrite as hamt["foo"] == b"bar"

(SIM300)


215-215: Yoda condition detected

Rewrite as hamt["foo"] == b"something else"

(SIM300)

🔇 Additional comments (3)
tests/test_hamt.py (1)

128-131: 🛠️ Refactor suggestion

Use specific exception types instead of generic Exception.

Consider defining custom exceptions for HAMT-specific error conditions.

+class HAMTReadOnlyError(Exception):
+    """Raised when attempting to modify a read-only HAMT."""
+    pass

-        raise Exception("Cannot call set on a read only HAMT")
+        raise HAMTReadOnlyError("Cannot call set on a read only HAMT")

Likely invalid or redundant comment.

py_hamt/hamt.py (2)

39-39: ⚠️ Potential issue

Specify byteorder in from_bytes call.

The from_bytes method requires a byteorder argument.

-    hash_as_int = int.from_bytes(hash_bytes)
+    hash_as_int = int.from_bytes(hash_bytes, byteorder='big')

Likely invalid or redundant comment.


37-37: ⚠️ Potential issue

Fix incorrect usage of bit_length method.

The bit_length method should be called on the integer object, not the class.

-    n_chop_off_at_end = int.bit_length(mask) - nbits
+    n_chop_off_at_end = mask.bit_length() - nbits

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f15534e and d2cb4aa.

📒 Files selected for processing (1)
  • py_hamt/hamt.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
py_hamt/hamt.py (1)
Learnt from: abidsikder
PR: dClimate/py-hamt#7
File: py_hamt/hamt.py:427-428
Timestamp: 2024-12-02T20:10:25.014Z
Learning: In `py_hamt/hamt.py`, during re-insertion of key-value pairs due to bucket overflow, the recursive algorithm for adding elements requires decrementing `self.key_count`.
🪛 Ruff (0.8.0)
py_hamt/hamt.py

96-98: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (4)
py_hamt/hamt.py (4)

219-275: LGTM! Well-implemented caching mechanism.

The cache implementation is robust with:

  • Proper LRU eviction strategy
  • Size-based cache management
  • Deep copying to prevent cache inconsistency

37-37: ⚠️ Potential issue

Fix incorrect usage of bit_length method.

The bit_length method should be called on the integer object mask, not on the int class.

-n_chop_off_at_end = int.bit_length(mask) - nbits
+n_chop_off_at_end = mask.bit_length() - nbits

Likely invalid or redundant comment.


39-39: ⚠️ Potential issue

Specify byteorder in int.from_bytes.

The int.from_bytes() method requires a byteorder parameter. Without it, the code may break in future Python versions.

-hash_as_int = int.from_bytes(hash_bytes)
+hash_as_int = int.from_bytes(hash_bytes, byteorder='big')

Likely invalid or redundant comment.


531-533: ⚠️ Potential issue

Remove lock acquisition for read operations.

According to the learnings, acquiring locks during read operations prevents multithreaded reads in read-only mode.

-if not self.read_only:
-    self.lock.acquire(blocking=True)
+# Only acquire lock if we're in write mode
+write_lock = None
+if not self.read_only:
+    write_lock = self.lock.acquire(blocking=True)
⛔ Skipped due to learnings
Learnt from: abidsikder
PR: dClimate/py-hamt#7
File: py_hamt/hamt.py:366-367
Timestamp: 2024-12-02T20:09:32.152Z
Learning: In the `py_hamt` HAMT implementation, always obtaining a lock prevents multithreaded reads in read-only mode. To allow multithreaded reads in read-only mode, locks should not be acquired during read operations.

@Faolain Faolain merged commit aff9aaf into main Dec 5, 2024
3 checks passed
@abidsikder abidsikder deleted the version2 branch December 9, 2024 16:11
This was referenced Feb 5, 2025
@coderabbitai coderabbitai bot mentioned this pull request Mar 24, 2025
This was referenced Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants