-
Notifications
You must be signed in to change notification settings - Fork 0
Version2 review #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also add msgspec and requests depdenencies for ipfs storage layer
The cache helps when the hamt's store has higher latency, like IPFS.
WalkthroughThis pull request introduces several new GitHub Actions workflows to automate deployment and testing processes for the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 16
🧹 Outside diff range and nitpick comments (23)
run-checks.sh (1)
3-4: Enhance test execution configurationConsider the following improvements for more reliable test execution:
# Run pytest with coverage +# Clean previous coverage data +uv run coverage erase + # Run tests with coverage and generate HTML report -uv run pytest --cov=py_hamt tests/ +uv run pytest --cov=py_hamt --cov-report=html --cov-report=term-missing tests/This will:
- Clean previous coverage data before running tests
- Generate an HTML coverage report for better visualization
- Show missing lines in terminal output
.github/workflows/run-checks.yaml (1)
11-20: Consider enhancing uv sync configuration.While the basic setup works, you might want to add error handling and caching:
- name: Create project environment - run: uv sync + run: | + uv sync || exit 1 + - name: Cache dependencies + uses: actions/cache@v3 + with: + path: .venv + key: ${{ runner.os }}-pip-${{ hashFiles('**/pyproject.toml') }} + restore-keys: | + ${{ runner.os }}-pip-.github/workflows/pages-main.yaml (1)
5-8: Remove redundant permissions configuration.The permissions are duplicated at both the workflow and job level. Since the job-level permissions override workflow-level permissions, you can remove one of these blocks.
permissions: contents: read pages: write id-token: write jobs: build: environment: name: github-pages url: ${{ steps.deployment.outputs.page_url }} - permissions: - contents: read - pages: write - id-token: write runs-on: ubuntu-latestAlso applies to: 14-17
.github/workflows/pages-preview.yaml (3)
3-4: Consider specifying relevant paths for the workflow triggerThe workflow could be optimized to run only when documentation-related files are changed.
name: Deploy static site generated by pdoc to GitHub Pages preview environment -on: pull_request +on: + pull_request: + paths: + - 'py_hamt/**/*.py' + - 'py_hamt/**/*.md' + - 'docs/**' + - 'README.md'
5-8: Remove redundant permissions configurationThe permissions are defined both at the workflow and job level. This is redundant and could lead to maintenance issues.
Keep only the job-level permissions and remove the workflow-level ones:
-permissions: - contents: read - pages: write - id-token: write jobs: build: environment:Also applies to: 14-17
28-31: Add caching for Python dependenciesConsider adding caching to speed up the workflow and reduce GitHub Actions minutes usage.
Add this step before the uv installation:
- uses: actions/cache@v4 with: path: | .venv ~/.cache/uv key: ${{ runner.os }}-uv-${{ hashFiles('**/pyproject.toml') }} restore-keys: | ${{ runner.os }}-uv-tests/test_extract_bits.py (2)
6-32: Consider restructuring tests for better organization and clarity.While the test cases are comprehensive, consider these improvements:
- Split into separate test functions for different scenarios
- Add docstrings explaining test cases
- Consider additional edge cases
Here's a suggested refactor:
-def test_extract_bits(): +def test_extract_bits_single_byte(): + """Test bit extraction from single byte inputs.""" assert extract_bits(bytes([0b11111111]), 0, 5) == 0b11111 assert extract_bits(bytes([0b10101010]), 0, 5) == 0b10101 assert extract_bits(bytes([0b10000000]), 0, 5) == 0b10000 assert extract_bits(bytes([0b00010000]), 0, 5) == 0b00010 + +def test_extract_bits_multiple_bytes(): + """Test bit extraction spanning multiple bytes.""" assert extract_bits(bytes([0b10000100, 0b10010000]), 0, 9) == 0b100001001 assert extract_bits(bytes([0b10101010, 0b10101010]), 0, 9) == 0b101010101 + +def test_extract_bits_with_offset(): + """Test bit extraction with different starting positions.""" assert extract_bits(bytes([0b10000100, 0b10010000]), 1, 5) == 0b10010 # ... remaining tests ...Consider adding these edge cases:
- Extracting all bits (length equals input size * 8)
- Extracting zero bits
- Extracting bits at the byte boundary
34-37: Enhance error handling test coverage.Consider adding more error handling test cases:
- Start position at the end of input
- Zero-length input
- Negative start position or length
Here's a suggested addition:
def test_extract_bits_error_cases(): """Test various error conditions for bit extraction.""" # Current test with pytest.raises(IndexError, match="Arguments extract more bits than remain in the hash bits"): extract_bits(bytes([0b1]), 20, 20) # Additional cases with pytest.raises(IndexError): extract_bits(bytes([0b1]), 8, 1) # Start at end of input with pytest.raises(IndexError): extract_bits(bytes(), 0, 1) # Empty input with pytest.raises(ValueError): extract_bits(bytes([0b1]), -1, 1) # Negative starttests/test_zarr_ipfs.py (1)
1-13: Consider adding type hints and version pinning.For better IDE support and reproducibility, consider:
- Adding type hints for the CID import:
from multiformats import CID as CIDType- Pinning the multiformats version in your project dependencies.
README.md (5)
11-11: Fix grammar in differences statement.There's a subject-verb agreement error in this sentence.
-Some key differences is that the py-hamt data structure is mutable and not asynchronous. +Some key differences are that the py-hamt data structure is mutable and not asynchronous.🧰 Tools
🪛 LanguageTool
[grammar] ~11-~11: The verb form ‘is’ does not seem to match the subject ‘differences’.
Context: ...ddressed systems. Some key differences is that the py-hamt data structure is muta...(SUBJECT_VERB_AGREEMENT_PLURAL)
16-21: Consider adding basic usage example.While you've provided installation instructions and links to documentation/tests, a quick example snippet would help users get started immediately.
Consider adding a simple usage example like:
from py_hamt import HAMT # Basic usage example hamt = HAMT() key = hamt.set("value") value = hamt.get(key)
37-37: Fix grammar in conjunction usage.There are two instances of incorrect usage of "both ... as well as".
-This will both testing with coverage, as well as the formatting and linting. +This will run both testing with coverage and formatting and linting. -We use pytest with 100% code coverage, and with test inputs that are both handwritten as well as generated by hypothesis. +We use pytest with 100% code coverage, and with test inputs that are both handwritten and generated by hypothesis.Also applies to: 39-39
🧰 Tools
🪛 LanguageTool
[grammar] ~37-~37: Probable usage error. Use “and” after ‘both’.
Context: ...` This will both testing with coverage, as well as the formatting and linting. See the com...(BOTH_AS_WELL_AS)
32-34: Add IPFS daemon prerequisites details.While you mention the IPFS daemon requirement, it would be helpful to include:
- Minimum required version
- Configuration steps for custom endpoints
- Link to IPFS installation guide
Consider adding these details in a Prerequisites subsection.
67-67: Add comma for better readability.For better readability in the final sentence:
-For more information please see the [uv documentation] +For more information, please see the [uv documentation]🧰 Tools
🪛 LanguageTool
[typographical] ~67-~67: Consider adding a comma here.
Context: ... add --dev pytest`. For more information please see the [uv documentation](https://docs...(PLEASE_COMMA)
py_hamt/store.py (3)
63-63: Raise a more specific exception when the ID is not foundWhen an ID is not found in the
DictStore, a genericExceptionis raised. Consider raising a more specific exception likeKeyErrorto provide clearer information about the nature of the error.
35-38: Prefix private variables with an underscoreThe variables
storeandhash_alginDictStoreare intended to be private, as indicated by the comments. Following Python naming conventions, consider prefixing these variable names with an underscore (e.g.,_store,_hash_alg) to indicate that they are for internal use.
50-55: Add meaningful docstrings to methodsThe methods
save_rawandsave_dag_cborinDictStorehave empty docstrings. Providing detailed docstrings will improve code readability and maintainability by explaining the purpose, parameters, and return values of these methods.tests/test_hamt.py (5)
56-69: Remove commented-out code to improve maintainabilityLines 56-69 contain a commented-out function
find_cache_incosistency. It is best practice to remove unused code to enhance readability and maintainability. If needed in the future, it can be retrieved from version control.Apply this diff to remove the commented-out code:
-# Sometimes useful to lace throughout the test lines when debugging cache problems -# def find_cache_incosistency(hamt: HAMT): -# for id in hamt.cache: -# cache_node = hamt.cache[id] -# store_node = Node.deserialize(hamt.store.load(id)) -# if cache_node.data != store_node.data: -# print("*** Inconsistency found") -# print(f"Cache Node: {cache_node.data}") -# print(f"Store Node: {store_node.data}") -# print(f"HAMT cache: {hamt.cache}") -# print(f"HAMT id_to_time: {hamt.id_to_time}") -# print(f"HAMT last accessed time: {hamt.last_accessed_times}") -# raise Exception("Cache inconsistency")
121-121: Fix typo in commentThere's a typo in the comment on line 121: "this should not raies any exceptions". It should be "this should not raise any exceptions".
Apply this diff to correct the typo:
-# Make sure all ids actually exist in the store, this should not raies any exceptions +# Make sure all ids actually exist in the store, this should not raise any exceptions
181-181: Avoid Yoda conditions in assertionsOn line 181, the assertion is written as
assert b"bar" == hamt["foo"]. It's more conventional and readable to place the variable first:assert hamt["foo"] == b"bar".Apply this diff to adjust the assertion:
-assert b"bar" == hamt["foo"] +assert hamt["foo"] == b"bar"🧰 Tools
🪛 Ruff (0.7.0)
181-181: Yoda condition detected
Rewrite as
hamt["foo"] == b"bar"(SIM300)
186-186: Avoid Yoda conditions in assertionsOn line 186, the assertion
assert b"something else" == hamt["foo"]uses a Yoda condition. Consider rewriting it asassert hamt["foo"] == b"something else"for consistency and improved readability.Apply this diff to adjust the assertion:
-assert b"something else" == hamt["foo"] +assert hamt["foo"] == b"something else"🧰 Tools
🪛 Ruff (0.7.0)
186-186: Yoda condition detected
Rewrite as
hamt["foo"] == b"something else"(SIM300)
211-233: Remove commented-out code to clean up the test fileLines 211-233 contain a commented-out function
test_and_print_perf. Removing inactive code helps maintain a clean codebase. If performance testing is needed, consider implementing it in a separate benchmarking module.Apply this diff to remove the commented-out code:
-# Run this with varying cache sizes to see the impact on performance of the cache when using IPFSStore() -# def test_and_print_perf(): -# import time -# -# num_ops = 50 -# -# # usual cache size -# hamt = HAMT(store=IPFSStore()) -# start_time = time.time() -# for key_int in range(num_ops): -# hamt[str(key_int)] = key_int -# end_time = time.time() -# op_avg_cache = (end_time - start_time) / 100 -# -# # no cache -# hamt = HAMT(store=IPFSStore(), max_cache_size_bytes=0) -# start_time = time.time() -# for key_int in range(num_ops): -# hamt[str(key_int)] = key_int -# end_time = time.time() -# op_avg_no_cache = (end_time - start_time) / 100 -# -# print(f"Improvement of {(1 - op_avg_cache / op_avg_no_cache) * 100:.2f}%")py_hamt/hamt.py (1)
364-364: Raise specific exceptions instead of genericExceptionUsing the generic
Exceptionclass makes it difficult to handle specific error conditions. ReplaceExceptionwith more specific exception classes likeRuntimeError,ValueError, or define custom exception classes for clearer error handling.For example:
- raise Exception("Cannot call set on a read only HAMT") + raise RuntimeError("Cannot call set on a read-only HAMT")This provides clearer context when the exception is raised and allows for more granular exception handling.
Also applies to: 461-461, 482-482
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
.github/workflows/pages-main.yaml(1 hunks).github/workflows/pages-preview.yaml(1 hunks).github/workflows/run-checks.yaml(1 hunks).gitignore(2 hunks).python-version(1 hunks)README.md(1 hunks)examples/ipfs_store.py(0 hunks)py_hamt/__init__.py(1 hunks)py_hamt/bit_utils.py(0 hunks)py_hamt/hamt.py(1 hunks)py_hamt/store.py(1 hunks)pyproject.toml(1 hunks)run-checks.sh(1 hunks)setup.cfg(0 hunks)test/bit_utils_test.py(0 hunks)tests/test_extract_bits.py(1 hunks)tests/test_hamt.py(1 hunks)tests/test_zarr_ipfs.py(1 hunks)
💤 Files with no reviewable changes (4)
- examples/ipfs_store.py
- py_hamt/bit_utils.py
- setup.cfg
- test/bit_utils_test.py
✅ Files skipped from review due to trivial changes (3)
- .gitignore
- .python-version
- py_hamt/init.py
🧰 Additional context used
🪛 LanguageTool
README.md
[grammar] ~11-~11: The verb form ‘is’ does not seem to match the subject ‘differences’.
Context: ...ddressed systems. Some key differences is that the py-hamt data structure is muta...
(SUBJECT_VERB_AGREEMENT_PLURAL)
[grammar] ~37-~37: Probable usage error. Use “and” after ‘both’.
Context: ...` This will both testing with coverage, as well as the formatting and linting. See the com...
(BOTH_AS_WELL_AS)
[grammar] ~39-~39: Probable usage error. Use “and” after ‘both’.
Context: ...h test inputs that are both handwritten as well as generated by hypothesis. This allows ...
(BOTH_AS_WELL_AS)
[typographical] ~67-~67: Consider adding a comma here.
Context: ... add --dev pytest`. For more information please see the [uv documentation](https://docs...
(PLEASE_COMMA)
🪛 Ruff (0.7.0)
tests/test_hamt.py
181-181: Yoda condition detected
Rewrite as hamt["foo"] == b"bar"
(SIM300)
186-186: Yoda condition detected
Rewrite as hamt["foo"] == b"something else"
(SIM300)
🔇 Additional comments (17)
run-checks.sh (2)
1-13: Verify CI integration behavior
Since this script is used in GitHub Actions, we should verify its integration with the workflow.
Let's check the GitHub Actions workflow configuration:
Make sure the workflow:
- Sets the correct permissions
- Uses appropriate Python/uv setup steps
- Handles script exit codes correctly
✅ Verification successful
Let me gather more information about the workflow configuration.
CI integration is properly configured
The workflow configuration is well-structured and handles the script execution correctly:
- Proper permissions are set for the workflow
- Uses official
astral-sh/setup-uv@v3action for uv setup - Each command from
run-checks.shis executed as a separate step, ensuring proper exit code handling - All dependencies (uv, IPFS) are properly installed before running the checks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check workflow file that uses this script
fd -e yml -e yaml . .github/workflows/ -x cat {} \; | rg -A 10 'run-checks.sh'
Length of output: 303
Script:
#!/bin/bash
# Get the complete workflow file content
fd -e yml -e yaml . .github/workflows/ -x cat {}
# Also check if there are any other workflow files that might be relevant
fd . .github/workflows/
Length of output: 3774
9-13: Specify paths and verify ruff configuration
Add explicit paths and configuration verification for more controlled linting and formatting.
Let's verify the ruff configuration and suggest improvements:
# Check linting with ruff
-uv run ruff check
+uv run ruff check src tests
# Auto format with ruff
-uv run ruff format
+uv run ruff format src tests✅ Verification successful
Add explicit paths to ruff commands
The repository has a clear structure with Python files in py_hamt/ and tests/ directories. These should be explicitly specified in the ruff commands for better control and clarity.
# Check linting with ruff
-uv run ruff check
+uv run ruff check py_hamt tests
# Auto format with ruff
-uv run ruff format
+uv run ruff format py_hamt tests🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if ruff configuration exists
if ! find . -maxdepth 1 -name "ruff.toml" -o -name "pyproject.toml" | grep -q .; then
echo "Warning: No ruff configuration found"
fi
# Check pyproject.toml for ruff settings if it exists
if [ -f "pyproject.toml" ]; then
echo "Ruff settings in pyproject.toml:"
rg -A 5 '^\[tool\.ruff\]' pyproject.toml
fi
Length of output: 218
Script:
#!/bin/bash
# Let's check the repository structure to understand what paths should be linted
fd -t d -d 1
# Check for Python files in the repository
fd -e py
# Check if there's any configuration in setup.cfg or setup.py
if [ -f "setup.cfg" ]; then
echo "=== setup.cfg content ==="
cat setup.cfg
fi
if [ -f "setup.py" ]; then
echo "=== setup.py content ==="
cat setup.py
fi
Length of output: 209
pyproject.toml (4)
15-17: LGTM! Build system configuration looks good
The use of hatchling as the build backend is a modern choice and follows best practices.
8-12: Verify dependency version constraints
Some version constraints seem unnecessarily strict. Let's verify the security and compatibility of these versions:
✅ Verification successful
Dependencies are using latest versions with no active security vulnerabilities
Based on the PyPI version checks and security advisory results:
- All packages are using their latest available versions:
- dag-cbor: 0.3.3
- msgspec: 0.18.6
- multiformats: 0.3.1.post4
- requests: 2.32.3
- sortedcontainers: 2.4.0
- No active security vulnerabilities exist for the specified versions
- The only security advisories found were for older versions of requests, all of which are patched in the current 2.32.3 version
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check PyPI for latest versions and potential security advisories
for pkg in "dag-cbor" "msgspec" "multiformats" "requests" "sortedcontainers"; do
echo "Checking $pkg..."
curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version'
# Check for security advisories
gh api graphql -f query="{
securityVulnerabilities(first: 5, ecosystem: PIP, package: \"$pkg\") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}"
done
Length of output: 4094
19-31: Consider reorganizing development dependencies
The current structure mixes different types of development tools with data science libraries. Consider splitting these into separate groups for better organization and selective installation:
test: pytest, hypothesisprofile: memray, snakevizdocs: pdoclint: ruffdata: pandas, numpy, xarray, zarr
Also, let's verify the usage of these dependencies:
6-6: Consider relaxing Python version requirement
The requirement >=3.12 might be too restrictive as Python 3.12 is very recent (released October 2023). Unless you specifically need Python 3.12 features, consider supporting Python 3.11 as well to increase adoption.
.github/workflows/run-checks.yaml (1)
6-10: LGTM!
The job configuration is clean and uses the latest Ubuntu runner, ensuring access to recent security patches and features.
.github/workflows/pages-main.yaml (3)
1-4: LGTM! Clear workflow name and appropriate trigger configuration.
The workflow is correctly configured to deploy documentation only on pushes to the main branch.
10-13: LGTM! Proper environment configuration for GitHub Pages.
The job correctly uses the github-pages environment and dynamically references the deployment URL.
20-35: 🛠️ Refactor suggestion
Consider enhancing the documentation build process.
While the basic setup works, here are some suggested improvements:
- Add caching for uv dependencies to speed up workflow runs
- Add error checking for the documentation build
- Consider additional pdoc flags for better documentation
- name: Install uv
uses: astral-sh/setup-uv@v3
+ - name: Cache dependencies
+ uses: actions/cache@v4
+ with:
+ path: .venv
+ key: ${{ runner.os }}-uv-${{ hashFiles('**/pyproject.toml') }}
- name: Create project environment
run: uv sync
- name: Build with pdoc
- run: uv run pdoc py_hamt -o ./_site
+ run: |
+ uv run pdoc py_hamt \
+ --docformat google \
+ --output-dir ./_site \
+ --show-source \
+ || exit 1Let's verify the project's documentation requirements:
.github/workflows/pages-preview.yaml (3)
10-18: LGTM! Job configuration is well structured
The environment setup and runner configuration are appropriate for the task.
21-21: Excellent use of SHA-pinned action versions!
Using SHA-pinned versions for all actions is a security best practice that prevents supply chain attacks.
Also applies to: 27-27, 36-36, 39-39
32-33: Verify pdoc configuration and output
The pdoc command might benefit from additional configuration options for better documentation output.
Let's check if there's any pdoc configuration in the project:
Also, consider adding these flags to the pdoc command for better documentation:
- run: uv run pdoc py_hamt -o ./_site
+ run: uv run pdoc py_hamt -o ./_site --docformat google --show-sourcetests/test_extract_bits.py (2)
1-4: LGTM! Imports are clean and well-organized.
5-5: Verify completeness of test adaptation.
The comment indicates these tests were adapted from rvagg's IAMap. Let's verify we haven't missed any test cases.
✅ Verification successful
Let me search for more specific information about the original tests.
Let me check the actual test content and try to find the original test source.
Let me try one final search to locate the original tests in the referenced repository.
Let me fetch the content of the bit-utils-test.js file since it appears to be the relevant test file from the original repository.
Test adaptation is complete and accurate
After comparing the Python implementation's test cases with the original JavaScript implementation's bit-utils-test.js, I can confirm that our test_extract_bits.py has successfully adapted all the relevant test cases from the original mask function (which is the equivalent function in the JavaScript code). The test cases cover the same bit patterns, bit positions, and expected results, with an additional error case for invalid input.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for the original JavaScript test file to compare test coverage
rg -l "IAMap" .
Length of output: 56
Script:
#!/bin/bash
# Check README.md for reference to the original tests
cat README.md
# Also search for any mentions of rvagg or the original test implementation
rg -i "rvagg" .
Length of output: 3949
Script:
#!/bin/bash
# Get the content of our test file
cat tests/test_extract_bits.py
# Try to fetch the original test file URL from GitHub
gh api repos/rvagg/iamap/contents/test/bits-test.js || true
Length of output: 1753
Script:
#!/bin/bash
# Search for test files in the original repository using GitHub API
gh api repos/rvagg/iamap/contents/test || true
# Also check if there's any direct reference to the original test file in our codebase
rg -l "bits-test" .
Length of output: 5423
Script:
#!/bin/bash
# Fetch the content of bit-utils-test.js
curl -s https://raw.githubusercontent.com/rvagg/iamap/master/test/bit-utils-test.js
# Also check our implementation's bit utility functions
rg -A 5 "extract_bits" .
Length of output: 10892
tests/test_zarr_ipfs.py (1)
74-78: Verify IPFS integration configuration and error handling patterns.
Let's check how other tests handle IPFS integration and error cases.
✅ Verification successful
Let me gather more information about error handling and configuration patterns specifically in the IPFS-related test files we found.
IPFS integration is properly configured and tested
The verification shows that the IPFS integration is well-structured with appropriate test patterns:
- The test file includes a proper fixture (
random_zarr_dataset) that handles setup and cleanup - Error handling is implemented in the HAMT implementation (as seen in test_hamt.py)
- The test performs both write and read operations with proper assertions
- The test includes performance measurements and data integrity checks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for IPFS-related test patterns and error handling
echo "Searching for IPFS test patterns..."
rg -l "IPFSStore" "tests/"
echo "Checking for error handling patterns in tests..."
ast-grep --pattern 'with pytest.raises($_) { $$$ }'
echo "Checking for IPFS configuration patterns..."
rg "ipfs_store = IPFSStore\(" -A 3
Length of output: 1140
Script:
#!/bin/bash
# Check content of IPFS-related test files
echo "Content of test_zarr_ipfs.py:"
cat tests/test_zarr_ipfs.py
echo -e "\nContent of test_hamt.py:"
cat tests/test_hamt.py
# Look for setup/teardown patterns
echo -e "\nChecking for test fixtures and setup patterns:"
rg "fixture|setup|teardown" tests/test_zarr_ipfs.py tests/test_hamt.py
Length of output: 10637
README.md (1)
1-5: LGTM! Well-structured project branding.
The centered dClimate logo with proper HTML attributes provides a professional appearance to the README.
| #!/bin/bash | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add shell options for better error handling
Add shell options to make the script more robust by failing fast on errors.
#!/bin/bash
+set -euo pipefailThis change will:
-e: Exit immediately if a command exits with non-zero status-u: Error on undefined variables-o pipefail: Return value of a pipeline is the value of the last (rightmost) command to exit with non-zero status
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| #!/bin/bash | |
| set -euo pipefail |
| - name: Run pytest with coverage | ||
| run: uv run pytest --cov=py_hamt tests/ | ||
|
|
||
| - name: Check coverage | ||
| run: uv run coverage report --fail-under=100 --show-missing | ||
|
|
||
| - name: Check linting with ruff | ||
| run: uv run ruff check | ||
|
|
||
| - name: Check formatting with ruff | ||
| run: uv run ruff format --check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance test reporting with artifacts.
Consider adding test result and coverage report artifacts for better visibility:
- name: Run pytest with coverage
- run: uv run pytest --cov=py_hamt tests/
+ run: |
+ uv run pytest --cov=py_hamt tests/ --junitxml=pytest.xml || exit 1
+ uv run coverage xml
+
+ - name: Upload test results
+ uses: actions/upload-artifact@v3
+ if: always()
+ with:
+ name: test-results
+ path: |
+ pytest.xml
+ coverage.xml📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Run pytest with coverage | |
| run: uv run pytest --cov=py_hamt tests/ | |
| - name: Check coverage | |
| run: uv run coverage report --fail-under=100 --show-missing | |
| - name: Check linting with ruff | |
| run: uv run ruff check | |
| - name: Check formatting with ruff | |
| run: uv run ruff format --check | |
| - name: Run pytest with coverage | |
| run: | | |
| uv run pytest --cov=py_hamt tests/ --junitxml=pytest.xml || exit 1 | |
| uv run coverage xml | |
| - name: Upload test results | |
| uses: actions/upload-artifact@v3 | |
| if: always() | |
| with: | |
| name: test-results | |
| path: | | |
| pytest.xml | |
| coverage.xml | |
| - name: Check coverage | |
| run: uv run coverage report --fail-under=100 --show-missing | |
| - name: Check linting with ruff | |
| run: uv run ruff check | |
| - name: Check formatting with ruff | |
| run: uv run ruff format --check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh we can include these test artifacts and get the test coverage in codecov which we pay for anyway on some of our other repos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Faolain, that's a great idea! Integrating Codecov for coverage reporting makes sense. Here's an updated workflow that uploads coverage reports to Codecov:
- name: Run pytest with coverage
run: uv run pytest --cov=py_hamt tests/
- name: Generate coverage XML
run: uv run coverage xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }} # Ensure this secret is set in your repository settings
files: coverage.xml
fail_ci_if_error: true
- name: Check coverage
run: uv run coverage report --fail-under=100 --show-missing
- name: Check linting with ruff
run: uv run ruff check
- name: Check formatting with ruff
run: uv run ruff format --checkMake sure to add the CODECOV_TOKEN secret to your repository settings for the upload to work correctly.
| name: Run checks | ||
| run-name: Triggered on push from ${{ github.actor }} to branch/tag ${{ github.ref_name }} | ||
| on: push | ||
| # Should be the same as py-hamt/run-checks.sh | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Branch filtering suggestion needs reconsideration, but script synchronization needs attention
The workflow steps mostly align with run-checks.sh, but there are two notable differences:
- The workflow has additional environment setup steps (checkout, uv, IPFS)
- The workflow's ruff format command includes
--checkflag while the script runs formatting directly
Since this is a Python package that needs IPFS for testing, running on all pushes is actually appropriate as we want to ensure all changes maintain quality. However, the comment about script synchronization is valid and needs addressing to prevent divergence.
Suggested updates:
- Either remove the comment about script synchronization
- Or update
run-checks.shto match the workflow'sruff format --checkcommand
🔗 Analysis chain
Consider adding branch filters and verifying script synchronization.
The workflow runs on all push events, which might be inefficient. Consider limiting it to specific branches:
-on: push
+on:
+ push:
+ branches:
+ - main
+ - 'feature/**'
+ paths-ignore:
+ - '**.md'
+ - 'docs/**'Also, let's verify the synchronization with the referenced script.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify if run-checks.sh exists and compare its contents with the workflow
# Check if the script exists and show its contents
fd -t f "run-checks.sh" -x cat {}
Length of output: 272
Script:
#!/bin/bash
# Let's verify if the workflow steps match the script commands
# Get the workflow file content to compare with the script
cat .github/workflows/run-checks.yaml
Length of output: 1007
| - name: Install IPFS | ||
| uses: ibnesayeed/setup-ipfs@master | ||
| with: | ||
| ipfs_version: "0.32.1" | ||
| run_daemon: true | ||
| id: ipfs_setup | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Security: Action pinning and IPFS version concerns require attention
- The action should be pinned to a specific commit hash instead of master:
- uses: ibnesayeed/setup-ipfs@master
+ uses: ibnesayeed/setup-ipfs@c779340c110024feeb46579fef211c89332caf85- Version 0.32.1 appears to be invalid as:
- It's not available in IPFS distributions
- The action's README shows much lower version examples (0.4.x, 0.5.x, 0.6.x)
- The version number (0.32.1) seems unusually high
Please verify and update to a valid IPFS version by consulting the official IPFS documentation or repository.
🔗 Analysis chain
Security: Avoid using @master tag for third-party actions.
Using the @master tag for actions can be risky as it may introduce breaking changes or security vulnerabilities. Consider:
- Pin to a specific commit hash:
- uses: ibnesayeed/setup-ipfs@master
+ uses: ibnesayeed/setup-ipfs@5c55235aae4d6bb5caee4d1e52e0db11cd766b5e- Verify if IPFS v0.32.1 is the latest stable version.
Let me try another approach to get the IPFS version information.
Let me try one final verification to check IPFS versions using a different approach.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check latest IPFS version and action's last commit
# Check latest IPFS release
curl -s https://api.github.com/repos/ipfs/kubo/releases/latest | jq -r .tag_name
# Check action's last commit
gh api repos/ibnesayeed/setup-ipfs/commits/master --jq '.sha'
Length of output: 190
Script:
#!/bin/bash
# Description: Check IPFS releases using different API endpoints and methods
# Try listing recent releases to find latest version
curl -s https://api.github.com/repos/ipfs/kubo/releases?per_page=5 | jq -r '.[].tag_name'
# Check tags as alternative
curl -s https://api.github.com/repos/ipfs/kubo/tags?per_page=5 | jq -r '.[].name'
# Also verify the action's README for recommended usage
curl -s https://raw.githubusercontent.com/ibnesayeed/setup-ipfs/master/README.md
Length of output: 3089
Script:
#!/bin/bash
# Check IPFS versions using dist.ipfs.io
curl -s https://dist.ipfs.io/go-ipfs/versions | sort -V | tail -n 5
# Verify if version 0.32.1 exists in their distribution
curl -s -I "https://dist.ipfs.io/go-ipfs/v0.32.1/go-ipfs_v0.32.1_linux-amd64.tar.gz" | head -n 1
Length of output: 288
| - name: Setup Pages | ||
| uses: actions/configure-pages@983d7736d9b0ae728b81ab479565c72886d7745b # v5 | ||
| - name: Install uv | ||
| uses: astral-sh/setup-uv@v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Pin setup-uv action to a specific commit SHA for enhanced security
The action should be pinned to the v3 tag's SHA: 8d55fbecc275b1c35dbe060458839f8d30439ccf. The action appears to be well-maintained, with recent commits and proper security practices:
- Uses Node.js 20 runtime
- Has minimal required permissions
- Properly handles GitHub token
- Has active maintenance (last commit 25 Nov 2024)
Update the workflow line to:
uses: astral-sh/setup-uv@8d55fbecc275b1c35dbe060458839f8d30439ccf🔗 Analysis chain
Verify the security of the setup-uv action.
While the action is from a reputable source (Astral), it's recommended to pin it to a specific SHA for better security.
Let me gather more information about the specific commit SHA for v3 tag.
Let me try a simpler approach to check the action's content.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check the latest release and security history of setup-uv action
echo "Checking setup-uv action releases..."
gh api repos/astral-sh/setup-uv/releases --jq '.[0].tag_name'
echo "Checking for security advisories..."
gh api repos/astral-sh/setup-uv/security-advisories --jq '.[] | {published: .published_at, severity: .severity, summary: .summary}'
Length of output: 368
Script:
#!/bin/bash
# Get the commit SHA for v3 tag and check its commit history
echo "Checking v3 tag details..."
gh api repos/astral-sh/setup-uv/git/refs/tags/v3 --jq '.object.sha'
echo "Checking commit message..."
gh api repos/astral-sh/setup-uv/git/commits/$(gh api repos/astral-sh/setup-uv/git/refs/tags/v3 --jq -r '.object.sha') --jq '.message'
# Check the action.yml file content at v3
echo "Checking action.yml content..."
gh api repos/astral-sh/setup-uv/contents/action.yml?ref=v3 --jq '.content' | base64 -d
Length of output: 768
Script:
#!/bin/bash
# Get the raw content of action.yml at v3
echo "Checking action.yml content at v3..."
curl -sSL "https://raw.githubusercontent.com/astral-sh/setup-uv/v3/action.yml"
echo -e "\nChecking latest commits..."
gh api repos/astral-sh/setup-uv/commits --jq '.[0:3] | .[] | {sha: .sha, date: .commit.author.date, message: .commit.message}'
Length of output: 2396
| def clear_cache_lru(self): | ||
| """@private""" | ||
| # The last condition is to check if anything is even in the cache. This is important for times when the cache size is so small that the size of the empty objects is larger than the cache limit | ||
| while ( | ||
| getsizeof(self.cache) | ||
| + getsizeof(self.id_to_time) | ||
| + getsizeof(self.last_accessed_times) | ||
| ) > self.max_cache_size_bytes and len(self.last_accessed_times) > 0: | ||
| least_recently_used: tuple[float, IPLDKind] = self.last_accessed_times.pop( | ||
| 0 | ||
| ) | ||
| least_recently_used_id = least_recently_used[1] | ||
| del self.cache[least_recently_used_id] | ||
| del self.id_to_time[least_recently_used_id] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Reevaluate the use of getsizeof() for cache size estimation
In the clear_cache_lru method, getsizeof() is used to estimate the memory usage of the cache and related structures. However, getsizeof() may not accurately reflect the actual memory footprint of container objects, potentially causing the cache to exceed its intended size limit.
Consider alternative approaches:
- Track the size manually: Maintain a running total of the sizes of individual nodes as they are added and removed from the cache.
- Limit by entry count: Limit the cache based on the number of entries rather than estimated memory size.
- Use specialized libraries: Utilize libraries designed for cache size management that can more accurately track memory usage.
|
|
||
| Args: | ||
| store: backing store | ||
| new_node (Hamt): node to save to store | ||
| key_count = self.key_count | ||
|
|
||
| Returns: | ||
| Hamt: node with store-generated id attached | ||
| """ | ||
| id = store.save(new_node.to_serializable()) | ||
| new_node.id = id | ||
| return new_node | ||
| if not self.read_only: | ||
| self.lock.release() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Utilize context manager for lock in __len__ method
In the __len__ method, the lock is manually acquired and released. To ensure the lock is always released, even if an exception occurs, use a context manager.
Suggested change:
def __len__(self) -> int:
"""Total number of keys"""
if not self.read_only:
- self.lock.acquire(blocking=True)
- key_count = self.key_count
- if not self.read_only:
- self.lock.release()
+ with self.lock:
+ key_count = self.key_count
+ else:
+ key_count = self.key_count
return key_countThis modification enhances thread safety by ensuring the lock is released properly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.lock.acquire(blocking=True) | |
| Args: | |
| store: backing store | |
| new_node (Hamt): node to save to store | |
| key_count = self.key_count | |
| Returns: | |
| Hamt: node with store-generated id attached | |
| """ | |
| id = store.save(new_node.to_serializable()) | |
| new_node.id = id | |
| return new_node | |
| if not self.read_only: | |
| self.lock.release() | |
| with self.lock: | |
| key_count = self.key_count | |
| else: | |
| key_count = self.key_count | |
| return key_count |
|
|
||
| raw_hash = self.hash_fn(key.encode()) | ||
|
|
||
| node_id_stack: list[Link] = [] | ||
| node_id_stack.append(self.root_node_id) | ||
|
|
||
| # Don't check if result is none but use a boolean to indicate finding something, this is because None is a possible value the HAMT can store | ||
| result_ptr: IPLDKind = None | ||
| found_a_result: bool = False | ||
| while True: | ||
| top_id = node_id_stack[-1] | ||
| top_node = self.read_node(top_id) | ||
| map_key = str(extract_bits(raw_hash, len(node_id_stack), 8)) | ||
|
|
||
| # Check if this node is in one of the buckets | ||
| buckets = top_node.get_buckets() | ||
| if map_key in buckets: | ||
| bucket = buckets[map_key] | ||
| for kv in bucket: | ||
| if key in kv: | ||
| result_ptr = kv[key] | ||
| found_a_result = True | ||
| break | ||
|
|
||
| # If it isn't in one of the buckets, check if there's a link to another serialized node id | ||
| links = top_node.get_links() | ||
| if map_key in links: | ||
| link_id = links[map_key] | ||
| node_id_stack.append(link_id) | ||
| continue | ||
| # Nowhere left to go, stop walking down the tree | ||
| else: | ||
| break | ||
|
|
||
| Args: | ||
| serializable (dict): object to check | ||
| if not self.read_only: | ||
| self.lock.release() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor lock handling in __getitem__ to prevent potential deadlocks
The __getitem__ method acquires the lock but may not release it if an exception occurs before self.lock.release(). To prevent potential deadlocks, use a context manager or try-finally block to ensure the lock is always released.
Apply this change:
def __getitem__(self, key: str) -> IPLDKind:
if not self.read_only:
- self.lock.acquire(blocking=True)
+ with self.lock:
+ # Existing code indented under the context manager
+ raw_hash = self.hash_fn(key.encode())
+ node_id_stack: list[Link] = []
+ node_id_stack.append(self.root_node_id)
+ # ... rest of the method code ...
- # At the end
- if not self.read_only:
- self.lock.release()This ensures that the lock is properly released even if exceptions occur.
Committable suggestion skipped: line range outside the PR's diff.
|
|
||
| raw_hash = self.hash_fn(key.encode()) | ||
|
|
||
| node_stack: list[tuple[Link, Node]] = [] | ||
| root_node = self.read_node(self.root_node_id) | ||
| node_stack.append((self.root_node_id, root_node)) | ||
|
|
||
| created_change = False | ||
| while True: | ||
| top_id, top_node = node_stack[-1] | ||
| map_key = str(extract_bits(raw_hash, len(node_stack), 8)) | ||
|
|
||
| buckets = top_node.get_buckets() | ||
| links = top_node.get_links() | ||
|
|
||
| if map_key in buckets and map_key in links: | ||
| self.lock.release() | ||
| raise Exception( | ||
| "Key in both buckets and links of the node, invariant violated" | ||
| ) | ||
| elif map_key in buckets: | ||
| bucket = buckets[map_key] | ||
|
|
||
| # Delete from within this bucket | ||
| for bucket_index in range(len(bucket)): | ||
| kv = bucket[bucket_index] | ||
| if key in kv: | ||
| created_change = True | ||
| bucket.pop(bucket_index) | ||
|
|
||
| # If this bucket becomes empty then delete this dict entry for the bucket | ||
| if len(bucket) == 0: | ||
| del buckets[map_key] | ||
|
|
||
| # This must be done to avoid IndexErrors after continuing to iterate since the length of the bucket has now changed | ||
| break | ||
|
|
||
| break | ||
| elif map_key in links: | ||
| link = links[map_key] | ||
| next_node = self.read_node(link) | ||
| node_stack.append((link, next_node)) | ||
| continue | ||
|
|
||
| def is_root_serializable(serializable: typing.Union[dict, list]) -> bool: | ||
| """Whether `serializable` is the serialized root of a Hamt | ||
| else: | ||
| # This key is not even in the HAMT so just exit | ||
| break | ||
|
|
||
| # Finally, reserialize and fix all links, deleting empty nodes as needed | ||
| if created_change: | ||
| self._reserialize_and_link(node_stack) | ||
| self.key_count -= 1 | ||
| node_stack_top_id = node_stack[0][0] | ||
| self.root_node_id = node_stack_top_id | ||
|
|
||
| if not self.read_only: | ||
| self.lock.release() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use context managers to guarantee lock release in __delitem__
In the __delitem__ method, the lock is acquired but may not be released if an exception occurs before self.lock.release(). This can cause deadlocks in concurrent environments. Refactor the lock handling to ensure the lock is always released.
Suggested change:
def __delitem__(self, key: str):
if self.read_only:
raise Exception("Cannot call delete on a read only HAMT")
- if not self.read_only:
- self.lock.acquire(blocking=True)
+ if not self.read_only:
+ with self.lock:
+ # Existing code indented under the context manager
+ raw_hash = self.hash_fn(key.encode())
+ node_stack: list[tuple[Link, Node]] = []
+ root_node = self.read_node(self.root_node_id)
+ node_stack.append((self.root_node_id, root_node))
+ # ... rest of the method code ...
- # At the end
- if not self.read_only:
- self.lock.release()This modification ensures thread safety by guaranteeing the lock is released even if an exception is raised.
Committable suggestion skipped: line range outside the PR's diff.
|
|
||
| Returns: | ||
| int: number of child links for this node | ||
| """ | ||
| count = 0 | ||
| for ele in self.data: | ||
| if ele.link: | ||
| count += 1 | ||
| return count | ||
|
|
||
| def is_invariant(self) -> bool: | ||
| """Perform a check on this node and its children that it is in | ||
| canonical format for the current data. As this uses `size()` to calculate the total | ||
| number of entries in this node and its children, it performs a full | ||
| scan of nodes and therefore incurs a load and deserialisation cost for each child node. | ||
| A `false` result from this method suggests a flaw in the implemetation. | ||
| Returns: | ||
| bool: whether the tree is in its canonical form | ||
| """ | ||
| size = self.size() | ||
| entry_arity = self.direct_entry_count() | ||
| node_arity = self.direct_node_count() | ||
| arity = entry_arity + node_arity | ||
| size_predicate = 2 | ||
| if node_arity == 0: | ||
| size_predicate = min(2, entry_arity) | ||
|
|
||
| inv1 = size - entry_arity >= 2 * (arity - entry_arity) | ||
| inv2 = size_predicate == 0 if arity == 0 else True | ||
| inv3 = size_predicate == 1 if (arity == 1 and entry_arity == 1) else True | ||
| inv4 = size_predicate == 2 if arity >= 2 else True | ||
| inv5 = ( | ||
| node_arity >= 0 and entry_arity >= 0 and (entry_arity + node_arity == arity) | ||
| ) | ||
| val_ptr = self.store.save_raw(dag_cbor.encode(val_to_insert)) | ||
|
|
||
| return inv1 and inv2 and inv3 and inv4 and inv5 | ||
| node_stack: list[tuple[Link, Node]] = [] | ||
| root_node = self.read_node(self.root_node_id) | ||
| node_stack.append((self.root_node_id, root_node)) | ||
|
|
||
| # FIFO queue to keep track of all the KVs we need to insert | ||
| # This is important for if any buckets overflow and thus we need to reinsert all those KVs | ||
| kvs_queue: list[tuple[str, IPLDKind]] = [] | ||
| kvs_queue.append((key_to_insert, val_ptr)) | ||
|
|
||
| def serializable_to_options(serializable: typing.Union[dict, list]) -> dict: | ||
| """Turn serialized node into configs that can be passed into Hamt as options | ||
| created_change = False | ||
| # Keep iterating until we have no more KVs to insert | ||
| while len(kvs_queue) != 0: | ||
| _, top_node = node_stack[-1] | ||
| curr_key, curr_val = kvs_queue[0] | ||
|
|
||
| Args: | ||
| serializable (dict): Serialized node | ||
| raw_hash = self.hash_fn(curr_key.encode()) | ||
| map_key = str(extract_bits(raw_hash, len(node_stack), 8)) | ||
|
|
||
| Returns: | ||
| dict: Options generated for serialized node | ||
| """ | ||
| return { | ||
| "hashAlg": serializable["hashAlg"], | ||
| "bitWidth": int( | ||
| math.log2(len(serializable["hamt"][0]) * 8) | ||
| ), # inverse of (2**bitWidth) / 8 | ||
| "bucketSize": serializable["bucketSize"], | ||
| } | ||
| buckets = top_node.get_buckets() | ||
| links = top_node.get_links() | ||
|
|
||
| if map_key in links and map_key in buckets: | ||
| self.lock.release() | ||
| raise Exception( | ||
| "Key in both buckets and links of the node, invariant violated" | ||
| ) | ||
| elif map_key in links: | ||
| next_node_id = links[map_key] | ||
| next_node = self.read_node(next_node_id) | ||
| node_stack.append((next_node_id, next_node)) | ||
| elif map_key in buckets: | ||
| bucket = buckets[map_key] | ||
| created_change = True | ||
|
|
||
| # If this bucket already has this same key, just rewrite the value | ||
| # Since we can't use continues to go back to the top of the while loop, use this boolean flag instead | ||
| should_continue_at_while = False | ||
| for kv in bucket: | ||
| if curr_key in kv: | ||
| kv[curr_key] = curr_val | ||
| kvs_queue.pop(0) | ||
| should_continue_at_while = True | ||
| break | ||
| if should_continue_at_while: | ||
| continue | ||
|
|
||
| bucket_has_space = len(bucket) < self.max_bucket_size | ||
| if bucket_has_space: | ||
| bucket.append({curr_key: curr_val}) | ||
| kvs_queue.pop(0) | ||
| self.key_count += 1 | ||
| # If bucket is full and we need to add, then all these KVs need to be taken out of this bucket and reinserted throughout the tree | ||
| else: | ||
| # Empty the bucket of KVs into the queue | ||
| for kv in bucket: | ||
| for k, v in kv.items(): | ||
| kvs_queue.append((k, v)) | ||
| self.key_count -= 1 | ||
|
|
||
| def is_serializable(serializable: typing.Union[dict, list]) -> bool: | ||
| """Check if `serializable` is a valid serialized Hamt | ||
| # Delete empty bucket, there should only be a link now | ||
| del buckets[map_key] | ||
|
|
||
| Args: | ||
| serializable (typing.Union[dict, list]): Serialized object to test | ||
| # Create a new link to a new node so that we can reflow these KVs into new buckets | ||
| new_node = Node() | ||
| new_node_id = self.write_node(new_node) | ||
|
|
||
| Returns: | ||
| bool: whether object is valid hamt | ||
| """ | ||
| if isinstance(serializable, list): | ||
| return ( | ||
| len(serializable) == 2 | ||
| and isinstance(serializable[0], bytes) | ||
| and isinstance(serializable[1], list) | ||
| ) | ||
| return is_root_serializable(serializable) | ||
| links[map_key] = new_node_id | ||
|
|
||
| # We need to rerun from the top with the new queue, but this time this node will have a link to put KVs deeper down in the tree | ||
| node_stack.append((new_node_id, new_node)) | ||
| else: | ||
| # If there is no link and no bucket, then we can create a new bucket, insert, and be done with this key | ||
| bucket: list[dict[str, IPLDKind]] = [] | ||
| bucket.append({curr_key: curr_val}) | ||
| kvs_queue.pop(0) | ||
| buckets[map_key] = bucket | ||
| self.key_count += 1 | ||
| created_change = True | ||
|
|
||
| # Finally, reserialize and fix all links, deleting empty nodes as needed | ||
| if created_change: | ||
| self._reserialize_and_link(node_stack) | ||
| node_stack_top_id = node_stack[0][0] | ||
| self.root_node_id = node_stack_top_id | ||
|
|
||
| if not self.read_only: | ||
| self.lock.release() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure locks are properly released using context managers or try-finally blocks
In the __setitem__ method, the lock is acquired with self.lock.acquire(blocking=True), but if an exception occurs before self.lock.release(), the lock may not be released, potentially leading to deadlocks. To guarantee the lock is always released, use a context manager (with self.lock:) or a try-finally block.
Apply this diff to refactor the lock handling:
def __setitem__(self, key_to_insert: str, val_to_insert: IPLDKind):
if self.read_only:
raise Exception("Cannot call set on a read only HAMT")
- if not self.read_only:
- self.lock.acquire(blocking=True)
+ if not self.read_only:
+ with self.lock:
+ # Existing code indented under the context manager
+ val_ptr = self.store.save_raw(dag_cbor.encode(val_to_insert))
+ node_stack: list[tuple[Link, Node]] = []
+ root_node = self.read_node(self.root_node_id)
+ node_stack.append((self.root_node_id, root_node))
+ # ... rest of the method code ...
- # At the end
- if not self.read_only:
- self.lock.release()This ensures that the lock is properly released even if an exception occurs within the block.
Committable suggestion skipped: line range outside the PR's diff.
|
My review is in progress 📖 - I will have feedback for you in a few minutes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Summary by Korbit AI
Code Execution Comments
- Improve script robustness with error handling, tool checks, and input validation in DictStore and IPFSStore classes.
- Address potential deadlocks by refining the locking mechanism in the HAMT class.
Code Health Comments
- Introduce variables or command-line arguments for easier script maintenance and use specific or custom exceptions for better error handling.
- Add a
__version__attribute in__init__.pyfor version identification.
Files scanned
| File Path | Reviewed |
|---|---|
| py_hamt/init.py | ✅ |
| run-checks.sh | ✅ |
| py_hamt/store.py | ✅ |
| py_hamt/hamt.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Need a new review? Comment
/korbit-reviewon this PR and I'll review your latest changes.Korbit Guide: Usage and Customization
Interacting with Korbit
- You can manually ask Korbit to review your PR using the
/korbit-reviewcommand in a comment at the root of your PR.- You can ask Korbit to generate a new PR description using the
/korbit-generate-pr-descriptioncommand in any comment on your PR.- Too many Korbit comments? I can resolve all my comment threads if you use the
/korbit-resolvecommand in any comment on your PR.- Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
- Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.
Customizing Korbit
- Check out our docs on how you can make Korbit work best for you and your team.
- Customize Korbit for your organization through the Korbit Console.
Current Korbit Configuration
General Settings
Setting Value Review Schedule Automatic excluding drafts Max Issue Count 10 Automatic PR Descriptions ✅ Issue Categories
Category Enabled Naming ✅ Database Operations ✅ Documentation ✅ Logging ✅ Error Handling ✅ Systems and Environment ✅ Objects and Data Structures ✅ Readability and Maintainability ✅ Asynchronous Processing ✅ Design Patterns ✅ Third-Party Libraries ✅ Performance ✅ Security ✅ Functionality ✅ Feedback and Support
| def __deepcopy__(self, memo) -> "HAMT": | ||
| if not self.read_only: | ||
| self.lock.acquire(blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential deadlock in HAMT locking mechanism.
Tell me more
The current locking mechanism in the HAMT class may lead to deadlocks. For example, in the __deepcopy__ method, if the HAMT is not read-only, it acquires the lock but doesn't release it if an exception occurs. Consider using a RLock or implementing a try-finally block to ensure the lock is always released.
| #!/bin/bash | ||
|
|
||
| # Run pytest with coverage | ||
| uv run pytest --cov=py_hamt tests/ | ||
|
|
||
| # Check coverage | ||
| uv run coverage report --fail-under=100 --show-missing | ||
|
|
||
| # Check linting with ruff | ||
| uv run ruff check | ||
|
|
||
| # Auto format with ruff | ||
| uv run ruff format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of error handling and tool checks in the script.
Tell me more
The script currently doesn't include any error handling or checks for required tools. This could lead to silent failures or unclear error messages. Consider adding checks for the required tools (uv, pytest, coverage, ruff) at the beginning of the script. Also, implement error handling for each command, exiting with a non-zero status code and providing an informative error message if any command fails. This will make the script more robust and easier to debug if issues occur.
| # Check coverage | ||
| uv run coverage report --fail-under=100 --show-missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use variables and command-line arguments for script configuration.
Tell me more
The script currently uses hardcoded values for the coverage threshold (100%) and the directories to check (tests/ for pytest, py_hamt for coverage). To make the script more flexible and easier to maintain, consider introducing variables for these values. For example:
COVERAGE_THRESHOLD=100
TEST_DIR=tests/
SRC_DIR=py_hamt
# Run pytest with coverage
uv run pytest --cov=$SRC_DIR $TEST_DIR
# Check coverage
uv run coverage report --fail-under=$COVERAGE_THRESHOLD --show-missingEven better, you could accept these as command-line arguments, allowing users to easily adjust these values when running the script.
| if id in self.store: | ||
| return self.store[id] | ||
| else: | ||
| raise Exception("ID not found in store") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use more specific exception types.
Tell me more
Consider using a more specific exception type instead of the generic Exception. This will make it easier for users of your code to catch and handle specific error conditions. For example, you could create a custom exception like IDNotFoundError or use a built-in exception like KeyError that better represents the nature of the error.
| from .hamt import HAMT, blake3_hashfn | ||
| from .store import Store, DictStore, IPFSStore | ||
|
|
||
| __all__ = [ | ||
| "HAMT", | ||
| "blake3_hashfn", | ||
| "Store", | ||
| "DictStore", | ||
| "IPFSStore", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a version attribute to the init.py file.
Tell me more
Consider adding a __version__ attribute to the __init__.py file. This is a common practice in Python libraries and helps users and developers quickly identify the version of the library they are working with. You could add something like __version__ = '0.1.0' at the beginning of the file.
| def save_raw(self, data: bytes) -> bytes: | ||
| """""" | ||
| return self.save(data) | ||
|
|
||
| def save_dag_cbor(self, data: bytes) -> bytes: | ||
| """""" | ||
| return self.save(data) | ||
|
|
||
| # Ignore the type error since bytes is in the IPLDKind type | ||
| def load(self, id: bytes) -> bytes: # type: ignore | ||
| """""" | ||
| if id in self.store: | ||
| return self.store[id] | ||
| else: | ||
| raise Exception("ID not found in store") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation in save and load methods.
Tell me more
Consider adding input validation in the save and load methods of both DictStore and IPFSStore classes. For instance, in the save methods, check if the input data is not empty, and in the load methods, verify that the id is of the correct type (bytes for DictStore and CID for IPFSStore). This will help prevent unexpected behavior and improve the robustness of the code.
|
@CodiumAI-Agent /review |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
@CodiumAI-Agent /improve |
PR Code Suggestions ✨No code suggestions found for the PR. |
|
This PR was created for testing and aggregating ai code reviewer suggestions, but I've finished reviewing them for PR #7 so we can close this PR now |
PR Code Suggestions ✨No code suggestions found for the PR. |
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
.gitignoreto ensure proper exclusion of specified files and directories.Documentation
Tests
Description by Korbit AI
What change is being made?
Add version 2.0 of the
py-hamtlibrary, including its implementation, configuration files for GitHub workflows, comprehensive documentation, and extensive testing.Why are these changes being made?
We are releasing a new major version of the
py-hamtlibrary with significant enhancements and improvements. The new implementation offers a flexible, mutable hash array mapped trie (HAMT) compatible with IPLDKind data types, making it robust for abstract storage backends such as IPFS. Automating deployment workflows and comprehensive CI testing is intended to facilitate ongoing development and integration processes. This update aligns with our goal to streamline data management over distributed systems, and the extensive test coverage ensures system reliability and robustness.