Skip to content

Conversation

@TheGreatAlgo
Copy link
Collaborator

@TheGreatAlgo TheGreatAlgo commented Feb 6, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced core storage functionality to support customizable value transformations during data write and retrieval.
    • Integrated secure encryption/decryption support for dataset handling.

  • Tests
    • Added comprehensive tests to validate secure handling, upload, and retrieval flows, ensuring proper error management when encryption prerequisites aren’t met.

  • Chores
    • Updated the development configuration with an additional cryptographic dependency.
    • Expanded the public API to include new encryption transformer functionality.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2025

Walkthrough

This update enhances the HAMT class by introducing two new callable attributes, transformer_encode and transformer_decode, which allow for custom transformations of values during insertion and retrieval. The __init__ method is updated to accept these parameters, and the relevant methods are modified accordingly. Additionally, a new dependency for cryptography, pycryptodome, is added to the project configuration. A test file is created to validate encryption and decryption of ZARR datasets using the ChaCha20-Poly1305 cipher, ensuring data integrity and proper error handling.

Changes

File(s) Change Summary
py_hamt/hamt.py Added transformer_encode and transformer_decode attributes; updated __init__, __setitem__, and __getitem__ methods to support value transformation during data storage and retrieval.
pyproject.toml Added a new dependency "pycryptodome" in the dependencies list under [project]; updated version specifications for "xarray" and "zarr" to exact versions.
tests/test_zarr_encryption_ipfs.py Created a new test file for ZARR encryption/decryption using ChaCha20-Poly1305; defined a random dataset generator fixture and expanded tests for uploading and retrieving encrypted datasets with proper error handling.
py_hamt/init.py Added create_zarr_encryption_transformers function to the import statements and __all__ list to expand the public API.
py_hamt/zarr_encryption_transformer.py Introduced a new file implementing encryption and decryption functionality for ZARR data; added TransformerFN type and create_zarr_encryption_transformers function for generating encoding and decoding transformers.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client Code
    participant H as HAMT.__new__
    participant TH as TransformedHamt
    participant N as HAMT
    C->>H: Instantiate HAMT (with transformer?)
    alt Transformer provided
        H-->>TH: Return TransformedHamt instance
    else
        H-->>N: Return HAMT instance
    end
Loading
sequenceDiagram
    participant T as Test Case
    participant S as HAMT Store
    participant E as Encryption Transformer
    T->>S: __setitem__(key, value)
    S->>E: Apply transformer_encode on value
    E-->>S: Return encoded bytes
    T->>S: __getitem__(key)
    S->>E: Apply transformer_decode on retrieved bytes
    E-->>S: Return original value
    T-->>S: Validate retrieved data
Loading

Possibly related PRs

  • Bug fixes #13: The changes in the main PR enhance the HAMT class with new transformation functions, while the retrieved PR fixes the HAMT.ids() function to return value pointers correctly; both modifications are directly related to the HAMT class in py_hamt/hamt.py.

Poem

I'm a rabbit, happy as can be,
Transforming data with agility and glee.
New methods and tests hop into view,
Encryption and decoding, all fresh and new.
With carrots of code, I celebrate this spree!
🥕🐇 Let our changes bounce with reliability!

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/test_zarr_encryption_ipfs.py (2)

155-159: Consider storing “crypto” attribute in the Zarr metadata.
Currently, you’re adding the encryption attribute directly to the xarray objects in memory, but the TODO note suggests that you aim to have this stored in Zarr metadata. You may explore xarray’s “encoding” dictionary when calling .to_zarr() or upon creating variables, which can persist custom metadata to Zarr.


234-235: Use more specific exception handling and valid attribute access.
Static analysis highlights two concerns:
• Calling pytest.raises(Exception) is considered bad practice. Prefer a more specific exception type.
• Accessing loaded_failure["temp"].values purely to trigger an exception can be seen as “useless” by linter rules. It may help to capture that value or clarify its usage.

Below is a suggested fix, specifying a more targeted exception and assigning the attribute to a dummy variable:

-    with pytest.raises(Exception):
-        loaded_failure["temp"].values
+    with pytest.raises(ValueError):  # or any relevant specific exception
+        _ = loaded_failure["temp"].values
🧰 Tools
🪛 Ruff (0.8.2)

234-234: pytest.raises(Exception) should be considered evil

(B017)


235-235: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

py_hamt/hamt.py (2)

126-135: Remove or clarify unused code.
This block of code has been commented out. If it’s no longer needed, consider removing it entirely to keep the codebase clear and maintainable. If you plan to revive it, please add a TODO comment stating the next steps.


562-565: Consider a separate decoding utility for clarity.
The inline decryption logic is functional, but factoring the decode operation into a helper method may improve testability and readability, especially if additional decoding steps become necessary in the future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 424c005 and c903225.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • py_hamt/hamt.py (7 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

234-234: pytest.raises(Exception) should be considered evil

(B017)


235-235: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto-mod
tests/test_zarr_encryption_ipfs.py

[error] 174-174: TypeError: Unsupported type for store_like: 'HAMT'


[warning] 197-197: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.

🔇 Additional comments (4)
py_hamt/hamt.py (3)

217-220: Ensure transformers meet concurrency expectations.
The new fields (transformer_encode and transformer_decode) enable custom transformations on inserted/retrieved values. If these transformers hold any state, be cautious of potential threading issues since the HAMT can be accessed concurrently.


268-269: Validate default transformer parameters.
Providing None by default for transformers is clear, but ensure the upstream usage checks for and gracefully handles the case when no transformation is required.


363-367: Confirm the usage of raw storing.
You’re encoding values with dag-cbor, transforming them, then calling store.save_raw(). This is valid if your store supports raw bytes, but verify that you truly want to skip the DAG-CBOR layer for retrieval.

pyproject.toml (1)

33-37: Dev-dependency on pycryptodome looks good.
This addition aligns well with your new encryption functionality. No issues found.

transformer_encode=encode,
transformer_decode=decode,
)
test_ds.to_zarr(store=hamt1, mode="w")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address “Unsupported type for store_like” pipeline error.
According to the pipeline failure logs, xarray raises a “TypeError: Unsupported type for store_like: 'HAMT'” on this line. You may need to provide a store implementation that xarray recognizes, or adapt the HAMT object to fulfill xarray’s requirements for a Zarr store-like.

🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto-mod

[error] 174-174: TypeError: Unsupported type for store_like: 'HAMT'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
py_hamt/zarr_encryption_transformer.py (2)

28-42: Docstring mismatch with function signature.
The docstring references parameters like “key_hash” that don’t exist in the actual function signature. This can cause confusion for future maintainers.

You might consider updating the docstring to reflect the correct parameters and usage for “create_zarr_encryption_transformers”.


58-69: Consider large-object processing.
You are loading the entire byte array into memory (via io.BytesIO), which may severely impact memory usage when encrypting large datasets. Chunk-based processing or streaming encryption could be more efficient.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c903225 and 2c058d4.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • py_hamt/__init__.py (1 hunks)
  • py_hamt/hamt.py (6 hunks)
  • py_hamt/zarr_encryption_transformer.py (1 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • py_hamt/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

153-153: pytest.raises(Exception) should be considered evil

(B017)


154-154: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto-mod
tests/test_zarr_encryption_ipfs.py

[error] 93-93: TypeError: Unsupported type for store_like: 'HAMT'

🔇 Additional comments (5)
py_hamt/hamt.py (2)

205-208: LGTM! Well-documented transformer attributes.

The transformer function attributes are well-typed and clearly documented.


205-553: Verify transformer function signatures in the constructor.

It's important to validate that the provided transformer functions match the expected signature to catch potential issues early.

Apply this diff to add signature validation:

     def __init__(
         self,
         store: Store,
         hash_fn: Callable[[bytes], bytes] = blake3_hashfn,
         read_only: bool = False,
         root_node_id: IPLDKind = None,
         max_cache_size_bytes=10_000_000,  # default to 10 megabytes
         transformer_encode: None | Callable[[str, bytes], bytes] = None,
         transformer_decode: None | Callable[[str, bytes], bytes] = None,
     ):
+        def validate_transformer(func: Callable | None, name: str) -> None:
+            if func is not None:
+                import inspect
+                sig = inspect.signature(func)
+                if list(sig.parameters.keys()) != ['key', 'value']:
+                    raise ValueError(
+                        f"{name} must have exactly two parameters: 'key' and 'value'"
+                    )
+                if sig.return_annotation != bytes and sig.return_annotation != inspect.Signature.empty:
+                    raise ValueError(
+                        f"{name} must return bytes"
+                    )
+
+        validate_transformer(transformer_encode, 'transformer_encode')
+        validate_transformer(transformer_decode, 'transformer_decode')
         self.store = store
         self.hash_fn = hash_fn
py_hamt/zarr_encryption_transformer.py (2)

48-53: Validate the key path logic.
The helper function _should_transform_key only checks the first path segment against encrypted_vars. This may lead to edge cases if the data is organized in deeper paths or multiple grouping levels.

Would you like to run a repository-wide check to see whether keys can have multiple nested segments? If needed, we can refine the logic to match the relevant portion of the path. Here is a script that searches for Zarr path usage:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Recursively search for usage of Zarr paths and print relevant context.
rg -A 5 '[.]to_zarr|open_zarr'

Length of output: 4765


Key path logic verified against repository usage.
The repository-wide search (using the rg command) did not reveal any cases where Zarr keys contain multiple nested segments. The tests (in files like tests/test_zarr_ipfs.py and tests/test_zarr_encryption_ipfs.py) consistently use top-level keys (e.g. "temp", "precip"), which aligns with the current _should_transform_key implementation.

  • The function correctly excludes paths based on the file name and index pattern before checking the first segment.
  • No evidence was found in the repository to suggest that keys with deeper nesting are used or required.

If future requirements arise that involve multi-level nested keys, the logic may need to be revisited.


64-78: Verify 24-byte nonce support.
ChaCha20-Poly1305 typically uses a 96-bit (12-byte) nonce. You appear to be doing XChaCha20, which can use a 192-bit (24-byte) nonce. Some libraries or versions might not fully support XChaCha20.

✅ Verification successful

🌐 Web query:

Is PyCryptodome’s ChaCha20_Poly1305 implementation documented to support 24-byte nonces (XChaCha20)?

💡 Result:

Yes, PyCryptodome's `ChaCha20_Poly1305` implementation explicitly supports 24-byte nonces for XChaCha20-Poly1305 according to its documentation[1][7]. The library distinguishes between three variants based on nonce length:

| Nonce Length | Variant               | Standardization Status    |
|--------------|-----------------------|---------------------------|
| 8 bytes      | Original ChaCha20     | Bernstein's specification |
| 12 bytes     | RFC7539 TLS version   | Standardized              |
| **24 bytes** | **XChaCha20-Poly1305**| Draft specification[1][7] |

The 24-byte nonce implementation follows the extended nonce construction described in security literature[3][5], despite the IETF standardization draft having expired[3][8]. When using this variant:
- Requires passing a **24-byte nonce** to `new()`[1][4]
- Uses HChaCha20 internally to derive subkeys from extended nonces[3]
- Maintains compatibility with other implementations like libsodium[3][8]

The documentation explicitly states this support in both API references[1][7] and code examples[4], though notes it's still technically a draft specification[1][7]. Security analysis suggests this construction reduces nonce reuse risks compared to shorter variants[3][8].

Citations:

- 1: https://pycryptodome.readthedocs.io/en/latest/src/cipher/chacha20_poly1305.html
- 2: https://docs.rs/chacha20poly1305/latest/chacha20poly1305/
- 3: https://www.scottbrady.io/c-sharp/xchacha20-poly1305-dotnet
- 4: https://binref.github.io/units/crypto/cipher/chacha.html
- 5: https://en.wikipedia.org/wiki/ChaCha20-Poly1305
- 6: https://github.com/Legrandin/pycryptodome/issues/811
- 7: https://github.com/Legrandin/pycryptodome/blob/master/Doc/src/cipher/chacha20_poly1305.rst
- 8: https://github.com/openbao/openbao/issues/235
- 9: https://binref.github.io
- 10: https://groups.google.com/g/pycryptodome/c/4OSIWNzjq_w

Verification: 24-byte nonce support confirmed.
The PyCryptodome documentation confirms that its ChaCha20_Poly1305 implementation explicitly supports a 24-byte nonce (XChaCha20-Poly1305). The snippet is correct and uses the 24-byte nonce appropriately.

tests/test_zarr_encryption_ipfs.py (1)

75-78: Retain consistent indexing of encryption attributes.
You manually set "crypto" attributes on "temp" and "precip" variables, which might need consistent handling across all relevant variables. If future expansions add more variables, you may risk missing encryption.

Would you like to check if any other variables or coordinates require encryption attributes? This script can help identify all variable names and see if we missed any:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
rg -A 5 'xr.Dataset|xr.DataArray'

Length of output: 1667


I'm going to dig deeper into the repository to check if any other variables or coordinates receive encryption attributes—or if there’s any inconsistency in setting them apart from "precip" and "temp".


🏁 Script executed:

#!/bin/bash
# Search for all occurrences where the "crypto" attribute is set in the codebase.
rg '\.attrs\["crypto"\]' -n .

Length of output: 233


Encryption attribute handling is consistent.
After checking the codebase, only "precip" and "temp" in tests/test_zarr_encryption_ipfs.py have been assigned crypto attributes. No other variables or coordinates set these attributes, confirming that the manual handling is currently applied consistently.

Comment on lines +256 to 258
transformer_encode: None | Callable[[str, bytes], bytes] = None,
transformer_decode: None | Callable[[str, bytes], bytes] = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update __deepcopy__ to include transformer functions.

The __deepcopy__ method (lines 281-295) needs to be updated to include the transformer functions when creating a copy of the HAMT instance.

Apply this diff to fix the issue:

 def __deepcopy__(self, memo) -> "HAMT":
     if not self.read_only:
         self.lock.acquire(blocking=True)

     copy_hamt = HAMT(
         store=self.store,
         hash_fn=self.hash_fn,
         read_only=self.read_only,
         root_node_id=self.root_node_id,
+        transformer_encode=self.transformer_encode,
+        transformer_decode=self.transformer_decode,
     )

     if not self.read_only:
         self.lock.release()

     return copy_hamt

Also applies to: 269-271

Comment on lines 55 to 57
def __init__(self, encryption_key: str):
self.encryption_key = encryption_key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused constructor.
The init method is unused and likely dead code. This could create confusion, as the transcription suggests it might be a leftover.

Apply this diff to remove the unused constructor:

- def __init__(self, encryption_key: str):
-     self.encryption_key = encryption_key
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, encryption_key: str):
self.encryption_key = encryption_key

Comment on lines +153 to +154
with pytest.raises(Exception):
loaded_failure["temp"].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use a more specific exception type.
Using pytest.raises(Exception) is overly broad, and static analysis flags the attribute access as “useless.” A better practice is to specify the exact exception type expected when accessing encrypted data without the transformer.

Example fix:

- with pytest.raises(Exception):
-     loaded_failure["temp"].values
+ with pytest.raises(ValueError):
+     _ = loaded_failure["temp"].values
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with pytest.raises(Exception):
loaded_failure["temp"].values
with pytest.raises(ValueError):
_ = loaded_failure["temp"].values
🧰 Tools
🪛 Ruff (0.8.2)

153-153: pytest.raises(Exception) should be considered evil

(B017)


154-154: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (424c005) to head (46eb32c).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #26   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         4    +1     
  Lines          391       436   +45     
=========================================
+ Hits           391       436   +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
py_hamt/zarr_encryption_transformer.py (1)

54-56: ⚠️ Potential issue

Remove unused constructor.

The __init__ method is unused and likely dead code. This could create confusion, as it suggests it might be a leftover.

Apply this diff to remove the unused constructor:

- def __init__(self, encryption_key: str):
-     self.encryption_key = encryption_key
tests/test_zarr_encryption_ipfs.py (1)

160-161: 🛠️ Refactor suggestion

Use a more specific exception type.

Using pytest.raises(Exception) is overly broad, and the attribute access is flagged as useless.

- with pytest.raises(Exception):
-     loaded_failure["temp"].values
+ with pytest.raises(ValueError):
+     _ = loaded_failure["temp"].values
🧰 Tools
🪛 Ruff (0.8.2)

160-160: pytest.raises(Exception) should be considered evil

(B017)


161-161: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🧹 Nitpick comments (2)
py_hamt/zarr_encryption_transformer.py (1)

78-82: Remove commented out code.

The commented out code block is unused and should be removed to maintain code cleanliness.

-        # if out is not None:
-        #     outbuf = io.BytesIO(plaintext)
-        #     outbuf.readinto(out)
-        #     return out
tests/test_zarr_encryption_ipfs.py (1)

74-77: Document the workaround for zarray configuration.

The TODO comment indicates a limitation with xarray. Consider documenting this workaround in the code comments or README to help future maintainers.

-    # TODO: THIS SHOULD BE DONE IN THE ZARRAY but it doesn't appear like xarray allows this
+    # Note: Ideally, this configuration should be in the zarray metadata.
+    # However, xarray currently doesn't provide an API to modify zarray metadata directly.
+    # This is a workaround by setting the crypto attribute on the variables themselves.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c058d4 and 2c2d7b3.

📒 Files selected for processing (4)
  • py_hamt/hamt.py (5 hunks)
  • py_hamt/zarr_encryption_transformer.py (1 hunks)
  • pyproject.toml (2 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pyproject.toml
  • py_hamt/hamt.py
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

160-160: pytest.raises(Exception) should be considered evil

(B017)


161-161: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto-mod
py_hamt/zarr_encryption_transformer.py

[error] 51-55: Coverage failure: total of 99 is less than fail-under=100. Lines 51 and 55 are missing coverage.

🔇 Additional comments (4)
py_hamt/zarr_encryption_transformer.py (3)

1-22: LGTM!

The imports are appropriate, and the exclusion list is well-documented with clear explanations for each excluded file type.


24-30: LGTM!

The type definition and function signature are well-structured with appropriate type hints.


57-68: LGTM!

The encryption implementation is secure, using appropriate nonce size and authenticated encryption with ChaCha20-Poly1305.

tests/test_zarr_encryption_ipfs.py (1)

18-68: LGTM!

The fixture is well-structured with appropriate test data generation and cleanup. The docstring clearly explains the return values.

Comment on lines 47 to 52
def _should_transform_key(key: str) -> bool:
if Path(key).name in encryption_exclude_files:
return False
if INDEX_KEY_PATTERN.fullmatch(key):
return False
return key.split("/")[0] in encrypted_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add test coverage for _should_transform_key function.

The pipeline indicates missing coverage for the _should_transform_key function. Add test cases to cover both the index key pattern match and variable name check.


🏁 Script executed:

#!/bin/bash
# Description: Verify test coverage for _should_transform_key function

# Test: Search for existing test cases
rg -A 5 "_should_transform_key"

Length of output: 1487


Action: Add dedicated test cases for _should_transform_key function

The search confirms that _should_transform_key is only referenced within the transformer file and isn’t directly covered by any unit tests. To ensure proper behavior, please add tests that validate:

  • The function returns False when Path(key).name is in encryption_exclude_files.
  • The function returns False when INDEX_KEY_PATTERN.fullmatch(key) is true.
  • The function returns the expected boolean from key.split("/")[0] in encrypted_vars for all other cases.

Comment on lines +31 to +41
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update docstring to match function parameters.

The docstring refers to a key_hash parameter, but the function accepts encryption_key. Update the docstring to reflect the actual parameters.

-    Parameters
-    ----------
-    key_hash: str
-        The hex digest of an encryption key. A key may be generated using
-        :func:`generate_encryption_key`. The hex digest is obtained by registering the
-        key using :func:`register_encryption_key`.
+    Parameters
+    ----------
+    encryption_key: bytes
+        The encryption key used for ChaCha20-Poly1305 cipher.
+    encrypted_vars: [str]
+        List of variable names to be encrypted.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
encryption_key: bytes
The encryption key used for ChaCha20-Poly1305 cipher.
encrypted_vars: [str]
List of variable names to be encrypted.
"""

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c2d7b3 and a771dad.

📒 Files selected for processing (1)
  • py_hamt/zarr_encryption_transformer.py (1 hunks)
🔇 Additional comments (4)
py_hamt/zarr_encryption_transformer.py (4)

1-21: LGTM! Well-structured imports and constants.

The imports are properly organized, and the encryption exclusion list is well-documented with clear explanations for each excluded file.


23-23: LGTM! Clear type definition.

The TransformerFN type alias clearly defines the expected signature for transformer functions.


30-40: Update docstring to match function parameters.

The docstring still refers to incorrect parameters and needs to be updated.


45-48: Add test coverage for _should_transform_key function.

The function still lacks test coverage as previously noted.

Comment on lines +62 to +70
def decode(key: str, val: bytes):
if not _should_transform_key(key):
return val

nonce, tag, ciphertext = val[:24], val[24:40], val[40:]
cipher = ChaCha20_Poly1305.new(key=encryption_key, nonce=nonce)
cipher.update(header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for malformed input in decode function.

The decode function should handle potential errors when processing malformed input:

  1. Input shorter than expected (< 40 bytes)
  2. Invalid tag during verification
 def decode(key: str, val: bytes):
     if not _should_transform_key(key):
         return val
 
+    if len(val) < 40:
+        raise ValueError("Input too short: encrypted data must be at least 40 bytes")
+
     nonce, tag, ciphertext = val[:24], val[24:40], val[40:]
     cipher = ChaCha20_Poly1305.new(key=encryption_key, nonce=nonce)
     cipher.update(header)
-    plaintext = cipher.decrypt_and_verify(ciphertext, tag)
+    try:
+        plaintext = cipher.decrypt_and_verify(ciphertext, tag)
+    except ValueError as e:
+        raise ValueError(f"Decryption failed: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def decode(key: str, val: bytes):
if not _should_transform_key(key):
return val
nonce, tag, ciphertext = val[:24], val[24:40], val[40:]
cipher = ChaCha20_Poly1305.new(key=encryption_key, nonce=nonce)
cipher.update(header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)
def decode(key: str, val: bytes):
if not _should_transform_key(key):
return val
if len(val) < 40:
raise ValueError("Input too short: encrypted data must be at least 40 bytes")
nonce, tag, ciphertext = val[:24], val[24:40], val[40:]
cipher = ChaCha20_Poly1305.new(key=encryption_key, nonce=nonce)
cipher.update(header)
try:
plaintext = cipher.decrypt_and_verify(ciphertext, tag)
except ValueError as e:
raise ValueError(f"Decryption failed: {str(e)}")

Comment on lines +71 to +75
# if out is not None:
# outbuf = io.BytesIO(plaintext)
# outbuf.readinto(out)
# return out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented out code.

The commented out code should be removed as it's not being used and could create confusion.

-        # if out is not None:
-        #     outbuf = io.BytesIO(plaintext)
-        #     outbuf.readinto(out)
-        #     return out
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# if out is not None:
# outbuf = io.BytesIO(plaintext)
# outbuf.readinto(out)
# return out

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/test_zarr_encryption_ipfs.py (1)

160-161: ⚠️ Potential issue

Use a more specific exception type.

The static analysis tool correctly flags the use of a broad Exception type. This was previously noted in past review comments and should be addressed.

Apply this fix:

-with pytest.raises(Exception):
-    loaded_failure["temp"].values
+with pytest.raises(ValueError, match="Encryption transformer required"):
+    _ = loaded_failure["temp"].values
🧰 Tools
🪛 Ruff (0.8.2)

160-160: pytest.raises(Exception) should be considered evil

(B017)


161-161: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🧹 Nitpick comments (3)
tests/test_zarr_encryption_ipfs.py (3)

18-68: Add type hints and data validation.

The fixture is well-structured, but consider these improvements:

  1. Add type hints to the function parameters and variables.
  2. Add validation for realistic data ranges (e.g., temperature and precipitation values).

Example improvements:

 @pytest.fixture
-def random_zarr_dataset():
+def random_zarr_dataset() -> tuple[str, xr.Dataset]:
     """Creates a random xarray Dataset and saves it to a temporary zarr store.
 
     Returns:
         tuple: (dataset_path, expected_data)
             - dataset_path: Path to the zarr store
             - expected_data: The original xarray Dataset for comparison
     """
     # Create temporary directory for zarr store
     temp_dir = tempfile.mkdtemp()
     zarr_path = os.path.join(temp_dir, "test.zarr")
 
     # Coordinates of the random data
     times = pd.date_range("2024-01-01", periods=100)
     lats = np.linspace(-90, 90, 18)
     lons = np.linspace(-180, 180, 36)
 
     # Create random variables with different shapes
-    temp = np.random.randn(len(times), len(lats), len(lons))
+    # Generate temperatures between -50°C and 50°C
+    temp = np.random.uniform(-50, 50, size=(len(times), len(lats), len(lons)))
-    precip = np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons)))
+    # Generate non-negative precipitation values with realistic distribution
+    precip = np.clip(np.random.gamma(2, 0.5, size=(len(times), len(lats), len(lons))), 0, None)

70-70: Add docstring to test function.

Add a descriptive docstring explaining the test's purpose, parameters, and expected behavior.

Example:

 def test_upload_then_read(random_zarr_dataset: tuple[str, xr.Dataset]):
+    """Test ZARR dataset upload and read with and without encryption.
+
+    Tests the following scenarios:
+    1. Upload and read with encryption
+    2. Upload and read without encryption
+    3. Attempt to read encrypted data without transformer
+
+    Args:
+        random_zarr_dataset (tuple[str, xr.Dataset]): Fixture providing zarr path and dataset
+    """

81-106: Improve performance measurement structure.

The performance measurement code could be more structured and reusable.

Consider extracting the timing logic into a helper function:

+def measure_operation(operation_name: str, func: callable, *args, **kwargs) -> float:
+    """Measure the execution time of an operation.
+
+    Args:
+        operation_name (str): Name of the operation being measured
+        func (callable): Function to measure
+        *args, **kwargs: Arguments to pass to the function
+
+    Returns:
+        float: Execution time in seconds
+    """
+    start_time = time.time()
+    result = func(*args, **kwargs)
+    total_time = time.time() - start_time
+    print(f"{operation_name} took {total_time:.2f} seconds")
+    return result
+
 start_time = time.time()
 encryption_key = get_random_bytes(32)
 
 encode, decode = create_zarr_encryption_transformers(
     encryption_key, ["temp", "precip"]
 )
-hamt1 = HAMT(
-    store=IPFSStore(pin_on_add=False),
-    transformer_encode=encode,
-    transformer_decode=decode,
-)
-test_ds.to_zarr(store=hamt1, mode="w")
-
-end_time = time.time()
-total_time = end_time - start_time
-print(f"Adding with encryption took {total_time:.2f} seconds")
+hamt1 = measure_operation(
+    "Adding with encryption",
+    lambda: HAMT(
+        store=IPFSStore(pin_on_add=False),
+        transformer_encode=encode,
+        transformer_decode=decode,
+    )
+)
+test_ds.to_zarr(store=hamt1, mode="w")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a771dad and 46eb32c.

📒 Files selected for processing (1)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

160-160: pytest.raises(Exception) should be considered evil

(B017)


161-161: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

🔇 Additional comments (2)
tests/test_zarr_encryption_ipfs.py (2)

1-16: LGTM!

The imports are well-organized and appropriate for testing ZARR data encryption/decryption functionality.


74-77: Address TODO comment about crypto attributes.

The TODO comment indicates a limitation in xarray's functionality. Consider:

  1. Opening an issue to track this limitation.
  2. Adding a comment explaining why this workaround is necessary.

Would you like me to help create an issue to track this limitation or suggest alternative approaches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants