Skip to content

Conversation

@TheGreatAlgo
Copy link
Collaborator

@TheGreatAlgo TheGreatAlgo commented Feb 5, 2025

Things to do:

It encrypts the entire thing right now. I have it working because I know what decryption is expected but for others its impossible. So i need to add filtering to ensure only the data keys are encrypted and not the metadata

I also need to force add the metadata for the encryption used so zarr.js on the frontend can understand

Summary by CodeRabbit

  • New Features

    • Enhanced data handling with optional transformation capabilities for flexible item storage and retrieval.
    • Strengthened secure data processes with improved encryption and decryption mechanisms.
  • Chores

    • Updated dependency configurations to support robust cryptographic functionality.
  • Tests

    • Added comprehensive tests to verify secure data processing and proper error handling during encryption workflows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2025

Walkthrough

The pull request refactors the HAMT implementation by introducing a factory method in the HAMT class within py_hamt/hamt.py that returns either a HAMTOriginal or a TransformedHamt instance based on a transformer argument. In addition, encryption testing for ZARR data using the ChaCha20-Poly1305 cipher is added in tests/test_zarr_encryption_ipfs.py, and a new tool configuration is included in pyproject.toml to add "pycryptodome" as a development dependency.

Changes

File(s) Change Summary
py_hamt/hamt.py Refactored HAMT class with a factory pattern; introduced HAMTOriginal and TransformedHamt with overrides for __setitem__ and __getitem__.
pyproject.toml Added [tool.uv] section with dev-dependencies = ["pycryptodome"].
tests/test_zarr_encryption_ipfs.py Added a comprehensive test framework for ZARR encryption using ChaCha20-Poly1305; implemented EncryptionFilter, a dataset fixture, and a test for IPFS upload/read.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HAMT
    participant HAMTOriginal
    participant TransformedHamt
    Client->>HAMT: Instantiate HAMT(store, transformer?)
    alt Transformer Provided
        HAMT-->>TransformedHamt: Return instance with transformation logic
    else
        HAMT-->>HAMTOriginal: Return standard instance
    end
Loading
sequenceDiagram
    participant Test as Testing Framework
    participant EncryptionFilter
    participant IPFS
    Test->>EncryptionFilter: Upload dataset (with encryption)
    EncryptionFilter->>IPFS: Send encrypted data
    Test->>EncryptionFilter: Download dataset (expect decryption)
    EncryptionFilter->>IPFS: Request encrypted data
    alt Correct Key
        EncryptionFilter-->>Test: Return decrypted dataset
    else Incorrect Key
        EncryptionFilter-->>Test: Raise decryption exception
    end
Loading

Possibly related PRs

  • Version 2 #7: The changes in the main PR, which involve substantial modifications to the HAMT class and its related classes in py_hamt/hamt.py, are related to the changes in the retrieved PR, which also includes significant updates to the HAMT implementation and introduces a new node structure in the same file.
  • Reimplement lru with python dictionary insertion order #16: The changes in the main PR and the retrieved PR are related as both modify the HAMT class in the py_hamt/hamt.py file, specifically altering its methods and internal logic.
  • Bug fixes #13: The changes in the main PR, particularly the modifications to the HAMT class and its methods, are directly related to the bug fix in the retrieved PR that addresses the HAMT.ids() function, which was also part of the HAMT class in the same file.

Suggested labels

bug

Suggested reviewers

  • Faolain

Poem

I’m a rabbit in the code field, hopping near and far,
With factories and transformations—oh what a star!
Encryption tests and IPFS rides,
Bringing magic where data hides.
A nibble of bugs, a joyful hop,
Celebrating changes that never stop!
🐇✨

Tip

🌐 Web search-backed reviews and chat
  • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
  • You can disable this feature by setting web_search: false in the knowledge_base settings.
  • Please share any feedback in the Discord discussion.
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
py_hamt/hamt.py (1)

615-638: Add type hints and improve error handling in TransformedHamt.

The implementation needs improvements:

  1. Add type hints for the transformer parameter.
  2. Add validation for the transformer object.

Apply this diff to improve the implementation:

-    def __init__(self, *args, transformer=None, **kwargs):
+    def __init__(self, *args, transformer: object = None, **kwargs):
+        if transformer is not None and not hasattr(transformer, 'encode') or not hasattr(transformer, 'decode'):
+            raise ValueError("transformer must implement encode and decode methods")
         super().__init__(*args, **kwargs)
         self.transformer = transformer
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 424c005 and 6a0e1cf.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • py_hamt/__init__.py (1 hunks)
  • py_hamt/hamt.py (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

178-178: pytest.raises(Exception) should be considered evil

(B017)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto
tests/test_zarr_encryption_ipfs.py

[error] 128-128: TypeError: Unsupported type for store_like: 'TransformedHamt'


[warning] 197-197: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.

🔇 Additional comments (4)
py_hamt/__init__.py (1)

1-11: LGTM!

The changes correctly expose the new HamtFactory class through the module's public interface.

tests/test_zarr_encryption_ipfs.py (2)

63-113: LGTM!

The fixture correctly generates random test data and handles cleanup.


115-192: Fix TypeError with TransformedHamt and improve test coverage.

The test has the following issues:

  1. Pipeline failure indicates TypeError with TransformedHamt.
  2. The test uses a bare Exception in pytest.raises.
  3. Missing test cases for concurrent access and large datasets.

Apply this diff to fix the issues:

-    test_ds.to_zarr(store=hamt1, mode="w")
+    test_ds.to_zarr(store=dict(hamt1), mode="w")

-    with pytest.raises(Exception):
+    with pytest.raises((ValueError, TypeError), match="(Decryption failed|Invalid key)"):
         xr.open_zarr(store=hamt1_read_bad)
✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check if the error is reproducible with the latest version

# Test: Search for similar issues or related code
rg -A 5 'TypeError.*TransformedHamt'
rg -A 5 'store_like.*TransformedHamt'

Length of output: 78


I'll generate a new shell script to search the repository for any usage or definition of TransformedHamt to gather more context on the TypeError being reported. This should help verify if the proposed changes are addressing an existing issue.


🏁 Script executed:

#!/bin/bash
# Search for the occurrence of 'TransformedHamt' in the repository
rg -n "TransformedHamt" .

Length of output: 306


Fix TypeError with TransformedHamt usage and update exception handling in tests.

The evidence confirms that the codebase instantiates and uses a TransformedHamt (see py_hamt/hamt.py), and the integration test still calls test_ds.to_zarr(store=hamt1, mode="w"). This likely causes a TypeError when the transformed HAMT isn’t correctly converted to a dict-like object for the Zarr write operation. Additionally, the bare exception in the test does not pinpoint the decryption or key error.

  • Call test_ds.to_zarr with store=dict(hamt1) to prevent the TypeError.
  • Replace pytest.raises(Exception) with a more specific tuple—e.g., using (ValueError, TypeError) and an appropriate match string like "(Decryption failed|Invalid key)".
  • Consider adding tests for concurrent access and handling very large datasets.

These changes address both the pipeline failure and improve test precision.

🧰 Tools
🪛 Ruff (0.8.2)

178-178: pytest.raises(Exception) should be considered evil

(B017)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[error] 128-128: TypeError: Unsupported type for store_like: 'TransformedHamt'

py_hamt/hamt.py (1)

126-135: LGTM!

The factory pattern is well-implemented and provides a clean interface for creating HAMT instances.

Comment on lines +20 to +60
class EncryptionFilter:
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""

codec_id = "xchacha20poly1305"
header = b"dClimate-Zarr"

def __init__(self, encryption_key: str):
self.encryption_key = encryption_key

def encode(self, buf):
raw = io.BytesIO()
raw.write(buf)
nonce = get_random_bytes(24) # XChaCha uses 24 byte (192 bit) nonce
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())

return nonce + tag + ciphertext

def decode(self, buf, out=None):
nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)

if out is not None:
outbuf = io.BytesIO(plaintext)
outbuf.readinto(out)
return out

return plaintext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance security by adding key validation and error handling.

The EncryptionFilter class needs improvements:

  1. Add validation for the encryption key length (should be 32 bytes for ChaCha20-Poly1305).
  2. Add error handling for encryption/decryption failures.

Apply this diff to improve the implementation:

 def __init__(self, encryption_key: str):
+    if not isinstance(encryption_key, bytes) or len(encryption_key) != 32:
+        raise ValueError("encryption_key must be 32 bytes")
     self.encryption_key = encryption_key

 def encode(self, buf):
+    try:
         raw = io.BytesIO()
         raw.write(buf)
         nonce = get_random_bytes(24)  # XChaCha uses 24 byte (192 bit) nonce
         cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
         cipher.update(self.header)
         ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())
         return nonce + tag + ciphertext
+    except Exception as e:
+        raise ValueError(f"Encryption failed: {str(e)}")

 def decode(self, buf, out=None):
+    try:
         nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
         cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
         cipher.update(self.header)
         plaintext = cipher.decrypt_and_verify(ciphertext, tag)
+    except ValueError:
+        raise ValueError("Decryption failed: Invalid key or corrupted data")
+    except Exception as e:
+        raise ValueError(f"Decryption failed: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class EncryptionFilter:
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""
codec_id = "xchacha20poly1305"
header = b"dClimate-Zarr"
def __init__(self, encryption_key: str):
self.encryption_key = encryption_key
def encode(self, buf):
raw = io.BytesIO()
raw.write(buf)
nonce = get_random_bytes(24) # XChaCha uses 24 byte (192 bit) nonce
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())
return nonce + tag + ciphertext
def decode(self, buf, out=None):
nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)
if out is not None:
outbuf = io.BytesIO(plaintext)
outbuf.readinto(out)
return out
return plaintext
class EncryptionFilter:
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""
codec_id = "xchacha20poly1305"
header = b"dClimate-Zarr"
def __init__(self, encryption_key: str):
if not isinstance(encryption_key, bytes) or len(encryption_key) != 32:
raise ValueError("encryption_key must be 32 bytes")
self.encryption_key = encryption_key
def encode(self, buf):
try:
raw = io.BytesIO()
raw.write(buf)
nonce = get_random_bytes(24) # XChaCha uses 24 byte (192 bit) nonce
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())
return nonce + tag + ciphertext
except Exception as e:
raise ValueError(f"Encryption failed: {str(e)}")
def decode(self, buf, out=None):
try:
nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)
except ValueError:
raise ValueError("Decryption failed: Invalid key or corrupted data")
except Exception as e:
raise ValueError(f"Decryption failed: {str(e)}")
if out is not None:
outbuf = io.BytesIO(plaintext)
outbuf.readinto(out)
return out
return plaintext

Comment on lines +33 to +36
[tool.uv]
dev-dependencies = [
"pycryptodome",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Move pycryptodome to runtime dependencies.

Since the encryption functionality is part of the core feature set and not just for development/testing, pycryptodome should be moved to the runtime dependencies section.

Apply this diff to fix the dependency:

-[tool.uv]
-dev-dependencies = [
-    "pycryptodome",
-]
+[project]
+dependencies = [
     "dag-cbor>=0.3.3",
     "msgspec>=0.18.6",
     "multiformats[full]>=0.3.1.post4",
     "requests>=2.32.3",
+    "pycryptodome>=3.20.0",
 ]

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
py_hamt/hamt.py (1)

615-638: Document special keys and transformer interface requirements.

The implementation would benefit from:

  1. Documentation explaining why .zarray and .attributes keys are special.
  2. Type hints or interface definition for the transformer object.
 class TransformedHamt(HAMTOriginal):
     """
     A wrapper around the HAMT class that applies a transformation function
     when setting or getting items, except for specific keys like ".attributes".
+
+    Special keys:
+    - ".zarray": Skipped during encoding to preserve Zarr metadata
+    - ".attributes": Skipped during decoding to preserve Zarr attributes
+
+    The transformer object must implement:
+    - encode(value: bytes) -> bytes
+    - decode(value: bytes) -> bytes
     """

     def __init__(self, *args, transformer=None, **kwargs):
+        """
+        Args:
+            transformer: An object implementing encode/decode methods for value transformation
+        """
         super().__init__(*args, **kwargs)
         self.transformer = transformer
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a0e1cf and 1bf4095.

📒 Files selected for processing (2)
  • py_hamt/hamt.py (2 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/test_zarr_encryption_ipfs.py

178-178: pytest.raises(Exception) should be considered evil

(B017)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto
tests/test_zarr_encryption_ipfs.py

[error] 128-128: TypeError: Unsupported type for store_like: 'TransformedHamt'


[warning] UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.

🔇 Additional comments (4)
tests/test_zarr_encryption_ipfs.py (3)

20-60: Enhance security by adding key validation and error handling.

The EncryptionFilter class needs improvements:

  1. Add validation for the encryption key length (should be 32 bytes for ChaCha20-Poly1305).
  2. Add error handling for encryption/decryption failures.
🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[warning] UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.


63-113: LGTM! Well-structured test fixture.

The fixture correctly generates test data and handles cleanup using the yield pattern.

🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[warning] UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.


128-129: ⚠️ Potential issue

Fix TypeError with TransformedHamt store.

The pipeline failure indicates that xarray doesn't recognize TransformedHamt as a valid store type. This needs to be addressed to ensure compatibility.

Let's check xarray's documentation for supported store types:

🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[error] 128-128: TypeError: Unsupported type for store_like: 'TransformedHamt'


[warning] UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.

py_hamt/hamt.py (1)

126-135: LGTM! Clean factory pattern implementation.

The factory method clearly handles the instantiation logic for both HAMT variants.

Comment on lines 178 to 179
with pytest.raises(Exception):
xr.open_zarr(store=hamt1_read_bad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use specific exception types instead of broad Exception catch.

Replace the broad Exception catch with the specific exception type that should be raised when using an incorrect encryption key.

-    with pytest.raises(Exception):
+    with pytest.raises(ValueError, match="Decryption failed:"):
         xr.open_zarr(store=hamt1_read_bad)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with pytest.raises(Exception):
xr.open_zarr(store=hamt1_read_bad)
with pytest.raises(ValueError, match="Decryption failed:"):
xr.open_zarr(store=hamt1_read_bad)
🧰 Tools
🪛 Ruff (0.8.2)

178-178: pytest.raises(Exception) should be considered evil

(B017)

🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[warning] UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tests/test_zarr_encryption_ipfs.py (2)

20-60: ⚠️ Potential issue

Enhance security by adding key validation and error handling.

The EncryptionFilter class needs improvements:

  1. Add validation for the encryption key length (should be 32 bytes for ChaCha20-Poly1305).
  2. Add error handling for encryption/decryption failures.

Additionally, consider:

  1. Making the header configurable or using a unique value per instance to prevent potential replay attacks.
  2. Adding a version field to support future algorithm changes.

Apply this diff to improve the implementation:

+    def __init__(self, encryption_key: bytes, header: bytes = b"dClimate-Zarr"):
+        if not isinstance(encryption_key, bytes) or len(encryption_key) != 32:
+            raise ValueError("encryption_key must be 32 bytes")
+        self.encryption_key = encryption_key
+        self.header = header
+        self.version = 1  # For future algorithm changes

     def encode(self, buf):
+        try:
             raw = io.BytesIO()
             raw.write(buf)
             nonce = get_random_bytes(24)  # XChaCha uses 24 byte (192 bit) nonce
             cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
             cipher.update(self.header)
             ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())
             return nonce + tag + ciphertext
+        except Exception as e:
+            raise ValueError(f"Encryption failed: {str(e)}")

     def decode(self, buf, out=None):
+        try:
             nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
             cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
             cipher.update(self.header)
             plaintext = cipher.decrypt_and_verify(ciphertext, tag)
+        except ValueError:
+            raise ValueError("Decryption failed: Invalid key or corrupted data")
+        except Exception as e:
+            raise ValueError(f"Decryption failed: {str(e)}")

173-178: ⚠️ Potential issue

Use specific exception types instead of broad Exception catch.

Replace the broad Exception catch with the specific exception type that should be raised when using an incorrect encryption key.

Apply this diff to improve error handling:

-    with pytest.raises(Exception, match="Codec ID not found in transformer"):
+    with pytest.raises(ValueError, match="Codec ID not found in transformer"):
         HAMT(
             store=IPFSStore(),
             root_node_id=hamt1_root,
             read_only=True,
         )
🧹 Nitpick comments (3)
py_hamt/hamt.py (3)

285-309: Improve error handling and performance in check_crypto.

The check_crypto method has several areas for improvement:

  1. Cache the codec_id check to avoid repeated attribute lookups.
  2. Use early returns to simplify the logic.
  3. Improve error messages.

Apply this diff to improve the implementation:

     def check_crypto(self):
-        has_codec_id = hasattr(self, "codec_id")
-        found_filter = False
+        codec_id = getattr(self, "codec_id", None)
+        found_metadata = False
         for key in self:
             if key.endswith("/.zarray"):
+                found_metadata = True
                 metadata_bytes = self[key]
                 metadata = json.loads(metadata_bytes.decode("utf-8"))
                 crypto = metadata.get("crypto", None)
                 if crypto:
-                    found_filter = True
-                    if not has_codec_id:
-                        raise Exception("Codec ID not found in transformer")
-                    if crypto[0]["id"] != self.codec_id:
-                        raise Exception("Codec ID does not match transformer")
-        if not found_filter and has_codec_id:
-            raise Exception("Codec ID found in transformer but no crypto found in metadata")
+                    if not codec_id:
+                        raise ValueError("Encrypted data found but no transformer provided")
+                    if crypto[0]["id"] != codec_id:
+                        raise ValueError(f"Expected codec ID '{codec_id}' but found '{crypto[0]['id']}'")
+                    return
+        if codec_id and not found_metadata:
+            raise ValueError("Transformer provided but no encrypted data found")

667-674: Optimize dictionary key check.

Using dict.keys() is unnecessary when checking for key existence.

Apply this diff to improve performance:

-        if "crypto" not in metadata.keys():
+        if "crypto" not in metadata:
             metadata['crypto'] = []
         # Append codec filter if not already present
         if codec_filter not in metadata['crypto']:
             metadata['crypto'].append(codec_filter)
         # Encode back to JSON bytes
         return json.dumps(metadata).encode('utf-8')
🧰 Tools
🪛 Ruff (0.8.2)

667-667: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)


680-691: Simplify conditional logic.

The conditional logic can be simplified by returning the negated condition directly.

Apply this diff to improve readability:

     def _should_transform_key(self, key):
-            if any(meta_key in key for meta_key in self.ZARR_METADATA_KEYS):
-                return False  # Don't transform metadata keys
-            
-            if self.INDEX_KEY_PATTERN.fullmatch(key):
-                return False  # Don't transform single index keys like lat/0
-            
-            return True  # Transform everything else (e.g., chunked data like precip/1.0.0)
+            # Don't transform metadata keys or single index keys
+            return not (
+                any(meta_key in key for meta_key in self.ZARR_METADATA_KEYS)
+                or self.INDEX_KEY_PATTERN.fullmatch(key)
+            )
🧰 Tools
🪛 Ruff (0.8.2)

685-690: Return the negated condition directly

(SIM103)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bf4095 and b812af8.

📒 Files selected for processing (2)
  • py_hamt/hamt.py (4 hunks)
  • tests/test_zarr_encryption_ipfs.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto
tests/test_zarr_encryption_ipfs.py

[error] 129-129: TypeError: Unsupported type for store_like: 'TransformedHamt'

🪛 Ruff (0.8.2)
py_hamt/hamt.py

652-652: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


667-667: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)


685-690: Return the negated condition directly

(SIM103)

🔇 Additional comments (2)
tests/test_zarr_encryption_ipfs.py (1)

63-112: LGTM! Well-structured test fixture.

The fixture is well-implemented with:

  • Proper cleanup using yield and shutil.rmtree.
  • Realistic test data with appropriate shapes and coordinates.
  • Clear documentation of return values.
py_hamt/hamt.py (1)

128-137: LGTM! Well-implemented factory pattern.

The factory pattern is correctly implemented to create appropriate HAMT instances based on the presence of a transformer.

transformer=EncryptionFilter(encryption_key=encryption_key),
encrypt_vars=["temp", "precip"],
)
test_ds.to_zarr(store=hamt1, mode="w")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

⚠️ Potential issue

🧩 Analysis chain

Fix TypeError: Unsupported type for store_like.

The pipeline failure indicates that test_ds.to_zarr(store=hamt1, mode="w") is failing because TransformedHamt is not a supported store type.

Run the following script to check if TransformedHamt implements all required methods for a zarr store:


🏁 Script executed:

#!/bin/bash
# Description: Check if TransformedHamt implements all required methods for a zarr store.

# Test: Search for required zarr store methods in TransformedHamt.
ast-grep --pattern $'class TransformedHamt {
  $$$
  __getitem__($$$) {
    $$$
  }
  __setitem__($$$) {
    $$$
  }
  __delitem__($$$) {
    $$$
  }
  __iter__($$$) {
    $$$
  }
  __len__($$$) {
    $$$
  }
  $$$
}'

Length of output: 218


I'll generate a new shell script to locate the definition of TransformedHamt and inspect its implementation to verify if it includes the required methods for a Zarr store.

Please run this script and share the output so we can verify whether TransformedHamt implements the required methods for a Zarr store.


🏁 Script executed:

#!/bin/bash
# 1. Find the file containing the definition of TransformedHamt.
FILE=$(rg -l "class TransformedHamt" . | head -n 1)
if [ -z "$FILE" ]; then
  echo "No file found containing 'class TransformedHamt'"
  exit 1
fi
echo "TransformedHamt class found in file: $FILE"

# 2. Print context of the class definition.
echo "Extracting class definition (20 lines context):"
rg -n "class TransformedHamt" -A 20 "$FILE"

# 3. Check for the required methods in that file.
echo "Searching for required methods in $FILE:"
echo "---- __getitem__ ----"
rg -n "def __getitem__" "$FILE"
echo "---- __setitem__ ----"
rg -n "def __setitem__" "$FILE"
echo "---- __delitem__ ----"
rg -n "def __delitem__" "$FILE"
echo "---- __iter__ ----"
rg -n "def __iter__" "$FILE"
echo "---- __len__ ----"
rg -n "def __len__" "$FILE"

Length of output: 2274

TransformedHamt Implements the Required Methods but Lacks Proper Interface Inheritance

Our investigation confirms that the TransformedHamt class (in py_hamt/hamt.py) contains implementations for the standard mapping methods (__getitem__, __setitem__, __delitem__, __iter__, and __len__). However, despite this, the pipeline still fails with a TypeError ("Unsupported type for store_like") when passing an instance of this class to to_zarr.

• Though the methods exist, Zarr likely verifies the store by checking for a valid mutable mapping type—typically done via an inheritance from (or registration with) collections.abc.MutableMapping.

• It appears that TransformedHamt (or its base class HAMTOriginal) is not declared as a subclass of MutableMapping, so even with the required methods, it isn’t recognized as a valid Zarr store.

To resolve this issue, please review and modify the class inheritance hierarchy so that TransformedHamt (and/or its parent) explicitly extends collections.abc.MutableMapping. This should meet the interface expectations required by Zarr.

🧰 Tools
🪛 GitHub Actions: Triggered on push from TheGreatAlgo to branch/tag crypto

[error] 129-129: TypeError: Unsupported type for store_like: 'TransformedHamt'

Comment on lines +652 to +657
def __init__(self, *args, transformer=None, encrypt_vars=[], **kwargs):
self.transformer = transformer
self.codec_id = transformer.codec_id
self.encrypt_vars = encrypt_vars
super().__init__(*args, **kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix mutable default argument.

Using mutable default arguments can lead to unexpected behavior.

Apply this diff to fix the issue:

-    def __init__(self, *args, transformer=None, encrypt_vars=[], **kwargs):
+    def __init__(self, *args, transformer=None, encrypt_vars=None, **kwargs):
         self.transformer = transformer
         self.codec_id = transformer.codec_id
-        self.encrypt_vars = encrypt_vars
+        self.encrypt_vars = encrypt_vars if encrypt_vars is not None else []
         super().__init__(*args, **kwargs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, *args, transformer=None, encrypt_vars=[], **kwargs):
self.transformer = transformer
self.codec_id = transformer.codec_id
self.encrypt_vars = encrypt_vars
super().__init__(*args, **kwargs)
def __init__(self, *args, transformer=None, encrypt_vars=None, **kwargs):
self.transformer = transformer
self.codec_id = transformer.codec_id
self.encrypt_vars = encrypt_vars if encrypt_vars is not None else []
super().__init__(*args, **kwargs)
🧰 Tools
🪛 Ruff (0.8.2)

652-652: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
py_hamt/hamt.py (1)

652-657: ⚠️ Potential issue

Fix mutable default argument.

Using mutable default arguments can lead to unexpected behavior when the same list is shared across instances.

Apply this diff to fix the issue:

-    def __init__(self, *args, transformer=None, encrypt_vars=[], **kwargs):
+    def __init__(self, *args, transformer=None, encrypt_vars=None, **kwargs):
         self.transformer = transformer
         self.codec_id = transformer.codec_id
-        self.encrypt_vars = encrypt_vars
+        self.encrypt_vars = encrypt_vars if encrypt_vars is not None else []
         super().__init__(*args, **kwargs)
🧰 Tools
🪛 Ruff (0.8.2)

652-652: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🧹 Nitpick comments (3)
py_hamt/hamt.py (3)

285-309: Enhance error messages and optimize metadata check.

The crypto validation is thorough, but consider these improvements:

  1. Error messages could be more descriptive by including the actual vs expected codec IDs
  2. The metadata check could be optimized by using key not in metadata instead of key not in metadata.keys()

Apply this diff to improve the implementation:

-        if "crypto" not in metadata.keys():
+        if "crypto" not in metadata:
             metadata['crypto'] = []
         if crypto:
             found_filter = True
             if not has_codec_id:
-                raise Exception("Codec ID not found in transformer")
+                raise Exception(f"Transformer required: crypto filter found in {key}")
             if crypto[0]["id"] != self.codec_id:
-                raise Exception("Codec ID does not match transformer")
+                raise Exception(f"Codec ID mismatch in {key}: expected {self.codec_id}, got {crypto[0]['id']}")

680-699: Simplify transformation logic and improve return statement.

The transformation logic can be simplified and the return statement can be more direct.

Apply this diff to improve the implementation:

     def _should_transform_key(self, key):
         """Determines whether the key should be transformed."""
         if any(meta_key in key for meta_key in self.ZARR_METADATA_KEYS):
             return False  # Don't transform metadata keys
         
         if self.INDEX_KEY_PATTERN.fullmatch(key):
             return False  # Don't transform single index keys
         
         parts = key.split('/')
         metadata = self[parts[0] + "/.zarray"]
         if not metadata:
             return False
         metadata = json.loads(metadata.decode('utf-8'))
         crypto = metadata.get("crypto", [])
-        # If no crypto then don't transform data
-        if not crypto:
-            return False
-        return True  # Transform everything else
+        return bool(crypto)  # Transform only if crypto filters exist
🧰 Tools
🪛 Ruff (0.8.2)

697-699: Return the negated condition directly

(SIM103)


658-674: Enhance metadata update robustness.

The metadata update logic could be more robust by validating the crypto filter structure.

Apply this diff to improve the implementation:

     def _update_metadata_crypto(self, value):
         """Update the metadata to include the codec in the crypto list."""
         try:
             metadata = json.loads(value.decode('utf-8'))
+            if not isinstance(metadata, dict):
+                return value
         except (json.JSONDecodeError, UnicodeDecodeError, AttributeError):
             return value  # Return original if not decodable
         
         codec_filter = {'id': self.codec_id}
-        if "crypto" not in metadata.keys():
+        if "crypto" not in metadata:
             metadata['crypto'] = []
+        elif not isinstance(metadata['crypto'], list):
+            metadata['crypto'] = []
         
         if codec_filter not in metadata['crypto']:
             metadata['crypto'].append(codec_filter)
         return json.dumps(metadata).encode('utf-8')
🧰 Tools
🪛 Ruff (0.8.2)

667-667: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b812af8 and 63bc61e.

📒 Files selected for processing (1)
  • py_hamt/hamt.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
py_hamt/hamt.py

652-652: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


667-667: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)


697-699: Return the negated condition directly

(SIM103)

🔇 Additional comments (1)
py_hamt/hamt.py (1)

128-137: LGTM! Well-structured factory pattern implementation.

The factory pattern implementation cleanly separates the HAMT variants and provides a good abstraction for client code.

Comment on lines +128 to +139
class HAMT:
def __new__(cls, store, transformer=None, **kwargs):
"""
Instantiating HAMT directly returns an instance of HAMT or TransformedHamt,
depending on whether a transformer is provided.
"""
if transformer:
return TransformedHamt(store=store, transformer=transformer, **kwargs)
return HAMTOriginal(store=store, **kwargs)


class HAMTOriginal(MutableMapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having to differentiate and add this layer of indirection, we can take in the transformer functions in the HAMT class initiation and then apply them right before values are either sent or retrieved form the store.

Comment on lines +33 to +36
[tool.uv]
dev-dependencies = [
"pycryptodome",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coderabbit also pointed this out, but yes this should be in the project dependencies rather than dev dependencies.

Comment on lines +20 to +60
class EncryptionFilter:
"""An encryption filter for ZARR data.
This class is serialized and stored along with the Zarr it is used with, so instead
of storing the encryption key, we store the hash of the encryption key, so it can be
looked up in the key registry at run time as needed.
Parameters
----------
key_hash: str
The hex digest of an encryption key. A key may be generated using
:func:`generate_encryption_key`. The hex digest is obtained by registering the
key using :func:`register_encryption_key`.
"""

codec_id = "xchacha20poly1305"
header = b"dClimate-Zarr"

def __init__(self, encryption_key: str):
self.encryption_key = encryption_key

def encode(self, buf):
raw = io.BytesIO()
raw.write(buf)
nonce = get_random_bytes(24) # XChaCha uses 24 byte (192 bit) nonce
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
ciphertext, tag = cipher.encrypt_and_digest(raw.getbuffer())

return nonce + tag + ciphertext

def decode(self, buf, out=None):
nonce, tag, ciphertext = buf[:24], buf[24:40], buf[40:]
cipher = ChaCha20_Poly1305.new(key=self.encryption_key, nonce=nonce)
cipher.update(self.header)
plaintext = cipher.decrypt_and_verify(ciphertext, tag)

if out is not None:
outbuf = io.BytesIO(plaintext)
outbuf.readinto(out)
return out

return plaintext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use a class, we should use a factory function that returns closures. This allows the transformers input on the HAMT class to be more flexible and accept a wider array of generic functions, rather than limit to either just this class type, or force clients to inherit from this class type.

@abidsikder abidsikder closed this Feb 7, 2025
@abidsikder abidsikder deleted the crypto branch February 18, 2025 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants