Skip to content

I/O: Adapter for Delta Lake#664

Merged
amotl merged 2 commits intomainfrom
deltalake
Mar 4, 2026
Merged

I/O: Adapter for Delta Lake#664
amotl merged 2 commits intomainfrom
deltalake

Conversation

@amotl
Copy link
Member

@amotl amotl commented Feb 20, 2026

About

Import from and export to Delta Lake tables, for interoperability purposes.

Documentation

https://cratedb-toolkit--664.org.readthedocs.build/io/deltalake/

References

Review

Please don't hesitate to review in retrospective when applicable.
Any feedback is much appreciated and well received.

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Delta Lake I/O adapter and wiring: new Delta Lake module, StandaloneCluster routing for deltalake:// URLs, docs, tests, and packaging extras to enable bidirectional Delta Lake ↔ CrateDB transfers using Polars.

Changes

Cohort / File(s) Summary
Delta Lake I/O module
cratedb_toolkit/io/deltalake.py
New adapter implementing DeltaLakeAddress, URL parsing and storage options, Polars-based load_table, and public functions from_deltalake() and to_deltalake() with chunked transfer and error handling.
Cluster integration
cratedb_toolkit/cluster/core.py
StandaloneCluster.load_table/save_table extended to detect deltalake scheme and route to the new handlers; deltALake branch placed before existing iceberg handling.
Tests
tests/io/test_deltalake.py
New CLI tests for load/save flows, version handling, and error cases; tests skip if optional deps missing.
Documentation
doc/io/deltalake/index.md, doc/io/index.md
Added Delta Lake I/O documentation and registered it in the IO index.
Changelog
CHANGES.md
Added changelog entry noting Delta Lake I/O adapter (text-only).
Dependencies / Packaging
pyproject.toml
Added optional-dependencies.deltalake (includes polars[deltalake]<1.39) and updated io-opentable extra to include deltalake.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Cluster as StandaloneCluster
    participant DLAdapter as from_deltalake()
    participant DLAddr as DeltaLakeAddress
    participant Polars as Polars
    participant CrateDB as CrateDB

    User->>Cluster: load_table(source_url, target_url)
    Cluster->>DLAdapter: from_deltalake(source_url, target_url)
    DLAdapter->>DLAddr: DeltaLakeAddress.from_url(source_url)
    DLAddr->>DLAddr: parse URL & collect storage options
    DLAddr->>Polars: scan_delta(location, version, storage_options)
    Polars-->>DLAdapter: LazyFrame / DataFrame
    DLAdapter->>CrateDB: write chunks to CrateDB
    CrateDB-->>DLAdapter: ack
    DLAdapter-->>Cluster: result True/False
    Cluster-->>User: operation result
Loading
sequenceDiagram
    participant User as User/CLI
    participant Cluster as StandaloneCluster
    participant DLAdapter as to_deltalake()
    participant DLAddr as DeltaLakeAddress
    participant CrateDB as CrateDB
    participant Polars as Polars
    participant DeltaStorage as Delta Storage

    User->>Cluster: save_table(source_url, target_url)
    Cluster->>DLAdapter: to_deltalake(source_url, target_url)
    DLAdapter->>DLAddr: DeltaLakeAddress.from_url(target_url)
    DLAddr->>DLAddr: parse URL & collect storage options
    DLAdapter->>CrateDB: read_cratedb(streaming, chunk_size)
    CrateDB-->>DLAdapter: DataFrame chunks
    loop per chunk
      DLAdapter->>Polars: write_delta(chunk, mode, storage_options)
      Polars->>DeltaStorage: write (overwrite/append)
      DeltaStorage-->>Polars: ack
    end
    DLAdapter-->>Cluster: result True/False
    Cluster-->>User: operation result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • I/O: Adapter for Apache Iceberg #444 — Adds new IO adapters and modifies cluster load/save routing to handle an additional table format; closely related to Delta Lake adapter insertion.

Suggested reviewers

  • seut
  • matriv

Poem

🐰 I hopped through bytes beside a lake,
Polars shimmered with each chunk I take,
Tables crossed the gleaming tide,
CrateDB and Delta side by side,
A rabbit's patchwork codey wake.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a Delta Lake adapter to the I/O subsystem, which aligns with the changeset's core purpose.
Description check ✅ Passed The description is directly related to the changeset, explaining the addition of Delta Lake import/export functionality and providing relevant documentation and references.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch deltalake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@amotl amotl marked this pull request as ready for review February 20, 2026 02:31
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the deltalake branch 2 times, most recently from 2612f2b to 7600ec0 Compare February 20, 2026 03:22
@amotl amotl requested review from matriv and seut February 20, 2026 03:28
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Base automatically changed from iceberg to main March 4, 2026 12:01
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cratedb_toolkit/io/deltalake.py (1)

74-87: ⚠️ Potential issue | 🟠 Major

collect_properties drops documented LakeFS options.

On Line 74–Line 87, only aws_/azure_/google_/delta_-prefixed keys are retained. The documented LakeFS example (endpoint, access_key_id, secret_access_key) won’t pass through storage_options, which can break LakeFS access.

🛠️ Proposed fix
-        prefixes = ["aws_", "azure_", "google_", "delta_"]
+        prefixes = ["aws_", "azure_", "google_", "delta_", "endpoint", "access_key_id", "secret_access_key"]

@@
-                if name.lower().startswith(prefix) and value is not None:
-                    opts[name.upper()] = value
+                if name.lower().startswith(prefix) and value is not None:
+                    key = name.upper() if name.lower().startswith(("aws_", "azure_", "google_", "delta_")) else name
+                    opts[key] = value
                     break
#!/bin/bash
# Verify mismatch between documented LakeFS params and current filter logic.
set -euo pipefail

echo "== collect_properties implementation =="
rg -n -C2 'prefixes = \[|opts\[name\.upper\(\)\]' cratedb_toolkit/io/deltalake.py

echo
echo "== LakeFS docs example params =="
rg -n 'lakefs\+deltalake://|endpoint=|access_key_id=|secret_access_key=' doc/io/deltalake/index.md
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cratedb_toolkit/io/deltalake.py` around lines 74 - 87, collect_properties
currently only returns query params that start with prefixes = ["aws_",
"azure_", "google_", "delta_"], which drops documented LakeFS options like
endpoint, access_key_id, and secret_access_key; update collect_properties (and
its prefixes check) to also accept those LakeFS option names (case-insensitive)
by adding a small whitelist (e.g. {"endpoint", "access_key_id",
"secret_access_key"}) and when iterating query_params include a param if it
matches any prefix OR if its lowercased name is in that whitelist, preserving
the existing behavior of uppercasing keys in opts and skipping None values
(refer to collect_properties, prefixes and url.query_params to locate the code).
🧹 Nitpick comments (1)
doc/io/deltalake/index.md (1)

61-63: Prefer env vars over inline secrets in URL examples.

The examples on Line 61 and Line 118 place secret-bearing parameters in the command URL. Even with placeholders, this pattern encourages leaking credentials via shell history/process lists. Consider adding an alternative example using environment variables for auth values.

Also applies to: 106-107, 118-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@doc/io/deltalake/index.md` around lines 61 - 63, Replace the inline
secret-bearing query parameters in the example URL strings (e.g. the
"s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=..."
and similar examples) with a short alternative that demonstrates using
environment variables (e.g., export or --env usage) for AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT and AWS_REGION, and update the doc text to
show both options so the primary example does not encourage embedding
credentials in the URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/io/test_deltalake.py`:
- Line 95: The assertion in tests/io/test_deltalake.py currently uses
exc_info.match() with the full error text which is brittle; change it to assert
on a stable substring from the exception (e.g. assert "not the same as the
specified end version" in str(exc_info.value) or assert "LogSegment" in
str(exc_info.value)) instead of matching the entire message, replacing the
exc_info.match("LogSegment end version 0 not the same as the specified end
version 99") call with a substring containment assertion to make the test
resilient to upstream message changes.

---

Duplicate comments:
In `@cratedb_toolkit/io/deltalake.py`:
- Around line 74-87: collect_properties currently only returns query params that
start with prefixes = ["aws_", "azure_", "google_", "delta_"], which drops
documented LakeFS options like endpoint, access_key_id, and secret_access_key;
update collect_properties (and its prefixes check) to also accept those LakeFS
option names (case-insensitive) by adding a small whitelist (e.g. {"endpoint",
"access_key_id", "secret_access_key"}) and when iterating query_params include a
param if it matches any prefix OR if its lowercased name is in that whitelist,
preserving the existing behavior of uppercasing keys in opts and skipping None
values (refer to collect_properties, prefixes and url.query_params to locate the
code).

---

Nitpick comments:
In `@doc/io/deltalake/index.md`:
- Around line 61-63: Replace the inline secret-bearing query parameters in the
example URL strings (e.g. the
"s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=..."
and similar examples) with a short alternative that demonstrates using
environment variables (e.g., export or --env usage) for AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT and AWS_REGION, and update the doc text to
show both options so the primary example does not encourage embedding
credentials in the URL.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 742831ff-cb6c-497c-bf8e-89eff8f80cb0

📥 Commits

Reviewing files that changed from the base of the PR and between e08cb3c and 043b53c.

📒 Files selected for processing (7)
  • CHANGES.md
  • cratedb_toolkit/cluster/core.py
  • cratedb_toolkit/io/deltalake.py
  • doc/io/deltalake/index.md
  • doc/io/index.md
  • pyproject.toml
  • tests/io/test_deltalake.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • CHANGES.md
  • doc/io/index.md

args=["load", "table", source_url],
catch_exceptions=False,
)
assert exc_info.match("LogSegment end version 0 not the same as the specified end version 99")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Check deltalake version constraints in pyproject.toml and assertion method in test file
rg -n 'polars\[deltalake\]|deltalake' pyproject.toml tests/io/test_deltalake.py || true
echo "---"
# Get context around line 95 in the test file
sed -n '90,100p' tests/io/test_deltalake.py || true

Repository: crate/cratedb-toolkit

Length of output: 1698


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Get full context around line 95 including pytest exception handling
sed -n '75,100p' tests/io/test_deltalake.py

Repository: crate/cratedb-toolkit

Length of output: 984


🌐 Web query:

deltalake python library error message "LogSegment end version" changelog releases

💡 Result:

The error text:

LogSegment end version X not the same as the specified end version Y

comes from delta-kernel-rs’s LogSegment::try_new check, which asserts that the “effective” last commit/checkpoint version found during _delta_log listing equals the end_version you requested. If log listing returns a different last version (often due to missing/hidden commit JSONs, non-contiguous versions, eventual consistency, or a stale/partial listing), it raises this error. [1]

Where to look in release notes / changelogs

This message is not typically called out verbatim in deltalake (Python) release notes; it’s emitted by the Rust kernel component. So the relevant “changelog trail” is usually: deltalake (PyPI) version → delta-rs release → delta-kernel version bumps.

  • deltalake (Python) release history (dates/versions to correlate with your environment). [2]
  • delta-rs consolidated CHANGELOG.md includes multiple “update kernel” entries (meaning delta-kernel behavior and errors like this may shift across these releases), e.g.:
    • rust-v0.27.0 includes “upgrade to delta_kernel 0.12.x”. [3]
    • rust-v0.28.1 includes “update kernel to 0.15.1”. [4]
    • rust-v0.30.0 includes “update delta-kernel to 0.19”. [5]

If you tell me your exact deltalake version and the operation you were calling (often CDF / table-changes / time travel), I can map it to the corresponding delta-rs/kernel dependency set and point you at the most relevant release block.

Sources: [1] [2] [3] [4] [5]


Use a stable substring for exception message assertion instead of full pattern.

The assertion currently matches the entire error message using exc_info.match(), which can break when the deltalake/delta-kernel stack updates since the underlying error text is generated by delta-kernel-rs and may change across versions. The dependency polars[deltalake]<1.39 is not pinned, leaving room for version drift.

Prefer asserting on a stable substring:

🛠️ Suggested fix
-    assert exc_info.match("LogSegment end version 0 not the same as the specified end version 99")
+    assert "specified end version 99" in str(exc_info.value)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert exc_info.match("LogSegment end version 0 not the same as the specified end version 99")
assert "specified end version 99" in str(exc_info.value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/io/test_deltalake.py` at line 95, The assertion in
tests/io/test_deltalake.py currently uses exc_info.match() with the full error
text which is brittle; change it to assert on a stable substring from the
exception (e.g. assert "not the same as the specified end version" in
str(exc_info.value) or assert "LogSegment" in str(exc_info.value)) instead of
matching the entire message, replacing the exc_info.match("LogSegment end
version 0 not the same as the specified end version 99") call with a substring
containment assertion to make the test resilient to upstream message changes.

@amotl amotl merged commit dd3e881 into main Mar 4, 2026
25 checks passed
@amotl amotl deleted the deltalake branch March 4, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant