[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515
Open
Baunsgaard wants to merge 2 commits into
Open
[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515Baunsgaard wants to merge 2 commits into
Baunsgaard wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2515 +/- ##
============================================
+ Coverage 71.56% 71.60% +0.03%
- Complexity 49125 49310 +185
============================================
Files 1575 1582 +7
Lines 189784 190583 +799
Branches 37232 37395 +163
============================================
+ Hits 135823 136461 +638
- Misses 43470 43565 +95
- Partials 10491 10557 +66 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Introduce a DELTA file format that reads and writes Delta Lake tables natively through the Spark-free Delta Kernel library, for matrices on the single-node CP path. DML read/write with format="delta" now operates directly on Delta tables without a Spark DataFrame round-trip. - Add FileFormat.DELTA and exclude it from the text formats - Accept format="delta" with unknown dimensions in the parser and set blocksize -1 for the columnar format - Wire DELTA into the matrix reader and writer factories - Add DeltaKernelUtils plus serial and parallel native Delta readers and WriterDelta with column-at-a-time, boxing-free data transfer - Expose Delta reader batch size and writer target file size via DMLConfig - Refresh cached matrix metadata after a Delta read (discovered dimensions) - Add a parquet.version property and pin delta-kernel 3.3.2 - Run Delta component IO tests in CI and broaden matrix coverage Append/overwrite table semantics, distributed execution, frames, and time travel are out of scope.
Extend the native Delta Lake support from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table. - Add FrameReaderDelta, FrameReaderDeltaParallel and FrameWriterDelta - Wire DELTA into the frame reader and writer factories - Refresh cached frame metadata and schema after a Delta read - Broaden Delta frame component IO coverage Stacked on the matrix Delta support; append/overwrite semantics, distributed execution, and time travel remain out of scope.
27fe2ff to
d269ee7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extend the native Delta Lake support (#2511) from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table.
Stacked on #2511 and should merge after it. Append/overwrite semantics, distributed execution, and time travel remain out of scope