Refactor writer implementations #851

wjones127 · 2022-09-27T03:09:39Z

Description

For further progress on the writer, we'll need to implement features that require query engines. This includes:

Higher writer protocols:
- V2: column invariants (Write enforce_invariant() function #592)
- V3: CHECK constraints
- V4: generated columns
Write types that require rewriting files:
DELETE
UPDATE
MERGE (Implement merge command #850)

We can provide a default one with DataFusion, but we will also have users that wish to plug in their own query engine. In addition, it's possible we may have users that wish to user their own Parquet writer (for distributed engines, for example). So we will likely want to refactor into three distinct layers:

A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);
A parametrized writer layer, who want to use their own query engine but will use the built-in data writer (you verify data; we write data and transaction);
A DataFusion-based writer that handles everything (verification, writing, transaction).

I'm not sure how viable this is yet, and would welcome feedback from others.

Use Case

Related Issue(s)

The text was updated successfully, but these errors were encountered:

houqp · 2022-09-28T05:56:39Z

A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);

If the trait only abstracts out the parquet io logic, I don't think we should call it the transaction layer? Because a transaction in delta contains more than just the data and checkpoint parquets.

I do think we should make the parquet implementation swapable through traits though, so that we can serve both arrow-rs and parquet2/arrow2 users.

The query engine trait abstraction makes total sense to me 👍

A DataFusion-based writer that handles everything (verification, writing, transaction).

Should the writer be implemented using the query engine trait? Then we only need to implement the query engine trait for Datafusion.

Overall, I think this looks like a good attack plan. It looks like we are on a good path towards a full-fledged native DeltaLake implementation!

wjones127 added enhancement New feature or request help wanted Extra attention is needed binding/rust Issues for the Rust crate labels Sep 27, 2022

MrPowers mentioned this issue Oct 4, 2022

Roadmap 2022 H2 (discussion) delta-io/delta#1307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor writer implementations #851

Refactor writer implementations #851

wjones127 commented Sep 27, 2022

houqp commented Sep 28, 2022

Refactor writer implementations #851

Refactor writer implementations #851

Comments

wjones127 commented Sep 27, 2022

Description

houqp commented Sep 28, 2022