Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor writer implementations #851

Open
wjones127 opened this issue Sep 27, 2022 · 1 comment
Open

Refactor writer implementations #851

wjones127 opened this issue Sep 27, 2022 · 1 comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed

Comments

@wjones127
Copy link
Collaborator

Description

For further progress on the writer, we'll need to implement features that require query engines. This includes:

We can provide a default one with DataFusion, but we will also have users that wish to plug in their own query engine. In addition, it's possible we may have users that wish to user their own Parquet writer (for distributed engines, for example). So we will likely want to refactor into three distinct layers:

  1. A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);
  2. A parametrized writer layer, who want to use their own query engine but will use the built-in data writer (you verify data; we write data and transaction);
  3. A DataFusion-based writer that handles everything (verification, writing, transaction).

I'm not sure how viable this is yet, and would welcome feedback from others.

Use Case

Related Issue(s)

@wjones127 wjones127 added enhancement New feature or request help wanted Extra attention is needed binding/rust Issues for the Rust crate labels Sep 27, 2022
@houqp
Copy link
Member

houqp commented Sep 28, 2022

A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);

If the trait only abstracts out the parquet io logic, I don't think we should call it the transaction layer? Because a transaction in delta contains more than just the data and checkpoint parquets.

I do think we should make the parquet implementation swapable through traits though, so that we can serve both arrow-rs and parquet2/arrow2 users.

The query engine trait abstraction makes total sense to me 👍

A DataFusion-based writer that handles everything (verification, writing, transaction).

Should the writer be implemented using the query engine trait? Then we only need to implement the query engine trait for Datafusion.

Overall, I think this looks like a good attack plan. It looks like we are on a good path towards a full-fledged native DeltaLake implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants