Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Introduce external storage API for Delta Log based on FoundationDB #867

Closed
renardeinside opened this issue Dec 15, 2021 · 3 comments
Labels
acknowledged This issue has been read and acknowledged by Delta admins enhancement New feature or request

Comments

@renardeinside
Copy link

renardeinside commented Dec 15, 2021

There is a very good concept about storing Delta Log in DynamoDB, introduced in this PR.

However, DynamoDB might not be the tool of choice for users who are using other clouds or having an on-premise setup. Distributed KV storages with transaction support, such as FoundationDB could be a nice extension for Delta Log storage.

Another benefit is that using such an external system it becomes possible to introduce multi-table transactions with explicit API, for example (just a concept, indeed the final API shall be revised):

val transaction = DeltaTransactionManager.beginTransaction()

df1.write.format("delta").withTransaction(transaction).saveAsTable("some_db.some_table_1")
df2.write.format("delta").withTransaction(transaction).saveAsTable("some_db.some_table_2")

transaction.commit()
@scottsand-db
Copy link
Collaborator

Just to clarify, the linked PR doesn't really "store the Delta Log in DynamoDB". DynamoDB is only used to provide mutual exclusion, a feature that S3 is lacking by not having a "put-if-absent API". Mutual exclusion is one of the three properties on which Delta Lake ACID guarantees are predicated.

Are you interested in storing the entire DeltaLog in DynamoDB / FoundationDB? If so, are there specific use cases as to why you prefer to do so?

Thanks for making this issue and prompting this discussion!

@scottsand-db scottsand-db added enhancement New feature or request acknowledged This issue has been read and acknowledged by Delta admins labels Dec 15, 2021
@renardeinside
Copy link
Author

renardeinside commented Dec 16, 2021

Hi @scottsand-db ,

Are you interested in storing the entire DeltaLog in DynamoDB / FoundationDB? If so, are there specific use cases as to why you prefer to do so?

Yes, I think this is at least an interesting technical idea worth trying out, and here are some of the potential UCs and benefits of this approach.

Potential UCs:

  • low latency streaming (for instance, setting triggerTime to 1 second/continious blows out the ListFiles API costs)
  • HDFS-based use-cases - since in HDFS the listing operation always hits the HDFS master node, users with multi-million file tables might suffer from a heavy load on listing the file names. I understand that it's 2021 and HDFS is heavily outdated, but there are still some people doing it like this.

More opportunities:

Usage of fast, transactional & distributed KV-like stores with decent key listing performance as a metadata layer opens a door for further optimizations, such as:

  • File metadata can be stored inside such a structure with file location in S3. For example, counts/indexes, etc.
  • Potential unlock for OLTP use-cases, where part of data can be additionally written to/read from SSD-based cache, and such a metadata layer can keep a pointer towards SSD cache and S3 file
  • Potential support for re-partitioning on the fly by keeping a map between partition and set of files inside such a metadata storage, instead of building hierarchical structures in S3.

Some of these designs are already applied in other systems, for example, Firebolt and Iceberg.

@scottsand-db
Copy link
Collaborator

Following up on this. Having a logstore that writes to FoundationDB won't work, as our LogStore APIs don't encapsulate all file system (log store) interactions, despite the name. e.g. checkpoints don't go through the LogStore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants