Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta LogStore Refactor - Project Plan #951

Closed
scottsand-db opened this issue Feb 23, 2022 · 0 comments
Closed

Delta LogStore Refactor - Project Plan #951

scottsand-db opened this issue Feb 23, 2022 · 0 comments
Labels
good first issue Good for newcomers
Milestone

Comments

@scottsand-db
Copy link
Collaborator

scottsand-db commented Feb 23, 2022

Overview and Requirements

Hi everyone - help is wanted!

This is the official project plan tracking the work to refactor Delta's LogStore classes to a new artifact delta-storage, and in Java (instead of Scala). The Delta LogStore is a general interface for all critical file system operations required to read and write the Delta log.

There are a variety of reasons for this initiative.

  1. Reduce code duplication. Currently, both the Delta Lake OSS and Delta Standalone libraries require access to this interface. However, without any separate LogStore artifact to depend on, any implementation needs to be duplicated accross both of these repos. We'd like to avoid that.
  2. Remove the Apache Spark™ dependency. Currently, the LogStore interface that the delta-core and delta-contribs artifacts use is contained within delta-core. This means any downstream dependencies will inherintely have to depend on Spark. As Delta Standalone is distinctly Spark-less, the current dependency hierarchy won't work.
  3. No redundant Scala cross publishing. These LogStore implementations don't use any fancy Scala language features, and by re-writing the relatively lighweight implementations in Java we can avoid the various headaches and overhead that supporting a cross-published Scala artifact can bring.
  4. This will enable us to support new lightweight and specific LogStore artifacts in the future. For example, for our goal to support S3 multi-cluster writes, we aim to have the DynamoDBLogStore (with its unique AWS SDK dependency) as its own artifact. This ensures that the specific AWS dependency isn't brought into other artifacts (e.g. delta-contribs).

How to Contribute

  • For any of the LogStores below, please comment on the issue letting us know you'd like to work on it.
  • Leave the Scala file alone for now, and create the corresponding Java file inside of storage/src/main/java/io/delta/storage. Refactor the LogStore here.
  • Add a new test suite to core/src/test/scala/org/apache/spark/sql/delta/LogStoreSuite.scala, much like PublicHDFSLogStoreSuite.
  • Submit your PR for review.
  • See this PR as an example.

Project Status

LogStore Issue PR Status
Initial setup. N/A #925 DONE
HadoopFileSystemLogStore and HDFSLogStore N/A #933 DONE
S3SingleDriverLogStore #952 #995 DONE
AzureLogStore #953 #1003 DONE
DelegatingLogStore #954 #1041 DONE
LocalLogStore #955 #1002 DONE
GCSLogStore #956 #1024 DONE
S3DynamoDBLogStore #339 #1023 DONE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants