Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "Delta Format Sharing" #341

Closed
linzhou-db opened this issue Jul 6, 2023 · 0 comments
Closed

Support "Delta Format Sharing" #341

linzhou-db opened this issue Jul 6, 2023 · 0 comments
Assignees
Milestone

Comments

@linzhou-db
Copy link
Collaborator

linzhou-db commented Jul 6, 2023

This is a proposal to support "Delta Format Sharing" in Delta Sharing Protocol.

Context: Advanced Delta Features

Advanced delta features such as DeletionVectors and ColumnMapping are developed where delta is no longer a parquet only protocol. In order to catch up with new advanced delta features, we are proposing to upgrade delta sharing protocol to support "delta format sharing", where we could return the shared table in delta format, and leverage developed delta spark library to read data. The benefit would be to avoid code duplication on supporting newly created advanced delta features in delta sharing spark.

Delta Format Sharing

The idea is to transfer the delta log from the provider to the recipient via the delta sharing http requests, construct a local delta log, and leverage delta spark library to read the data out of the delta log.

Protocol Changes

In the delta sharing protocol, a new http request header delta-sharing-capabilities will be introduced, where its value will be comma separated capabilities, where each capability is like capability_key=capability_value. Example: delta-sharing-capabilities:responseFormat=delta,readerfeatures=deletionVectors,columnMapping.
For upgraded delta sharing server that could handle the new header, it will parse the new header and prepare the response accordingly, it will ignore the capabilities that cannot be handled or having an unrecognized value. But it will return error if the shared table has capabilities that is not specified in the header (indicating it's supported by the client).

If the responseFormat=delta in the request header and the delta sharing server could handle it, then it will add a similar header in the response as well to indicate that it's handled: delta-sharing-capabilities:responseFormat=delta. Then each line in the response is a json object that could be parsed as a delta action, and could be constructed as a delta log on the client side. With the only change to be the path will be a pre-signed url, so the client side needs to read data out of the pre-signed url.

Library Changes

In order to support this, we need to restructure the delta sharing libraries. We'll launch a delta-sharing-client library to include code with two core functionalities: delta sharing client and related utils that handle http requests/responses to the delta sharing server, delta sharing file system and related utils that handle reading data out of pre-signed url and refreshing of pre-signed urls. With responseFormat=delta, the delta sharing client won't parse the json lines and will let the delta spark library to parse and handle them.

We'll continue to release delta-sharing-spark library with the rest of the functionalities including data source, the streaming source, options, etc. While all the code will be moved from delta-io/delta-sharing to delta-io/delta to be able to leverage all the delta classes and libraries to construct a delta log, read data, and finally serve the DataFrame to the query.

@linzhou-db linzhou-db self-assigned this Jul 6, 2023
@linzhou-db linzhou-db changed the title Support responseFormat=delta in delta sharing Support "Delta Format Sharing" in Delta Sharing Protocol Aug 2, 2023
@linzhou-db linzhou-db changed the title Support "Delta Format Sharing" in Delta Sharing Protocol Support "Delta Format Sharing" Aug 29, 2023
@MrPowers MrPowers added this to the 1.0.0 milestone Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants