Skip to content

Time Travel (querying the historical versions of data) ability for Hudi Table #14718

@hudi-bot

Description

@hudi-bot

Hi, all:
We plan to use Hudi to sync mysql binlog data. There will be a flink ETL task to consume binlog records from kafka and save data to hudi every one hour. The binlog records are also grouped every one hour and all records of one hour will be saved in one commit. The data transmission pipeline should be like – binlog -> kafka -> flink -> parquet.

After the data is synced to hudi, we want to querying the historical hourly versions of the Hudi table in hive SQL.

Here is a more detailed description of our issue along with a simply design of Time Travel for Hudi, the design is under development and testing:

[https://docs.google.com/document/d/1r0iwUsklw9aKSDMzZaiq43dy57cSJSAqT9KCvgjbtUo/edit?usp=sharing]

We have to support Time Travel ability recently for our business needs. We also have seen the [RFC 07|https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table].
Be glad to receive any suggestion or dicussion.

JIRA info


Comments

14/Dec/20 16:07;xleesf;[~qian heng] sorry would not access the google doc you provided, and it would be better if you would send a discuss email to dev ML. ;;;


14/Dec/20 18:50;nishith29;[~qian heng] Like [~xleesf] pointed, even I was unable to access the google doc. Could you please start a discuss thread on the dev mailing list ? This will help you get feedback from other members as well. Based on that, we can see if this needs a separate RFC or we can make changes to RFC-07;;;


15/Dec/20 00:02;vinoth;+1 if we can keep discussions to the mailing list and then onto the cWIki, that would be great. 

Happy to provide any access/permissions as needed. ;;;


15/Dec/20 06:20;qian heng;The doc is already available, sorry for the mistake;;;


12/Mar/22 14:12;xushiyan;[~x1q1j1] can you please go through the description and design doc to see if any further work needed?;;;


13/Mar/22 05:39;x1q1j1;hi [~qian heng] 1. SparkSQL already supports time travel to query Hudi table HUDI-3221
2. Hive SQL needs to add syntax support to hive source code.(This priority will be implemented later than presto)

  1. Presto/Trino SQL implemented time travel to query Hudi table. (will be next);;;

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions