[HUDI-2243] Support Time Travel Query For Hoodie Table #3360

pengzhiwei2018 · 2021-07-28T07:23:42Z

What is the purpose of the pull request

Support time travel query for hoodie table for both COW and MOR table.
spark.read.format("hudi").option("as.of.instant", "20210728141108").load(basePath).show()
spark.read.format("hudi").option("as.of.instant", "2021-07-28 14: 11: 08").load(basePath).show()
spark.read.format("hudi").option("as.of.instant", "2021-07-28").load(basePath).show()

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-07-28T07:26:30Z

CI report:

8286f42424b6dfab84ef0968be9d361622eac8e0 UNKNOWN
7a178e1783036a3c4a66709bc63e36aecb66ac0c UNKNOWN
e5c39aea06dd4c12439afa19059baf3288449973 UNKNOWN
62e2ef8b00f59b833fe9ba16f65897cd6a9389ba UNKNOWN
c7666c2c9e4286ee9e53504af8a4c522e451a439 UNKNOWN
096eea6 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

codope · 2021-07-30T04:10:23Z

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestTimeTravelQuery.scala

+    val df2 = Seq((1, "a1", 12, 1001)).toDF("id", "name", "value", "version")
+    df2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)


Shouldn't we set this (and other instances below) to tableType just like on line 70?

Good catch!

codope · 2021-07-30T04:23:45Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java

@@ -58,6 +58,11 @@
     */
    Stream<HoodieBaseFile> getLatestBaseFiles();

+    /**


nit:

/** * Stream all the latest version data files across partitions with precondition that commitTime(file) before * maxCommitTime. */

More in line with the existing doc. What do you think?

make sense!

codope · 2021-07-30T04:26:59Z

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestTimeTravelQuery.scala

+    metaClient.reloadActiveTimeline()
+    val secondCommit = metaClient.getActiveTimeline.filterCompletedInstants().lastInstant().get().getTimestamp
+
+    // Third write


Wondering what happens when clean commits are interleaved in between, say as.of.instant is 1002 and there are couple of clean commits before that. I believe the behavior would be same as we have today when latest instant is passed?

Yes, I think so. It should be the same behavior with query the latest instant.

codope

@pengzhiwei2018 Thanks for this. Maybe we could have couple of follow on tasks.

Allow user to specify as.of.instant in YYYY-MM-DD to YYYY-MM-DD hh:mm:ss formts.
Support this with SQL dml as well, e.g. select a,b,c from hudi_table AS OF 20210728141108 . This would really help useers to rollback using CTAS directly. What do you think?

pengzhiwei2018 · 2021-07-30T06:07:39Z

@pengzhiwei2018 Thanks for this. Maybe we could have couple of follow on tasks.

Allow user to specify as.of.instant in YYYY-MM-DD to YYYY-MM-DD hh:mm:ss formts.

Support this with SQL dml as well, e.g. select a,b,c from hudi_table AS OF 20210728141108 . This would really help useers to rollback using CTAS directly. What do you think?

Both Agreed! I will submit a PR to support time travel query for spark sql after #3277 has merged as we need do some sql extension for hudi-spark module based on that PR which has did a lot of basic work.
For question 1, This PR has supported now!

codope · 2021-08-04T03:07:27Z

@pengzhiwei2018 Thanks for this. Maybe we could have couple of follow on tasks.

Allow user to specify as.of.instant in YYYY-MM-DD to YYYY-MM-DD hh:mm:ss formts.

Support this with SQL dml as well, e.g. select a,b,c from hudi_table AS OF 20210728141108 . This would really help useers to rollback using CTAS directly. What do you think?

Both Agreed! I will submit a PR to support time travel query for spark sql after #3277 has merged as we need do some sql extension for hudi-spark module based on that PR which has did a lot of basic work.
For question 1, This PR has supported now!

@pengzhiwei2018 Thanks for quickly adding the first suggestion. This diff looks good to me. Can you resolve the conflicts and then we can land it?

pengzhiwei2018 · 2021-08-04T11:06:01Z

@pengzhiwei2018 Thanks for this. Maybe we could have couple of follow on tasks.

Allow user to specify as.of.instant in YYYY-MM-DD to YYYY-MM-DD hh:mm:ss formts.

Support this with SQL dml as well, e.g. select a,b,c from hudi_table AS OF 20210728141108 . This would really help useers to rollback using CTAS directly. What do you think?

Both Agreed! I will submit a PR to support time travel query for spark sql after #3277 has merged as we need do some sql extension for hudi-spark module based on that PR which has did a lot of basic work.
For question 1, This PR has supported now!

@pengzhiwei2018 Thanks for quickly adding the first suggestion. This diff looks good to me. Can you resolve the conflicts and then we can land it?

The PR has updated to solve the conflicts. Please take a review again~

nsivabalan

added feedback for source code

nsivabalan · 2021-08-05T23:34:06Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

+  public final Stream<HoodieBaseFile> getLatestBaseFilesBeforeOrOn(String maxCommitTime) {
+    try {
+      readLock.lock();
+      return fetchAllStoredFileGroups()


I see an opportunity for code re-use between this and getLatestBaseFilesBeforeOrOn(String partitionStr, String maxCommitTime)(lines 470 to 486).

Infact we could change the signature of existing method to

getLatestBaseFilesBeforeOrOn(Option<String> partitionStr, String maxCommitTime)

and not introduce a new method.

nsivabalan · 2021-08-05T23:40:12Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

@@ -162,28 +163,19 @@
  }

  // Return parquet file with a list of log files in the same file group.
-  public static List<Pair<Option<HoodieBaseFile>, List<String>>> groupLogsByBaseFile(Configuration conf, List<Path> partitionPaths) {
+  public static List<Pair<Option<HoodieBaseFile>, List<String>>> groupLogsByBaseFile(HoodieTableMetaClient metaClient,


are the changes in this method an optimization or is there anything required for this patch as such?
I am not aware of why this was designed this way. But there should a reason for it. Lets take this up once we have the release. so that we can consult w/ vinoth on the improvisations.
Can we please revert those changes not really required for this patch.
I meant the perpartitionMetaclient related changes.

Yes, I think we can revert these changes.

nsivabalan · 2021-08-05T23:45:12Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -238,7 +249,12 @@ case class HoodieFileIndex(
      case (_, _) =>
        // Fetch and store latest base files and its sizes
        cachedAllInputFileSlices = partitionFiles.map(p => {
-          (p._1, fileSystemView.getLatestFileSlices(p._1.partitionPath).iterator().asScala.toSeq)
+          val fileSlices = (if (queryInstant.isDefined) {
+            fileSystemView.getLatestFileSlicesBeforeOrOn(p._1.partitionPath, queryInstant.get, true)


Is it possible to do Option.map().OrElse() to make it nicer.

I will try it.

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlUtils.scala

nsivabalan · 2021-08-05T23:47:52Z

...imeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/BaseFileHandler.java

-    return viewManager.getFileSystemView(basePath).getLatestBaseFiles().map(BaseFileDTO::fromHoodieBaseFile)
-        .collect(Collectors.toList());
+  public List<BaseFileDTO> getLatestDataFiles(String basePath, Option<String> maxCommitTime) {
+    if (maxCommitTime.isPresent()) {


if possible, Option.map().OrElse()

nsivabalan · 2021-08-06T14:26:15Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala


    (tableType, queryType) match {
      case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) =>
        // Fetch and store latest base and log files, and their sizes
        cachedAllInputFileSlices = partitionFiles.map(p => {
          val latestSlices = if (activeInstants.lastInstant().isPresent) {
-           fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath,
-             activeInstants.lastInstant().get().getTimestamp).iterator().asScala.toSeq
+            fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath, queryInstant.get)


shouldn't line no 234 be like
if (queryInstant.isPresent)

Good catch!

nsivabalan · 2021-08-06T14:29:07Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -238,7 +249,11 @@ case class HoodieFileIndex(
      case (_, _) =>
        // Fetch and store latest base files and its sizes
        cachedAllInputFileSlices = partitionFiles.map(p => {
-          (p._1, fileSystemView.getLatestFileSlices(p._1.partitionPath).iterator().asScala.toSeq)
+          val fileSlices = queryInstant


Just so I understand correctly. Here, we are changing existing behavior. Even if not for time travel query, previsouly we were calling fileSystemView.getLatestFileSlices. But now, since we assign queryInstant upfront (either to specified query or latest instant), we will call fileSystemView.getLatestFileSlicesBeforeOrOn.

Is that intentional ?

I will change the code to keep the origin behavior for non-time travel query.

nsivabalan · 2021-08-06T23:34:49Z

@hudi-bot run azure

nsivabalan · 2021-08-07T00:32:15Z

@hudi-bot run azure

pengzhiwei2018 · 2021-08-07T05:46:03Z

@hudi-bot run azure

pengzhiwei2018 · 2021-08-07T12:45:08Z

@hudi-bot run azure

tooptoop4 · 2021-08-08T01:56:08Z

trinodb/trino#8773 after that maybe hudi time travel could be added to trino

pengzhiwei2018 force-pushed the dev_time_travel_df branch from 07ef77f to 6b944e2 Compare July 28, 2021 11:10

vinothchandar added the priority:blocker label Jul 29, 2021

codope reviewed Jul 30, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_time_travel_df branch 2 times, most recently from 7fe97f7 to f421cd8 Compare July 30, 2021 07:19

pengzhiwei2018 force-pushed the dev_time_travel_df branch from f421cd8 to e84a04f Compare August 4, 2021 11:01

pengzhiwei2018 force-pushed the dev_time_travel_df branch 2 times, most recently from 1bd1655 to b9e30d2 Compare August 5, 2021 11:32

vinothchandar assigned nsivabalan Aug 5, 2021

nsivabalan requested changes Aug 5, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_time_travel_df branch 5 times, most recently from d70057b to f13bafc Compare August 6, 2021 12:28

nsivabalan reviewed Aug 6, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_time_travel_df branch from d9d348f to c6d8da9 Compare August 6, 2021 15:39

nsivabalan approved these changes Aug 6, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_time_travel_df branch 2 times, most recently from 77bf630 to a58afb7 Compare August 7, 2021 04:30

[HUDI-2243] Support Time Travel Query For Hoodie Table

096eea6

pengzhiwei2018 force-pushed the dev_time_travel_df branch from a58afb7 to 096eea6 Compare August 7, 2021 12:44

nsivabalan merged commit 32a50d8 into apache:master Aug 7, 2021

liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021

[HUDI-2243] Support Time Travel Query For Hoodie Table (apache#3360)

b51f3d5

This was referenced Aug 25, 2021

[SUPPORT] Point query at hudi tables #3054

Closed

[SUPPORT]How to query history snapshot by given one history partition? #3005

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-2243] Support Time Travel Query For Hoodie Table #3360

[HUDI-2243] Support Time Travel Query For Hoodie Table #3360

pengzhiwei2018 commented Jul 28, 2021 •

edited

Loading

hudi-bot commented Jul 28, 2021 •

edited

Loading

codope Jul 30, 2021 •

edited

Loading

pengzhiwei2018 Jul 30, 2021

codope Jul 30, 2021

pengzhiwei2018 Jul 30, 2021

codope Jul 30, 2021

pengzhiwei2018 Jul 30, 2021

codope left a comment

pengzhiwei2018 commented Jul 30, 2021 •

edited

Loading

codope commented Aug 4, 2021

pengzhiwei2018 commented Aug 4, 2021

nsivabalan left a comment

nsivabalan Aug 5, 2021

nsivabalan Aug 5, 2021

nsivabalan Aug 5, 2021

pengzhiwei2018 Aug 6, 2021

nsivabalan Aug 5, 2021

pengzhiwei2018 Aug 6, 2021

nsivabalan Aug 5, 2021

nsivabalan Aug 6, 2021

pengzhiwei2018 Aug 6, 2021

nsivabalan Aug 6, 2021

pengzhiwei2018 Aug 6, 2021

nsivabalan commented Aug 6, 2021

nsivabalan commented Aug 7, 2021

pengzhiwei2018 commented Aug 7, 2021

pengzhiwei2018 commented Aug 7, 2021

tooptoop4 commented Aug 8, 2021

[HUDI-2243] Support Time Travel Query For Hoodie Table #3360

[HUDI-2243] Support Time Travel Query For Hoodie Table #3360

Conversation

pengzhiwei2018 commented Jul 28, 2021 • edited Loading

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

hudi-bot commented Jul 28, 2021 • edited Loading

CI report:

codope Jul 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

pengzhiwei2018 commented Jul 30, 2021 • edited Loading

codope commented Aug 4, 2021

pengzhiwei2018 commented Aug 4, 2021

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan commented Aug 6, 2021

nsivabalan commented Aug 7, 2021

pengzhiwei2018 commented Aug 7, 2021

pengzhiwei2018 commented Aug 7, 2021

tooptoop4 commented Aug 8, 2021

pengzhiwei2018 commented Jul 28, 2021 •

edited

Loading

hudi-bot commented Jul 28, 2021 •

edited

Loading

codope Jul 30, 2021 •

edited

Loading

pengzhiwei2018 commented Jul 30, 2021 •

edited

Loading