[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL #2893

umehrot2 · 2021-04-28T22:18:00Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

This pr adds support for metadata based listing for Hudi Spark DataSource and Spark SQL based queries. The detailed design for Spark integration (V2 implementation specifically) can be found at https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements#RFC15:HUDIFileListingImprovements-Spark. Two parts of the V2 design have already been implemented:

Custom FileIndex for Hudi: [HUDI-1591] [RFC-26] Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning #2651
Registering Hudi tables as DataSource tables in Hive metastore so they are executed via Hudi DataSource instead of Hive InputFormat/Serde. In the process, it will also use the FileIndex implemented in Hudi DataSource: [HUDI-1415] Read Hoodie Table As Spark DataSource Table #2283

In this pr we build on top of the FileIndex implementation to get file listing using Hudi's metadata table if the feature is enabled, and otherwise fallback to distributed listing using Spark Context. The metadata table will be read just once and it will reduce O(N) list calls to O(1) get calls for N partitions. We also refactor the Hudi metadata table contract to add a new API which can fetch lists for multiple partitions (opens the reader just once).

I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries. This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

Fixes #2935

Brief change log

Verify this pull request

Existing unit tests updated
Internally on AWS EMR ran several performance tests via Spark DataSource and Spark SQL to observe improvements in query planning times

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2021-04-28T23:38:46Z

Codecov Report

Merging #2893 (8dddccd) into master (b9e28e5) will increase coverage by 0.04%.
The diff coverage is 62.50%.

@@             Coverage Diff              @@
##             master    #2893      +/-   ##
============================================
+ Coverage     45.79%   45.84%   +0.04%     
- Complexity     5270     5274       +4     
============================================
  Files           909      908       -1     
  Lines         39390    39400      +10     
  Branches       4244     4253       +9     
============================================
+ Hits          18039    18063      +24     
+ Misses        19508    19480      -28     
- Partials       1843     1857      +14

Flag	Coverage Δ
hudicli	`39.95% <ø> (ø)`
hudiclient	`30.44% <ø> (+0.05%)`	⬆️
hudicommon	`47.57% <25.00%> (-0.01%)`	⬇️
hudiflink	`61.33% <ø> (+0.48%)`	⬆️
hudihadoopmr	`51.29% <ø> (ø)`
hudisparkdatasource	`66.44% <71.87%> (-0.09%)`	⬇️
hudisync	`51.73% <ø> (ø)`
huditimelineservice	`64.36% <ø> (ø)`
hudiutilities	`56.24% <ø> (-0.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...c/main/java/org/apache/hudi/common/fs/FSUtils.java	`47.08% <0.00%> (ø)`
...va/org/apache/hudi/metadata/BaseTableMetadata.java	`0.00% <0.00%> (ø)`
.../org/apache/hudi/metadata/HoodieTableMetadata.java	`0.00% <ø> (ø)`
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala	`76.16% <67.79%> (-4.97%)`	⬇️
...e/hudi/metadata/FileSystemBackedTableMetadata.java	`89.13% <75.00%> (-2.98%)`	⬇️
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala	`86.40% <78.37%> (-0.99%)`	⬇️
...a/org/apache/hudi/utilities/sources/SqlSource.java	`0.00% <0.00%> (-64.71%)`	⬇️
.../org/apache/hudi/sink/compact/CompactFunction.java	`86.66% <0.00%> (-13.34%)`	⬇️
...c/main/java/org/apache/hudi/util/StreamerUtil.java	`55.46% <0.00%> (-0.95%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9e28e5...8dddccd. Read the comment docs.

vinothchandar · 2021-05-06T04:50:40Z

@pengzhiwei2018 could you also please review this PR?

pengzhiwei2018 · 2021-05-06T05:55:05Z

@pengzhiwei2018 could you also please review this PR?

That is all right! I will spend some time on this PR tonight.

pengzhiwei2018 · 2021-05-10T02:22:17Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java

+        }
+      } catch (Exception e) {
+        if (metadataConfig.enableFallback()) {
+          LOG.error("Failed to retrieve files in partitions from metadata", e);


If enable the fallback here, an empty partitionsFilesMap will return if there is an Exception happened, is it right?

Good catch. It wasn't falling back to use file system to do the listing. Will fix it.

hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java

pengzhiwei2018 · 2021-05-10T02:35:54Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

    val properties = new Properties()
+    // To support metadata listing via Spark SQL we allow users to pass the config via Hadoop Conf. Spark SQL does not


Should we get these configurations from the spark.sessionState.conf for spark?

Makes sense. It will be a better experience than having to pass it via hadoop conf. Customers would be able to enable in the spark sql session using SET commands.

umehrot2 · 2021-05-12T22:55:43Z

@pengzhiwei2018 @vinothchandar I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries.

This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

pengzhiwei2018 · 2021-05-13T02:16:21Z

@pengzhiwei2018 @vinothchandar I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries.

This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

That sounds good to me!

umehrot2 · 2021-05-21T02:32:03Z

@vinothchandar @pengzhiwei2018 do take a look again.

pengzhiwei2018 · 2021-05-21T06:18:36Z

@vinothchandar @pengzhiwei2018 do take a look again.

ok, will start review tomorrow~ But can you fix the CI?

vinothchandar

Left some comments. High level looks ok to me. If you both are happy, feel free to land this @umehrot2

vinothchandar · 2021-05-25T01:28:34Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -107,34 +113,61 @@ case class HoodieFileIndex(
  }

  @transient @volatile private var fileSystemView: HoodieTableFileSystemView = _
-  @transient @volatile private var cachedAllInputFiles: Array[HoodieBaseFile] = _
+  @transient @volatile private var cachedAllInputFiles: Map[PartitionRowPath, Map[HoodieBaseFile, Seq[HoodieLogFile]]] = _


Could we just reuse a FileSlice object here? instead of the map of base to logs.

Made the change.

vinothchandar · 2021-05-25T01:44:49Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

      prunedPartitions.map { partition =>
-        val fileStatues = fileSystemView.getLatestBaseFiles(partition.partitionPath).iterator()


so previously, we were actually performing the listing for each listFiles() call? Without actually using the cached values?

No! We will cache the files in the fileSystemView. So each call of listFiles will reuse the cache values.

vinothchandar · 2021-05-25T01:48:03Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

  override def inputFiles: Array[String] = {
-    cachedAllInputFiles.map(_.getFileStatus.getPath.toString)
+    cachedAllInputFiles.values.flatten.flatMap(baseLogFilesMapping => {
+      Iterator(baseLogFilesMapping._1.getPath) ++ baseLogFilesMapping._2.map(_.getFileStatus.getPath.toString)


any chance we can reuse code across this method and allFiles below?

vinothchandar · 2021-05-25T01:49:13Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+      case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) =>
+        // Fetch and store latest base and log files, and their sizes
+        cachedAllInputFiles = partitionFiles.map(p => {
+          val latestSlices = fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath, activeInstants.lastInstant().get().getTimestamp)


any error handling needed for the case where the timeline is empty?

If there is no commit success yet, activeInstants.lastInstant().get() may lead to query crash. So we'd better to return empty file list.

Added check.

vinothchandar · 2021-05-25T01:50:09Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+        cachedAllInputFiles = partitionFiles.map(p => {
+          val latestSlices = fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath, activeInstants.lastInstant().get().getTimestamp)
+          val baseAndLogFilesMapping = latestSlices.iterator().asScala.map(slice => {
+            (slice.getBaseFile.get(), slice.getLogFiles.sorted(HoodieLogFile.getLogFileComparator).iterator().asScala.toSeq)


Worth double checking that the comparator is actually sorting in the desired order. Just a random word of caution

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

vinothchandar · 2021-05-25T01:56:21Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-          (partitionRowPath, filesInPartition)
-        }.collect().map(f => f._1 -> f._2).toMap
+
+    var fetchedPartition2Files: Map[PartitionRowPath, Array[FileStatus]] = Map()


rename: fetchedPartitionToFiles ?

pengzhiwei2018 · 2021-06-17T07:57:00Z

hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java

+    List<Pair<String, FileStatus[]>> partitionToFiles = engineContext.map(partitionPaths, partitionPathStr -> {
+      Path partitionPath = new Path(partitionPathStr);
+      FileSystem fs = partitionPath.getFileSystem(hadoopConf.get());
+      return Pair.of(partitionPathStr, FSUtils.getAllDataFilesInPartition(fs, partitionPath));


FileStatus is not a Serializable class by the default spark serializer. If user have not specify the serializer to kryo when query hudi table, an NotSerializableException will throw out. There is a same problem for FSUtils.getAllPartitionPaths. But I think we fix this in another PR.

I see yeah FileStatus is not serializable in Hadoop 2, but has been made Serializable in Hadoop 3. We should fix this in a separate PR for all methods by introducing SerializableFileStatus similar to Spark https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala#L347.

hudi-bot · 2021-06-17T08:00:39Z

CI report:

68bf6fc Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

pengzhiwei2018 · 2021-06-17T08:04:48Z

Hi @umehrot2 , Over all LGTM! I have left some comments on the PR. After these comments and the conflicts are resolved, we can land it.

umehrot2 · 2021-06-19T01:13:51Z

@vinothchandar @pengzhiwei2018 addressed the latest comments. If it looks good to you guys, I can land it.

vinothchandar · 2021-06-19T05:01:40Z

@hudi-bot run azure

pengzhiwei2018 · 2021-06-22T07:06:38Z

@vinothchandar @pengzhiwei2018 addressed the latest comments. If it looks good to you guys, I can land it.

LGTM! Thanks for this great contribution @umehrot2 .

vinothchandar · 2021-07-01T15:30:06Z

I rebased this off master and started fixing the compile errors. Still needs work to get it in mergeable state, more compile errors to be resolved due to the config PR landing

vinothchandar · 2021-07-08T03:17:39Z

@umehrot2 I tried to rebase this off wenning's PR while you were gone. its midway now. Could you rebase and repush. We can then work on landing this

umehrot2 · 2021-07-29T02:06:04Z

@hudi-bot run azure

…Source and Spark SQL

vinothchandar

LGTM overall

…Source and Spark SQL (apache#2893)

umehrot2 requested a review from vinothchandar April 28, 2021 22:19

vinothchandar self-assigned this May 6, 2021

vinothchandar added this to Ready for Review in PR Tracker Board May 6, 2021

pengzhiwei2018 reviewed May 10, 2021

View reviewed changes

pengzhiwei2018 mentioned this pull request May 11, 2021

[Support] HoodieFileIndex get a error when there is no partition path in table storage #2935

Closed

nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021

apache deleted a comment May 11, 2021

umehrot2 force-pushed the metadata_listing_spark branch from 1ce0f37 to c50f97d Compare May 12, 2021 22:52

umehrot2 force-pushed the metadata_listing_spark branch from c50f97d to aa544d7 Compare May 13, 2021 22:30

umehrot2 changed the title ~~[HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL~~ [HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL May 13, 2021

vinothchandar force-pushed the metadata_listing_spark branch from aa544d7 to 50e2f1b Compare May 25, 2021 01:21

vinothchandar moved this from Ready for Review to Nearing Landing in PR Tracker Board May 25, 2021

vinothchandar reviewed May 25, 2021

View reviewed changes

pengzhiwei2018 reviewed Jun 17, 2021

View reviewed changes

umehrot2 force-pushed the metadata_listing_spark branch 2 times, most recently from 70b70e9 to 8dddccd Compare June 19, 2021 01:12

vinothchandar added the priority:blocker label Jun 21, 2021

vinothchandar force-pushed the metadata_listing_spark branch from 8dddccd to 88b2110 Compare July 1, 2021 15:29

umehrot2 force-pushed the metadata_listing_spark branch 2 times, most recently from 0d885a5 to 25302f8 Compare July 28, 2021 23:52

[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark Data…

68bf6fc

…Source and Spark SQL

vinothchandar force-pushed the metadata_listing_spark branch from 25302f8 to 68bf6fc Compare August 3, 2021 19:29

vinothchandar approved these changes Aug 3, 2021

View reviewed changes

vinothchandar merged commit 1ff2d34 into apache:master Aug 3, 2021

PR Tracker Board automation moved this from Nearing Landing to Done Aug 3, 2021

liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021

[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark Data…

f1c3843

…Source and Spark SQL (apache#2893)

		val properties = new Properties()
		// To support metadata listing via Spark SQL we allow users to pass the config via Hadoop Conf. Spark SQL does not

		prunedPartitions.map { partition =>
		val fileStatues = fileSystemView.getLatestBaseFiles(partition.partitionPath).iterator()

[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL #2893

[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL #2893

Conversation

umehrot2 commented Apr 28, 2021 • edited by vinothchandar

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

codecov-commenter commented Apr 28, 2021 • edited

Codecov Report

vinothchandar commented May 6, 2021

pengzhiwei2018 commented May 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

umehrot2 commented May 12, 2021

pengzhiwei2018 commented May 13, 2021 • edited

umehrot2 commented May 21, 2021

pengzhiwei2018 commented May 21, 2021

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhiwei2018 Jun 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jun 17, 2021 • edited

CI report:

pengzhiwei2018 commented Jun 17, 2021

umehrot2 commented Jun 19, 2021

vinothchandar commented Jun 19, 2021

pengzhiwei2018 commented Jun 22, 2021

vinothchandar commented Jul 1, 2021

vinothchandar commented Jul 8, 2021

umehrot2 commented Jul 29, 2021

vinothchandar left a comment

Choose a reason for hiding this comment

umehrot2 commented Apr 28, 2021 •

edited by vinothchandar

codecov-commenter commented Apr 28, 2021 •

edited

pengzhiwei2018 commented May 13, 2021 •

edited

pengzhiwei2018 Jun 17, 2021 •

edited

hudi-bot commented Jun 17, 2021 •

edited