Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL #2893

Merged
merged 1 commit into from Aug 3, 2021

Conversation

umehrot2
Copy link
Contributor

@umehrot2 umehrot2 commented Apr 28, 2021

Tips

What is the purpose of the pull request

This pr adds support for metadata based listing for Hudi Spark DataSource and Spark SQL based queries. The detailed design for Spark integration (V2 implementation specifically) can be found at https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements#RFC15:HUDIFileListingImprovements-Spark. Two parts of the V2 design have already been implemented:

In this pr we build on top of the FileIndex implementation to get file listing using Hudi's metadata table if the feature is enabled, and otherwise fallback to distributed listing using Spark Context. The metadata table will be read just once and it will reduce O(N) list calls to O(1) get calls for N partitions. We also refactor the Hudi metadata table contract to add a new API which can fetch lists for multiple partitions (opens the reader just once).

I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries. This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

Fixes #2935

Brief change log

Verify this pull request

  • Existing unit tests updated
  • Internally on AWS EMR ran several performance tests via Spark DataSource and Spark SQL to observe improvements in query planning times

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2021

Codecov Report

Merging #2893 (8dddccd) into master (b9e28e5) will increase coverage by 0.04%.
The diff coverage is 62.50%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2893      +/-   ##
============================================
+ Coverage     45.79%   45.84%   +0.04%     
- Complexity     5270     5274       +4     
============================================
  Files           909      908       -1     
  Lines         39390    39400      +10     
  Branches       4244     4253       +9     
============================================
+ Hits          18039    18063      +24     
+ Misses        19508    19480      -28     
- Partials       1843     1857      +14     
Flag Coverage Δ
hudicli 39.95% <ø> (ø)
hudiclient 30.44% <ø> (+0.05%) ⬆️
hudicommon 47.57% <25.00%> (-0.01%) ⬇️
hudiflink 61.33% <ø> (+0.48%) ⬆️
hudihadoopmr 51.29% <ø> (ø)
hudisparkdatasource 66.44% <71.87%> (-0.09%) ⬇️
hudisync 51.73% <ø> (ø)
huditimelineservice 64.36% <ø> (ø)
hudiutilities 56.24% <ø> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...c/main/java/org/apache/hudi/common/fs/FSUtils.java 47.08% <0.00%> (ø)
...va/org/apache/hudi/metadata/BaseTableMetadata.java 0.00% <0.00%> (ø)
.../org/apache/hudi/metadata/HoodieTableMetadata.java 0.00% <ø> (ø)
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala 76.16% <67.79%> (-4.97%) ⬇️
...e/hudi/metadata/FileSystemBackedTableMetadata.java 89.13% <75.00%> (-2.98%) ⬇️
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala 86.40% <78.37%> (-0.99%) ⬇️
...a/org/apache/hudi/utilities/sources/SqlSource.java 0.00% <0.00%> (-64.71%) ⬇️
.../org/apache/hudi/sink/compact/CompactFunction.java 86.66% <0.00%> (-13.34%) ⬇️
...c/main/java/org/apache/hudi/util/StreamerUtil.java 55.46% <0.00%> (-0.95%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9e28e5...8dddccd. Read the comment docs.

@vinothchandar vinothchandar self-assigned this May 6, 2021
@vinothchandar
Copy link
Member

@pengzhiwei2018 could you also please review this PR?

@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board May 6, 2021
@pengzhiwei2018
Copy link
Contributor

@pengzhiwei2018 could you also please review this PR?

That is all right! I will spend some time on this PR tonight.

}
} catch (Exception e) {
if (metadataConfig.enableFallback()) {
LOG.error("Failed to retrieve files in partitions from metadata", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If enable the fallback here, an empty partitionsFilesMap will return if there is an Exception happened, is it right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It wasn't falling back to use file system to do the listing. Will fix it.

val properties = new Properties()
// To support metadata listing via Spark SQL we allow users to pass the config via Hadoop Conf. Spark SQL does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get these configurations from the spark.sessionState.conf for spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It will be a better experience than having to pass it via hadoop conf. Customers would be able to enable in the spark sql session using SET commands.

@nsivabalan nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021
@apache apache deleted a comment May 11, 2021
@umehrot2
Copy link
Contributor Author

@pengzhiwei2018 @vinothchandar I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries.

This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

@pengzhiwei2018
Copy link
Contributor

pengzhiwei2018 commented May 13, 2021

@pengzhiwei2018 @vinothchandar I have further re-factored HoodieFileIndex for more efficient integration in case of MOR real time queries. Earlier we were just listing base files using the file index and later it would again perform listing for log files in MergeOnReadSnapshotRelation using groupLogsByBaseFile. Now, I will be storing and fetching both base and log files in-case of real time queries.

This ensures that filesystem is listed just once if filesystem listing is used. In case of metadata, it ensures the it will be read just once and no addition listing or reading is done to fetch log files.

That sounds good to me!

@umehrot2 umehrot2 changed the title [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL [HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL May 13, 2021
@umehrot2
Copy link
Contributor Author

@vinothchandar @pengzhiwei2018 do take a look again.

@pengzhiwei2018
Copy link
Contributor

@vinothchandar @pengzhiwei2018 do take a look again.

ok, will start review tomorrow~ But can you fix the CI?

@vinothchandar vinothchandar moved this from Ready for Review to Nearing Landing in PR Tracker Board May 25, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. High level looks ok to me. If you both are happy, feel free to land this @umehrot2

@@ -107,34 +113,61 @@ case class HoodieFileIndex(
}

@transient @volatile private var fileSystemView: HoodieTableFileSystemView = _
@transient @volatile private var cachedAllInputFiles: Array[HoodieBaseFile] = _
@transient @volatile private var cachedAllInputFiles: Map[PartitionRowPath, Map[HoodieBaseFile, Seq[HoodieLogFile]]] = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just reuse a FileSlice object here? instead of the map of base to logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

prunedPartitions.map { partition =>
val fileStatues = fileSystemView.getLatestBaseFiles(partition.partitionPath).iterator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so previously, we were actually performing the listing for each listFiles() call? Without actually using the cached values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! We will cache the files in the fileSystemView. So each call of listFiles will reuse the cache values.

override def inputFiles: Array[String] = {
cachedAllInputFiles.map(_.getFileStatus.getPath.toString)
cachedAllInputFiles.values.flatten.flatMap(baseLogFilesMapping => {
Iterator(baseLogFilesMapping._1.getPath) ++ baseLogFilesMapping._2.map(_.getFileStatus.getPath.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can reuse code across this method and allFiles below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) =>
// Fetch and store latest base and log files, and their sizes
cachedAllInputFiles = partitionFiles.map(p => {
val latestSlices = fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath, activeInstants.lastInstant().get().getTimestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any error handling needed for the case where the timeline is empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no commit success yet, activeInstants.lastInstant().get() may lead to query crash. So we'd better to return empty file list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check.

cachedAllInputFiles = partitionFiles.map(p => {
val latestSlices = fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p._1.partitionPath, activeInstants.lastInstant().get().getTimestamp)
val baseAndLogFilesMapping = latestSlices.iterator().asScala.map(slice => {
(slice.getBaseFile.get(), slice.getLogFiles.sorted(HoodieLogFile.getLogFileComparator).iterator().asScala.toSeq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth double checking that the comparator is actually sorting in the desired order. Just a random word of caution

(partitionRowPath, filesInPartition)
}.collect().map(f => f._1 -> f._2).toMap

var fetchedPartition2Files: Map[PartitionRowPath, Array[FileStatus]] = Map()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: fetchedPartitionToFiles ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

List<Pair<String, FileStatus[]>> partitionToFiles = engineContext.map(partitionPaths, partitionPathStr -> {
Path partitionPath = new Path(partitionPathStr);
FileSystem fs = partitionPath.getFileSystem(hadoopConf.get());
return Pair.of(partitionPathStr, FSUtils.getAllDataFilesInPartition(fs, partitionPath));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileStatus is not a Serializable class by the default spark serializer. If user have not specify the serializer to kryo when query hudi table, an NotSerializableException will throw out. There is a same problem for FSUtils.getAllPartitionPaths. But I think we fix this in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see yeah FileStatus is not serializable in Hadoop 2, but has been made Serializable in Hadoop 3. We should fix this in a separate PR for all methods by introducing SerializableFileStatus similar to Spark https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala#L347.

@hudi-bot
Copy link

hudi-bot commented Jun 17, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@pengzhiwei2018
Copy link
Contributor

Hi @umehrot2 , Over all LGTM! I have left some comments on the PR. After these comments and the conflicts are resolved, we can land it.

@umehrot2 umehrot2 force-pushed the metadata_listing_spark branch 2 times, most recently from 70b70e9 to 8dddccd Compare June 19, 2021 01:12
@umehrot2
Copy link
Contributor Author

@vinothchandar @pengzhiwei2018 addressed the latest comments. If it looks good to you guys, I can land it.

@vinothchandar
Copy link
Member

@hudi-bot run azure

@pengzhiwei2018
Copy link
Contributor

@vinothchandar @pengzhiwei2018 addressed the latest comments. If it looks good to you guys, I can land it.

LGTM! Thanks for this great contribution @umehrot2 .

@vinothchandar
Copy link
Member

I rebased this off master and started fixing the compile errors. Still needs work to get it in mergeable state, more compile errors to be resolved due to the config PR landing

@vinothchandar
Copy link
Member

@umehrot2 I tried to rebase this off wenning's PR while you were gone. its midway now. Could you rebase and repush. We can then work on landing this

@umehrot2 umehrot2 force-pushed the metadata_listing_spark branch 2 times, most recently from 0d885a5 to 25302f8 Compare July 28, 2021 23:52
@umehrot2
Copy link
Contributor Author

@hudi-bot run azure

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

@vinothchandar vinothchandar merged commit 1ff2d34 into apache:master Aug 3, 2021
PR Tracker Board automation moved this from Nearing Landing to Done Aug 3, 2021
liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker priority:major degraded perf; unable to move forward; potential bugs
Projects
Development

Successfully merging this pull request may close these issues.

[Support] HoodieFileIndex get a error when there is no partition path in table storage
6 participants