Skip to content

[HUDI-5276] Fix getting partition paths under relative paths#7744

Merged
codope merged 3 commits intoapache:masterfrom
yihua:HUDI-5276-fix-partition-prefix-analysis
Jan 25, 2023
Merged

[HUDI-5276] Fix getting partition paths under relative paths#7744
codope merged 3 commits intoapache:masterfrom
yihua:HUDI-5276-fix-partition-prefix-analysis

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Jan 24, 2023

Change Logs

When the metadata table is enabled and used for getting the partition paths under certain directories (listing by partition path prefix in SparkHoodieTableFileIndex and getting query partition paths in BaseHoodieTableFileIndex), the following logic in HoodieBackedTableMetadata is invoked

  public List<String> getPartitionPathsWithPrefixes(List<String> prefixes) throws IOException {
    return getAllPartitionPaths().stream()
        .filter(p -> prefixes.stream().anyMatch(p::startsWith))
        .collect(Collectors.toList());
  }

If all the partition paths contain 1, 10, and 100, listing using 1 returns all three, which is incorrect. The prefixes should serve as the exact relative path and the naming itself is misleading.

This PR made the following change to correct the issue:

  • Renames getPartitionPathsWithPrefixes to getPartitionPathWithPathPrefixes in HoodieTableMetadata and update relevant variable naming and docs
  • Fixes the logic in HoodieBackedTableMetadata#getPartitionPathWithPathPrefixes to do exact parent directory matching
  • Adds new tests in TestHoodieFileIndex to guard around the logic. These tests fail before this PR.

This PR fixes #7298.

Impact

Fixes the bug of getting irrelevant partition paths given a relative path/prefix.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua requested a review from alexeykudinkin January 24, 2023 19:10
@yihua yihua added priority:blocker Production down; release blocker engine:spark Spark integration reader-core labels Jan 24, 2023
// In that case, we simply could not apply partition pruning and will have to regress to scanning
// the whole table
if (haveProperPartitionValues(partitionPaths)) {
if (haveProperPartitionValues(partitionPaths) && partitionSchema.nonEmpty) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is needed for the case where the partition column value has slashes (partition column values are not supposed to be filled), and the query predicates cover all the partition columns, so the exact matching of a partition path happens and partition values are filled inside partitionPaths (see below code snippet from SparkHoodieTableFileIndex#tryListByPartitionPathPrefix).

  if (staticPartitionColumnNameValuePairs.length == partitionColumnNames.length) {
        // In case composed partition path is complete, we can return it directly avoiding extra listing operation
        Seq(new PartitionPath(relativePartitionPathPrefix, staticPartitionColumnNameValuePairs.map(_._2.asInstanceOf[AnyRef]).toArray))
      }

Without this fix, the test case 3 (with predicate "dt = '2023/01/01' and region_code = '1'") in testFileListingWithPartitionPrefixPruning would fail.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 2f22b07 into apache:master Jan 25, 2023
@alexeykudinkin
Copy link
Contributor

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…7744)

This PR made the following change to correct the issue:
- Renames `getPartitionPathsWithPrefixes` to `getPartitionPathWithPathPrefixes` in `HoodieTableMetadata` and update relevant variable naming and docs
- Fixes the logic in `HoodieBackedTableMetadata#getPartitionPathWithPathPrefixes` to do exact parent directory matching
- Adds new tests in `TestHoodieFileIndex` to guard around the logic.  These tests fail before this PR.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…7744)

This PR made the following change to correct the issue:
- Renames `getPartitionPathsWithPrefixes` to `getPartitionPathWithPathPrefixes` in `HoodieTableMetadata` and update relevant variable naming and docs
- Fixes the logic in `HoodieBackedTableMetadata#getPartitionPathWithPathPrefixes` to do exact parent directory matching
- Adds new tests in `TestHoodieFileIndex` to guard around the logic.  These tests fail before this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:spark Spark integration priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Hudi use regular match query partition path caused invalid input path was added

4 participants