Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() #4810

Merged

Conversation

zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Feb 14, 2022

https://issues.apache.org/jira/browse/HUDI-3421

What is the purpose of the pull request

If there is a inflight clustering instant at the earliest of active time line, AbstractTableFileSystemView#getxxBaseFIle() will be broken because of un-committed data file created by this clustering job.

Steps to re produce this problem (set archive min =2, max =3 and disable clean)
1. do ingestion with commit 1
2. trigger a clustering job with replacecommit 2
3. do another ingestion until commit 1 is archived ---> for now commit2 is the earliest instant at active timeline. Also we could delete commit 1 directly 
4. compare the count(*) query result with/without replacecommit 2

We could find the result without replacecommit 2 is bigger than with it.

This patch is try to fix it and We couldn't take uncommitted data file created by inflight clustering as a HoodieBaseFile

We could look at added UT for more details and without this Patch, this added UT will failed because of additional un-committed clustering data file.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@zhangyue19921010
Copy link
Contributor Author

Here is the test table in my local env

.
├── .hoodie
│   ├── .aux
│   │   └── .bootstrap
│   │       ├── .fileids
│   │       └── .partitions
│   ├── .hoodie.properties.crc
│   ├── .temp
│   ├── 20220214113603048.replacecommit.inflight
│   ├── 20220214113603048.replacecommit.requested
│   ├── 20220214113742525.commit
│   ├── 20220214113742525.commit.requested
│   ├── 20220214113742525.inflight
│   ├── 20220214113823507.commit
│   ├── 20220214113823507.commit.requested
│   ├── 20220214113823507.inflight
│   ├── 20220214113906724.commit
│   ├── 20220214113906724.commit.requested
│   ├── 20220214113906724.inflight
│   ├── 20220214113951921.commit
│   ├── 20220214113951921.commit.requested
│   ├── 20220214113951921.inflight
│   ├── archived
│   │   └── .commits_.archive.1_1-0-1
│   └── hoodie.properties
└── 20210623
    ├── .hoodie_partition_metadata
    ├── 08097b95-8096-42bf-81ee-9b9719af72e5-0_1-12-0_20220214113742525.parquet
    ├── 22c58520-9743-42f7-9bca-cfd1c250af7d-0_1-12-0_20220214113503975.parquet
    ├── 2fef7543-30bd-495d-ad06-ad9e17d00220-0_0-11-0_20220214113823507.parquet
    ├── 31ebf933-f1db-4d8e-841c-276faf50e6d9-0_0-11-0_20220214113906724.parquet
    ├── 383fea19-f097-4682-91c5-7dd43b4116d3-0_1-12-0_20220214113951921.parquet
    ├── 500c0898-2367-4de1-9d42-d40a41e50dbc-0_0-3-4_20220214113603048.parquet
    ├── c6e497de-4c2a-4d89-9f7f-fa07bcab3d6c-0_0-11-0_20220214113742525.parquet
    ├── c9e25ea5-6571-47a1-befe-463a3e383f26-0_0-11-0_20220214113503975.parquet
    ├── e454fcfd-8176-4d94-bf43-8a8d7c9a15ba-0_0-11-0_20220214113951921.parquet
    ├── e881db1c-4aba-4627-8adf-b893b1ffc70c-0_1-12-0_20220214113906724.parquet
    └── f044cb05-e191-4db6-b094-7c86b1ad3df8-0_1-12-0_20220214113823507.parquet

Query

select count(*) from hudi_test

result with 20220214113603048.replacecommit ==> 5179250
result without 20220214113603048.replacecommit ==> 6215100

As we can see the result is duplicated.

@zhangyue19921010 zhangyue19921010 changed the title [HUDI-3421]Pending clustering may break AbstractTableFileSystemView [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() Feb 14, 2022
@@ -492,7 +505,7 @@ protected HoodieBaseFile addBootstrapBaseFileIfPresent(HoodieFileGroupId fileGro
.map(fileGroup -> Option.fromJavaOptional(fileGroup.getAllBaseFiles()
.filter(baseFile -> HoodieTimeline.compareTimestamps(baseFile.getCommitTime(), HoodieTimeline.LESSER_THAN_OR_EQUALS, maxCommitTime
))
.filter(df -> !isBaseFileDueToPendingCompaction(df)).findFirst()))
.filter(df -> !isBaseFileDueToPendingCompaction(df) && !isBaseFileDueToPendingClustering(df)).findFirst()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the caller pass in the right maxCommitTime to filter out the pending base files ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inflight clustering at the earliest instant of the active timeline, this bug could happen(based on follow code). So that no matter what maxCommitTime is, we can' t filter it out.
Now we use take containsOrBeforeTimelineStarts as committed, which may involve unfinished clustering data(this inflight clustering instant is at the earliest of active timeline.)

private boolean isFileSliceCommitted(FileSlice slice) {

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@zhangyue19921010
Copy link
Contributor Author

Hi @codope and @satishkotha Sorry to bother you. Would you mind to take a look at his patch?
Thanks a lot :)

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! LGTM. but lets wait to get a review from Satish or Sagar who authored most pieces of clustering.

@nsivabalan nsivabalan added this to Under Discussion PRs in PR Tracker Board via automation Feb 15, 2022
@nsivabalan nsivabalan moved this from Under Discussion PRs to Nearing Landing in PR Tracker Board Feb 15, 2022
@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Feb 15, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangyue19921010 Thanks for catching the bug! Left a few comments

*/
protected boolean isBaseFileDueToPendingClustering(HoodieBaseFile baseFile) {
List<String> pendingReplaceInstants =
metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not reuse isPendingClusteringScheduledForFileId() or getPendingClusteringInstant()? So, we maintain a map of fgIdToPendingClustering which supports various methods. If we can reuse one of them then we need to call active timeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, maybe we can't use fgIdToPendingClustering to do filter here.
Because the files recorded in fgIdToPendingClustering are committed file and need to be seen.
What we need to filter here are the in-flight uncommitted data files produced by clustering job.

So that we need to know the instant time of xxxx.replacecommit.requested or xxxx.replacecommit.inflight and use it to filter out uncommitted clustering creating data files instead of the files which need to be clustering.

* 2. getBaseFileOn
* 3. getLatestBaseFilesInRange
* 4. getAllBaseFiles
* 5. getLatestBaseFiles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other base file related APIs like fetchLatestBaseFiles, fetchAllBaseFiles? Are they all covered by this change?
PS: I think we should take a follow up task to make FSView APIs more uniform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, there is pretty much getxxxxLatestxxx() methods, hhh.
The root change here is that add isBaseFileDueToPendingClustering the same as isBaseFileDueToPendingCompaction so add this check to wherever isBaseFileDueToPendingCompaction is.

And this APIs in UT are all affected APIs.

@nsivabalan
Copy link
Contributor

@codope : can we close the loop here please. If you want me to take it up, let me know. happy to do.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. All comments addressed.

@codope codope merged commit 7428100 into apache:master Feb 25, 2022
PR Tracker Board automation moved this from Nearing Landing to Done Feb 25, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants