Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5289] Fix writeStatus RDD recalculated in cluster #7373

Closed
wants to merge 2 commits into from

Conversation

Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Dec 3, 2022

Change Logs

fix https://issues.apache.org/jira/projects/HUDI/issues/HUDI-5289

There is a patch similar to this problem, but after this patch is added, the problem still exists.

Impact

Could improve the clustering performance.

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I put up the fix, I did verify that clustering was not triggered multiple times. Can you confirm that you are seeing clustering being triggered twice?
If you keep inspecting the data directory, you will find additional data files just during the validaiton.
Or other way to test this is, marker directory reconciliation will delete some additional files if dag was triggered twice. If not, you should not see any marker file deletions.

@nsivabalan nsivabalan added priority:blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 6, 2022

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

@hudi-bot
Copy link

hudi-bot commented Dec 6, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@boneanxs
Copy link
Contributor

boneanxs commented Dec 6, 2022

@nsivabalan yes, I confirm, and @boneanxs find it too, So I think we should persist it

I find there was a issue when I first implemented clustering as row, but didn't investigate too much at that time.

But I'm thinking whether we can do collectAsList and then parallelize this list to build HoodieData[WriteStatus] in performClusteringWithRecordsRDD, just like performClusteringWithRecordsAsRow has already done (dereference the RDD to a list) to permanently fix this?

@@ -122,6 +123,8 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final Hood
.stream();
JavaRDD<WriteStatus>[] writeStatuses = convertStreamToArray(writeStatusesStream.map(HoodieJavaRDD::getJavaRDD));
JavaRDD<WriteStatus> writeStatusRDD = engineContext.union(writeStatuses);
// Persist writeStatus, since it may be reused
writeStatusRDD.persist(StorageLevel.MEMORY_AND_DISK());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't rely on cache since cache could be invalidated and we have had experiences where actual files written did not match w/ whats in writeStatus. Hence if something fails, we let dag to get retriggered here. we added guard rails at BasecommitActionExecutor by means of cloning.
CC @alexeykudinkin

@nsivabalan
Copy link
Contributor

Are you guys saying that, even in a successful path, the dag is getting triggered twice. Or it happens only during exception path.

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 12, 2022

@nsivabalan
yes, even in a successful path, the dag is getting triggered twice. But you must turn off hoodie.datasource.write.row.writer.enable, because when it is turned on, it will avoid recalculation becauce of the collect at HoodieDatasetBulkInsertHelper. At present, it seems that we do not recommend using the persist, and I will close these two PRs.

@Zouxxyy Zouxxyy closed this Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker release-0.12.2 Patches targetted for 0.12.2
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants