Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7606] Unpersist RDDs after table services, mainly compaction and clustering #11000

Merged
merged 8 commits into from
Apr 14, 2024

Conversation

rmahindra123
Copy link
Contributor

@rmahindra123 rmahindra123 commented Apr 11, 2024

Change Logs

Unpersist RDDs after table services. Currently, the releaseResources is only called for commit or deltacommits. Tests show that the RDDs persisted by compaction and clustering are not explicitly unpersisted.

Impact

Cleans up persisted rdds w/o leaving any residues.

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Apr 11, 2024
@rmahindra123 rmahindra123 changed the title [] Unpersist RDDs after table services, mainly compaction [HUDI-7606] Unpersist RDDs after table services, mainly compaction Apr 11, 2024
@@ -268,6 +268,7 @@ public boolean commitStats(String instantTime, HoodieData<WriteStatus> writeStat
commitCallback.call(new HoodieWriteCommitCallbackMessage(
instantTime, config.getTableName(), config.getBasePath(), stats, Option.of(commitActionType), extraMetadata));
}
releaseResources(instantTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code path covers all writers correct? spark ds writer, and deltastreamer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, since SparkRDDWriteClient implements the method and is used across Spark DS and deltastreamer, we should be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are missing async compaction flows.

writeClient.commitCompaction(instant.getTimestamp(), compactionMetadata.getCommitMetadata().get(), Option.empty());

->
public void commitCompaction(String compactionInstantTime, HoodieCommitMetadata metadata,

->
protected void completeCompaction(HoodieCommitMetadata metadata, HoodieTable table, String compactionCommitTime) {

I don't think we are unpersisting rdds in this flow.

Likely same for async clustering too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it for all table services, for both inline and async flows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch btw

@rmahindra123 rmahindra123 changed the title [HUDI-7606] Unpersist RDDs after table services, mainly compaction [HUDI-7606] Unpersist RDDs after table services, mainly compaction and clustering Apr 13, 2024
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:XS PR with lines of changes in <= 10 labels Apr 14, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 48952ae into apache:master Apr 14, 2024
40 of 41 checks passed
yihua pushed a commit that referenced this pull request May 14, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
yihua pushed a commit that referenced this pull request May 14, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
yihua pushed a commit that referenced this pull request May 15, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
yihua pushed a commit that referenced this pull request May 15, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
yihua pushed a commit that referenced this pull request May 15, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
yihua pushed a commit that referenced this pull request May 15, 2024
…d clustering (#11000)

---------

Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-0.15.0 size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants