Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-405] Replace Metric with LongAccumulator to reduce memory overhead for spatial join #1041

Merged
merged 5 commits into from
Sep 29, 2023

Conversation

jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Sep 29, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

Replace the old Metric class with the Spark built-in LongAccumulator to reduce the memory overhead of spatial join

This should significantly reduce the Sedona driver program memory overhead for both RDD-based join and SQL-based join

How was this patch tested?

Passed existing tests.

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation update.

@umartin
Copy link
Contributor

umartin commented Sep 29, 2023

Just my two cents. I would prefer removing the Sedona Metric class and use the Spark build in LongAccumulator instead. The LongAccumulator has no overhead. There would be no need for a debug option. Less code, less knobs and metrics for everyone. Keeping the Sedona Metric class means there is no way to have metrics for larger jobs, where you would really want them for tuning.

@jiayuasu
Copy link
Member Author

@umartin The Metric is designed to capture the <paritionId, buildSideCount> and <paritionId, streamSideCount> for each partition in a task. If we use the LongAccumulator without the partitionId, then there is no partition-wise information and there is no point of even having this LongAccumulator.

Or maybe I misunderstood your suggestion. Please advise.

@umartin
Copy link
Contributor

umartin commented Sep 29, 2023

This is from the Spark UI where I replaced the Sedona Metric with a LongAccumulator. Spark already tracks the accumulators by task in "Tasks" as well as presenting the summary in "Accumulators"
image

@jiayuasu
Copy link
Member Author

@umartin Oh, now it is clear to me. Thanks for the suggestion. Will fix it according to your suggestion

@umartin
Copy link
Contributor

umartin commented Sep 29, 2023

Thank you! I'm really looking forward to the 1.5 release. There are tons of great changes. You've been really busy!

@jiayuasu jiayuasu changed the title [SEDONA-405] Introduce debug mode to reduce memory overhead for spatial join [SEDONA-405] Replace Metric with LongAccumulator to reduce memory overhead for spatial join Sep 29, 2023
@jiayuasu jiayuasu linked an issue Sep 29, 2023 that may be closed by this pull request
@jiayuasu jiayuasu merged commit db07953 into master Sep 29, 2023
40 checks passed
@jiayuasu jiayuasu deleted the OOM-issue branch September 29, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in Sedona 1.4.1 leading to OutOfMemoryException
2 participants