-
Notifications
You must be signed in to change notification settings - Fork 4.5k
BEAM-5933: avoid memory allocation in hashCode call #6909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
iemejia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
This PR causes BEAM-6407... |
|
Out of curiosity, did you measure the allocation and its cost? Asking because we use this sort of practice a lot, since building your own hashCode is a mess and we don't use AutoValue enough. If it is expensive, we should inline a few versions of the call for low numbers of varargs. |
|
I have encountered this as a part of larger investigation into the performance aspects of our pipelines. The invocations were frequent enough to stand out in the memory profiler. I found it interesting that hash calculations needs to allocate memory and considering the high number of occurrences I thought this would be worth "fixing". Unfortunately I was not aware of this "hidden contract" that assumes particular hash implementation is necessary. Btw the overall improvement we were able to achieve in our pipelines was about 15% (7mins down to 6mins), but this can be predominantly (if not completely) attributed to removing the beam metrics collection in the DirectRunner. The load that beam metrics (in DirectRunner) put on the heap is massive (that extra minute basically went to GC activity), however, knowing that anything performance related is implicitly considered a non-issue I didn't report it. This high resource consumption is probably why FlinkRunner eventually added the option to disable metrics collection. |
|
@janotav 'anything performance related' is an issue, even if Direct Runner is a test runner, improvements in its performance benefit us all so don't hesitate to report them (or contribute fixes). Also your fix is in 'sdks/java/core' so it benefits Everyone. It is really worth! |
|
Do you have more details on the metrics performance issue you mention, mind to create a JIRA please. |
|
@janotav you are quite right that this hidden contract is very suspicious. I have looked into the type hierarchy to investigate. The issue is that there are two desires in conflict: (1) a runner can deserialize a protobuf PCollectionView using just the tag, into whatever its runner-specific representation and (2) you can use PCollectionView as a key to retrieve values. Together, these force any subclass of PCollectionView should be equal (and equal hashcode) if their tags are equal, since runner's create proxy views or whatever. IMO this contract is broken, since the same tag but different But if you want to gain the performance back, I bet you can roll forward and also just change here to match: https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/RunnerPCollectionView.java#L108 Even better would be to port things to use the tag as the key into any implementation map. There is not even equals and hashcode on these subclasses in the Dataflow worker so I think that implies the tag is used directly: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowPortabilityPCollectionView.java and https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/FetchAndFilterStreamingSideInputsOperation.java#L99 |
|
You may mean that anything performance-related in the DirectRunner is a non-issue. Sometimes it seems that way, and it is true that it is focused on just being a fake for testing. But it is so bad that we really do need to improve it. Please keep reporting issues! |
Avoid unwanted memory allocation currently done during invocation of
org.apache.beam.sdk.values.PCollectionViews$SimplePCollectionView.hashCode()
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)