Skip to content

Conversation

@janotav
Copy link
Contributor

@janotav janotav commented Nov 1, 2018

Avoid unwanted memory allocation currently done during invocation of

org.apache.beam.sdk.values.PCollectionViews$SimplePCollectionView.hashCode()


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@iemejia iemejia merged commit 1ca7605 into apache:master Nov 1, 2018
@nielm
Copy link
Contributor

nielm commented Jan 16, 2019

This PR causes BEAM-6407...

@kennknowles
Copy link
Member

kennknowles commented Jan 16, 2019

Out of curiosity, did you measure the allocation and its cost? Asking because we use this sort of practice a lot, since building your own hashCode is a mess and we don't use AutoValue enough. If it is expensive, we should inline a few versions of the call for low numbers of varargs.

@janotav
Copy link
Contributor Author

janotav commented Jan 17, 2019

I have encountered this as a part of larger investigation into the performance aspects of our pipelines. The invocations were frequent enough to stand out in the memory profiler. I found it interesting that hash calculations needs to allocate memory and considering the high number of occurrences I thought this would be worth "fixing". Unfortunately I was not aware of this "hidden contract" that assumes particular hash implementation is necessary.

Btw the overall improvement we were able to achieve in our pipelines was about 15% (7mins down to 6mins), but this can be predominantly (if not completely) attributed to removing the beam metrics collection in the DirectRunner. The load that beam metrics (in DirectRunner) put on the heap is massive (that extra minute basically went to GC activity), however, knowing that anything performance related is implicitly considered a non-issue I didn't report it. This high resource consumption is probably why FlinkRunner eventually added the option to disable metrics collection.

@iemejia
Copy link
Member

iemejia commented Jan 17, 2019

@janotav 'anything performance related' is an issue, even if Direct Runner is a test runner, improvements in its performance benefit us all so don't hesitate to report them (or contribute fixes). Also your fix is in 'sdks/java/core' so it benefits Everyone. It is really worth!

@iemejia
Copy link
Member

iemejia commented Jan 17, 2019

Do you have more details on the metrics performance issue you mention, mind to create a JIRA please.

@kennknowles
Copy link
Member

@janotav you are quite right that this hidden contract is very suspicious. I have looked into the type hierarchy to investigate.

The issue is that there are two desires in conflict: (1) a runner can deserialize a protobuf PCollectionView using just the tag, into whatever its runner-specific representation and (2) you can use PCollectionView as a key to retrieve values. Together, these force any subclass of PCollectionView should be equal (and equal hashcode) if their tags are equal, since runner's create proxy views or whatever. IMO this contract is broken, since the same tag but different ViewFn should not ever be equal.

But if you want to gain the performance back, I bet you can roll forward and also just change here to match: https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/RunnerPCollectionView.java#L108

Even better would be to port things to use the tag as the key into any implementation map.

There is not even equals and hashcode on these subclasses in the Dataflow worker so I think that implies the tag is used directly: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowPortabilityPCollectionView.java and https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/FetchAndFilterStreamingSideInputsOperation.java#L99

@kennknowles
Copy link
Member

You may mean that anything performance-related in the DirectRunner is a non-issue. Sometimes it seems that way, and it is true that it is focused on just being a fake for testing. But it is so bad that we really do need to improve it. Please keep reporting issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants