Skip to content

[BEAM-6350] Reuse PCollectionView when created in translators#7399

Merged
dmvk merged 3 commits intoapache:masterfrom
seznam:simunek/reuusePviews
Jan 14, 2019
Merged

[BEAM-6350] Reuse PCollectionView when created in translators#7399
dmvk merged 3 commits intoapache:masterfrom
seznam:simunek/reuusePviews

Conversation

@mareksimunek
Copy link
Contributor

If for LeftJoin is used BroadcastHashJoinTranslator then from right side is created PCollectionView (as sideInput).

If we use right side in multiple joins then PCollectionView is created multiple times which is not optimal behavior.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

@mareksimunek mareksimunek changed the title [BEAM-6350] Reuuse PCollectionView when created in translators [BEAM-6350] Reuse PCollectionView when created in translators Jan 3, 2019
@VaclavPlajt VaclavPlajt force-pushed the simunek/reuusePviews branch 2 times, most recently from 157ef28 to 6db25b4 Compare January 4, 2019 06:21
@dmvk dmvk self-requested a review January 4, 2019 13:22
@dmvk
Copy link
Member

dmvk commented Jan 4, 2019

Run JavaPortabilityApi PreCommit

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea 👍

* @return the current (already existing or computed) value associated with the specified key
*/
@SuppressWarnings("unchecked")
public <K, V> PCollectionView<Map<K, Iterable<V>>> computeViewAsMultimapIfAbsent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more readable name? eg.: asMultimap

PViewsStore getPCollectionViewsStore();

@Internal
void setPCollectionViewsStore(PViewsStore pCollectionViewsStore);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is setter necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon we don't need a separate field in pipeline options. It should be enough to keep instance property in BHJ Translator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. But Gradle complains during build that there is a missing setter. Is there a way of circumventing it?
The PViewsStore will be skipped completely. As a conclusion of mutual discussion.

final PCollectionView<Map<KeyT, Iterable<RightT>>> broadcastRight =
right.apply(View.asMultimap());
return left.apply(
pViews.computeViewAsMultimapIfAbsent(right, rightKeyed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think comparing by pcollection is enough as we can have a different key extractor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But that brigs the problem of comparing lambdas for equality (again). Which we do not have good enough solution for. So I will use == equality and try to highlight it in the docs.

@VaclavPlajt VaclavPlajt force-pushed the simunek/reuusePviews branch from 6db25b4 to ee805b4 Compare January 9, 2019 09:18
mareksimunek and others added 2 commits January 9, 2019 10:18
…muliplication. PCollectionViews are now stored in BroadcastHashJoinTranslator. Key extractor is taken into consideration when looking for the same views.
@VaclavPlajt VaclavPlajt force-pushed the simunek/reuusePviews branch from ee805b4 to 2b90576 Compare January 9, 2019 09:18
* Used to prevent multiple views to the same input PCollection. And therefore multiple broadcasts
* of the same data.
*/
private Table<PCollection<?>, UnaryFunction<?, KeyT>, PCollectionView<?>> pViews =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add final to this field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@VaclavPlajt VaclavPlajt force-pushed the simunek/reuusePviews branch from 694d6db to 1ba3efb Compare January 9, 2019 14:13
@VaclavPlajt
Copy link
Contributor

Run JavaPortabilityApi PreCommit

@dmvk dmvk merged commit 9780341 into apache:master Jan 14, 2019
@dmvk dmvk deleted the simunek/reuusePviews branch January 14, 2019 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants