Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-5867] Operator translation based on operator name #6846

Closed

Conversation

mareksimunek
Copy link
Contributor

Smarter why to choose operator translation based on operators name by introducing NameBaseTranslationProviderTest. Useful e.g. when we want Join operator which fits in memory translate with BroadcastHashJoinTranslator


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

@mareksimunek
Copy link
Contributor Author

@VaclavPlajt @dmvk

@@ -34,9 +34,10 @@ dependencies {
testCompile library.java.slf4j_api
testCompile library.java.hamcrest_core
testCompile library.java.hamcrest_library
testCompile library.java.mockito_core
testCompile library.java.powermock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we stick with mockito?

@mareksimunek mareksimunek force-pushed the simunek/NameTranslationProvider branch from 02b6b8d to 5cc2cb6 Compare October 26, 2018 13:13
@mareksimunek mareksimunek force-pushed the simunek/NameTranslationProvider branch from 5cc2cb6 to dcd8c12 Compare October 29, 2018 11:22
@mareksimunek mareksimunek force-pushed the simunek/NameTranslationProvider branch from 15507d0 to cfedd93 Compare October 29, 2018 16:43
@dmvk
Copy link
Member

dmvk commented Oct 29, 2018

I think this heads the right direction. We need a convenient and flexible way for user to choose which translator should be used for an operator.

I'm thinking about having some kind of composable translator provider, that could compose of multiple translators (in first matching provider wins order.? It may be far more flexible, than setting a default translator in NamedTranslatorProvider (this implies, that there can be only one other provider alongside the named one).

@mareksimunek @VaclavPlajt any thoughts about this?

@mareksimunek
Copy link
Contributor Author

@dmvk I think it should be solved in another ticket. You could also pass composable translator provider to NamedTranslatorProvider where the name matching will be first. Also you could later allow to not have set default translator and put this one in the composable one, which will comply with your idea.

Copy link
Contributor

@VaclavPlajt VaclavPlajt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at comments and update documentation.

* @param shortName on which should operator name start
* @return builder
*/
public NameBasedTranslatorProvider.Builder addShortNameTranslation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that names of parameters are mismatched. Class<? extends Operator> operator should be operatorClass and could be just OperatorTranslator<?, ?, ?> translator since it is an instance.

I would also thing more about the name of the method itself. Something like addNameBasedTranslation may better describe what the mehod does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

* {@link Builder#addShortNameTranslation(Class, OperatorTranslator, String)} translation). If no
* match, then its used {@link Builder#setDefaultTranslationProvider(TranslatorProvider)}.
*/
public class NameBasedTranslatorProvider implements TranslatorProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rephrase?:
`An implementation of {@link TranslatorProvider} which allows for custom translations of hand picked operators based on operator names.

It either selects specific translation or defaults to wrapped {@link TranslatorProvider}. Selection of specific translation is based on operator's name and {@link Class}. Specific translation could be added through {@link Builder#addShortNameTranslation(Class, OperatorTranslator, String)} method during build step. Names are considered match when an operator's name starts with given prefix.`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased and class link little edited

@@ -343,7 +343,7 @@ Represents left join of two (left and right) datasets on given key producing sin
// KV(3, "3+rat"), KV(0, "0+null"), KV(4, "4+duck"), KV(3, "3+cat"),
// KV(3, "3+rat"), KV(1, "1+X")]
```
Euphoria support performance optimization called 'BroadcastHashJoin' for the `LeftJoin`. User can indicate through previous operator's output hint `.output(SizeHint.FITS_IN_MEMORY)` that output `Dataset` of that operator fits in executors memory. And when the `Dataset` is used as right input, Euphoria will automatically translated `LeftJoin` as 'BroadcastHashJoin'. Broadcast join can be very efficient when joining between skewed datasets.
Euphoria support performance optimization called 'BroadcastHashJoin' for the `LeftJoin`. User can indicate by operators name that right side of left join fits in executors memory (more [NameBasedTranslation](#namebasedtranslatorprovider)). Broadcast join can be very efficient when joining between skewed datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we should not write that this type of optimization is based on operators name. It is based on option to translate it differently. Some other implementation of TranslatorProvider may do that choice based on different information than operators name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would link TranslationProviders section here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited to more generic

@@ -362,7 +362,7 @@ Dataset<KV<Integer, String>> joined =
// KV(4, "4+duck"), KV(3, "3+cat"), KV(3, "3+rat"), KV(1, "1+X"),
// KV(8, "null+elephant"), KV(5, "null+mouse")]
```
Euphoria support performance optimization called 'Broadcast Hash Join' for the `RightJoin`. User can indicate through previous operator's output hint `.output(SizeHint.FITS_IN_MEMORY)` that output `Dataset` of that operator fits in executors memory. And when the `Dataset` is used as left input, Euphoria will automatically translated `RightJoin` as 'Broadcast Hash Join'. Broadcast join can be very efficient when joining between skewed datasets.
Euphoria support performance optimization called 'Broadcast Hash Join' for the `RightJoin`. User can indicate by operators name that left side of right join fits in executors memory (more [NameBasedTranslation](#namebasedtranslatorprovider)). Broadcast join can be very efficient when joining between skewed datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as in case of LeftJoin.

@@ -567,6 +567,19 @@ Dataset<SomeEventObject> timeStampedEvents =
//Euphoria will now know event time for each event
```

## TranslationProviders
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## TranslationProviders
## TranslationProviders
The fact that Euphoria API is translated to Beam Java SDK give us option to fine tune the translation itself. Euphoria uses `TranslationProvider` to decide which `OperatorTranslator` should be used. User of Euphoria API can supply it ts own `TranslationProvider` by extending `EuphoriaOptions`. Euphoria already contains some implementations.
#SimpleTranslatorProvider
Default implementation of `TranslationProvider`. It is able to give default translators for all Euphoria operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included

@VaclavPlajt
Copy link
Contributor

@dmvk I like the idea of chaining (anyhow) TranslationProviders so one does not need to define translations of all the operators every time new TranslationProvider is implemented. Having composite provider is aligned with that.

But it would require some more attention than originally anticipated in @mareksimunek scope of current work. So would also prefer to do it as different issue.

@mareksimunek mareksimunek force-pushed the simunek/NameTranslationProvider branch from ee94c44 to 4dedbe0 Compare November 5, 2018 12:55
@mareksimunek mareksimunek force-pushed the simunek/NameTranslationProvider branch from 4dedbe0 to f046acb Compare November 5, 2018 13:18
@robertwb
Copy link
Contributor

Looks like this needs a rebase.

@VaclavPlajt
Copy link
Contributor

This is now obsolete by already merged refactoring. @mareksimunek Can we close it?

@dmvk
Copy link
Member

dmvk commented Dec 10, 2018

@VaclavPlajt 👍

@dmvk dmvk closed this Dec 10, 2018
@dmvk dmvk deleted the simunek/NameTranslationProvider branch December 10, 2018 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants