Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liberates ReduceFnRunner from WindowingInternals, and lets WindowingInternals do windowed side outputs #1353

Merged
merged 6 commits into from Nov 18, 2016

Conversation

jkff
Copy link
Contributor

@jkff jkff commented Nov 12, 2016

  • Introduces WindowingInternals.sideOutputWindowedValue (will be necessary for Splittable DoFn)
  • Implements it properly in all runners (required some minor refactoring in Spark and Flink ProcessContext implementations)
  • Introduces "OutputWindowedValue" interface and "SideInputAccess" interfaces, and uses them in ReduceFnRunner instead of directly using WindowingInternals.
  • Introduces adapters from WindowingInternals to these two interfaces, for use in OldDoFn contexts
  • Moves some StateContext functions into ReduceFnContextFactory, because they make more sense in runners-core than in sdk (because they are only invoked by different runners). The only remaining StateContexts function is nullContext, but I couldn't figure out an easy way to move it into runners-core and gave up (however in fact I'm not sure its current usages are correct at all...)

R: @kennknowles (for bulk of the code and as committer)
CC: @aljoscha @amitsela (for the minor refactorings in respective runners)

/**
* Allows accessing the side inputs for a particular main input window.
*/
public interface SideInputAccess {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there already SideInputReader?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 seems to be the same thing.

Reminds me that SideInputReader really should be in runners-core but it is still in the SDK due to misuse in CombineFnRunners and thereabouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It took some fiddling with WindowingInternals - note also that now SideInputReader is the only user of WindowingInternals.sideInput. I'll be able to move SideInputReader into runners-core after another PR that @peihe is working on.

@aljoscha
Copy link
Contributor

The changes on the Flink side look good 👍

@@ -32,7 +32,7 @@
import org.apache.beam.sdk.values.TupleTag;
import org.apache.spark.Accumulator;
import org.apache.spark.api.java.function.FlatMapFunction;

import org.joda.time.Instant;
Copy link
Member

@amitsela amitsela Nov 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused.
Occurs in MultiDoFnFunction as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

@Override
public <T> void sideOutputWithTimestamp(TupleTag<T> tupleTag, T t, Instant instant) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're opting for enabling side output by default (leaving DoFnFunction to explicitly override with "Unsupported"), that's fine by me, but I guess DoFnFunction should override sideOutputWithTimestamp with an UnsupportedOperationException as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class delegates normal output() and sideOutput() on the ProcessContext to the internal abstract methods I introduced - outputWindowedValue and sideOutputWindowedValue (I renamed them and made them protected in response to your comments). sideOutputWindowedValue in DoFnFunction is overridden to throw an UnsupportedOperationException.

"sideOutput is an unsupported operation for doFunctions, use a "
+ "MultiDoFunction instead.");
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complementing comment to say that sideOutputWithTimestamp should throw UnsupportedOperationException as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the comment above addresses this too. sideOutputWithTimestamp is defined in the base class and delegates to sideOutputWindowedValue, which here is overridden to throw the exception.

}

public abstract void output(WindowedValue<OutputT> output);
public abstract <T> void sideOutput(TupleTag<T> tag, WindowedValue<T> output);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space between the abstracts ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@amitsela
Copy link
Member

It seems that neither Spark or Flink runners were actually built by Jenkins as Direct runner failed on "unused imports" in checkstyle.

I've added some comments for the changes in the Spark runner, and I've also built the PR branch locally (ignoring checkstyle) and executed the integrations tests for local-runnable-on-service-tests for both Spark and Flink (to avoid surprises in post-commit) and I'm happy to say they both passed.

@@ -496,4 +501,51 @@ public Timers timers() {
return timers;
}
}

private static <W extends BoundedWindow> StateContext<W> stateContextFromComponents(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only used by ReduceFnContextFactory (that's why I made them private to this class), and they in any case belong more into runners-core than into SDK - especially now that this method takes SideInputReader, which I would like to move into runners-core.

@jkff
Copy link
Contributor Author

jkff commented Nov 14, 2016

Thanks all, PTAL!

@amitsela
Copy link
Member

LGTM pending Jenkins/Travis.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite good. My comments are just slightly more than nits.


@Override
public <T> T sideInput(PCollectionView<T> view) {
return sideInputReader.get(view, windowFn.getSideInputWindow(mainInputWindow));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some discussions, and yes this is right.

SideOutputT output,
Instant timestamp,
Collection<? extends BoundedWindow> windows,
PaneInfo pane) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this (and the above which is duplicates) should both throw. A DoFn is not permitted to output to a different window. Silently dropping data seems a cruel way to enforce this.


@Override
public void outputWindowedValue(KV<String, OutputT> output, Instant timestamp,
Collection<? extends BoundedWindow> windows, PaneInfo pane) {
Collection<? extends BoundedWindow> windows, PaneInfo pane) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't align. Use google-java-format.

Instant timestamp,
Collection<? extends BoundedWindow> windows,
PaneInfo pane) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable error message, please. This might be a checkState if it is unreachable.


@Override
public <T> boolean contains(PCollectionView<T> view) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


@Override
public boolean isEmpty() {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


@Override
public <T> boolean contains(PCollectionView<T> view) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an informative message to this exception?


@Override
public boolean isEmpty() {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an informative message to this exception?

Instant timestamp,
Collection<? extends BoundedWindow> windows,
PaneInfo pane) {
throw new UnsupportedOperationException("Can't output to side outputs from a ReduceFn");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it GroupAlsoByWindow, since actually that is the criteria. If ReduceFn ever becomes a thing, it might be allowed to side output. In fact, stateful DoFn is basically ReduceFn.

@kennknowles
Copy link
Member

kennknowles commented Nov 15, 2016

Just to direct your attention: the Jenkins failure is checkstyle in various Flink files. It runs before tests, to fail fast. The changes here will almost certainly fail spectacularly if they are incorrect, so I am not trying to replace testing by my manual inspection.

@jkff jkff force-pushed the reducefn-windowing-internals branch 3 times, most recently from 4db5bb4 to 1325673 Compare November 16, 2016 00:03
@jkff
Copy link
Contributor Author

jkff commented Nov 16, 2016

Dang. Looks like I can't just move StateContext.createFromComponents away from the SDK.

The Jenkins failure is:

(546b2b14867c0a52): Exception: java.lang.NoSuchMethodError: org.apache.beam.sdk.util.state.StateContexts.createFromComponents(Lorg/apache/beam/sdk/options/PipelineOptions;Lorg/apache/beam/sdk/util/WindowingInternals;Lorg/apache/beam/sdk/transforms/windowing/BoundedWindow;)Lorg/apache/beam/sdk/util/state/StateContext;

Let's see what I can do about this...

@jkff jkff force-pushed the reducefn-windowing-internals branch from 765cc77 to e230c9d Compare November 16, 2016 22:51
@jkff
Copy link
Contributor Author

jkff commented Nov 17, 2016

Green, PTAL.

Makes WindowingInternals.sideInput take the side input window
instead of main input window.
It must be temporarily restored for compatibility with
current Dataflow worker in order for integration tests
to pass.
Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the exception of the missing error messages.


@Override
public <T> boolean contains(PCollectionView<T> view) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an informative message to this exception?


@Override
public boolean isEmpty() {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an informative message to this exception?

@jkff jkff force-pushed the reducefn-windowing-internals branch from e230c9d to c0623c1 Compare November 17, 2016 23:36
@jkff
Copy link
Contributor Author

jkff commented Nov 17, 2016

Done.

@asfgit asfgit merged commit c0623c1 into apache:master Nov 18, 2016
asfgit pushed a commit that referenced this pull request Nov 18, 2016
@jkff jkff deleted the reducefn-windowing-internals branch November 18, 2016 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants