Skip to content

[BEAM-2140] Execute Splittable DoFn directly in Flink Runner#3480

Closed
aljoscha wants to merge 1 commit intoapache:masterfrom
aljoscha:fix-flink-splittable-dofn-squashed
Closed

[BEAM-2140] Execute Splittable DoFn directly in Flink Runner#3480
aljoscha wants to merge 1 commit intoapache:masterfrom
aljoscha:fix-flink-splittable-dofn-squashed

Conversation

@aljoscha
Copy link
Contributor

Before, we were using ProcessFn. This was causing problems with the
Flink Runner for two reasons:

  1. StatefulDoFnRunner is in the processing path, which means
    processing-time timers are being dropped when the watermark reaches +Inf

  2. When a pipeline shuts down (for example, when bounded sources shut
    down) Flink will drop any outstanding processing-time timers, meaning
    that that any remaining Restrictions will not be processed.

The fix for 1. is to execute the splittable DoFn directly, thereby
bypassing the late data/timer dropping logic.

The fix for 2. builds on the fix for 1. and also introduces a "last
resort" even-time timer that fires at +Inf and makes sure that any
remaining restrictions are being exhausted.

R: @jkff Not sure if we wan't to fix it like this or maybe adapt ProcessFn and remove StatefulDoFnRunner from the processing path.

Before, we were using ProcessFn. This was causing problems with the
Flink Runner for two reasons:

1. StatefulDoFnRunner is in the processing path, which means
processing-time timers are being dropped when the watermark reaches +Inf

2. When a pipeline shuts down (for example, when bounded sources shut
down) Flink will drop any outstanding processing-time timers, meaning
that that any remaining Restrictions will not be processed.

The fix for 1. is to execute the splittable DoFn directly, thereby
bypassing the late data/timer dropping logic.

The fix for 2. builds on the fix for 1. and also introduces a "last
resort" even-time timer that fires at +Inf and makes sure that any
remaining restrictions are being exhausted.
@aljoscha
Copy link
Contributor Author

Run Flink ValidatesRunner

@jkff
Copy link
Contributor

jkff commented Jun 30, 2017

Thanks! I'm gonna be out for the whole week though, with no connectivity, so maybe @kennknowles or @tgroh can take a look meanwhile?

@kennknowles
Copy link
Member

I'll take a peek

@davorbonaci
Copy link
Member

davorbonaci commented Jul 17, 2017

@jkff / @kennknowles, any chance you can take a look at this shortly? This would be a great addition to the Flink runner!

@jkff
Copy link
Contributor

jkff commented Jul 20, 2017

Sorry, forgot about this PR while on vacation. Looking.

@jkff
Copy link
Contributor

jkff commented Jul 20, 2017

Run Flink ValidatesRunner

Copy link
Contributor

@jkff jkff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

restrictionExhausted =
processRestriction(elementState, restrictionState, holdState, elementAndRestriction);

} while (!restrictionExhausted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it is never exhausted, e.g. this is an SDF reading from pubsub forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we forever block here, which is not good. However, this is the last chance we have of executing the splittable DoFn. If we get the +Inf watermark, this means that the upstream operators have shut down and that our operator will also be shutdown by Flink if all incoming elements have been processed (Flink will shutdown an operator if it has no more upstream live operators, it will not block shutdown on outstanding timers).

This is also the main reason why we cannot simply reuse ProcessFn as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this sounds like a serious issue, because SDF is specifically intended to support infinite outputs per element. Do you think it's feasible to fix Flink to support this (e.g. optionally block shutdown on outstanding timers for an operator)?


// Initialize state (element and restriction)

WindowedValue<KV<InputT, RestrictionT>> windowedValue =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not super happy about copy-pasting code of ProcessFn here. Why can't we just use the processFn, like https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/SplittableProcessElementsEvaluatorFactory.java does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm also not happy about this. The mean reason for copying is so that I can set that last-resort event-time timer that allows me to block shutting down when our upstream operators have shut down.

I'll think about this, maybe I can set the timer outside of the ProcessFn, I probably just have to make sure to set the right state namespace on the timer.

@jkff
Copy link
Contributor

jkff commented Jul 20, 2017 via email

@aljoscha
Copy link
Contributor Author

Closing this for now since the implementation does not work. I have another one, that is simpler but also doesn't work because there is currently no way of getting around the fact that Flink will shutdown the timer service before calling close() on an operator. Meaning that there is no way of blocking on "in-flight" timers. I'll try and get such a feature in Flink, only then can we fix this issue.

@aljoscha aljoscha closed this Sep 18, 2017
@aljoscha aljoscha deleted the fix-flink-splittable-dofn-squashed branch January 5, 2018 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants