[FLINK-4460] Allow ProcessFunction on non-keyed streams #3438

aljoscha · 2017-03-01T13:16:37Z

This is in preparation for side outputs, which will only work on ProcessFunction. We still want side outputs on non-keyed streams so we have to make ProcessFunction available there.

See this ML thread for reference: https://lists.apache.org/thread.html/f3fe7d68986877994ad6b66173f40e72fc454420720a74ea5a834cc2@%3Cdev.flink.apache.org%3E

aljoscha · 2017-03-01T13:16:56Z

R: @uce or @rmetzger for review, please

rmetzger · 2017-03-03T20:47:07Z

I looked over the changes and didn't find anything critical. The only thing that made me thinking was the boxed Long type for the timestamp(). I assume you decided for this approach to signal timestamp unavailability using null. The Java documentation does not recommend to rely on autoboxing for performance critical code: http://docs.oracle.com/javase/1.5.0/docs/guide/language/autoboxing.html

Tests, Scala API were done. I assume that we don't need to explicitly mention support for the process function on non-keyed streams.

aljoscha · 2017-03-04T07:14:47Z

@rmetzger Yes, it's unfortunate that in our model not all elements always have a timestamp. The other alternative is throwing an exception when trying to access a non-existing timestamp.

rmetzger · 2017-03-04T15:52:42Z

In addition to throwing an exception, we should also expose element.hasTimestamp() to offer our users a clean way of checking for timestamps.
Lets see what @uce or other reviewers think about this.

aljoscha · 2017-03-04T16:03:39Z

I think the discussion of timestamps and additional interfaces is orthogonal to this PR: KeyedProcessOperator is a renaming of the pre-existing ProcessOperator and the new ProcessOperator is a simplification that does away with timers. The interface for timestamps exists in the current code base, if we want to change that we should open other Jira issues.

wenlong88 · 2017-03-06T02:15:51Z

...k-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/ProcessFunction.java

 *
 * @param <I> Type of the input elements.
 * @param <O> Type of the output elements.
 */
 @PublicEvolving
-public interface ProcessFunction<I, O> extends Function {
+public abstract class ProcessFunction<I, O> extends AbstractRichFunction {


hi, changing form interface to class is incompatible on the user side. Can't ProcessFunction just extend RichFunction?

I think the problem is that we need a default implementation for onTimer(long, OnTimerContext, Collector) (see below).

hi @wenlong88 in the ML discussion (https://lists.apache.org/thread.html/f3fe7d68986877994ad6b66173f40e72fc454420720a74ea5a834cc2@%3Cdev.flink.apache.org%3E) we decided to make ProcessFunction available on non-keyed streams as well to allow using side outputs there. This requires making the onTimer() method abstract, otherwise every user would always have to implement it. We marked ProcessFunction as @PublicEvolcing just for such cases; it's still a very young API and we didn't know exactly what was going to be needed in the end.

uce

Cool change! I'm OK with the change from interface to abstract class. Do we need to update the documentation for any of the changes? If yes, I would make this part of this PR.

I had some inline comments that you can have a look at before merging. Other than that, +1 to merge this.

uce · 2017-03-06T09:54:09Z

flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java

+	 * @return The transformed {@link DataStream}.
+	 */
+	@Internal
+	public <R> SingleOutputStreamOperator<R> process(


Is this internal method only exposed as public for the Scala API? If yes, I'm wondering if it makes sense to call transform manually in the Scala DataStream API.

Yes, it's exposed for that. The pattern, so far, is for methods to also expose a public method that takes a TypeInformation because we get the TypeInformation from the context bound in the Scala API.

Calling transform() manually is an option but if we do that we would basically not base the Scala API on the Java API anymore and we would have code that instantiates the Stream Operators in both the Java and Scala API. For example, right now we have the code for instantiating a flat map operator in (Java)DataStream while (Scala)DataStream.flatMap() calls that method.

What do you think?

Makes sense to keep it like that. The benefits to base the Scala API on top of the Java API instead of duplicating it are very persuasive, too. 😄 So +1 to keep it as is. 👍 I was just wondering whether users would be confused by this.

uce · 2017-03-06T09:55:14Z

...k-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/ProcessFunction.java

+ *
+ * <p><b>NOTE:</b> A {@code ProcessFunction} is always a
+ * {@link org.apache.flink.api.common.functions.RichFunction}. Therefore, access to the
+ * {@link org.apache.flink.api.common.functions.RuntimeContext} as always available and setup and


typo: as => is?

uce · 2017-03-06T09:57:51Z

...k-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/ProcessFunction.java

 *
 * @param <I> Type of the input elements.
 * @param <O> Type of the output elements.
 */
 @PublicEvolving
-public interface ProcessFunction<I, O> extends Function {
+public abstract class ProcessFunction<I, O> extends AbstractRichFunction {


I think the problem is that we need a default implementation for onTimer(long, OnTimerContext, Collector) (see below).

uce · 2017-03-06T10:00:42Z

...k-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/ProcessFunction.java

 *
 * @param <I> Type of the input elements.
 * @param <O> Type of the output elements.
 */
 @PublicEvolving
-public interface ProcessFunction<I, O> extends Function {
+public abstract class ProcessFunction<I, O> extends AbstractRichFunction {


Missing serialVersionUID

uce · 2017-03-06T10:15:48Z

@rmetzger @aljoscha I would agree with Aljoscha that your point is independent of this PR. Is there an issue for 2.0 to track this?

aljoscha · 2017-03-06T10:34:19Z

@uce There is not issue for 2.0 to track this because I don't think there is consensus about always having timestamps.

aljoscha · 2017-03-06T10:54:16Z

@uce There is some documentation that says that ProcessFunction is only available on keyed streams. I'll change that.

This is in preparation of allowing ProcessFunction on DataStream because we will use it to allow side outputs from the ProcessFunction Context.

Introduce new ProcessOperator for this. Rename the pre-existing ProcessOperator to KeyedProcessOperator.

…thod This is in preparation of allowing CoProcessFunction on a non-keyed connected stream. we will use it to allow side outputs from the ProcessFunction Context.

Introduce new CoProcessOperator for this. Rename the pre-existing CoProcessOperator to KeyedCoProcessOperator.

aljoscha · 2017-03-06T15:55:32Z

Merged

wenlong88 · 2017-03-06T16:07:49Z

thanks for explaination, I have such concern because we have just suggested our users to use processFunction to implement their jobs, they need to change their code too when we sync the cimmit.after all, it is really nice to have timer in more scenarios.

wenlong88 reviewed Mar 6, 2017

View reviewed changes

uce reviewed Mar 6, 2017

View reviewed changes

aljoscha added 5 commits March 6, 2017 12:26

[FLINK-4460] Make ProcessFunction abstract, add default onTime() method

82eddca

This is in preparation of allowing ProcessFunction on DataStream because we will use it to allow side outputs from the ProcessFunction Context.

[FLINK-4660] Allow ProcessFunction on DataStream

0228676

Introduce new ProcessOperator for this. Rename the pre-existing ProcessOperator to KeyedProcessOperator.

[FLINK-4460] Make CoProcessFunction abstract, add default onTime() me…

e12f320

…thod This is in preparation of allowing CoProcessFunction on a non-keyed connected stream. we will use it to allow side outputs from the ProcessFunction Context.

[FLINK-4660] Allow CoProcessFunction on non-keyed ConnectedStreams

06740fb

Introduce new CoProcessOperator for this. Rename the pre-existing CoProcessOperator to KeyedCoProcessOperator.

[FLINK-4460] Update doc: ProcessFunction now possible on DataStream

746c1ef

aljoscha force-pushed the jira-4460-process-for-everyone branch from a26accf to 746c1ef Compare March 6, 2017 13:04

asfgit merged commit 746c1ef into apache:master Mar 6, 2017

aljoscha deleted the jira-4460-process-for-everyone branch March 6, 2017 15:53

rmetzger added the component=API/DataStream label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4460] Allow ProcessFunction on non-keyed streams #3438

[FLINK-4460] Allow ProcessFunction on non-keyed streams #3438

aljoscha commented Mar 1, 2017

aljoscha commented Mar 1, 2017

rmetzger commented Mar 3, 2017

aljoscha commented Mar 4, 2017

rmetzger commented Mar 4, 2017

aljoscha commented Mar 4, 2017

wenlong88 Mar 6, 2017

uce Mar 6, 2017

aljoscha Mar 6, 2017

uce left a comment

uce Mar 6, 2017

aljoscha Mar 6, 2017

uce Mar 6, 2017

uce Mar 6, 2017

aljoscha Mar 6, 2017

uce Mar 6, 2017

uce Mar 6, 2017

uce commented Mar 6, 2017

aljoscha commented Mar 6, 2017

aljoscha commented Mar 6, 2017

aljoscha commented Mar 6, 2017

wenlong88 commented Mar 6, 2017

[FLINK-4460] Allow ProcessFunction on non-keyed streams #3438

[FLINK-4460] Allow ProcessFunction on non-keyed streams #3438

Conversation

aljoscha commented Mar 1, 2017

aljoscha commented Mar 1, 2017

rmetzger commented Mar 3, 2017

aljoscha commented Mar 4, 2017

rmetzger commented Mar 4, 2017

aljoscha commented Mar 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uce commented Mar 6, 2017

aljoscha commented Mar 6, 2017

aljoscha commented Mar 6, 2017

aljoscha commented Mar 6, 2017

wenlong88 commented Mar 6, 2017