-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-4460] Side Outputs in Flink #3484
Conversation
1b70ee5
to
8c5ac2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* <pre>{@code | ||
* static final OutputTag<X> sideOutputTag = new OutputTag<X>("side-output") {}; | ||
* | ||
* public void flatMap(X value, Collector<String> out) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments seems out of date, I think we already decided to get ride of CollectorWrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing
@@ -85,6 +86,7 @@ | |||
private Set<Integer> sources; | |||
private Set<Integer> sinks; | |||
private Map<Integer, Tuple2<Integer, List<String>>> virtualSelectNodes; | |||
private Map<Integer, Tuple2<Integer, OutputTag>> virtualOutputNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider use addVirtualSideOutputNode
and virtualSideOutputNodes
. Unless we want to refactor move away from current assumption <IN>operator<OUT>
to <<tag1,IN1>...<tagX,INX> operator <<taga,OUTa>...<tagx,OUTX>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is already called addVirtualSideOutputNode()
. I'm adjusting the name of the field. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
int virtualId = upStreamVertexID; | ||
upStreamVertexID = virtualOutputNodes.get(virtualId).f0; | ||
if (outputTag == null) { | ||
// selections that happen downstream override earlier selections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may consider call out this behavior in getSideOutput
comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is a leftover from copying this code from split/select. For side outputs it can't happen that you have multiple "selects" after one another. Will remove the comment. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me!
@@ -60,6 +60,7 @@ | |||
import java.util.List; | |||
import java.util.Map; | |||
import java.util.Map.Entry; | |||
import com.google.common.collect.Iterables; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think introduce this dependency is good idea or bad idea? Up to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'm changing this to simply have two loops.
I think you introduced this in the first place, though. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right, good catch!
Thanks for fixing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -60,6 +60,7 @@ | |||
import java.util.List; | |||
import java.util.Map; | |||
import java.util.Map.Entry; | |||
import com.google.common.collect.Iterables; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right, good catch!
Thanks for fixing!
// element not handled by any window | ||
// late arriving tag has been set | ||
// windowAssigner is event time and current timestamp + allowed lateness no less than element timestamp | ||
if(isSkippedElement && lateDataOutputTag != null && isLate(element)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this again. I think it doesn't hurt to have it because it catches the case when a WindowAssigner
doesn't assign any windows. In that case an element is also "skipped" but it is not necessarily considered late. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a test for the behaviour with a "weird" WindowAssigner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments, most on code style and duplicate code. There is one that raises some correctness concerns. It is in the OutputTag
class, in the equals method, the second point.
This review was just the initial one. I may have some additional comments.
this.typeInfo = TypeExtractor.createTypeInfo(instance, baseClass, instance.getClass(), genericParameterPos); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove one of the 2 empty lines.
* @param id The id of the created {@code OutputTag}. | ||
*/ | ||
public OutputTag(String id) { | ||
Preconditions.checkNotNull(id, "OutputTag id cannot be null."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need both lines with the checks. We can just have:
this.id = Preconditions.checkNotNull(id, "OutputTag id cannot be null.");
Preconditions.checkNotNull(id, "OutputTag id cannot be null."); | ||
this.id = requireNonNull(id); | ||
|
||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for line breaking:
TypeHint<T> typeHint = new TypeHint<T>(OutputTag.class, this, 0) {};
*/ | ||
public OutputTag(String id, TypeInformation<T> typeInfo) { | ||
this.id = Preconditions.checkNotNull(id, "OutputTag id cannot be null."); | ||
this.typeInfo = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for line breaking.
|
||
@Override | ||
public boolean equals(Object obj) { | ||
return obj instanceof OutputTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points:
- we cannot have
this.id == null
or(OutputTag) obj).id == null
because we check at the constructor, so this method can be simplified. - we never check for uniqueness of the
outputTag.id
. We should do it at the translation. This is also a correctness issue as this may result in undesired sideoutput "collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have liked to include the TypeInformation
into the check but we can't do that because it's transient. I'll try and figure something out for checking that side outputs are unique, not as easy as it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The problem is that if this does not work, then we can have important side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still the first comment applies: the equals
can be simplified given that id != null
.
@@ -1528,14 +1572,16 @@ public void testDropDueToLatenessSessionZeroLatenessPurgingTrigger() throws Exce | |||
stateDesc, | |||
new InternalSingleValueWindowFunction<>(new ReducedSessionWindowFunction()), | |||
PurgingTrigger.of(EventTimeTrigger.create()), | |||
LATENESS); | |||
LATENESS, | |||
lateOutputTag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong alignment
@@ -1618,14 +1664,16 @@ public void testDropDueToLatenessSessionZeroLateness() throws Exception { | |||
stateDesc, | |||
new InternalSingleValueWindowFunction<>(new ReducedSessionWindowFunction()), | |||
EventTimeTrigger.create(), | |||
LATENESS); | |||
LATENESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong alignment
@@ -1702,15 +1754,16 @@ public void testDropDueToLatenessSessionWithLatenessPurgingTrigger() throws Exce | |||
stateDesc, | |||
new InternalSingleValueWindowFunction<>(new ReducedSessionWindowFunction()), | |||
PurgingTrigger.of(EventTimeTrigger.create()), | |||
LATENESS); | |||
LATENESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong alignment
@@ -53,5 +54,11 @@ public void collect(StreamRecord<T> record) { | |||
} | |||
|
|||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature can go on the same line
@@ -40,6 +41,12 @@ public void collect(StreamRecord<T> record) { | |||
} | |||
|
|||
@Override | |||
public <X> void collect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature fits in one line
Thanks @kl0u for the (already) quite thorough review! I'll push a commit with fixes. |
3c521b2
to
efefb83
Compare
Thanks @aljoscha I will have a look on Monday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aljoscha . I did a first pass which did not include the translation part of the StreamGraph
. I will continue with this part now.
So far I had some minor comments and one more important at the CopyingChainingOutput.pushToOperator()
.
|
||
@Override | ||
public boolean equals(Object obj) { | ||
return obj instanceof OutputTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still the first comment applies: the equals
can be simplified given that id != null
.
* into the side output with the given {@link OutputTag}. | ||
* | ||
* @see org.apache.flink.streaming.api.functions.ProcessFunction.Context#output(OutputTag, Object) | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space between the ) and the {
* @see org.apache.flink.streaming.api.functions.ProcessFunction.Context#output(OutputTag, Object) | ||
*/ | ||
public <X> DataStream<X> getSideOutput(OutputTag<X> sideOutputTag){ | ||
sideOutputTag = clean(sideOutputTag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to not reuse the argument variable but create a new one.
} | ||
|
||
requestedSideOutputs.put(sideOutputTag, sideOutputTag.getTypeInfo()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requireNotNull
should be in the beginning of the method.
* connected to downstream operations. | ||
* | ||
* @param <T> The type of the elements that result from this {@code SideOutputTransformation} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we do not check if the input
is null
(we do it in the caller method only) but we try get the parallelism. We could have the parallelism as a separate argument, and then, after the super()
check if the input is null.
This makes the code of the class self-contained as you do not have to check other classes to see if the input
can be null
or not. What do you think?
* @param element The element to check | ||
* @return The element for which should be considered when sideoutputs | ||
*/ | ||
protected boolean isLate(StreamRecord<IN> element){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used any more, right? So it can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have removed the check by accident. I think we agreed to rename this to something more meaningful and keep it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I remember as well.
outSerializer = upStreamConfig.getTypeSerializerSideOut( | ||
edge.getOutputTag(), taskEnvironment.getUserClassLoader()); | ||
} else { | ||
// main output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can become one line.
|
||
pushToOperator(record); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can become private
as the copying alternative has its own implementation, right?
pushToOperator(record); | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can become private
, as before.
operator.processElement(copy); | ||
} | ||
catch (Exception e) { | ||
operator.processElement(castRecord); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be copy
, not castRecord
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aljoscha I finished my review. I had some comments.
* We need to create an {@link OutputTag} so that we can reference it when emitting | ||
* data to a side output and also to retrieve the side output stream from an operation. | ||
*/ | ||
static final OutputTag<String> rejectedWordsTag = new OutputTag<String>("rejected") {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we add a side output but we do nothing to show that it works. Probably we can add a prefix "rejected-" to the record and print it, so that the user can see what the side output does.
this.sourceVertex = sourceVertex; | ||
this.targetVertex = targetVertex; | ||
this.typeNumber = typeNumber; | ||
this.selectedNames = selectedNames; | ||
this.outputPartitioner = outputPartitioner; | ||
this.outputTag = outputTag; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add the outputTag also in the edgeId
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the edge id exactly does and who uses it so I prefer to not touch it, for now.
@@ -333,32 +373,39 @@ public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int ty | |||
downStreamVertexID, | |||
typeNumber, | |||
null, | |||
new ArrayList<String>()); | |||
new ArrayList<String>(), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null
should go to the next line for uniformity.
@@ -63,6 +66,7 @@ | |||
private TypeSerializer<?> typeSerializerIn1; | |||
private TypeSerializer<?> typeSerializerIn2; | |||
private TypeSerializer<?> typeSerializerOut; | |||
private Map<OutputTag<?>, TypeSerializer<?>> typeSerializerMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used anywhere in the code. Can it be removed, along with the getTypeSerializerOut()
and setTypeSerializerOut()
?
efefb83
to
62cc5ee
Compare
62cc5ee
to
20d8d67
Compare
This does not yet allow users to emit to side outputs in user functions. Only operators (StreamOperator) can emit to side outputs. A side output can be retrieved on a SingleOutputStreamOperator.
We use side outputs to emit dropped late data.
The Scala side output API uses context bounds to get a TypeInformation for an OutputTag. This also adds a SideOutputITCase for the Scala API.
This also adds tests.
20d8d67
to
d0eef93
Compare
This is a refinement of #2982 by @chenqin.
I changed the API a bit, added support for side outputs to
ProcessFunction
, enabled side outputs to work with chaining, added proper Scala API and a Scala API test and added documentation.R: @uce @kl0u and @chenqin for review, please