New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLINK-3179 Combiner is not injected if Reduce or GroupReduce input is explicitly partitioned (Ram) #1553
Conversation
explicitly partitioned (Ram)
Also ensured that the related test cases passes and also the Wordcount program output with and without partition remains the same. |
Thanks for the PR! |
@@ -102,36 +107,72 @@ public SingleInputPlanNode instantiate(Channel in, SingleInputNode node) { | |||
DriverStrategy.SORTED_GROUP_REDUCE, this.keyList); | |||
} else { | |||
// non forward case. all local properties are killed anyways, so we can safely plug in a combiner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else
branch will not be entered if the GroupReduce's predecessor is a Partition operator.
You need to add an if else
branch to the condition.
You identified the right classes and methods for the fix, but the place within the method is not correct. Let me explain the issue. In the common case as for example in a WordCount program, the operator order looks like this:
in this case, a combiner will be append to the Map to reduce the data before it is partitioned over the network. This looks like:
In some cases, Flink knows that the data is already appropriately partitioned (e.g. after a join):
in this case, the data is already local and no combiner needs to injected. The check is based on the shipping strategy of the input channel (this is the In case of an explicit partition operator, the operators look as follows:
hence, the code enters the
Hence, we should adapt the condition to inject a combiner if the input strategy of Reduce is We should add appropriate tests for this feature. I suggest:
|
Thank you very much for the feedback. Let me try to understand this thing better and update the PR sooner. I will reach out here in case of any questions or doubts that I have. Thanks a lot. |
I went through the code. In both cases of WordCount program with and without explicit partition
|
The Hence the logic has to into the |
Might be a stupid question, but what if the partitioner depends on the number of elements. E.g. if you use |
If a |
@fhueske |
@ramkrish86, no worries :-) Thanks for working on this.. |
A new push request has been submitted. JYFI @fhueske . |
@@ -66,7 +66,7 @@ public static void main(String[] args) throws Exception { | |||
|
|||
DataSet<Tuple2<String, Integer>> counts = | |||
// split up the lines in pairs (2-tuples) containing: (word,1) | |||
text.flatMap(new Tokenizer()) | |||
(text.flatMap(new Tokenizer())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change
Hi @ramkrish86, thanks for the update. Also we must take care of the case where the result of the partition operator goes into more than one function. Consider the following case:
which should be translated to:
Both translation tests need to be extended to cover this case. Thanks, Fabian |
@fhueske |
I do not have a concrete use case in mind, but it is certainly possible to implement such a job in the DataSet API. Hence, it should be correctly translated.
|
@fhueske |
Sorry, I forgot a
|
New PR submitted @fhueske . Thanks for helping me thro this code review. It is was more of a beginner and there is a lot to learn from my side. |
return new SingleInputPlanNode(node, "Reduce ("+node.getOperator().getName()+")", | ||
toReducer, DriverStrategy.SORTED_GROUP_REDUCE, this.keyList); | ||
} | ||
} | ||
|
||
private SingleInputPlanNode injectCombinerBeforPartitioner(Channel in, SingleInputNode node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in method name injectCombinerBeforePartitioner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use meaningful parameter name: in -> toReducer, node -> reduceNode
Hi @ramkrish86, I would suggest we put this PR for a few days on ice and I check whether it is possible to continue or if we have to find another approach. |
@fhueske |
Hi @ramkrish86, I totally understand that you are disappointed. I'm very sorry to raise these concerns this late after you put a lot of effort into this PR. I should have noticed this issue much earlier :-( Touching the optimizer is always a little bit like open heart surgery and must be done very carefully with the whole picture in mind. I have not completely investigated the possible side effects yet, but will definitely let you know once I have. Would you like to work on a different issue in the meantime? |
@fhueske |
Hi @ramkrish86, I thought about this PR and came to the conclusion that we should not continue. The optimizer's design does not allow to modify operators in or inject operators into enumerated subplans. This might cause invalid execution plans and in worst case wrong results without somebody noticing it. I would simply log a WARN message that a combiner was not added if the optimizer identifies a Partition operator in front of a Reduce or combinable GroupReduce operator and give a hint that an explicit CombinerFunction can be added with groupCombine in front of the partition operator. Sorry again @ramkrish86 that I lead you into a dead end with this PR. |
…ot added in front of PartitionOperator This closes apache#1822 This closes apache#1553
Closed this PR in favor of PR #1822 |
…ot added in front of PartitionOperator This closes apache#1822 This closes apache#1553
Followed the guidance given in the description in order to fix this. Is the approach correct here? Also using this to learn the code.
Once we see that a partition node is the input of a reduce node or group reduce node - we try to inject the combiner to the source node (the data source node) and the reducer node will take the actual partition node as the input.
So now the structure would be DataSource->Combine->Partition->Reduce.
Suggestions and feedback welcome as am not sure if I have covered all the cases here.