-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2138] Added custom partitioning to DataStream #872
Conversation
Wouldn't it make sense to implement custom partitioning in a way that it allows to return a array of indexes like in the ChannelSelector interface? Returning only 1 index limits the partitioning very much. Maybe the users could implement a ChannelSelector and we would wrap that. |
I guess it is easier for the users to understand and partitioning to multiple channels at a time is rarely needed. Is there a use-case where it is needed? It should be consistent with the batch API in my opinion. Let's start a discussion about this if we would like to change the custom partitioning. |
I think I could find several use cases if I wanted to :) For example I would often like to broadcast some model information to many downstream operators at once. (not exactly broadcast ,maybe only to a couple of them) Also even this does not give full flexibility. Imagine a scenario where I have a self loop, and I want to send something to all others (except myself), to do this I would also need to know my own channel index... |
I actually like this approach. We had the same discussion for the batch API and resolved to this, because:
|
By the way, in the Scala DataSet the user should specify the Java |
In the batch API, equality of the partitioners is used to determine compatibility of the partitioning. This may at some point become interesting for the streaming API as well. In any case, let's pick one of the two variants (function of partitioner implementation). Overloading the methods too much with equally powerful variants inevitable confuses users. |
I'd prefer the function implementation (like Is it okay, if I change it to the function implementation in both (Scala batch, Scala streaming) APIs? If not, then let's just stick with the partitioner implementation in the APIs. |
I am confused now, what is it going to be?
I am not a big fan of (1), as these redundant options are confusing blow-ups of the APIs. |
Sorry for not making myself clear. I would actually go for I don't understand how changing from partitioner implementation to function implementation in the batch API would mess up determining the compatibility of the partitioning. By compatibility I mean the type of the key must be the same as the input of the partitioner. I suppose there was another reason (that I do not understand) for choosing the partitioner implementation for the Scala batch API, so if (4) is not an option, I would go for (2) (only partitioner, sync with batch API). |
The partitioner function in Scala was simply added as a mirror of the Java API. The batch API is stable, that means at most we can add a Scala function and deprecate the partitioner. |
Okay, then I will
Is it okay? |
How about we leave the batch API as it is for now and address that as a separate issue? There are quite some subtleties in how the optimizer assesses equality of partitioning (based on partitioners) that would have to be changed (and should retain backwards compatibility). |
Okay then. These are the effects of changing I did not know about. Let's stick to (2) and later on, we might reconsider this. |
-1 http://flink.apache.org/coding-guidelines.html:
Could you also add an IT case that ensures that the data is actually partitioned properly? The test you've added is only ensuring that the partitioning properties are set correctly in the |
Sorry.
|
Looks good to merge. If no objections, I will merge it tomorrow. 👍 |
Custom partitioning added to DataStream in order to be more consistent with the batch API.