[FLINK-2138] Added custom partitioning to DataStream #872

gaborhermann · 2015-06-26T16:36:09Z

Custom partitioning added to DataStream in order to be more consistent with the batch API.

gyfora · 2015-06-26T18:55:31Z

Wouldn't it make sense to implement custom partitioning in a way that it allows to return a array of indexes like in the ChannelSelector interface? Returning only 1 index limits the partitioning very much.

Maybe the users could implement a ChannelSelector and we would wrap that.

gaborhermann · 2015-06-27T17:12:48Z

I guess it is easier for the users to understand and partitioning to multiple channels at a time is rarely needed. Is there a use-case where it is needed?

It should be consistent with the batch API in my opinion. Let's start a discussion about this if we would like to change the custom partitioning.

gyfora · 2015-06-27T17:30:55Z

I think I could find several use cases if I wanted to :) For example I would often like to broadcast some model information to many downstream operators at once. (not exactly broadcast ,maybe only to a couple of them)

Also even this does not give full flexibility. Imagine a scenario where I have a self loop, and I want to send something to all others (except myself), to do this I would also need to know my own channel index...

StephanEwen · 2015-06-29T08:45:39Z

I actually like this approach. We had the same discussion for the batch API and resolved to this, because:

You can always chain a FlatMapFunction with a partitionCustom() request to solve all the above situations.
This interface allows easy Java8-lambda implementation and it works well with the type extraction.
It seems to cover the majority of cases more elegantly, as there is no need for array wrapping in the user code.

gaborhermann · 2015-06-29T12:53:41Z

By the way, in the Scala DataSet the user should specify the Java Partitioner[K] class. Wouldn't it be more convenient to wrap a function like (K, Int) => Int into a Partitioner[K] similarly to the KeySelector?

StephanEwen · 2015-06-29T12:57:20Z

In the batch API, equality of the partitioners is used to determine compatibility of the partitioning. This may at some point become interesting for the streaming API as well.

In any case, let's pick one of the two variants (function of partitioner implementation). Overloading the methods too much with equally powerful variants inevitable confuses users.

gaborhermann · 2015-06-29T13:58:06Z

I'd prefer the function implementation (like (K, Int) => Int), but it should stay consistent with the batch API. I don't see why the wrapping would effect the compatibility checking of the partitioning.

Is it okay, if I change it to the function implementation in both (Scala batch, Scala streaming) APIs? If not, then let's just stick with the partitioner implementation in the APIs.

StephanEwen · 2015-06-29T14:42:56Z

I am confused now, what is it going to be?

Overloading, such that it is Scala function and Partitioner, at the cost of redundant APIs.
Only partitioner (sync with batch API)
Only Scala function (break with batch API)

I am not a big fan of (1), as these redundant options are confusing blow-ups of the APIs.

gaborhermann · 2015-06-29T15:38:16Z

Sorry for not making myself clear.

I would actually go for
4. Only the Scala function (both in the streaming and batch API)

I don't understand how changing from partitioner implementation to function implementation in the batch API would mess up determining the compatibility of the partitioning. By compatibility I mean the type of the key must be the same as the input of the partitioner.

I suppose there was another reason (that I do not understand) for choosing the partitioner implementation for the Scala batch API, so if (4) is not an option, I would go for (2) (only partitioner, sync with batch API).

StephanEwen · 2015-06-29T16:11:23Z

The partitioner function in Scala was simply added as a mirror of the Java API.

The batch API is stable, that means at most we can add a Scala function and deprecate the partitioner.

gaborhermann · 2015-06-29T16:44:21Z

Okay, then I will

deprecate the partitioner implementation in the batch API
add the function implementation to the batch API
add the function implementation to the streaming API and remove the partitioner implementation (so streaming will only have function implementation). As this PR is not merged yet we do not break the streaming API.

Is it okay?
I guess it's worth it. This way Scala users will be able to write more concise code and they will not get confused by the overloaded functions because the ones with the partitioner will be deprecated.

StephanEwen · 2015-06-30T10:29:22Z

How about we leave the batch API as it is for now and address that as a separate issue? There are quite some subtleties in how the optimizer assesses equality of partitioning (based on partitioners) that would have to be changed (and should retain backwards compatibility).

gaborhermann · 2015-06-30T11:41:05Z

Okay then. These are the effects of changing I did not know about. Let's stick to (2) and later on, we might reconsider this.

rmetzger · 2015-07-01T08:47:36Z

-1
Documentation is missing.

http://flink.apache.org/coding-guidelines.html:

Documentation Updates. Many changes in the system will also affect the documentation (both JavaDocs and the user documentation in the docs/ directory.). Pull requests and patches are required to update the documentation accordingly, otherwise the change can not be accepted to the source code.

Could you also add an IT case that ensures that the data is actually partitioned properly? The test you've added is only ensuring that the partitioning properties are set correctly in the DataStream.

gaborhermann · 2015-07-01T12:39:41Z

Sorry.

updated the docs (custom partitioning was also missing in the Scala batch API docs)
added IT case tests (also for the other stream partitioning methods as they were missing too)

gyfora · 2015-07-10T19:50:15Z

Looks good to merge. If no objections, I will merge it tomorrow. 👍

Closes apache#872

gaborhermann force-pushed the FLINK-2138 branch from 584f54c to f461bba Compare June 27, 2015 16:47

gaborhermann force-pushed the FLINK-2138 branch from f461bba to 67813a4 Compare June 27, 2015 17:16

gaborhermann force-pushed the FLINK-2138 branch from 67813a4 to 38e254f Compare July 1, 2015 12:36

gaborhermann force-pushed the FLINK-2138 branch from 38e254f to 40a49b6 Compare July 1, 2015 12:40

Gábor Hermann added 3 commits July 7, 2015 10:39

[FLINK-2138] [streaming] Added custom partitioning to DataStream

52f5ec0

[FLINK-2138] [streaming] Added custom partitioning to scala DataStream

9531176

[FLINK-2138] [streaming] Added docs and tests for partitioning

a3ee7eb

gaborhermann force-pushed the FLINK-2138 branch from 40a49b6 to a3ee7eb Compare July 7, 2015 08:42

asfgit closed this in 3f3aeb7 Jul 13, 2015

mxm pushed a commit to mxm/flink that referenced this pull request Jul 14, 2015

[FLINK-2138] [streaming] Added docs and tests for partitioning

7c25d7b

Closes apache#872

shghatge pushed a commit to shghatge/flink that referenced this pull request Aug 8, 2015

[FLINK-2138] [streaming] Added docs and tests for partitioning

2b359e0

Closes apache#872

nikste pushed a commit to nikste/flink that referenced this pull request Sep 29, 2015

[FLINK-2138] [streaming] Added docs and tests for partitioning

083589b

Closes apache#872

nltran pushed a commit to nltran/flink that referenced this pull request Jan 8, 2016

[FLINK-2138] [streaming] Added docs and tests for partitioning

2b583c8

Closes apache#872

rmetzger added the component=<none> label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-2138] Added custom partitioning to DataStream #872

[FLINK-2138] Added custom partitioning to DataStream #872

gaborhermann commented Jun 26, 2015

gyfora commented Jun 26, 2015

gaborhermann commented Jun 27, 2015

gyfora commented Jun 27, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 30, 2015

gaborhermann commented Jun 30, 2015

rmetzger commented Jul 1, 2015

gaborhermann commented Jul 1, 2015

gyfora commented Jul 10, 2015

[FLINK-2138] Added custom partitioning to DataStream #872

[FLINK-2138] Added custom partitioning to DataStream #872

Conversation

gaborhermann commented Jun 26, 2015

gyfora commented Jun 26, 2015

gaborhermann commented Jun 27, 2015

gyfora commented Jun 27, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 29, 2015

gaborhermann commented Jun 29, 2015

StephanEwen commented Jun 30, 2015

gaborhermann commented Jun 30, 2015

rmetzger commented Jul 1, 2015

gaborhermann commented Jul 1, 2015

gyfora commented Jul 10, 2015