-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KIP-221 / Add KStream#repartition operation #7170
KIP-221 / Add KStream#repartition operation #7170
Conversation
…mber of partitions based on InternalTopicProperties
…, KeyValueMapper)
…erations method with InteralTopicConfig and StreamPartitioner parameters
…cNamesWithProperties; Moved InternalTopicProperties class to dedicated file
… after repartition operation is performed
@lkokhreidze Thanks for the PR. There are some checkstyle errors. Can you please fix them before we review your PR? |
@mjsax done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Made an initial pass. I still need to wrap my head around the optimization layer and how we merge repartition nodes. We need to add more test to RepartitiontTopicNameTest
and/or StreamsGraphTest
IMHO, to verify that the new repartition()
operator works as intended.
Also, it seems you forgot to update groupBy()
and groupByKey()
.
Finally, thinking about the KIP once more: as we extend groupBy
to configure the internal repartition topic, I am wondering if we should extend the KIP and also allow to do this for join()
that may also create repartition topics? \cc @guozhangwang @bbejeck @vvcephei @ableegoldman @cadonna @abbccdda
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamImpl.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/kafka/streams/kstream/internals/graph/OptimizableRepartitionNode.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamImpl.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopologyBuilder.java
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopologyBuilder.java
Outdated
Show resolved
Hide resolved
@mjsax this is the first PR (written in PR description) |
Agreed, I'll do that. I wanted to tag @bbejeck as seems like he's the main author behind optimization logic. I'll add tests for optimization logic to make sure nothing breaks. |
Ah. I guess, I skipped the PR description... Sorry for that. I discussed the proposal with @vvcephei in person, and thinking about the semantics once more, I am actually wondering if it is wise to change We have basically two dimensions which 2 cases each to consider for
Case (1), (2), and (4) are straight forward. However, case (2) is somewhat awkward because we actually want to treat Therefore, only case (4) is left in which passing in Therefore, I don't see a good use case for which it make sense to pass in Would be great if you could share your thoughts about it? A second point I discussed with @vvcephei is about the optimization. We both have the impression that |
Not sure I agree @mjsax -- maybe you just want to control the parallelism in case a repartition is required? You could enforce users to step through their whole topology, figure out when/where repartitioning is needed, and use |
I don't see this as a use case in practice. Why would one want to change the parallelism? Because, the aggregation operation is over or under provisions and thus one wants to decrease or increase the parallelism. If I am ok with the "default" parallelism in case there is no repartitioning, why would I not be ok with it if data is repartitioned?
This is less an issue IMHO, because if I want to scale up for example, it's sufficient to insert |
|
My question is, why do you need Similar to your second comment, if you want to "scale up" again later, you call |
Ok, well I am fine with this framing it as a "set parallelism" operation...I don't want to stall this KIP/PR further, but what if this was split into a new set of |
Hello @mjsax @ableegoldman @vvcephei
While, for me, as a user, 2nd option looks much more appealing, similarly how key selector for Again, your arguments are totally valid, and all can be achieved just by having |
Thanks @vvcephei for the update and no worries :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, well. I did start the review, and made a fair amount of progress before getting sidetracked by a global catastrophe...
It's still in my "actively working on this" bucket, and I'll commit to not starting new work until I finish my review. For now, I'll go ahead and ask this one question, which came up early in my review. I skimmed over the KIP and discussion thread, but didn't see a specific discussion of the overload in question.
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @lkokhreidze , I finally finished my review, and it looks good to me. I'm not sure if @mjsax wants to make another pass.
streams/src/test/java/org/apache/kafka/streams/kstream/internals/graph/StreamsGraphTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing!! And thanks to @vvcephei to help pushing this through.
Overall LGTM. Thanks for writing extensive tests!!!
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void shouldCreateOnlyOneRepartitionTopicWhenRepartitionIsFollowedByGroupByKey() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above: we should be able to test with via unit tests using Topology#describe()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about that, but somehow it felt "safer" with integration tests. Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar thought, that it looks like good fodder for unit testing, but I did like the safety blanket of verifying the actual partition counts. I guess I'm fine either way, with a preference for whatever is already in the PR ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
I guess that is fair. (I just try to keep test runtime short if we can -- let's keep the integration test.)
} | ||
|
||
@Test | ||
public void shouldGenerateRepartitionTopicWhenNameIsNotSpecified() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be unit-test able via Topology#describe()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about that, but somehow it felt "safer" with integration tests. Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
} | ||
|
||
@Test | ||
public void shouldGoThroughRebalancingCorrectly() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this test is about, ie, how does is relate to the repartition()
feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related to this comment #7170 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying!
} | ||
|
||
@Test | ||
public void shouldInvokePartitionerWhenSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this test actually verifies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the "easiest" way I could figure out to verify that custom partitioner is invoked when it's set
streams/src/test/java/org/apache/kafka/streams/kstream/internals/KStreamRepartitionTest.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamImpl.java
Outdated
Show resolved
Hide resolved
Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>
Hi @mjsax, I've addressed your comments, would appreciate another review. |
Small update: f2bcdfe In this commit I've added Topology optimization option as test parameter. This PR touches topology optimization (indirectly). In order to make sure that everything works as expected, I though it would beneficial in the integration tests verifying both, Regards, |
Wow, that's great. Thanks, @lkokhreidze ! |
Arrays.asList(StreamsConfig.OPTIMIZE, StreamsConfig.NO_OPTIMIZATION) | ||
.forEach(x -> values.add(new Object[]{x})); | ||
|
||
return values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unnesseary complex? A simple
return Arrays.asList(new String[][] {
{StreamsConfig.OPTIMIZE},
{StreamsConfig.NO_OPTIMIZATION}
});
would do, too :)
(Feel free to ignore the comment.)
return values; | ||
} | ||
|
||
public KStreamRepartitionIntegrationTest(final String topologyOptimization) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple
@Parameter
public String topologyOptimization;
Would be sufficient instead of adding a constructor and those lines could go into before()
.
(As above, feel free to ignore this comment.)
Merged to |
Yes, thank you @lkokhreidze for seeing this through! |
KIP-221: Enhance DSL with Connecting Topic Creation and Repartition Hint
Tickets: KAFKA-6037 KAFKA-8611
Description
This is PR for KIP-221. Goal of this PR is to introduce new
KStream#repartition
operator and underline machinery that can be used for repartition configuration onKStream
instance.Notable Changes
org.apache.kafka.streams.kstream.internals.graph.UnoptimizableRepartitionNode
. This node is NOT subject of optimization algorithm, therefore, eachrepartition
operation is excluded from optimization algorithm.org.apache.kafka.streams.processor.internals.InternalTopicProperties
class that can be used for capturing repartition topic configurations passed via DSL operationsorg.apache.kafka.streams.processor.internals.InternalTopologyBuilder#internalTopicNamesWithProperties
map for storing mapping between internal topics and their corresponding configuration. If configuration is presentRepartitionTopicConfig
is enriched with configurations passed via DSL operations (In this case viaorg.apache.kafka.streams.kstream.Repartitioned
class).KStreamRepartitionIntegrationTest
for testing different scenarios ofKStream#repartition
Committer Checklist (excluded from commit message)