MINOR: Adjust logic of conditions to set number of partitions in step zero of assignment. #7419

bbejeck · 2019-09-30T16:17:29Z

A minor change in logic to account for repartition topics where we might not have the num partitions yet in the metadata.

Ran all existing tests plus all streams system tests

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

… zero of assignment.

bbejeck · 2019-09-30T16:20:50Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

                                    // It is possible the sourceTopic is another internal topic, i.e,
                                    // map().join().join(map())
-                                    if (repartitionTopicMetadata.containsKey(sourceTopicName)


Here by checking if the source topic is a repartition topic and if the number of partitions are present, we are expecting they might not be available.

However, if that is the case we will drop down to the else block and this results in throwing an Exception. IMHO it seems that is not the intent of the logic, as we would always throw and Exception if any source topic (internal or otherwise) did not have a partition count available.

bbejeck · 2019-09-30T16:23:48Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

-                                        && repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {
-                                        numPartitionsCandidate = repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().get();
+                                    if (repartitionTopicMetadata.containsKey(sourceTopicName)) {
+                                        if (repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {


Here's the change if the source topic is a repartition topic, drop into a new block to check if the repartition topic has a partition count available. If not we don't throw an Exception, as we will only throw when a non-internal topic reports no partition count available.

@bbejeck @vvcephei I checked the code that right now we use linked-hashmap for the node-groups / topic-groups construction, whose order is preserved. I think that means that assuming the topology is a DAG with no cycles, one pass from sub-topology 1 is arguably sufficient. However, once case that we did not handle today which is also why we are still doing a while-loop here is, e.g. (numbers are sub-topology indices):

1 -> 2, 1 -> 3, 3 -> 2

And if we loop over the order of 1,2,3, then when we are processing 2 since 3's not set yet we do no have the num.partitions for the repartition topic between 3 -> 2.

Looking at InternalTopologyBuilder#makeNodeGroups, I think it is possible that we ensure it ordered as

1 -> 3, 1 -> 2, 3 -> 3

so that we can make one pass without the while loop, and can also assume that the parent sub-topologies sink/repartition topic num.partitions are set when processing this, WDYT?

Basically, when ordering the non-source node groups we do not rely on Utils.sorted(nodeFactories.keySet() but rely on some specific logic that those non-source sub-topologies with all parents as source sub-topologies gets indexed first.

Thanks for the idea @guozhangwang . If I understand right, it sounds like you're suggesting to propagate the partition count from (external) sources all the way through the topology, in topological order. If the partition count is purely determined by the external source topics, then it should indeed work to do this in topological order in one pass.

What I'm wondering now is whether there's any situation where some of the repartition topics might already exist with a specific number of partitions. An easy strawman is, "what if the operator has pre-created some of the internal topics?", which may or may not be allowed. Another is "what if the topology has changed slightly to add a new repartition topic early in the topology?" Maybe there are some other similar scenarios. I'm not sure if any of these are real possibilities, or if they'd affect the outcome, or if we want to disallow them anyway to make our lives easier.

WDYT?

Those are good points, making a one-pass num.partition decision is not critical in our framework, and I think it's more or less a brainstorming with you guys to see if it is possible :) To me as long as we would not be stuck infinitely in the while loop it should be fine.

If user pre-create the topic with the exact xx-repartition name, then yes I think that could make things tricker. Also with KIP-221 the repartition hint, I'm not sure how that would affect this as well.

bbejeck · 2019-09-30T16:24:44Z

Ran the system tests for this PR as well

http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2019-09-28--001.1569697279--bbejeck--MINOR_adjust_guard_conditions_in_step_zero_in_on_assignment--f77e2eb/report.html

bbejeck · 2019-09-30T16:25:39Z

ping @guozhangwang, and @vvcephei for review

vvcephei

Ah, good call, @bbejeck . Thanks for the fix!

guozhangwang

The change LGTM, I left some minor comments and also a meta one for improving in a future PR.

guozhangwang · 2019-09-30T19:00:16Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

+                                    if (repartitionTopicMetadata.containsKey(sourceTopicName)) {
+                                        if (repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {
+                                            numPartitionsCandidate = repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().get();
+                                        }


This dates before this PR, but while reviewing it I realized that line 898 in prepareTopic:

topic.setNumberOfPartitions(numPartitions.get());

is not necessary since the numPartitions is read from the topic.

guozhangwang · 2019-09-30T19:15:18Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

-                                        && repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {
-                                        numPartitionsCandidate = repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().get();
+                                    if (repartitionTopicMetadata.containsKey(sourceTopicName)) {
+                                        if (repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {


@bbejeck @vvcephei I checked the code that right now we use linked-hashmap for the node-groups / topic-groups construction, whose order is preserved. I think that means that assuming the topology is a DAG with no cycles, one pass from sub-topology 1 is arguably sufficient. However, once case that we did not handle today which is also why we are still doing a while-loop here is, e.g. (numbers are sub-topology indices):

1 -> 2, 1 -> 3, 3 -> 2

And if we loop over the order of 1,2,3, then when we are processing 2 since 3's not set yet we do no have the num.partitions for the repartition topic between 3 -> 2.

Looking at InternalTopologyBuilder#makeNodeGroups, I think it is possible that we ensure it ordered as

1 -> 3, 1 -> 2, 3 -> 3

so that we can make one pass without the while loop, and can also assume that the parent sub-topologies sink/repartition topic num.partitions are set when processing this, WDYT?

guozhangwang · 2019-09-30T19:16:57Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

-                                        && repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {
-                                        numPartitionsCandidate = repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().get();
+                                    if (repartitionTopicMetadata.containsKey(sourceTopicName)) {
+                                        if (repartitionTopicMetadata.get(sourceTopicName).numberOfPartitions().isPresent()) {


Basically, when ordering the non-source node groups we do not rely on Utils.sorted(nodeFactories.keySet() but rely on some specific logic that those non-source sub-topologies with all parents as source sub-topologies gets indexed first.

…ics counts (#7904) This PR fixes the regression introduced in 2.4 from 2 refactoring PRs: #7249 #7419 The bug was introduced by having a logical path leading numPartitionsCandidate to be 0, which is assigned to numPartitions and later being checked by setNumPartitions. In the subsequent check we will throw illegal argument if the numPartitions is 0. This bug is both impacting new 2.4 application and upgrades to 2.4 in certain types of topology. The example in original JIRA was imported as a new integration test to guard against such regression. We also verify that without the bug fix application will still fail by running this integration test. Reviewers: Guozhang Wang <wangguoz@gmail.com>

…ics counts (apache#7904) This PR fixes the regression introduced in 2.4 from 2 refactoring PRs: apache#7249 apache#7419 The bug was introduced by having a logical path leading numPartitionsCandidate to be 0, which is assigned to numPartitions and later being checked by setNumPartitions. In the subsequent check we will throw illegal argument if the numPartitions is 0. This bug is both impacting new 2.4 application and upgrades to 2.4 in certain types of topology. The example in original JIRA was imported as a new integration test to guard against such regression. We also verify that without the bug fix application will still fail by running this integration test. Reviewers: Guozhang Wang <wangguoz@gmail.com>

MINOR: Adjust logic of conditions to set number of partitions in step…

f77e2eb

… zero of assignment.

bbejeck commented Sep 30, 2019

View reviewed changes

bbejeck added the streams label Sep 30, 2019

vvcephei approved these changes Sep 30, 2019

View reviewed changes

guozhangwang reviewed Sep 30, 2019

View reviewed changes

guozhangwang merged commit d53eab1 into apache:trunk Sep 30, 2019

abbccdda mentioned this pull request Jan 7, 2020

KAFKA-9335: Fix StreamPartitionAssignor regression in repartition topics counts #7904

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: Adjust logic of conditions to set number of partitions in step zero of assignment. #7419

MINOR: Adjust logic of conditions to set number of partitions in step zero of assignment. #7419

bbejeck commented Sep 30, 2019

bbejeck Sep 30, 2019

bbejeck Sep 30, 2019

guozhangwang Sep 30, 2019

guozhangwang Sep 30, 2019

vvcephei Sep 30, 2019

guozhangwang Sep 30, 2019

bbejeck commented Sep 30, 2019 •

edited

bbejeck commented Sep 30, 2019

vvcephei left a comment

guozhangwang left a comment

guozhangwang Sep 30, 2019

guozhangwang Sep 30, 2019

guozhangwang Sep 30, 2019

MINOR: Adjust logic of conditions to set number of partitions in step zero of assignment. #7419

MINOR: Adjust logic of conditions to set number of partitions in step zero of assignment. #7419

Conversation

bbejeck commented Sep 30, 2019

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbejeck commented Sep 30, 2019 • edited

bbejeck commented Sep 30, 2019

vvcephei left a comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbejeck commented Sep 30, 2019 •

edited