Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set internal replication factors to match default and min.insync.replicas #140

Merged
merged 9 commits into from
Feb 3, 2018

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Feb 1, 2018

Based on @shrinandj's find in #139.

As explained in #116 (comment) we'd like to keep min.insync.replicas=2.

What's odd is that the default, according to docs, is 3. Also I guess this change won't affect a running kafka cluster.

This property isn't mentioned in https://kafka.apache.org/documentation/#prodconfig.
There was a change in 0.11: "The offsets.topic.replication.factor broker config is now enforced upon auto topic creation."

This PR quite possibly needs

  • A repro case.
  • A Add Jobs and tests for common maintenance operations #95 style job to fix existing clusters.
  • A similar fix to transaction.state.log.replication.factor.
  • ~~A similar fix to config.storage.replication.factor.
    • Defaulted to 3 already, used for Kafka Connect only?
  • ~~A similar fix to status.storage.replication.factor.
    • Defaulted to 3 already, used for Kafka Connect only?
  • A similar fix to replication.factor.
    • Seems to be a Kafka Streams client property only.

@@ -57,6 +57,7 @@ data:
num.partitions=1

default.replication.factor=3
offsets.topic.replication.factor=3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property is set to 1 below at line 117.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, explains why we didn't get the documented default.

@solsson
Copy link
Contributor Author

solsson commented Feb 1, 2018

@shrinandj Given that the documented default is 3, and that this error hasn't been reported before, could it be in your case that if a consumer was running when kafka was starting for the first time, the topic is created with 1 replica because there's only one broker?

I guess kubectl apply -R -f kafka/ would cause this behavior, because it sets up the tests immediately.

Maybe then the config change in e78f1c5 has no effect? Or will Kafka refuse to create the topic if there is an explicit value 3? I must find time to test this.

May be relevant to #116.

Update: didn't see your review comment above. It explains why we get 1. I've pushed 321189a. Maybe the gotcha can be alleviated by grouping all of these properties.

@solsson
Copy link
Contributor Author

solsson commented Feb 1, 2018

I tried to add a repro case in https://github.com/Yolean/kubernetes-kafka/compare/fix-offset-topic-replication...test-consumergroup?expand=1 but I think it'll be too complex for end-to-end testing this way, while it should be rather trivial to verify that all replicas are considered members of the group.

@solsson
Copy link
Contributor Author

solsson commented Feb 1, 2018

#108 is why I had three replicas for __consumer_offsets in the QA cluster I tested on now. It was created with 1 replica there too. The error message there looks a bit different from https://stackoverflow.com/questions/48536347/kafka-consumer-get-marking-the-coordinator-dead-error-when-using-group-ids, but the cause could be the same.

@solsson
Copy link
Contributor Author

solsson commented Feb 1, 2018

I also noticed in now that our Kafka Streams meta topics have 1 replica. I wonder which of the replication.factor properties that apply there?

Copy link

@shrinandj shrinandj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

which match the default.replication.factor and min.insync.replicas
that we've changed to now.
@solsson
Copy link
Contributor Author

solsson commented Feb 2, 2018

@solsson
Copy link
Contributor Author

solsson commented Feb 2, 2018

Here's what we had in our new cluster:

$ kubectl -n kafka logs kafka-0 -c broker | head -n 200 | grep replication.factor
	default.replication.factor = 3
	offsets.topic.replication.factor = 1
	transaction.state.log.replication.factor = 1

Properties config.storage.replication.factor and status.storage.replication.factor are not listed there. Maybe they're only for Kafka Connect, as both official and Confluent mention them there. They are listed as broker configs as well.

Anyway defaults are 3 for them so I conclude we don't need to add them to the properties file.

I think I've also found the reason why our Kafka Streams topics have 1 replica. It's the replication.factor property. It isn't listed as a broker config, so probably it can only be set on clients.

@solsson
Copy link
Contributor Author

solsson commented Feb 2, 2018

I had been quite focused on UnderReplicatedPartitions in #107 + #128, and ignoring the fact that if a broker default OR client property OR client property default OR topic creation job sets replicas to 1, neither Kafka nor metrics will alert you to the fact that loss of the wrong broker will disrupt operations.

I can't find a metric for the configured topic replication factor, neither from #125 or from #128. The JMX MBean kafka.controller:type=KafkaController,name=OfflinePartitionsCount sounds interesting, but I fail to find such values in our Prometheus. Maybe omitted by config in #49 / #128.

@solsson
Copy link
Contributor Author

solsson commented Feb 2, 2018

It could be argued that applications must check for this as part of QA, but for Kafka Streams this is non-trivial as you typically have lower replication on test clusters, and in production must provision the client with a setting that overrides the default. Hence I added a readiness "test" in e784bca, that could spot the problem before application downtime monitoring does.

@solsson solsson added this to the v3.1 milestone Feb 2, 2018
@solsson
Copy link
Contributor Author

solsson commented Feb 2, 2018

I wanted to try to fix existing topics using Kafka Manager, but it turns out it can add partitions but not increase replication factor: yahoo/CMAK#224.

Will use our job instead as in #108 (comment). Will not affect this PR, as #95 was designed for manual topic name change.

solsson added a commit that referenced this pull request Feb 2, 2018
solsson added a commit that referenced this pull request Feb 3, 2018
may be impacting the producer clients, losing messages or causing back-pressure in the application.
This is most often a “site down” type of problem and will need to be addressed immediately.”

Excerpt from: Neha Narkhede, Gwen Shapira, and Todd Palino. ”Kafka: The Definitive Guide”.

We now export kafka_controller_kafkacontroller_value{name="OfflinePartitionsCount",} and friends.
See #140 for why.
@solsson solsson merged commit 85ef561 into master Feb 3, 2018
solsson added a commit that referenced this pull request Feb 3, 2018
reverting #140 to avoid "does not meet the required replication factor '3' for the offsets topic"
solsson added a commit that referenced this pull request Feb 3, 2018
@cemo
Copy link

cemo commented Feb 19, 2018

I am new to kafka, I have upgraded my cluster from v3.0 to v3.1 and hit this issue.

I had run jobs under maintains folder but

{"version":1,"partitions":[{"topic":"__consumer_offsets","partition":19,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":30,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":47,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":29,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":41,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":39,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":17,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":10,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":14,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":40,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":18,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":0,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":26,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":24,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":33,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":20,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":3,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":21,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":5,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":22,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":12,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":8,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":23,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":15,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":11,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":48,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":13,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":49,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":6,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":28,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":4,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":37,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":31,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":44,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":42,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":34,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":46,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":25,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":45,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":27,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":32,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":43,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":36,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":35,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":7,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":38,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":9,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":1,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":16,"replicas":[0,1,2]},{"topic":"__consumer_offsets","partition":2,"replicas":[0,1,2]}]}
# Triggering kafka-reassign-partitions.sh ./bin/kafka-reassign-partitions.sh --zookeeper=zookeeper.kafka:2181 --execute --reassignment-json-file=/tmp/proposed-reassignment.json
# Reassignment exited. Upon success you may want to run preferred-replica-election.

I had also run preferred-replicat-election. But still having issues:

[2018-02-19 19:29:46,110] ERROR [ReplicaManager broker=1] Error processing append operation on partition __consumer_offsets-38 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition __consumer_offsets-38 is [1], below required minimum [2]
[2018-02-19 19:29:46,110] INFO [GroupCoordinator 1]: Preparing to rebalance group test_search_v2 with old generation 9727 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2018-02-19 19:29:46,211] INFO [GroupCoordinator 1]: Stabilized group test_search_v2 generation 9728 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)
[2018-02-19 19:29:46,212] INFO [GroupCoordinator 1]: Assignment received from leader for group test_search_v2 for generation 9728 (kafka.coordinator.group.GroupCoordinator)
[2018-02-19 19:29:46,212] ERROR [ReplicaManager broker=1] Error processing append operation on partition __consumer_offsets-38 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition __consumer_offsets-38 is [1], below required minimum [2]
[2018-02-19 19:29:46,212] INFO [GroupCoordinator 1]: Preparing to rebalance group test_search_v2 with old generation 9728 (__consumer_offsets-38) (kafka.coordinator.group.GroupCoordinator)

On my kafka manager I had observed that all my cluster topics are replicated to 3 but only __consumer_offsets has not.

I could not find any clue and I had to deleted cluster.

@solsson
Copy link
Contributor Author

solsson commented Feb 20, 2018

@cemo See https://github.com/Yolean/kubernetes-kafka/tree/master/maintenance#increase-a-topics-replication-factor

Kafka maintenance terminology isn't IMO intuitive. Let us know if you find better tooling. There's many ways to use kafka-reassign-partitions.sh, depending on the interesting fact that you basically need to craft the reassignment json yourself. Look at the seds and the command's docs if you're interested.

solsson added a commit that referenced this pull request Apr 11, 2018
reverting #140 to avoid "does not meet the required replication factor '3' for the offsets topic"
solsson added a commit that referenced this pull request May 19, 2018
reverting #140 to avoid "does not meet the required replication factor '3' for the offsets topic"
solsson added a commit that referenced this pull request Jul 21, 2018
reverting #140 to avoid "does not meet the required replication factor '3' for the offsets topic"
pavel-agarkov added a commit to Midnight-Lizard/kubernetes-kafka that referenced this pull request Jul 22, 2018
* Scales to 2 brokers + 3 zookeeper instances

* Same as default.replication.factor/min.insync.replicas

* Minimizes the cluster for use with for example Minikube

* Configures internal topics for single broker,

reverting Yolean#140 to avoid "does not meet the required replication factor '3' for the offsets topic"

* Ksql rc (#1)

* Burrow's master now handles api v3

* Container fails to start, I see no logs

* This log config looks better, but makes no difference regarding start
solsson added a commit that referenced this pull request Aug 14, 2018
reverting #140 to avoid "does not meet the required replication factor '3' for the offsets topic"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants