Failures to resolve addresses are logged continuously instead of once, like connection failures #186

thelamb · 2015-01-27T13:29:48Z

The text below describes the original question: librdkafka continuously tries to reconnect to brokers it cannot connect to. The question included a note that address resolve errors are logged continously.
The continuous reconnect is by design, and no longer part of the 'issue' but I left it for reference

It appears to me that when librdkafka fails to connect (or fails to resolve) a broker, it keeps retrying indefinitely. Am I overlooking a configuration parameter or is this expected behavior?

I'm in a scenario where my users can send Kafka message to a broker (Lua wrapper around librdkafka). The user can configure the broker's address. In case the broker is down, or if the user entered an invalid broker address I'd rather not have constant retries in the background.

Internally I solve the problem now by removing inactive Producers after a certain timeout.
This works alright for connection failures (i.e. Connection refused), but it produces a lot of spam when there is a failure in resolving the broker's address. In this case the library prints the error message to terminal continuously.

To reproduce:

 $ ./rdkafka_example -P -t test -b localhost:9092
% Type stuff and hit enter to send
1422363572.561 RDKAFKA-3-FAIL: rdkafka#producer-0: localhost:9092/bootstrap: Failed to connect to broker at localhost:9092: Connection refused
1422363572.561 RDKAFKA-3-ERROR: rdkafka#producer-0: localhost:9092/bootstrap: Failed to connect to broker at localhost:9092: Connection refused
1422363572.561 RDKAFKA-3-ERROR: rdkafka#producer-0: 1/1 brokers are down

This 'Connection refused' is only printed once because of the duplicate-error-check in rd_kafka_broker_connect.

$ ./rdkafka_example -P -t test -b foo:9092
% Type stuff and hit enter to send
1422363764.196 RDKAFKA-3-ERROR: rdkafka#producer-0: Failed to resolve 'foo:9092': Name or service not known
1422363764.196 RDKAFKA-3-GETADDR: rdkafka#producer-0: foo:9092/bootstrap: Failed to resolve 'foo:9092': Name or service not known
1422363765.197 RDKAFKA-3-ERROR: rdkafka#producer-0: Failed to resolve 'foo:9092': Name or service not known
1422363765.197 RDKAFKA-3-GETADDR: rdkafka#producer-0: foo:9092/bootstrap: Failed to resolve 'foo:9092': Name or service not known

The resolve errors are printed continuously.

So my questions are:

Is there a reason for the indefinite retries?
1.1. If not, would you be interested in a pull request for this if I find the time to do it?
I suppose I could work around the log-spam by implemented the log_cb. I just feel like it should be handled the same way as a connection failure.

I must admit that I'm in no way an expert when it comes to Kafka, so there may very well be a good reason for which you must continuously try to reconnect.

Thanks for your time/efforts

The text was updated successfully, but these errors were encountered:

edenhill · 2015-01-27T21:22:03Z

The core philosophy of librdkafka is to hide all gory Kafka communication details from the application.
Since the app probably has a primary purpose other than Kafka (the transport) it should not have to
include logic to handle various non-permanent errors of the Kafka cluster.

One broker being down, or all, might be a permanent error, but it may also be a temporary failure (incomplete routing, firewall rules, cluster not actually started yet, etc). Instead of librdkafka bailing out
it will quietly sit on whatever messages the application has tried to produce until the cluster comes back to life (at least a leader broker for the desired topic+partition), or when the messages eventually time out due to message.timeout.ms.

Having said that it is sometimes useful for an application to find out if none of the brokers are up, and
for this reason the RD_KAFKA_RESP_ERR__ALL_BROKERS_DOWN error exists which an application can listen to by registering an error_cb and calling rd_kafka_poll() every now and then.
But the application developer should think twice before relying on this signaling since it may very well be a temporary error. It might be better to limit the message.timeout.ms instead.

edenhill · 2015-01-27T21:24:19Z

The repeating resolve errors should be silenced indeed.

edenhill · 2015-01-27T21:27:11Z

Lua bindings sounds great! Is this something you will open source?

thelamb · 2015-01-28T08:52:28Z

Thanks for your detailed reply edenhill. It makes sense to me to try and reconnect since the failure can be temporary. I'll think about whether or not I'll implement the RD_KAFKA_RESP_ERR__ALL_BROKERS_DOWN (I overlooked this error, my apologies) but I think it makes sense not to now.
I'll rename and leave this issue open until this is fixed. I'll probably find some time this week for the pull request.
I cannot open source it as-is since I developed it as part of my work.
I enjoyed writing the bindings though so I might decide to write another version in my own time for the fun of it. Our current version is also not complete, since we only need support to Produce messages.

edenhill · 2015-01-31T15:49:56Z

That should fix it.
Please verify it on your side aswell.
Thanks.

mmanoj · 2016-06-08T21:17:06Z

@edenhill

I got below error once i kill the one of the kafka process (I have 3, with replication-factor 3)

1465420107.985 RDKAFKA-3-ERROR: rdkafka#producer-5109: localhost:9094/bootstrap: Connect to ipv4#127.0.0.1:9094 failed: Connection refused

Original Kafka State:

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic PCRF
Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 0,2,1

After Kill one server process:

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic PCRF
Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1

edenhill · 2016-06-08T21:19:08Z

@mmanoj do you get the message repeatedly? If so, can you provide mutliple consecutive logs?

mmanoj · 2016-06-11T08:32:44Z

I got this while I test the failover of the brokers. Topic created with kafka script as follows:
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Please specify the log required, I will arrange, I not get what you mean by "mutliple consecutive logs"

mmanoj · 2016-06-11T08:48:49Z

@edenhill
I'm using producer mode for above senario and I found below while I was search possible solution. Hope it's helpful to investigate:

SOHU-Co/kafka-node#175

mmanoj · 2016-06-11T12:40:33Z

[2016-06-11 15:57:29,998] WARN [ReplicaFetcherThread-0-2], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7b6ea4bf (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to infini:9094 (id: 2 rack: null) failed
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingReady$extension$2.apply(NetworkClientBlockingOps.scala:63)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingReady$extension$2.apply(NetworkClientBlockingOps.scala:59)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntil$extension(NetworkClientBlockingOps.scala:120)
at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:59)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:239)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:229)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:107)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:98)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-06-11 15:57:30,185] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions PCRF,0
[2016-06-11 15:57:30,197] INFO [ReplicaFetcherThread-0-2], Shutting down (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:57:30,198] INFO [ReplicaFetcherThread-0-2], Stopped (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:57:30,198] INFO [ReplicaFetcherThread-0-2], Shutdown completed (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:58:50,942] INFO Partition [PCRF,0] on broker 0: Expanding ISR for partition [PCRF,0] from 0 to 0,2 (kafka.cluster.Partition)

mmanoj · 2016-06-11T12:42:33Z

@edenhill

I got this multiple round of testing. Please advice.

edenhill · 2016-06-11T12:50:32Z

Those broker logs indicate connectivity problems between your brokers, so not something related to librdkafka.

edenhill · 2016-06-11T12:51:15Z

I'm not sure what your initial problem is, you mentioned a single log line but this issue covers repeated unsuppressed identical log lines.

mmanoj · 2016-06-11T12:57:06Z

my issue same as issue described in SOHU-Co/kafka-node#175

I use librdkafka to connect to broker addresses, I manually shutdown brokers round-robin manner and see the message publish to kafka. While I test this scenario i saw publisher unable to process and it's stuck process to other brokers.
char *brokers = "localhost:9092,localhost:9093,localhost:9094";

mmanoj · 2016-06-11T13:13:38Z

@edenhill

I observed this while kill the leader broker. And my message publishing process hang which use librdkafka. Any idea or possible solution for this senario. I'm using rdkafka_example.c with little modifications.

mmanoj · 2016-06-11T13:18:31Z

Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 0,1,2

After Leader Fail::

Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1

thelamb changed the title ~~Configurable max reconnect tries~~ Failures to resolve addresses are logged continuously instead of once, like connection failures Jan 28, 2015

edenhill added a commit that referenced this issue Jan 31, 2015

Silence recurring broker resolve failures (#186)

4aab658

edenhill closed this as completed Jan 31, 2015

djschoe mentioned this issue Oct 20, 2016

Connection failures are logged continuously instead of once, like failures to resolve addresses #847

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures to resolve addresses are logged continuously instead of once, like connection failures #186

Failures to resolve addresses are logged continuously instead of once, like connection failures #186

thelamb commented Jan 27, 2015

edenhill commented Jan 27, 2015

edenhill commented Jan 27, 2015

edenhill commented Jan 27, 2015

thelamb commented Jan 28, 2015

edenhill commented Jan 31, 2015

mmanoj commented Jun 8, 2016

edenhill commented Jun 8, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

edenhill commented Jun 11, 2016

edenhill commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

Failures to resolve addresses are logged continuously instead of once, like connection failures #186

Failures to resolve addresses are logged continuously instead of once, like connection failures #186

Comments

thelamb commented Jan 27, 2015

edenhill commented Jan 27, 2015

edenhill commented Jan 27, 2015

edenhill commented Jan 27, 2015

thelamb commented Jan 28, 2015

edenhill commented Jan 31, 2015

mmanoj commented Jun 8, 2016

Original Kafka State:

After Kill one server process:

edenhill commented Jun 8, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

edenhill commented Jun 11, 2016

edenhill commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016

mmanoj commented Jun 11, 2016