Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures to resolve addresses are logged continuously instead of once, like connection failures #186

Closed
thelamb opened this issue Jan 27, 2015 · 16 comments

Comments

@thelamb
Copy link

thelamb commented Jan 27, 2015

The text below describes the original question: librdkafka continuously tries to reconnect to brokers it cannot connect to. The question included a note that address resolve errors are logged continously.
The continuous reconnect is by design, and no longer part of the 'issue' but I left it for reference


It appears to me that when librdkafka fails to connect (or fails to resolve) a broker, it keeps retrying indefinitely. Am I overlooking a configuration parameter or is this expected behavior?

I'm in a scenario where my users can send Kafka message to a broker (Lua wrapper around librdkafka). The user can configure the broker's address. In case the broker is down, or if the user entered an invalid broker address I'd rather not have constant retries in the background.

Internally I solve the problem now by removing inactive Producers after a certain timeout.
This works alright for connection failures (i.e. Connection refused), but it produces a lot of spam when there is a failure in resolving the broker's address. In this case the library prints the error message to terminal continuously.

To reproduce:

 $ ./rdkafka_example -P -t test -b localhost:9092
% Type stuff and hit enter to send
1422363572.561 RDKAFKA-3-FAIL: rdkafka#producer-0: localhost:9092/bootstrap: Failed to connect to broker at localhost:9092: Connection refused
1422363572.561 RDKAFKA-3-ERROR: rdkafka#producer-0: localhost:9092/bootstrap: Failed to connect to broker at localhost:9092: Connection refused
1422363572.561 RDKAFKA-3-ERROR: rdkafka#producer-0: 1/1 brokers are down

This 'Connection refused' is only printed once because of the duplicate-error-check in rd_kafka_broker_connect.

$ ./rdkafka_example -P -t test -b foo:9092
% Type stuff and hit enter to send
1422363764.196 RDKAFKA-3-ERROR: rdkafka#producer-0: Failed to resolve 'foo:9092': Name or service not known
1422363764.196 RDKAFKA-3-GETADDR: rdkafka#producer-0: foo:9092/bootstrap: Failed to resolve 'foo:9092': Name or service not known
1422363765.197 RDKAFKA-3-ERROR: rdkafka#producer-0: Failed to resolve 'foo:9092': Name or service not known
1422363765.197 RDKAFKA-3-GETADDR: rdkafka#producer-0: foo:9092/bootstrap: Failed to resolve 'foo:9092': Name or service not known

The resolve errors are printed continuously.

So my questions are:

  1. Is there a reason for the indefinite retries?
    1.1. If not, would you be interested in a pull request for this if I find the time to do it?
  2. I suppose I could work around the log-spam by implemented the log_cb. I just feel like it should be handled the same way as a connection failure.

I must admit that I'm in no way an expert when it comes to Kafka, so there may very well be a good reason for which you must continuously try to reconnect.

Thanks for your time/efforts

@edenhill
Copy link
Contributor

The core philosophy of librdkafka is to hide all gory Kafka communication details from the application.
Since the app probably has a primary purpose other than Kafka (the transport) it should not have to
include logic to handle various non-permanent errors of the Kafka cluster.

One broker being down, or all, might be a permanent error, but it may also be a temporary failure (incomplete routing, firewall rules, cluster not actually started yet, etc). Instead of librdkafka bailing out
it will quietly sit on whatever messages the application has tried to produce until the cluster comes back to life (at least a leader broker for the desired topic+partition), or when the messages eventually time out due to message.timeout.ms.

Having said that it is sometimes useful for an application to find out if none of the brokers are up, and
for this reason the RD_KAFKA_RESP_ERR__ALL_BROKERS_DOWN error exists which an application can listen to by registering an error_cb and calling rd_kafka_poll() every now and then.
But the application developer should think twice before relying on this signaling since it may very well be a temporary error. It might be better to limit the message.timeout.ms instead.

@edenhill
Copy link
Contributor

The repeating resolve errors should be silenced indeed.

@edenhill
Copy link
Contributor

Lua bindings sounds great! Is this something you will open source?

@thelamb
Copy link
Author

thelamb commented Jan 28, 2015

  1. Thanks for your detailed reply edenhill. It makes sense to me to try and reconnect since the failure can be temporary. I'll think about whether or not I'll implement the RD_KAFKA_RESP_ERR__ALL_BROKERS_DOWN (I overlooked this error, my apologies) but I think it makes sense not to now.
  2. I'll rename and leave this issue open until this is fixed. I'll probably find some time this week for the pull request.
  3. I cannot open source it as-is since I developed it as part of my work.
    I enjoyed writing the bindings though so I might decide to write another version in my own time for the fun of it. Our current version is also not complete, since we only need support to Produce messages.

@thelamb thelamb changed the title Configurable max reconnect tries Failures to resolve addresses are logged continuously instead of once, like connection failures Jan 28, 2015
@edenhill
Copy link
Contributor

That should fix it.
Please verify it on your side aswell.
Thanks.

@mmanoj
Copy link

mmanoj commented Jun 8, 2016

@edenhill

I got below error once i kill the one of the kafka process (I have 3, with replication-factor 3)

1465420107.985 RDKAFKA-3-ERROR: rdkafka#producer-5109: localhost:9094/bootstrap: Connect to ipv4#127.0.0.1:9094 failed: Connection refused

Original Kafka State:

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic PCRF
Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 0,2,1

After Kill one server process:

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic PCRF
Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1

@edenhill
Copy link
Contributor

edenhill commented Jun 8, 2016

@mmanoj do you get the message repeatedly? If so, can you provide mutliple consecutive logs?

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

I got this while I test the failover of the brokers. Topic created with kafka script as follows:
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Please specify the log required, I will arrange, I not get what you mean by "mutliple consecutive logs"

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

@edenhill
I'm using producer mode for above senario and I found below while I was search possible solution. Hope it's helpful to investigate:

SOHU-Co/kafka-node#175

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

[2016-06-11 15:57:29,998] WARN [ReplicaFetcherThread-0-2], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7b6ea4bf (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to infini:9094 (id: 2 rack: null) failed
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingReady$extension$2.apply(NetworkClientBlockingOps.scala:63)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingReady$extension$2.apply(NetworkClientBlockingOps.scala:59)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntil$extension(NetworkClientBlockingOps.scala:120)
at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:59)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:239)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:229)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:107)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:98)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-06-11 15:57:30,185] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions PCRF,0
[2016-06-11 15:57:30,197] INFO [ReplicaFetcherThread-0-2], Shutting down (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:57:30,198] INFO [ReplicaFetcherThread-0-2], Stopped (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:57:30,198] INFO [ReplicaFetcherThread-0-2], Shutdown completed (kafka.server.ReplicaFetcherThread)
[2016-06-11 15:58:50,942] INFO Partition [PCRF,0] on broker 0: Expanding ISR for partition [PCRF,0] from 0 to 0,2 (kafka.cluster.Partition)

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

@edenhill

I got this multiple round of testing. Please advice.

@edenhill
Copy link
Contributor

Those broker logs indicate connectivity problems between your brokers, so not something related to librdkafka.

@edenhill
Copy link
Contributor

I'm not sure what your initial problem is, you mentioned a single log line but this issue covers repeated unsuppressed identical log lines.

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

my issue same as issue described in SOHU-Co/kafka-node#175

I use librdkafka to connect to broker addresses, I manually shutdown brokers round-robin manner and see the message publish to kafka. While I test this scenario i saw publisher unable to process and it's stuck process to other brokers.
char *brokers = "localhost:9092,localhost:9093,localhost:9094";

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

@edenhill

I observed this while kill the leader broker. And my message publishing process hang which use librdkafka. Any idea or possible solution for this senario. I'm using rdkafka_example.c with little modifications.

@mmanoj
Copy link

mmanoj commented Jun 11, 2016

Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 0,1,2

After Leader Fail::

Topic:PCRF PartitionCount:1 ReplicationFactor:3 Configs:
Topic: PCRF Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants