Spurious SASL error messages #1218

mimaison · 2017-05-18T09:45:31Z

Description

When running librdkafka (0.9.5) with node-rdkafka (0.10.2) using SASL Plain against a Kafka Cluster running 0.10.2.1 (also seen on 0.10.0.1), we often get error messages emitted back to the client. These happen intermittently and come from both the Consumer and Producer.
As far as we can tell these errors are harmless, the clients keep functioning as expected whenever they appear.

[thrd:sasl_ssl://kafka01-XXX]: sasl_ssl://kafka01-XXX:9093/0: Failed to initialize SASL authentication: SASL Handshake not supported by broker (required by mechanism PLAIN)

[thrd:sasl_ssl://kafka06-XXX]: sasl_ssl://kafka06-XXX:9093/5: Receive failed: No error'

Error: Local: Authentication failure\n    at Error (native)

Error: Local: All broker connections are down\n    at Error (native)

How to reproduce

Not sure yet. We've not been able to identify what triggers them. Our clients can run without a single error for days and then emit several errors every second for a while. We've also seen example when a rebalance made them stop.

Checklist

Please provide the following information:

librdkafka version (release number or git tag):
Apache Kafka version:
librdkafka client configuration:
All default + SASL/SSL

        'metadata.broker.list': opts.brokers,
        'security.protocol': 'sasl_ssl',
        'ssl.ca.location': opts.calocation,
        'sasl.mechanisms': 'PLAIN',
        'sasl.username': opts.username,
        'sasl.password': opts.password,
        'api.version.request': true

Operating system:
Linux and much less frequently macOS
Using the legacy Consumer
No
Using the high-level KafkaConsumer
Yes
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2017-05-18T11:51:34Z

What I think is happening here is that the SaslHandshakeRequest times out, causing the client to revert to broker.version.fallback protocol features that default to v0.9, lacking support for SaslHandshake (PLAIN, SCRAM, etc..).

Why is it timing out? Because there was a bug that capped this timeout at... wait for it .. 10 ms!
This has been fixed on master : 164f065

The workaround at this point is to set socket.timeout.ms to a value <= 10,000 (10s)

mimaison · 2017-05-18T17:01:41Z

thanks @edenhill !
We'll try that and will let you know in the next few days

seansabour · 2017-05-24T16:50:08Z

Hi @edenhill,
I've tried the suggested solution and still seem to be getting the errors that I am seeing.

[Wed May 24 2017 09:37:00 GMT-0700 (PDT)] ERROR Error from kafka consumer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}

[Wed May 24 2017 09:37:01 GMT-0700 (PDT)] ERROR Error from kafka producer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}

[Wed May 24 2017 09:37:01 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<kafka_broker>.]: sasl_ssl://<kafka_broker>:9093/8: Receive failed: No error' }

Here are the options that we are currently using:

const kafkaConsumerOpts = {
    //'debug': 'all',
    "metadata.broker.list": kafkaOpts.brokers,
    "security.protocol": "sasl_ssl",
    "ssl.ca.location": kafkaOpts.calocation,
    "sasl.mechanisms": "PLAIN",
    "sasl.username": kafkaOpts.username,
    "sasl.password": kafkaOpts.password,
    "client.id": "kafka-consumer-gobot",
    "group.id": "kafka-gobot",
    "api.version.request": true,
    "socket.timeout.ms": 10000
};

const kafkaProducerOpts = {
    //'debug': 'all',
    "metadata.broker.list": kafkaOpts.brokers,
    "security.protocol": "sasl_ssl",
    "ssl.ca.location": kafkaOpts.calocation,
    "sasl.mechanisms": "PLAIN",
    "sasl.username": kafkaOpts.username,
    "sasl.password": kafkaOpts.password,
    "client.id": "kafka-producer-gobot",
    "dr_msg_cb": true, // Enable delivery reports with message payload
    "api.version.request": true,
    "socket.timeout.ms": 10000
};

edenhill · 2017-05-24T17:29:47Z

@seansabour
What librdkafka version are you on? (and what high-level language binding? curious..)

Can you reproduce this with debug set to broker,protocol,security and provide the logs?
Thanks

seansabour · 2017-05-24T18:16:32Z

@edenhill We are using the node-rdkafka high-level language binding. I believe that this library currently uses librdkafka version 0.9.4. Referencing here: https://github.com/Blizzard/node-rdkafka

I was able to reproduce with the debug set to broker,protocol,security

[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'SEND',
  message: '[thrd:sasl_ssl:/<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/5: Sent GroupCoordinatorRequest (v0, 47 bytes @ 0, CorrId 5)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl:/<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/0: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>]: sasl_ssl:/<Kafka_broker>/0: Receive failed: No error' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/0: Broker changed state UP -> DOWN' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] ERROR Error from kafka producer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: Receive failed: No error' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: Broker changed state UP -> DOWN' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'CONNECT',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/7: broker in state DOWN connecting' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'CONNECT',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl:/<Kafka_broker>/7: Connecting to ipv4#*.*.*.*:9093 (sasl_ssl) with socket 44' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/7: Broker changed state DOWN -> CONNECT' }

edenhill · 2017-05-24T21:08:23Z

Thanks, could you provide more logs, I would like to also see what happens before broker 0 goes down?

seansabour · 2017-05-25T02:23:54Z

Sure, I hope this helps a bit more..

kafka.txt

edenhill · 2017-05-25T18:24:33Z

I'm looking at broker kafka05 in the logs and everything seems to go fine, it connects and authenticates, but is then suddenly disconnected by the remote peer (broker, network):

[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: Receive failed: No error' }
[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: Broker changed state UP -> DOWN' }

Can you check the broker logs for that same time and see if there is an exception?

There's some weird log line duplication in the log file, not sure if you are running multiple client instances with the same log or something else is up.

mimaison · 2017-05-26T09:46:35Z

The only relevant messages I can find in the logs are:

[2017-05-25 02:17:40,978] WARN Attempting to send response via channel for which there is no open connection, connection id 39 (kafka.network.Processor)

There's quite a bit of those.

The broker logic that throws it is:

 // channel can be null if the selector closed the connection because it was idle for too long
    if (channel == null) {
      warn(s"Attempting to send response via channel for which there is no open connection, connection id $id")
      response.request.updateRequestMetrics()
    }

seansabour · 2017-06-01T15:47:10Z

@edenhill Any idea on what could be causing this? It is causing a lot of noise on our end and would not like to disable the logging incase a true issue occurs.

edenhill · 2017-06-15T21:47:26Z

Since these log messages indicates that the remote peer closed the connection you will need to check the broker or network (firewall, nat, session timeouts, etc) for hints.

mimaison · 2017-08-02T15:37:08Z

After further investigations, it looks like these messages could come from the brokers that are not being used by the client. As librdkafka connects to all the brokers in the cluster, the brokers that don't host any of the leader partitions will timeout the idle connections.

I'll try to confirm that with the teams that are seeing this

edenhill · 2017-08-04T07:41:20Z

Sounds reasonable

https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected

mimaison · 2017-08-11T14:57:13Z

@seansabour Can you confirm #1218 (comment) ? From node-rdkafka, you can find the leaders for your partitions by calling getMetadata()

edenhill added the wait-info label Jun 27, 2017

treziac mentioned this issue Aug 9, 2017

librdkafka build dependency missing #1367

Closed

9 tasks

edenhill closed this as completed Jan 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious SASL error messages #1218

Spurious SASL error messages #1218

mimaison commented May 18, 2017 •

edited

edenhill commented May 18, 2017

mimaison commented May 18, 2017

seansabour commented May 24, 2017

edenhill commented May 24, 2017

seansabour commented May 24, 2017

edenhill commented May 24, 2017

seansabour commented May 25, 2017

edenhill commented May 25, 2017

mimaison commented May 26, 2017

seansabour commented Jun 1, 2017

edenhill commented Jun 15, 2017

mimaison commented Aug 2, 2017

edenhill commented Aug 4, 2017

mimaison commented Aug 11, 2017

Spurious SASL error messages #1218

Spurious SASL error messages #1218

Comments

mimaison commented May 18, 2017 • edited

Description

How to reproduce

Checklist

edenhill commented May 18, 2017

mimaison commented May 18, 2017

seansabour commented May 24, 2017

edenhill commented May 24, 2017

seansabour commented May 24, 2017

edenhill commented May 24, 2017

seansabour commented May 25, 2017

edenhill commented May 25, 2017

mimaison commented May 26, 2017

seansabour commented Jun 1, 2017

edenhill commented Jun 15, 2017

mimaison commented Aug 2, 2017

edenhill commented Aug 4, 2017

mimaison commented Aug 11, 2017

mimaison commented May 18, 2017 •

edited