Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious SASL error messages #1218

Closed
6 of 9 tasks
mimaison opened this issue May 18, 2017 · 14 comments
Closed
6 of 9 tasks

Spurious SASL error messages #1218

mimaison opened this issue May 18, 2017 · 14 comments

Comments

@mimaison
Copy link
Contributor

mimaison commented May 18, 2017

Description

When running librdkafka (0.9.5) with node-rdkafka (0.10.2) using SASL Plain against a Kafka Cluster running 0.10.2.1 (also seen on 0.10.0.1), we often get error messages emitted back to the client. These happen intermittently and come from both the Consumer and Producer.
As far as we can tell these errors are harmless, the clients keep functioning as expected whenever they appear.

[thrd:sasl_ssl://kafka01-XXX]: sasl_ssl://kafka01-XXX:9093/0: Failed to initialize SASL authentication: SASL Handshake not supported by broker (required by mechanism PLAIN)
[thrd:sasl_ssl://kafka06-XXX]: sasl_ssl://kafka06-XXX:9093/5: Receive failed: No error'
Error: Local: Authentication failure\n    at Error (native)
Error: Local: All broker connections are down\n    at Error (native)

How to reproduce

Not sure yet. We've not been able to identify what triggers them. Our clients can run without a single error for days and then emit several errors every second for a while. We've also seen example when a rebalance made them stop.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag):
  • Apache Kafka version:
  • librdkafka client configuration:
    All default + SASL/SSL
        'metadata.broker.list': opts.brokers,
        'security.protocol': 'sasl_ssl',
        'ssl.ca.location': opts.calocation,
        'sasl.mechanisms': 'PLAIN',
        'sasl.username': opts.username,
        'sasl.password': opts.password,
        'api.version.request': true
  • Operating system:
    Linux and much less frequently macOS
  • Using the legacy Consumer
    No
  • Using the high-level KafkaConsumer
    Yes
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@edenhill
Copy link
Contributor

What I think is happening here is that the SaslHandshakeRequest times out, causing the client to revert to broker.version.fallback protocol features that default to v0.9, lacking support for SaslHandshake (PLAIN, SCRAM, etc..).

Why is it timing out? Because there was a bug that capped this timeout at... wait for it .. 10 ms!
This has been fixed on master : 164f065

The workaround at this point is to set socket.timeout.ms to a value <= 10,000 (10s)

@mimaison
Copy link
Contributor Author

thanks @edenhill !
We'll try that and will let you know in the next few days

@seansabour
Copy link

Hi @edenhill,
I've tried the suggested solution and still seem to be getting the errors that I am seeing.

[Wed May 24 2017 09:37:00 GMT-0700 (PDT)] ERROR Error from kafka consumer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}

[Wed May 24 2017 09:37:01 GMT-0700 (PDT)] ERROR Error from kafka producer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}

[Wed May 24 2017 09:37:01 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<kafka_broker>.]: sasl_ssl://<kafka_broker>:9093/8: Receive failed: No error' }

Here are the options that we are currently using:

const kafkaConsumerOpts = {
    //'debug': 'all',
    "metadata.broker.list": kafkaOpts.brokers,
    "security.protocol": "sasl_ssl",
    "ssl.ca.location": kafkaOpts.calocation,
    "sasl.mechanisms": "PLAIN",
    "sasl.username": kafkaOpts.username,
    "sasl.password": kafkaOpts.password,
    "client.id": "kafka-consumer-gobot",
    "group.id": "kafka-gobot",
    "api.version.request": true,
    "socket.timeout.ms": 10000
};

const kafkaProducerOpts = {
    //'debug': 'all',
    "metadata.broker.list": kafkaOpts.brokers,
    "security.protocol": "sasl_ssl",
    "ssl.ca.location": kafkaOpts.calocation,
    "sasl.mechanisms": "PLAIN",
    "sasl.username": kafkaOpts.username,
    "sasl.password": kafkaOpts.password,
    "client.id": "kafka-producer-gobot",
    "dr_msg_cb": true, // Enable delivery reports with message payload
    "api.version.request": true,
    "socket.timeout.ms": 10000
};

@edenhill
Copy link
Contributor

@seansabour
What librdkafka version are you on? (and what high-level language binding? curious..)

Can you reproduce this with debug set to broker,protocol,security and provide the logs?
Thanks

@seansabour
Copy link

@edenhill We are using the node-rdkafka high-level language binding. I believe that this library currently uses librdkafka version 0.9.4. Referencing here: https://github.com/Blizzard/node-rdkafka

I was able to reproduce with the debug set to broker,protocol,security

[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'SEND',
  message: '[thrd:sasl_ssl:/<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/5: Sent GroupCoordinatorRequest (v0, 47 bytes @ 0, CorrId 5)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl:/<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/0: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>]: sasl_ssl:/<Kafka_broker>/0: Receive failed: No error' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/0: Broker changed state UP -> DOWN' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] ERROR Error from kafka producer:{"message":"broker transport failure","stack":"Error: Local: Broker transport failure\n\n","origin":"local","code":-1,"errno":-1}
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: Receive failed: No error' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/4: Broker changed state UP -> DOWN' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'CONNECT',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/7: broker in state DOWN connecting' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'CONNECT',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl:/<Kafka_broker>/7: Connecting to ipv4#*.*.*.*:9093 (sasl_ssl) with socket 44' }
[Wed May 24 2017 11:12:37 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://<Kafka_broker>.]: sasl_ssl://<Kafka_broker>/7: Broker changed state DOWN -> CONNECT' }


@edenhill
Copy link
Contributor

Thanks, could you provide more logs, I would like to also see what happens before broker 0 goes down?

@seansabour
Copy link

Sure, I hope this helps a bit more..

kafka.txt

@edenhill
Copy link
Contributor

I'm looking at broker kafka05 in the logs and everything seems to go fine, it connects and authenticates, but is then suddenly disconnected by the remote peer (broker, network):

[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'BROKERFAIL',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: failed: err: Local: Broker transport failure: (errno: Undefined error: 0)' }
[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 3,
  fac: 'FAIL',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: Receive failed: No error' }
[Wed May 24 2017 19:17:53 GMT-0700 (PDT)] INFO { severity: 7,
  fac: 'STATE',
  message: '[thrd:sasl_ssl://kafka05-prod01.*]: sasl_ssl://kafka05-prod01.*:9093/4: Broker changed state UP -> DOWN' }

Can you check the broker logs for that same time and see if there is an exception?

There's some weird log line duplication in the log file, not sure if you are running multiple client instances with the same log or something else is up.

@mimaison
Copy link
Contributor Author

The only relevant messages I can find in the logs are:

[2017-05-25 02:17:40,978] WARN Attempting to send response via channel for which there is no open connection, connection id 39 (kafka.network.Processor)

There's quite a bit of those.

The broker logic that throws it is:

 // channel can be null if the selector closed the connection because it was idle for too long
    if (channel == null) {
      warn(s"Attempting to send response via channel for which there is no open connection, connection id $id")
      response.request.updateRequestMetrics()
    }

@seansabour
Copy link

@edenhill Any idea on what could be causing this? It is causing a lot of noise on our end and would not like to disable the logging incase a true issue occurs.

@edenhill
Copy link
Contributor

Since these log messages indicates that the remote peer closed the connection you will need to check the broker or network (firewall, nat, session timeouts, etc) for hints.

@mimaison
Copy link
Contributor Author

mimaison commented Aug 2, 2017

After further investigations, it looks like these messages could come from the brokers that are not being used by the client. As librdkafka connects to all the brokers in the cluster, the brokers that don't host any of the leader partitions will timeout the idle connections.

I'll try to confirm that with the teams that are seeing this

@edenhill
Copy link
Contributor

edenhill commented Aug 4, 2017

@mimaison
Copy link
Contributor Author

@seansabour Can you confirm #1218 (comment) ? From node-rdkafka, you can find the leaders for your partitions by calling getMetadata()

@edenhill edenhill closed this as completed Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants