Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR Uncaught exception in herder work thread in Distributed mode #189

Closed
aldobd opened this issue Feb 1, 2017 · 11 comments
Closed

ERROR Uncaught exception in herder work thread in Distributed mode #189

aldobd opened this issue Feb 1, 2017 · 11 comments

Comments

@aldobd
Copy link

aldobd commented Feb 1, 2017

Hi.

I am trying to start Kafka connect (Confluent 3.1.1) in distributed mode but I am having some problems:

[2017-02-01 11:08:52,803] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:181)
java.lang.NullPointerException
at org.apache.kafka.connect.storage.KafkaConfigBackingStore$ConsumeCallback.onCompletion(KafkaConfigBackingStore.java:444)
[2017-02-01 11:08:51,893] INFO Discovered coordinator inaki-P552LA:9092 (id: 2147483647 rack: null) for group test. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:555)
[2017-02-01 11:08:52,803] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:181)
java.lang.NullPointerException
at org.apache.kafka.connect.storage.KafkaConfigBackingStore$ConsumeCallback.onCompletion(KafkaConfigBackingStore.java:444)
at org.apache.kafka.connect.storage.KafkaConfigBackingStore$ConsumeCallback.onCompletion(KafkaConfigBackingStore.java:424)
at org.apache.kafka.connect.util.KafkaBasedLog.poll(KafkaBasedLog.java:253)
at org.apache.kafka.connect.util.KafkaBasedLog.readToLogEnd(KafkaBasedLog.java:293)
at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:143)
at org.apache.kafka.connect.storage.KafkaConfigBackingStore.start(KafkaConfigBackingStore.java:259)
at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:114)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:169)
at java.lang.Thread.run(Thread.java:745)
[2017-02-01 11:08:52,804] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:68)
[2017-02-01 11:08:52,804] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2017-02-01 11:08:52,807] INFO Stopped ServerConnector@e3cee7b{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2017-02-01 11:08:52,813] INFO Stopped o.e.j.s.ServletContextHandler@120f38e6{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)

Is something similar to that
confluentinc/kafka-connect-hdfs#85

I did some steps mentioned and If I start workers before connect it works without NullPointerException, but when I send messages throw timeout errors

@aldobd aldobd closed this as completed Feb 2, 2017
@ewencp
Copy link
Contributor

ewencp commented Feb 3, 2017

What version are you on and did you resolve this? We did used to see this sometimes and I think we tried to add some logging to give more info if it happened again because it was difficult to tell how things had gotten out of sync from looking at the code.

@aldobd
Copy link
Author

aldobd commented Feb 23, 2017

I could't resolve it, I just restart everything to default following documentation and test again. So, the error disappeared

@txbm
Copy link

txbm commented Nov 2, 2018

Reproducing this consistently with Kafka Connect 1.1.0 against a 1.0.0 cluster. Repro steps are:

  1. Shutdown Kafka connect
  2. Delete Kafka Connect topics (offsets, status, configs)
  3. (assuming auto-creation enabled in cluster) Start Kafka Connect back up
  4. Send GET request to https://my-kafka-connect/connectors

Symptoms, you'll get a timeout on the GET request and an exception like this:

Uncaught exception in herder work thread, exiting:
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadat

Interestingly, it will succeed in creating the offsets topic, but not the config or status topics.

Resetting cluster repeatedly (by deleting Kafka Connect topics) does not resolve.

This happened immediately after upgrading Kafka Connect from 1.0 to 1.1. No other changes to cluster or deployment environment. @ewencp

@txbm
Copy link

txbm commented Nov 2, 2018

I may have just figured this out... it looks like the status topic has been eternally marked for deletion (confirming from within ZK). It's no longer listed in the broker metadata as an existing topic but it appears ZK never completed the delete. I think this causing a strange state where when Kafka Connect then tries to recreate it, Kafka can't service the request because ZK disagrees... not sure if this is the same root cause as the original reported issue but it's something.

@txbm
Copy link

txbm commented Nov 2, 2018

We ended up forcing the controllers to reset their "deleting state" of the Kafka Connect topics by:

  1. Shutdown Kafka connect
  2. Running deletes from kafka-topics.sh
  3. Deleting ZK entries
  4. Creating topics (will appear to fail if controllers are still in "deleting" state but that's ok)
  5. Run deletes again from kafka-topics.sh (ensures resetting controllers "deleting" state)
  6. Restart Kafka Connect to automatically create topics for real

@gmanolache
Copy link

Had the same bug, kafka-connect was creating the 3 partitions connect-configs, connect-offsets and connect-status with the default number of partitions (in my case 64 as defined in the kafka server properties) and was causing this issue.

As the above solution, I ended up removing the topics, recreating them with 1 partition and restarting kafka-connect.

@liranshaked
Copy link

liranshaked commented Dec 25, 2019

I am unable to start kafka connect
See in the kafka connect logs :

Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)

org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

Full logs ::

[2019-12-25 11:36:31,409] INFO ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [localhost:9092]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = connect-cluster
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
(org.apache.kafka.clients.consumer.ConsumerConfig:347)
[2019-12-25 11:36:31,448] WARN The configuration 'config.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,448] WARN The configuration 'status.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,448] WARN The configuration 'plugin.path' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'internal.key.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'config.storage.replication.factor' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'value.converter.schema.registry.url' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'internal.key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'status.storage.replication.factor' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'internal.value.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'internal.value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,449] WARN The configuration 'offset.storage.replication.factor' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,450] WARN The configuration 'offset.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,450] WARN The configuration 'value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,450] WARN The configuration 'key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,450] WARN The configuration 'key.converter.schema.registry.url' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:355)
[2019-12-25 11:36:31,450] INFO Kafka version: 5.3.0-ccs (org.apache.kafka.common.utils.AppInfoParser:117)
[2019-12-25 11:36:31,450] INFO Kafka commitId: 6481debc2be778ee (org.apache.kafka.common.utils.AppInfoParser:118)
[2019-12-25 11:36:31,451] INFO Kafka startTimeMs: 1577273791450 (org.apache.kafka.common.utils.AppInfoParser:119)
[2019-12-25 11:36:31,533] INFO [Producer clientId=producer-1] Cluster ID: 3gSpLKOtT8y-axt0zWl0oA (org.apache.kafka.clients.Metadata:261)
[2019-12-25 11:36:31,634] INFO [Consumer clientId=consumer-1, groupId=connect-cluster] Cluster ID: 3gSpLKOtT8y-axt0zWl0oA (org.apache.kafka.clients.Metadata:261)
Dec 25, 2019 11:36:32 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.RootResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.RootResource will be ignored.
Dec 25, 2019 11:36:32 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will be ignored.
Dec 25, 2019 11:36:32 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be ignored.
Dec 25, 2019 11:36:32 AM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method createConnector in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource contains empty path annotation.
WARNING: The (sub)resource method serverInfo in org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty path annotation.

[2019-12-25 11:36:32,532] INFO Started o.e.j.s.ServletContextHandler@43045f9f{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:855)
[2019-12-25 11:36:32,532] INFO REST resources initialized; server is started and ready to handle requests (org.apache.kafka.connect.runtime.rest.RestServer:233)
[2019-12-25 11:36:32,532] INFO Kafka Connect started (org.apache.kafka.connect.runtime.Connect:56)
[2019-12-25 11:37:31,457] ERROR [Worker clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
[2019-12-25 11:37:31,504] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:66)
[2019-12-25 11:37:31,519] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:241)
[2019-12-25 11:37:31,552] INFO Stopped http_8083@24361cfc{HTTP/1.1,[http/1.1]}{0.0.0.0:8083} (org.eclipse.jetty.server.AbstractConnector:341)
[2019-12-25 11:37:31,571] INFO node0 Stopped scavenging (org.eclipse.jetty.server.session:167)
[2019-12-25 11:37:31,706] INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer:258)
[2019-12-25 11:37:31,706] INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder:533)
[2019-12-25 11:37:36,749] INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder:553)
[2019-12-25 11:37:36,749] INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect:71)

@ishanivij
Copy link

I had the same issue. Resolved by deleting the existing topics connect-configs, connect-status, connect-offsets and recreating with number of partitions =1 for all. (As suggested by @gmanolache )

@txbm
Copy link

txbm commented Sep 17, 2020

I had the same issue. Resolved by deleting the existing topics connect-configs, connect-status, connect-offsets and recreating with number of partitions =1 for all. (As suggested by @gmanolache )

Apologies if this is obvious, but for larger or production grade workloads you will need more than 1 partition on some of those topics. This deletion state issue notwithstanding, just thought I should mention that.

@shivusajjan
Copy link

shivusajjan commented Oct 8, 2021

I am deployed kafka, zookeeper, kafka-connect using Strimzi in kuberentes. I updated kafka-connect image in kubernetes for RabbitMQSourceConnector plugin and it was working initially, after restart Kafka-Connect we get following error in Kafka-Connect. May I know the reason why it should give below error?

2021-10-08 06:07:13,122 ERROR [Worker clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [DistributedHerder-connect-1]
java.lang.NullPointerException
at org.apache.kafka.connect.storage.KafkaConfigBackingStore$ConsumeCallback.onCompletion(KafkaConfigBackingStore.java:685)
at org.apache.kafka.connect.storage.KafkaConfigBackingStore$ConsumeCallback.onCompletion(KafkaConfigBackingStore.java:481)
at org.apache.kafka.connect.util.KafkaBasedLog.poll(KafkaBasedLog.java:264)
at org.apache.kafka.connect.util.KafkaBasedLog.readToLogEnd(KafkaBasedLog.java:287)
at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:154)
at org.apache.kafka.connect.storage.KafkaConfigBackingStore.start(KafkaConfigBackingStore.java:265)
at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:123)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:277)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-10-08 06:07:13,123 INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect) [Thread-10]
2021-10-08 06:07:13,125 INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer) [Thread-10]
2021-10-08 06:07:13,130 INFO Stopped http_8083@39c11e6c{HTTP/1.1,[http/1.1]}{0.0.0.0:8083} (org.eclipse.jetty.server.AbstractConnector) [Thread-10]
2021-10-08 06:07:13,130 INFO node0 Stopped scavenging (org.eclipse.jetty.server.session) [Thread-10]
2021-10-08 06:07:13,131 INFO REST server stopped (org.apache.kafka.connect.runtime.rest.RestServer) [Thread-10]
2021-10-08 06:07:13,131 INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [Thread-10]
2021-10-08 06:07:18,132 INFO [Worker clientId=connect-1, groupId=connect-cluster] Herder stopped (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [Thread-10]
2021-10-08 06:07:18,132 INFO Kafka Connect stopped (org.apache.kafka.connect.runtime.Connect) [Thread-10]

@AllanNyasha
Copy link

Reproducing this consistently with Kafka Connect 1.1.0 against a 1.0.0 cluster. Repro steps are:

1. Shutdown Kafka connect

2. Delete Kafka Connect topics (offsets, status, configs)

3. (assuming auto-creation enabled in cluster) Start Kafka Connect back up

4. Send GET request to https://my-kafka-connect/connectors

Symptoms, you'll get a timeout on the GET request and an exception like this:

Uncaught exception in herder work thread, exiting:
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadat

Interestingly, it will succeed in creating the offsets topic, but not the config or status topics.

Resetting cluster repeatedly (by deleting Kafka Connect topics) does not resolve.

This happened immediately after upgrading Kafka Connect from 1.0 to 1.1. No other changes to cluster or deployment environment. @ewencp

this worked for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants