Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdkafka_broker.c:2755:rd_kafka_fetch_reply_handle: assert failed #1948

Closed
7 tasks done
Kshitij29 opened this issue Aug 14, 2018 · 5 comments
Closed
7 tasks done

rdkafka_broker.c:2755:rd_kafka_fetch_reply_handle: assert failed #1948

Kshitij29 opened this issue Aug 14, 2018 · 5 comments
Labels

Comments

@Kshitij29
Copy link

Kshitij29 commented Aug 14, 2018

Description

I get a lot of partition count change messages as follows:

`
%4|1534226113.072|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [14] is unknown (partition_cnt 6)

%4|1534226113.072|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [13] is unknown (partition_cnt 6)

%4|1534226113.072|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [12] is unknown (partition_cnt 6)

%5|1534226114.589|PARTCNT|rdkafka#consumer-9| [thrd:main]: Topic abc.com partition count changed from 6 to 15

%5|1534226121.090|PARTCNT|rdkafka#consumer-9| [thrd:main]: Topic abc.com partition count changed from 15 to 6

%4|1534226121.090|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [14] is unknown (partition_cnt 6)

%4|1534226121.090|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [13] is unknown (partition_cnt 6)

%4|1534226121.090|LEADER|rdkafka#consumer-9| [thrd:main]: abc.com [12] is unknown (partition_cnt 6)

%5|1534226121.177|PARTCNT|rdkafka#consumer-9| [thrd:main]: Topic abc.com partition count changed from 6 to 15
`
However using kafka-topics.sh, I get that partition count is constant at 15.

After these partition count change messages, my daemon crashes with the following error:
*** rdkafka_broker.c:2755:rd_kafka_fetch_reply_handle: assert: tver && rd_kafka_toppar_s2i(tver->s_rktp) == rktp *** Abort trap: 6

How to reproduce

Created 10 consumers in 10 separate threads using github.com/confluentinc/confluent-kafka-go .
All consumer threads are created by a main thread. All subscribe to same topic (here "abc.com"). Seeing the issue as soon as the main thread starts.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): v0.11.4
  • Apache Kafka version: v0.10.1.0
  • librdkafka client configuration: "group.id": abc.com, "auto.offset.reset": "earliest", "auto.commit.interval.ms": 5000, "api.version.request": "false", "broker.version.fallback": 0.10.1.0, "broker.address.family": "v4", "queued.max.messages.kbytes": 100,
  • Operating system: macOS Sierra 10.12.6
  • Provide logs (with debug=.. as necessary) from librdkafka: In Description
  • Provide broker log excerpts: NA
  • Critical issue: yes
@edenhill
Copy link
Contributor

It seems like your cluster is desynchronized, different brokers returning different partition counts for a topic.

librdkafka queries the brokers for topic metadata directly, while kafka-topics.sh queries zookeeper. A client should only query the brokers and not zookeeper.

As for the crash; there are some known problems when the partition count decreases (which should never happen in a healthy cluster), and this crash is most likely related to that. We'll look into fixing the crash but please do try to fix your cluster.

@edenhill edenhill added the bug label Aug 14, 2018
@Kshitij29
Copy link
Author

Thanks for the prompt reply @edenhill . I'll check if the cluster is properly synchronised. However, if this desynchronisation is persistent in our cluster due to some reason, I'll wait for your fix.

edenhill added a commit that referenced this issue Aug 15, 2018
…1948)

This may happen when the cluster is desynchronized and different
brokers report different partition counts for a topic, resulting in
the at-request-time rktp being removed and a new rktp created before
the fetch response is returned.
edenhill added a commit that referenced this issue Aug 15, 2018
…1948)

This may happen when the cluster is desynchronized and different
brokers report different partition counts for a topic, resulting in
the at-request-time rktp being removed and a new rktp created before
the fetch response is returned.
@edenhill
Copy link
Contributor

Fixed on master

@Kshitij29
Copy link
Author

Thanks for the solution @edenhill . The application does not crash now even with out-of-sync brokers.

@edenhill
Copy link
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants