assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

mensfeld · 2023-06-09T07:46:03Z

Hey,

This is expansion of the report made here: #4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.

If you consider it part of #4308 please merge them 🙏

How to reproduce

Initialize a few consumers in the same process (multiple threads) or many processes.
Let them rebalance and during a cooperative-sticky rebalance, try to close them
Profit (crash)

Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:

loop do
  [1,2,3,4,5].map do
    Thread.new do
      consumer = setup_rdkafka_consumer(
        'partition.assignment.strategy': 'cooperative-sticky'
      )
      consumer.subscribe('topic')
      consumer.poll(500)
      consumer.close
    end
  end.each(&:join)
end

the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.

Every single execution ends up with this:

ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
Aborted (core dumped)

I confirmed this behavior exists in the following librdkafka versions:

2.1.1
2.0.2
1.9.2
librdkafka version (release number or git tag): 2.1.1 and all mentioned above
Apache Kafka version: 2.8.1 and 3.4.0 from bitnami
librdkafka client configuration: presented in the above code snippet
Operating system: Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue - I would consider it critical, because the race-condition is not mentioned in the termination docs (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#termination) and on the contrary, they state that: "There is no need to unsubscribe"

Now let's dive deeper:

As a temporary mitigation, it is possible to call rd_kafka_unsubscribe and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of running rd_kafka_destroy. The wait is needed despite rd_kafka_assignment returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.
Locking the rebalance callback via mutex seems to solve the problem; however, my integration suite under load still showed that it crashes (though way less).
Waiting after rd_kafka_consumer_close also partially mitigates this due to the fact, that rd_kafka_consumer_close will unsubscribe automatically. This may mitigate this on a long living consumers (ref: Segfaults during consumer shutdown karafka/rdkafka-ruby#254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.
This issue does not exist with the standard assignment strategy (only cooperative sticky).
When using a custom rebalance callback with mutex this issue is also partially mitigated, especially when having a global state on a process that would indicate that we're in a rebalance process, however there is a case, where the rebalance callback is not yet triggered but the rdkafka_sticky_assignor is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of the rdkafka_sticky_assignor and it handing to rebalance callback however issue persists.
It can affects more complex consumer frameworks like karafka less because they do a lot of stuff around the consumer before running the close, hence the probability of being in the rebalance state is lower (thought it can happen).
I did not find similar assertion checks in rdkafka_assignor.c nor in rdkafka_roundrobin_assignor.
Not using rd_kafka_destroy on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).
Really long initial poll on the first rebalance for a consumer instance (more than 2 seconds) seems to also mitigate this at least partially. This may be because of the fact that this poll takes longer than the whole rebalance, but hard for me to give 100% result here.

Suggested fix

The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.

Logs

Here is the debug all info tail (if you need more just ping me, I can generate it on the spot):

D, [2023-06-09T09:35:10.457239 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457246 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: JoinGroup response clean-up
D, [2023-06-09T09:35:10.457254 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-join -> wait-metadata (state up)
D, [2023-06-09T09:35:10.457262 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Request metadata for 1 topic(s): partition assignor
D, [2023-06-09T09:35:10.457271 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent MetadataRequest (v9, 41 bytes @ 0, CorrId 5)
D, [2023-06-09T09:35:10.457278 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received MetadataResponse (v9, 341 bytes, CorrId 5, rtt 0.51ms)
D, [2023-06-09T09:35:10.457286 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ===== Received metadata (for 1 requested topics): partition assignor =====
D, [2023-06-09T09:35:10.457294 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ClusterId: o9n0mScnRY-4mvOHIvPnZQ, ControllerId: 1001
D, [2023-06-09T09:35:10.457302 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1 brokers, 1 topics
D, [2023-06-09T09:35:10.457309 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Topic 6bad105f430f with 10 partitions
D, [2023-06-09T09:35:10.457317 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Broker #0/1: localhost:9092 NodeId 1001
D, [2023-06-09T09:35:10.457325 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1/1 requested topic(s) seen in metadata
D, [2023-06-09T09:35:10.457332 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" running cooperative-sticky assignor for 1 member(s) and 1 eligible subscribed topic(s):
D, [2023-06-09T09:35:10.457340 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) with 0 owned partition(s) and 1 subscribed topic(s):
D, [2023-06-09T09:35:10.457349 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [-1]
D, [2023-06-09T09:35:10.457356 #2162967] DEBUG -- : rdkafka: [thrd:main]: Sort 10 partitions in existing assignment
D, [2023-06-09T09:35:10.457364 #2162967] DEBUG -- : rdkafka: [thrd:main]: Prepared balanced reassignment for 1 consumers, 0 available partition(s) where of 0 are unassigned (initializing=true, revocationRequired=false, 1 fixed assignments)
D, [2023-06-09T09:35:10.457372 #2162967] DEBUG -- : rdkafka: [thrd:main]: Reassignment not performed after 1 iteration(s) of 0 reassignable partition(s)
D, [2023-06-09T09:35:10.457380 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" cooperative-sticky assignment for 1 member(s) finished in 0.037ms:
D, [2023-06-09T09:35:10.457387 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) assigned 10 partition(s):
D, [2023-06-09T09:35:10.457395 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [0]
D, [2023-06-09T09:35:10.457403 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [1]
D, [2023-06-09T09:35:10.457410 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [2]
D, [2023-06-09T09:35:10.457418 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [3]
D, [2023-06-09T09:35:10.457425 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [4]
D, [2023-06-09T09:35:10.457432 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [5]
D, [2023-06-09T09:35:10.457440 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [6]
D, [2023-06-09T09:35:10.457460 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [7]
D, [2023-06-09T09:35:10.457470 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [8]
D, [2023-06-09T09:35:10.457477 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [9]
D, [2023-06-09T09:35:10.457485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": "cooperative-sticky" assignor run for 1 member(s)
D, [2023-06-09T09:35:10.457493 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Partitions owned by members: 0, partitions assigned by assignor: 10
D, [2023-06-09T09:35:10.457501 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": COOPERATIVE protocol collection sizes: maybe revoking: 0, ready to migrate: 10, unknown but owned: 0
D, [2023-06-09T09:35:10.457508 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": 10 partitions assigned to consumers
D, [2023-06-09T09:35:10.457518 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-metadata -> wait-sync (state up)
D, [2023-06-09T09:35:10.457526 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent SyncGroupRequest (v3, 221 bytes @ 0, CorrId 6)
D, [2023-06-09T09:35:10.457533 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457541 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received SyncGroupResponse (v3, 10 bytes, CorrId 6, rtt 0.21ms)
D, [2023-06-09T09:35:10.457549 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: SyncGroupRequest failed: Broker: Group rebalance in progress: actions Permanent
D, [2023-06-09T09:35:10.457556 #2162967] DEBUG -- : rdkafka: [thrd:main]: SyncGroup response: Broker: Group rebalance in progress (0 bytes of MemberState data)
D, [2023-06-09T09:35:10.457564 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": synchronization failed: Broker: Group rebalance in progress: rejoining
D, [2023-06-09T09:35:10.457572 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Rejoining group without an assignment: SyncGroup error: Broker: Group rebalance in progress
D, [2023-06-09T09:35:10.457580 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-sync -> init (state up)
I, [2023-06-09T09:35:10.458469 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.185397148132324 ms
D, [2023-06-09T09:35:10.458501 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>8, :topic=>"6bad105f430f", :payload=>"9"}
I, [2023-06-09T09:35:10.468784 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.214381694793701 ms
D, [2023-06-09T09:35:10.468830 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>9, :topic=>"6bad105f430f", :payload=>"10"}
D, [2023-06-09T09:35:10.941137 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941271 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941342 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state init)
D, [2023-06-09T09:35:10.941424 #2162967] DEBUG -- : rdkafka: [thrd:main]: Terminating group "node.rb-e4ef35230e0d_app" in state up with 0 partition(s)
D, [2023-06-09T09:35:10.941485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": unsubscribe from current subscription of size 1 (leave group=true, has joined=true, rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69, join-state init)
D, [2023-06-09T09:35:10.941568 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": clearing subscribed topics list (1)
D, [2023-06-09T09:35:10.941634 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941700 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941753 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": effective subscription list changed from 1 to 0 topic(s):
D, [2023-06-09T09:35:10.941841 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: unsubscribe
D, [2023-06-09T09:35:10.941904 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" initiating rebalance (COOPERATIVE) in state up (join-state init) with 0 assigned partition(s): unsubscribe
D, [2023-06-09T09:35:10.941950 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state wait-join)
ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.

and Kafka log matching this time:

karafka_21_kafka | [2023-06-09 07:35:07,160] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 1 (__consumer_offsets-31) (reason: Adding new member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-390196ef-cf8f-4d1a-a029-b64793413735] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 2 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e023de05-5c28-4577-af61-bc3ac4217652] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 2 (__consumer_offsets-31) (reason: removing member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 3 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 3 (__consumer_offsets-31) (reason: Adding new member rdkafka-275f4343-0436-4314-899a-14bd98f1c20b with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 4 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,377] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 4. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,378] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 4 (__consumer_offsets-31) (reason: Adding new member rdkafka-83086da6-e6b6-4744-9330-635edd095984 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,887] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-275f4343-0436-4314-899a-14bd98f1c20b] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 5 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-83086da6-e6b6-4744-9330-635edd095984] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 5 (__consumer_offsets-31) (reason: removing member rdkafka-83086da6-e6b6-4744-9330-635edd095984 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 6 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,399] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 6 (__consumer_offsets-31) (reason: Adding new member rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,400] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 7 (__consumer_offsets-31) with 2 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,404] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 7. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,935] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 7 (__consumer_offsets-31) (reason: removing member rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 8 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,445] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 8 (__consumer_offsets-31) (reason: Adding new member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 9 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 9 (__consumer_offsets-31) (reason: Adding new member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Member rdkafka-d7596725-b61d-4386-a4e3-6b35e6994d75 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Stabilized group node.rb-c0211bcb57d6_app generation 4 (__consumer_offsets-25) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,014] INFO [GroupCoordinator 1001]: Member rdkafka-d518effd-ace9-4f38-a3b2-7736948e2109 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,015] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e293c87d05db_app generation 4 (__consumer_offsets-27) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,221] INFO [GroupCoordinator 1001]: Member rdkafka-bbfb59a4-11b8-4f61-a08e-fe34f54e25aa in group node.rb-1969f87a8fdf_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Stabilized group node.rb-1969f87a8fdf_app generation 2 (__consumer_offsets-48) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc] in group node.rb-1969f87a8fdf_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-1969f87a8fdf_app in state PreparingRebalance with old generation 2 (__consumer_offsets-48) (reason: removing member rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Group node.rb-1969f87a8fdf_app with generation 3 is now empty (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 in group node.rb-e4ef35230e0d_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 10 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,447] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 10 (__consumer_offsets-31) (reason: removing member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 11 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-c0211bcb57d6_app in state PreparingRebalance with old generation 4 (__consumer_offsets-25) (reason: removing member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Group node.rb-c0211bcb57d6_app with generation 5 is now empty (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e293c87d05db_app in state PreparingRebalance with old generation 4 (__consumer_offsets-27) (reason: removing member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Group node.rb-e293c87d05db_app with generation 5 is now empty (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator)

The text was updated successfully, but these errors were encountered:

mensfeld · 2023-06-12T13:23:44Z

It seems to be a duplicate of: #4252

Once 2.2.0 is released I will close this one and the rdkafka-ruby related one.

mensfeld · 2023-07-13T09:32:19Z

@emasab I can still reproduce it with 2.2.0:

rdkafka_sticky_assignor.c:2157: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.

emasab · 2023-07-18T13:23:29Z

Hello @mensfeld . This seems like a different thing, the assignor_state is NULL initially and it's only set after assignment. But its value isn't checked by the code before calling the function. I mark it as a bug to reproduce with a test and fix.

mensfeld · 2023-07-18T13:32:27Z

@emasab thanks. So theoretically I should be able to mitigate it for now by always waiting for the first rebalance to finish on our side right?

emasab · 2023-07-18T14:10:02Z

That should mitigate it. Thanks for the report!

mensfeld · 2023-07-20T16:51:28Z

Awesome. I will implement this strategy on our side then as a temporary measure.

mensfeld · 2023-07-27T20:12:50Z

@emasab

is NULL initially and it's only set after assignment.

Is is also set after an empty assignment? Can I assume, that the moment rb_cb kicks in, I'm good to go with the shutdown?

mensfeld · 2023-07-31T15:02:46Z

@emasab FYI I was able to repro + mitigate on my side. Thanks

e.g. a destroy happens before the first assignment. Only affects the cooperative-sticky assignor. fixes #4312

mensfeld changed the title ~~Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy~~ Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashing the process Jun 9, 2023

mensfeld mentioned this issue Jun 9, 2023

Segfaults during consumer shutdown karafka/rdkafka-ruby#254

Closed

5 tasks

mensfeld changed the title ~~Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashing the process~~ Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashes the process Jun 12, 2023

mensfeld mentioned this issue Jul 13, 2023

karafka-rdkafka bump karafka/karafka#1514

Merged

emasab added the bug label Jul 18, 2023

mensfeld mentioned this issue Jul 28, 2023

Ensure cooperative sticky is not shut down before the first rebalance karafka/karafka#1523

Closed

mensfeld changed the title ~~Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashes the process~~ assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling Jul 31, 2023

emasab added a commit that referenced this issue Aug 2, 2023

Fix segfault if assignor state is NULL,

f4740ca

e.g. a destroy happens before the first assignment. Only affects the cooperative-sticky assignor. fixes #4312

emasab mentioned this issue Aug 2, 2023

Fix segfault if assignor state is NULL, #4381

Merged

emasab added the GREAT REPORT label Aug 2, 2023

emasab added a commit that referenced this issue Aug 11, 2023

Fix segfault if assignor state is NULL,

9e0b7cc

e.g. a destroy happens before the first assignment. Only affects the cooperative-sticky assignor. fixes #4312

deepredsky mentioned this issue Aug 18, 2023

Add support for cooperative-sticky assignment strategy zendesk/racecar#339

Merged

emasab added the in progress label Sep 21, 2023

emasab closed this as completed in #4381 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

mensfeld commented Jun 9, 2023 •

edited

mensfeld commented Jun 12, 2023

mensfeld commented Jul 13, 2023

emasab commented Jul 18, 2023

mensfeld commented Jul 18, 2023

emasab commented Jul 18, 2023

mensfeld commented Jul 20, 2023

mensfeld commented Jul 27, 2023

mensfeld commented Jul 31, 2023

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

Comments

mensfeld commented Jun 9, 2023 • edited

How to reproduce

Suggested fix

Logs

mensfeld commented Jun 12, 2023

mensfeld commented Jul 13, 2023

emasab commented Jul 18, 2023

mensfeld commented Jul 18, 2023

emasab commented Jul 18, 2023

mensfeld commented Jul 20, 2023

mensfeld commented Jul 27, 2023

mensfeld commented Jul 31, 2023

mensfeld commented Jun 9, 2023 •

edited