Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

Closed
7 tasks done
mensfeld opened this issue Jun 9, 2023 · 8 comments · Fixed by #4381
Closed
7 tasks done

Comments

@mensfeld
Copy link

mensfeld commented Jun 9, 2023

Hey,

This is expansion of the report made here: #4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.

If you consider it part of #4308 please merge them 🙏

How to reproduce

  1. Initialize a few consumers in the same process (multiple threads) or many processes.
  2. Let them rebalance and during a cooperative-sticky rebalance, try to close them
  3. Profit (crash)

Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:

loop do
  [1,2,3,4,5].map do
    Thread.new do
      consumer = setup_rdkafka_consumer(
        'partition.assignment.strategy': 'cooperative-sticky'
      )
      consumer.subscribe('topic')
      consumer.poll(500)
      consumer.close
    end
  end.each(&:join)
end

the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.

Every single execution ends up with this:

ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
Aborted (core dumped)

I confirmed this behavior exists in the following librdkafka versions:

  • 2.1.1

  • 2.0.2

  • 1.9.2

  • librdkafka version (release number or git tag): 2.1.1 and all mentioned above

  • Apache Kafka version: 2.8.1 and 3.4.0 from bitnami

  • librdkafka client configuration: presented in the above code snippet

  • Operating system: Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • Provide logs (with debug=.. as necessary) from librdkafka

  • Provide broker log excerpts

  • Critical issue - I would consider it critical, because the race-condition is not mentioned in the termination docs (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#termination) and on the contrary, they state that: "There is no need to unsubscribe"

Now let's dive deeper:

  1. As a temporary mitigation, it is possible to call rd_kafka_unsubscribe and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of running rd_kafka_destroy. The wait is needed despite rd_kafka_assignment returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.
  2. Locking the rebalance callback via mutex seems to solve the problem; however, my integration suite under load still showed that it crashes (though way less).
  3. Waiting after rd_kafka_consumer_close also partially mitigates this due to the fact, that rd_kafka_consumer_close will unsubscribe automatically. This may mitigate this on a long living consumers (ref: Segfaults during consumer shutdown karafka/rdkafka-ruby#254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.
  4. This issue does not exist with the standard assignment strategy (only cooperative sticky).
  5. When using a custom rebalance callback with mutex this issue is also partially mitigated, especially when having a global state on a process that would indicate that we're in a rebalance process, however there is a case, where the rebalance callback is not yet triggered but the rdkafka_sticky_assignor is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of the rdkafka_sticky_assignor and it handing to rebalance callback however issue persists.
  6. It can affects more complex consumer frameworks like karafka less because they do a lot of stuff around the consumer before running the close, hence the probability of being in the rebalance state is lower (thought it can happen).
  7. I did not find similar assertion checks in rdkafka_assignor.c nor in rdkafka_roundrobin_assignor.
  8. Not using rd_kafka_destroy on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).
  9. Really long initial poll on the first rebalance for a consumer instance (more than 2 seconds) seems to also mitigate this at least partially. This may be because of the fact that this poll takes longer than the whole rebalance, but hard for me to give 100% result here.

Suggested fix

The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.

Logs

Here is the debug all info tail (if you need more just ping me, I can generate it on the spot):

D, [2023-06-09T09:35:10.457239 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457246 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: JoinGroup response clean-up
D, [2023-06-09T09:35:10.457254 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-join -> wait-metadata (state up)
D, [2023-06-09T09:35:10.457262 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Request metadata for 1 topic(s): partition assignor
D, [2023-06-09T09:35:10.457271 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent MetadataRequest (v9, 41 bytes @ 0, CorrId 5)
D, [2023-06-09T09:35:10.457278 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received MetadataResponse (v9, 341 bytes, CorrId 5, rtt 0.51ms)
D, [2023-06-09T09:35:10.457286 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ===== Received metadata (for 1 requested topics): partition assignor =====
D, [2023-06-09T09:35:10.457294 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ClusterId: o9n0mScnRY-4mvOHIvPnZQ, ControllerId: 1001
D, [2023-06-09T09:35:10.457302 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1 brokers, 1 topics
D, [2023-06-09T09:35:10.457309 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Topic 6bad105f430f with 10 partitions
D, [2023-06-09T09:35:10.457317 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Broker #0/1: localhost:9092 NodeId 1001
D, [2023-06-09T09:35:10.457325 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1/1 requested topic(s) seen in metadata
D, [2023-06-09T09:35:10.457332 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" running cooperative-sticky assignor for 1 member(s) and 1 eligible subscribed topic(s):
D, [2023-06-09T09:35:10.457340 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) with 0 owned partition(s) and 1 subscribed topic(s):
D, [2023-06-09T09:35:10.457349 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [-1]
D, [2023-06-09T09:35:10.457356 #2162967] DEBUG -- : rdkafka: [thrd:main]: Sort 10 partitions in existing assignment
D, [2023-06-09T09:35:10.457364 #2162967] DEBUG -- : rdkafka: [thrd:main]: Prepared balanced reassignment for 1 consumers, 0 available partition(s) where of 0 are unassigned (initializing=true, revocationRequired=false, 1 fixed assignments)
D, [2023-06-09T09:35:10.457372 #2162967] DEBUG -- : rdkafka: [thrd:main]: Reassignment not performed after 1 iteration(s) of 0 reassignable partition(s)
D, [2023-06-09T09:35:10.457380 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" cooperative-sticky assignment for 1 member(s) finished in 0.037ms:
D, [2023-06-09T09:35:10.457387 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) assigned 10 partition(s):
D, [2023-06-09T09:35:10.457395 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [0]
D, [2023-06-09T09:35:10.457403 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [1]
D, [2023-06-09T09:35:10.457410 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [2]
D, [2023-06-09T09:35:10.457418 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [3]
D, [2023-06-09T09:35:10.457425 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [4]
D, [2023-06-09T09:35:10.457432 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [5]
D, [2023-06-09T09:35:10.457440 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [6]
D, [2023-06-09T09:35:10.457460 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [7]
D, [2023-06-09T09:35:10.457470 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [8]
D, [2023-06-09T09:35:10.457477 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [9]
D, [2023-06-09T09:35:10.457485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": "cooperative-sticky" assignor run for 1 member(s)
D, [2023-06-09T09:35:10.457493 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Partitions owned by members: 0, partitions assigned by assignor: 10
D, [2023-06-09T09:35:10.457501 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": COOPERATIVE protocol collection sizes: maybe revoking: 0, ready to migrate: 10, unknown but owned: 0
D, [2023-06-09T09:35:10.457508 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": 10 partitions assigned to consumers
D, [2023-06-09T09:35:10.457518 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-metadata -> wait-sync (state up)
D, [2023-06-09T09:35:10.457526 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent SyncGroupRequest (v3, 221 bytes @ 0, CorrId 6)
D, [2023-06-09T09:35:10.457533 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457541 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received SyncGroupResponse (v3, 10 bytes, CorrId 6, rtt 0.21ms)
D, [2023-06-09T09:35:10.457549 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: SyncGroupRequest failed: Broker: Group rebalance in progress: actions Permanent
D, [2023-06-09T09:35:10.457556 #2162967] DEBUG -- : rdkafka: [thrd:main]: SyncGroup response: Broker: Group rebalance in progress (0 bytes of MemberState data)
D, [2023-06-09T09:35:10.457564 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": synchronization failed: Broker: Group rebalance in progress: rejoining
D, [2023-06-09T09:35:10.457572 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Rejoining group without an assignment: SyncGroup error: Broker: Group rebalance in progress
D, [2023-06-09T09:35:10.457580 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-sync -> init (state up)
I, [2023-06-09T09:35:10.458469 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.185397148132324 ms
D, [2023-06-09T09:35:10.458501 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>8, :topic=>"6bad105f430f", :payload=>"9"}
I, [2023-06-09T09:35:10.468784 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.214381694793701 ms
D, [2023-06-09T09:35:10.468830 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>9, :topic=>"6bad105f430f", :payload=>"10"}
D, [2023-06-09T09:35:10.941137 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941271 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941342 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state init)
D, [2023-06-09T09:35:10.941424 #2162967] DEBUG -- : rdkafka: [thrd:main]: Terminating group "node.rb-e4ef35230e0d_app" in state up with 0 partition(s)
D, [2023-06-09T09:35:10.941485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": unsubscribe from current subscription of size 1 (leave group=true, has joined=true, rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69, join-state init)
D, [2023-06-09T09:35:10.941568 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": clearing subscribed topics list (1)
D, [2023-06-09T09:35:10.941634 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941700 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941753 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": effective subscription list changed from 1 to 0 topic(s):
D, [2023-06-09T09:35:10.941841 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: unsubscribe
D, [2023-06-09T09:35:10.941904 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" initiating rebalance (COOPERATIVE) in state up (join-state init) with 0 assigned partition(s): unsubscribe
D, [2023-06-09T09:35:10.941950 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state wait-join)
ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.

and Kafka log matching this time:

karafka_21_kafka | [2023-06-09 07:35:07,160] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 1 (__consumer_offsets-31) (reason: Adding new member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-390196ef-cf8f-4d1a-a029-b64793413735] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 2 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e023de05-5c28-4577-af61-bc3ac4217652] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 2 (__consumer_offsets-31) (reason: removing member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 3 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 3 (__consumer_offsets-31) (reason: Adding new member rdkafka-275f4343-0436-4314-899a-14bd98f1c20b with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 4 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,377] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 4. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,378] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 4 (__consumer_offsets-31) (reason: Adding new member rdkafka-83086da6-e6b6-4744-9330-635edd095984 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,887] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-275f4343-0436-4314-899a-14bd98f1c20b] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 5 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-83086da6-e6b6-4744-9330-635edd095984] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 5 (__consumer_offsets-31) (reason: removing member rdkafka-83086da6-e6b6-4744-9330-635edd095984 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 6 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,399] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 6 (__consumer_offsets-31) (reason: Adding new member rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,400] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 7 (__consumer_offsets-31) with 2 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,404] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 7. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,935] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 7 (__consumer_offsets-31) (reason: removing member rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 8 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,445] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 8 (__consumer_offsets-31) (reason: Adding new member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 9 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 9 (__consumer_offsets-31) (reason: Adding new member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Member rdkafka-d7596725-b61d-4386-a4e3-6b35e6994d75 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Stabilized group node.rb-c0211bcb57d6_app generation 4 (__consumer_offsets-25) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,014] INFO [GroupCoordinator 1001]: Member rdkafka-d518effd-ace9-4f38-a3b2-7736948e2109 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,015] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e293c87d05db_app generation 4 (__consumer_offsets-27) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,221] INFO [GroupCoordinator 1001]: Member rdkafka-bbfb59a4-11b8-4f61-a08e-fe34f54e25aa in group node.rb-1969f87a8fdf_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Stabilized group node.rb-1969f87a8fdf_app generation 2 (__consumer_offsets-48) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc] in group node.rb-1969f87a8fdf_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-1969f87a8fdf_app in state PreparingRebalance with old generation 2 (__consumer_offsets-48) (reason: removing member rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Group node.rb-1969f87a8fdf_app with generation 3 is now empty (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 in group node.rb-e4ef35230e0d_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 10 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,447] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 10 (__consumer_offsets-31) (reason: removing member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 11 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-c0211bcb57d6_app in state PreparingRebalance with old generation 4 (__consumer_offsets-25) (reason: removing member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Group node.rb-c0211bcb57d6_app with generation 5 is now empty (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e293c87d05db_app in state PreparingRebalance with old generation 4 (__consumer_offsets-27) (reason: removing member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Group node.rb-e293c87d05db_app with generation 5 is now empty (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator)
@mensfeld mensfeld changed the title Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashing the process Jun 9, 2023
@mensfeld mensfeld changed the title Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashing the process Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashes the process Jun 12, 2023
@mensfeld
Copy link
Author

It seems to be a duplicate of: #4252

Once 2.2.0 is released I will close this one and the rdkafka-ruby related one.

@mensfeld
Copy link
Author

@emasab I can still reproduce it with 2.2.0:

rdkafka_sticky_assignor.c:2157: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.

@emasab
Copy link
Collaborator

emasab commented Jul 18, 2023

Hello @mensfeld . This seems like a different thing, the assignor_state is NULL initially and it's only set after assignment. But its value isn't checked by the code before calling the function. I mark it as a bug to reproduce with a test and fix.

@emasab emasab added the bug label Jul 18, 2023
@mensfeld
Copy link
Author

@emasab thanks. So theoretically I should be able to mitigate it for now by always waiting for the first rebalance to finish on our side right?

@emasab
Copy link
Collaborator

emasab commented Jul 18, 2023

That should mitigate it. Thanks for the report!

@mensfeld
Copy link
Author

Awesome. I will implement this strategy on our side then as a temporary measure.

@mensfeld
Copy link
Author

@emasab

is NULL initially and it's only set after assignment.

Is is also set after an empty assignment? Can I assume, that the moment rb_cb kicks in, I'm good to go with the shutdown?

@mensfeld
Copy link
Author

@emasab FYI I was able to repro + mitigate on my side. Thanks

@mensfeld mensfeld changed the title Cooperative-sticky ongoing rebalance causes race condition with rd_kafka_destroy crashes the process assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling Jul 31, 2023
emasab added a commit that referenced this issue Aug 2, 2023
e.g. a destroy happens before the first assignment.
Only affects the cooperative-sticky assignor.

fixes #4312
emasab added a commit that referenced this issue Aug 11, 2023
e.g. a destroy happens before the first assignment.
Only affects the cooperative-sticky assignor.

fixes #4312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants