Skip to content

Conversation

@john-z-yang
Copy link
Contributor

@john-z-yang john-z-yang commented Nov 17, 2024

Overview

Now that the consumer event handling code is running in a dedicated green thread, it turns out that it can deadlock with the green thread that polls our rdkafka client. This change runs the rdkafka client polling in a dedicated blocking thread pool so it will never contend with our green threads.

Details

Async rust does not rely on preemption to switch what future to execute. It is uses a cooperative model, so an underlying POSIX thread can only change what green thread to execute when the currently executing green thread hits an .await point in its code. Therefore, any code that runs between any .await point is guaranteed that it will hog that POSIX thread.

The rdkafka rust binding library provides an async client, however the rebalance callback is not async. In order to know when the rebalance event that the rebalance callback sent to the event handler is processed, it needs to use a sync_channel instead of something that is async like oneshot channel to signal completion. This also means that while the rebalance callback is waiting for rendezvous, it does not yield execution.

sequenceDiagram
    Broker->>+pre_rebalance: Partition revocation
    pre_rebalance->>+handle_events: Write event & rendezvous sender to event sender
    handle_events->>+ActorHandles: .shutdown()
    ActorHandles->>-handle_events: .shutdown() completes
    handle_events->>pre_rebalance: Write to rendezvous sender
    deactivate handle_events
    pre_rebalance->>-Broker: Ack
Loading

Meanwhile, the handle_events green thread could be scheduled on the same POSIX thread as the consumer client green thread. Since the rebalance callback is waiting for rendezvous, and it does not yield execution, the handle_events green thread never gets to run. This will deadlock the consumer.

How do we prevent future problems like this?

This problem has always existed since the consumer is merged in here. We just got lucky because the consumer polling was running on a separate POSIX thread than the event handling thread. We can reproduce this problem from earlier commits if we run them with a single threaded runtime

#[tokio::main(flavor = "current_thread")]

Since this is the only non-async part of our codebase, now that it is on a dedicated threadpool, it should never happen again.

But to give us more confidence, when we have our integration test, we should also test our program with the current_thread tokio setting.

sentry_dsn: None,
sentry_env: None,
log_level: LogLevel::Info,
log_level: LogLevel::Debug,
Copy link
Contributor Author

@john-z-yang john-z-yang Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the default log level to debug here because it seems like this default config is the development config

@john-z-yang john-z-yang force-pushed the john/fix-rdkafka-deadlock branch from fae0e73 to 552a338 Compare November 17, 2024 01:39
@john-z-yang john-z-yang marked this pull request as ready for review November 17, 2024 01:41
@john-z-yang john-z-yang requested a review from a team November 17, 2024 01:41
@john-z-yang john-z-yang force-pushed the john/fix-rdkafka-deadlock branch from 552a338 to 2b7e2b3 Compare November 17, 2024 04:25
@john-z-yang john-z-yang force-pushed the john/fix-rdkafka-deadlock branch from 2b7e2b3 to cdb4328 Compare November 17, 2024 04:34
@john-z-yang john-z-yang merged commit a8c172f into main Nov 18, 2024
3 checks passed
@john-z-yang john-z-yang deleted the john/fix-rdkafka-deadlock branch November 18, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants