-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use-after-free when rd_kafka_new returns RD_KAFKA_RESP_ERR__CRIT_SYS_RESOURCE #4100
Open
4 tasks done
Comments
The function pointer that is the target of the use-after-free was set with
|
Quuxplusone
added a commit
to Quuxplusone/librdkafka
that referenced
this issue
Dec 16, 2022
…depath As the new comment says: We shouldn't return NULL unless failure has actually occurred and our caller can proceed with cleaning up their resources. If librdkafka is still actively running background threads that need to touch those resources, we MUST NOT communicate otherwise to our caller. Since this codepath "should never happen," it theoretically doesn't matter much what we do here. But `return NULL` in practice leads to use-after-free segfaults on overloaded VMs, so we shouldn't do that. Instead, just loop (and log) until the background threads have run, and then proceed. Slow success is still success. Fixes confluentinc#4100.
Quuxplusone
added a commit
to Quuxplusone/librdkafka
that referenced
this issue
Dec 16, 2022
…depath As the new comment says: We shouldn't return NULL unless failure has actually occurred and our caller can proceed with cleaning up their resources. If librdkafka is still actively running background threads that need to touch those resources, we MUST NOT communicate otherwise to our caller. Since this codepath "should never happen," it theoretically doesn't matter much what we do here. But `return NULL` in practice leads to use-after-free segfaults on overloaded VMs, so we shouldn't do that. Instead, just loop (and log) until the background threads have run, and then proceed. Slow success is still success. Fixes confluentinc#4100.
5 tasks
7 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We're hitting this codepath in
rd_kafka_new
. This is in a pathologically large integration test where (1) all Kafka brokers are unreachable by design, and (2) we're running lots of VMs (7x 16-core VMs all simulated on the same single 34-core physical machine) such that "the OS is not scheduling the background threads
" is actually a highly likely scenario for us.Looking at this code, I see that it calls
rd_kafka_log(rk, ...)
and then returns null without destroying therk
at all. This seems like a memory leak, right? That's bad, but not fatal.What's fatal is that this codepath leads to a use-after-free for us, because we assume that when
rd_kafka_new
fails, we should clean up therk_conf
that we passed to it. In other words, we do exactly what's documented in INTRODUCTION.md:But along this specific codepath,
rd_kafka_new
fails and returns NULL and yet there are still background threads running, and those background threads are going to try to access the function-pointer callbacks registered in thatconf
object. If werd_kafka_conf_destroy(conf)
on failure, then we have a use-after-free situation on those function pointers, and the symptom is generally a segfault.#2820 might be related; it mentions this same codepath.
My ideal outcome here would be a simple rule like "When
rd_kafka_new
fails by returning NULL, it always relinquishes ownership of therk_conf
, so you can feel free to destroy it at your leisure," or "Wheneverrd_kafka_new
is called, it always takes ownership of therk_conf
, so you should actually never destroy therk_conf
after that point because now it belongs to the library." (The former is what I always thought the rule was. The latter would be awesome, but probably can't be achieved in practice because it would cause double-free bugs for all existing code.)The second-best outcome would be if you could give us a simple distinguishing rule for when we should destroy the
rk_conf
and when we shouldn't. For example, "Whenrd_kafka_new
fails by returning NULL, it relinquishes ownership of therk_conf
if and only ifrd_kafka_last_error() != RD_KAFKA_RESP_ERR__CRIT_SYS_RESOURCE
. Client code should check for that situation and avoid destroying therk_conf
ifrd_kafka_last_error() == RD_KAFKA_RESP_ERR__CRIT_SYS_RESOURCE
." (However, notice that that rule is also wrong, because of this codepath.)Checklist
Please provide the following information:
v1.8.2
,v1.9.2
2.6.3
, I thinkThe text was updated successfully, but these errors were encountered: