Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Futex crashes under heavy load #951

Open
nyh opened this issue Mar 7, 2018 · 5 comments
Open

Futex crashes under heavy load #951

nyh opened this issue Mar 7, 2018 · 5 comments

Comments

@nyh
Copy link
Contributor

nyh commented Mar 7, 2018

This issue was discovered by @benoit-canet already in February 2017, and recently resurfaced by @wkozaczuk.
Benoit reported that when running a Golang HTTP server under load, he saw crashes in the futex code.

It is worth noting why we discovered this problem only in Go code. As I explained in issue #853, the futex code was very lightly used by general Linux applications, but could be much more heavily used in Go - in Benoit's benchmark it was used as much as 10,000 times per second. And it's possible that with such heavy use, more bugs get exposed.

I believe the bug is as follows: The FUTEX_WAIT code (see futex() in linux.cc) does this:

waitqueue &q = queues[uaddr];
sched::thread::wait_for(queues_mutex, tmr, q);

So, while we wait on this queue, the queues_mutex is released, and a FUTEX_WAKE can run in parallel (in fact, that's the whole point...).
When FUTEX_WAKE happens, it will find this queue, wake_one() on it, and if the queue becomes empty, it deletes the queue object.

The bug in this is that waking the waiting thread doesn't guarantee that it will run first (obviously). If it doesn't, the wait_for() of FUTEX_WAIT is woken up with the object q already destroyed. And that is a bug, because when wait_for() wakes up (see sched.hh) it calls the disarm() method on the waitqueue. And that will crash if done after the waitqueue object was already destroyed.

@benoit-canet already sent a patch for this bug, where he used a shared_ptr instead of the waitqueue object directly: FUTEX_WAKE will destroy the shared_ptr, but the actual waitqueue will be not be destroyed until the FUTEX_WAIT code which holds a copy of it will also drop it, after returning from the wake.

I think there may be a more efficient solution to this bug (how about the waiter removing the empty queue, not the waker?), but anyway the performance of this futex implementation sucks (see #853) so adding a shared_ptr will not make a noticable difference, so we can consider Benoit's patch.

@nyh
Copy link
Contributor Author

nyh commented Mar 7, 2018

Unfortunately, I am not able to reproduce this bug.
I wonder if this trivial patch, moving the erasing of the unused queue to the waiter instead of the waker, solves this bug in a simpler way than Benoit's patch with the extra reference counting?

diff --git a/linux.cc b/linux.cc
index 25d28ee0..bf2d99ab 100644
--- a/linux.cc
+++ b/linux.cc
@@ -77,6 +77,9 @@ int futex(int *uaddr, int op, int val, const struct timespec *
timeout,
                     tmr.set(std::chrono::seconds(timeout->tv_sec) +
                             std::chrono::nanoseconds(timeout->tv_nsec));
                     sched::thread::wait_for(queues_mutex, tmr, q);
+                    if (q.empty()) {
+                        queues.erase(uaddr);
+                    }
                     // FIXME: testing if tmr was expired isn't quite right -
                     // we could have had both a wakeup and timer expiration
                     // racing. It would be more correct to check if we were
@@ -87,6 +90,9 @@ int futex(int *uaddr, int op, int val, const struct timespec *timeout,
                     }
                 } else {
                     q.wait(queues_mutex);
+                    if (q.empty()) {
+                        queues.erase(uaddr);
+                    }
                 }
                 return 0;
             } else {
@@ -108,9 +114,6 @@ int futex(int *uaddr, int op, int val, const struct timespec *timeout,
                     i->second.wake_one(queues_mutex);
                     waken++;
                 }
-                if(i->second.empty()) {
-                    queues.erase(i);
-                }
                 return waken;
             }
         }

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Mar 8, 2018 via email

@nyh
Copy link
Contributor Author

nyh commented Mar 8, 2018

The more I look into the code, the less do I understand why what I thought was a bug (with the waker, instead of the waiter, deleting the wait queue) is really bug...

Unlike what I said above, it seems that wait_for() on a wait_queue alone does not actually need the queue to be still alive when waking up. The waiter supplies a "wait record" to the queue and the wait is on that, and the waker is the one who removes this wait record from the wait queue - so the waiter wakes up and only needs to check its wait record - not the wait queue which isn't touched at all.
It seems the waiter only needs to touch the queue after being woken up if some other reason (i.e., timeout) caused this wakeup, and the waiter (in the disarm() code) needs to remove its own wait record from the queue. So the next guess would be that the bug happens during a race between timeout and wakeup, but even then I don't understand how it can happen: in the existing code, wait_queue::wake() will only delete the queue after having woken the wait_record, and in wait_object<waitqueue>::disarm(), if the wait_record is already woken, it returns immediately and does not touch the queue at all... There is on suspicious part in waitqueue.cc in disarm() does:

    auto& fifo = _wq._waiters_fifo;
    if (_wr.woken()) {
        return;
    }
    ...

In what seems to be the "wrong" order - it should have checked woken() first - but even this I don't think is a problem, because the code only does not really read _wq, it only makes a copy of its address, as far as I can tell?

Benoit said he reproduced the bug with "wrk" run against an HTTP server in Go. @benoit-canet, can you perhaps try to recall more information on how you reproduced it? Thanks.

@wkozaczuk
Copy link
Collaborator

@nyh your last comment suggests you no longer think your original theory is correct. Given we can not reproduce this issue (I still cannot) and this issue does not describe the exact recipe to reproduce it, shall we close it?

@nyh
Copy link
Contributor Author

nyh commented Jul 12, 2021

Good question. The bug is probably real, and it will be useful for someone who encounters it in the future to be able to search for past guesses on it, but I agree that it's bad form that this issue mixes the issue description with a lot of wild and/or incorrect guesses :-(
If this issue really bothers you you can close it, but if not - I suggest we keep it. Maybe we should create a new tag for issues that we can't reproduce, to make it clearer that this is probably a real issue but we have no idea how to proceed with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants