-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race when resizing channel across separate clients #20
Comments
Friendly ping @dallison, any opinion on what might be causing this? |
@a7g4 I dug into this a bit, but I'm not able to reproduce the crashes in any config I try (gcc / clang, libstdc++ / libc++, opt / dbg). But looking at the code and at Dave's fix to the 'reload' function on monday, see 8a4bdeb#diff-0456d178d4819d469423f58832be3dd34231c1325bb7ea678813e48cca5f1b11L99 I suspect this bug is fixed, because that reload bug would definitely cause the sort of crash you reported, i.e., the buffer_index being out-of-sync with the buffers_ vector, since the reload is the way a subscriber checks if updating the buffers_ is necessary before proceeding with looking at the message slots. Like I said, I cannot check that, I cannot reproduce your error. |
I've found the issue. Accessing an argument after it has been std::moved. I have a fix. |
Sorry for the delay, for some reason I am not getting notified of some issues in github. |
No worries! Let us know how we can help 🙂 |
Is that this commit? 8a4bdeb If so, it doesn't seem to resolve the issue 😢 We did notice that this issue seems pretty common on |
I did a bit of debugging on our machine. Here is what I've been able to gather so far from running that test in gdb. Error:
Stack trace on all relevant threads:
Some stack variables printed out from gdb:
Observations:
Currently, I'm suspicious of |
I found the bug. It was a race condition between the publisher's increment of the ref count on the new (resized) slot (at https://github.com/dallison/subspace/blob/main/client/client.cc#L264) via the SetSlotToBiggestBuffer call after the reply from the server to resize the channel, and the subscriber unmapping unused buffers (see https://github.com/dallison/subspace/blob/main/common/channel.cc#L432C15-L432C33) via the UnmapUnusedBuffers function. Basically, the subscriber might come around to clean up unused buffers right in between the server having resized channel but the publisher having not yet claimed the new slot. Kaboom. I can fix it pretty easily by never cleaning the last buffer. (i.e., always assume that the largest buffer is in use, even if it's ref count is zero). I'll make a PR soon. |
I guess in theory, there is still a race condition because if multiple publishers are resizing at the same time, there could be multiple buffers on the tail end of the channel that don't yet have a refs count. Maybe the proper solution is for the server to set to ref count to 1 when extending the buffers, instead of having the publishers increment it after the reply. |
I've been chasing down a weird crash and I think it's a race when a channel gets resized. The symptoms of the crash are an error message like this (sometimes other values) followed by termination:
I've since managed to isolate it to channel resizing (I think). This is a test that reproduces the issue about 1 in 10 runs (varies, on one of my machines its 1 in 2; on another its closer to 1 in 20):
This is the output and the stack trace (
macos-aarch64
compiled with clang; running underlldb
) running on one machine (this machine crashes about 1 in every 20 runs):On a different machine (
linux-aarch64
) compiled withgcc
running undergdb
(this machine triggers this crash every other run):The text was updated successfully, but these errors were encountered: