New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cMojangAPI segmentation faults and asserts #2830

Closed
tigerw opened this Issue Dec 27, 2015 · 28 comments

Comments

Projects
None yet
8 participants
@tigerw
Member

tigerw commented Dec 27, 2015

We are currently experiencing technical difficulties with regards to connections to Mojang; references are as follows:

Potential analysis:

Problem 1

Mojang made their servers close the connection quicker. This means that cBlockingSslClientSocket::OnDisconnected is often called whilst cMojangAPI is still reading data from the SSL context (whatever that is :P). In Debug builds, this causes further calls to cBlockingSslClientSocket::Receive to assert that m_IsConnected is unexpectedly false. This is known.

Problem 2

OnDisconnected runs on a separate thread from the thread cMojangAPI calls Receive from. OnDisconnect also will reset the m_Server pointer of the above class. Thus, in Release builds, I suspect, but am not sure, that a concurrency issue occurs where after m_Server has been reset but before m_IsConnected has been set to false by OnDisconnect, cBlockingSslClientSocket::Disconnect is called automatically by the destruction of the cBlockingSslClientSocket object (because cMojangAPI has completed its work super quickly, yeah) which attempts to dereference the m_Server to call cTCPLink::Close, but crashes since m_Server was already reset.

@sphinxc0re

This comment has been minimized.

Show comment
Hide comment
@sphinxc0re

sphinxc0re Dec 27, 2015

Contributor

@Dinnerbone could you give us a hint on this? 😄

Contributor

sphinxc0re commented Dec 27, 2015

@Dinnerbone could you give us a hint on this? 😄

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 28, 2015

Member

That wasn't nice, there's a problem in our code, no need to bother Mojang people about it.

We clearly have a race condition in the networking code.

Member

madmaxoft commented Dec 28, 2015

That wasn't nice, there's a problem in our code, no need to bother Mojang people about it.

We clearly have a race condition in the networking code.

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 28, 2015

Contributor

Looked with clang thread sanitizer. There is a data race on variable m_IsConnected BlockingSslClientSocket.cpp:341 (write) and BlockingSslClientSocket.cpp:255 (read). Not sure if that is the problem.

Contributor

Seadragon91 commented Dec 28, 2015

Looked with clang thread sanitizer. There is a data race on variable m_IsConnected BlockingSslClientSocket.cpp:341 (write) and BlockingSslClientSocket.cpp:255 (read). Not sure if that is the problem.

@madmaxoft madmaxoft self-assigned this Dec 28, 2015

@tigerw

This comment has been minimized.

Show comment
Hide comment
@tigerw

tigerw Dec 28, 2015

Member

Xoft, please try not to add too many queues and locks, I don't like queues and locks... :P

On 28 Dec 2015, at 08:27, "Lukas Pioch" notifications@github.com wrote:

Looked with clang thread sanitizer. There is a data race on variable m_IsConnected BlockingSslClientSocket.cpp:341 (write) and BlockingSslClientSocket.cpp:255 (read). Not sure if that is the problem.


Reply to this email directly or view it on GitHub.

Member

tigerw commented Dec 28, 2015

Xoft, please try not to add too many queues and locks, I don't like queues and locks... :P

On 28 Dec 2015, at 08:27, "Lukas Pioch" notifications@github.com wrote:

Looked with clang thread sanitizer. There is a data race on variable m_IsConnected BlockingSslClientSocket.cpp:341 (write) and BlockingSslClientSocket.cpp:255 (read). Not sure if that is the problem.


Reply to this email directly or view it on GitHub.

@sphinxc0re

This comment has been minimized.

Show comment
Hide comment
@sphinxc0re

sphinxc0re Dec 28, 2015

Contributor

I made a joke... just a joke

Contributor

sphinxc0re commented Dec 28, 2015

I made a joke... just a joke

@worktycho

This comment has been minimized.

Show comment
Hide comment
@worktycho

worktycho Dec 28, 2015

Member

Just make it an atomic<bool>?

Member

worktycho commented Dec 28, 2015

Just make it an atomic<bool>?

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 28, 2015

Contributor

Yes used atomic and fixed then another data race, but not the cause of this problem.

Contributor

Seadragon91 commented Dec 28, 2015

Yes used atomic and fixed then another data race, but not the cause of this problem.

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 28, 2015

Member

I can't really pinpoint this, because I'm not getting a crash at all, no matter what I do. Has anyone run into this under Windows, or is it Linux- / Mac- specific?

I've identified one possible cause of trouble, m_IsConnected being set to true two times. And I'll make it an atomic bool, that can't hurt.

Member

madmaxoft commented Dec 28, 2015

I can't really pinpoint this, because I'm not getting a crash at all, no matter what I do. Has anyone run into this under Windows, or is it Linux- / Mac- specific?

I've identified one possible cause of trouble, m_IsConnected being set to true two times. And I'll make it an atomic bool, that can't hurt.

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 28, 2015

Member

I've added a uuid command to the Debuggers plugin that queries the cMojangAPI for players' UUIDs. It contacts the Mojang servers each time an unknown name is queried, so using it on some non-existent username provides a simple way of testing the cBlockingSslClientSocket implementation. Can anyone verify that it indeed invokes the crash on their side?

Member

madmaxoft commented Dec 28, 2015

I've added a uuid command to the Debuggers plugin that queries the cMojangAPI for players' UUIDs. It contacts the Mojang servers each time an unknown name is queried, so using it on some non-existent username provides a simple way of testing the cBlockingSslClientSocket implementation. Can anyone verify that it indeed invokes the crash on their side?

@tigerw

This comment has been minimized.

Show comment
Hide comment
@tigerw

tigerw Dec 28, 2015

Member

You can still get a thread switch after resetting m_Server but before setting m_IsConnected though.

PS: I could only reproduce the crash with some manual thread delaying using cEvent and this_thread::sleep

Member

tigerw commented Dec 28, 2015

You can still get a thread switch after resetting m_Server but before setting m_IsConnected though.

PS: I could only reproduce the crash with some manual thread delaying using cEvent and this_thread::sleep

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 28, 2015

Member

That's why I swapped them in #2831. It should be okay if a thread sees m_IsConnected as false but m_Server still assigned.

Can you retry those tests that did break on the #2831 code?

I'm currently compiling on my Linux machine to see if I get any failures from the original code, then I'll test the new one, too.

Member

madmaxoft commented Dec 28, 2015

That's why I swapped them in #2831. It should be okay if a thread sees m_IsConnected as false but m_Server still assigned.

Can you retry those tests that did break on the #2831 code?

I'm currently compiling on my Linux machine to see if I get any failures from the original code, then I'll test the new one, too.

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 28, 2015

Contributor

I changed m_TicksSinceLastPacket in ClientHandle.h to use atomic. Currently I get the assertion error or a null access to m_Socket in BlockingSslClientSocket.cpp:242.

With the command uuid, I have no crashes / problems.

Contributor

Seadragon91 commented Dec 28, 2015

I changed m_TicksSinceLastPacket in ClientHandle.h to use atomic. Currently I get the assertion error or a null access to m_Socket in BlockingSslClientSocket.cpp:242.

With the command uuid, I have no crashes / problems.

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 28, 2015

Member

@Seadragon91 are you using the uuid command with a non-existent playername?

Member

madmaxoft commented Dec 28, 2015

@Seadragon91 are you using the uuid command with a non-existent playername?

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 28, 2015

Contributor

Tested the uuid command with existent and non-existent player names. No crashes under windows and linux. But under windows I get messages from this issue #2418. I don't get them under linux.

Contributor

Seadragon91 commented Dec 28, 2015

Tested the uuid command with existent and non-existent player names. No crashes under windows and linux. But under windows I get messages from this issue #2418. I don't get them under linux.

@psilohanf

This comment has been minimized.

Show comment
Hide comment
@psilohanf

psilohanf Dec 29, 2015

I am experiencing this every single time even with the newest revision of cuberite. My system is: Linux vs1191233.vserver.de 2.6.32-042stab111.12 #1 SMP Thu Sep 17 11:38:20 MSK 2015 x86_64 GNU/Linux (Debian GNU/Linux 8)
As soon as a user logs in, a segmentation fault happens...

psilohanf commented Dec 29, 2015

I am experiencing this every single time even with the newest revision of cuberite. My system is: Linux vs1191233.vserver.de 2.6.32-042stab111.12 #1 SMP Thu Sep 17 11:38:20 MSK 2015 x86_64 GNU/Linux (Debian GNU/Linux 8)
As soon as a user logs in, a segmentation fault happens...

@lstep

This comment has been minimized.

Show comment
Hide comment
@lstep

lstep Dec 29, 2015

On Ubuntu, just cloned the latest revision, and compiled with debug activated. On my first run, just launching it (no client connected), I got the error:

[d68e6fdd5b532615|14:48:03] Cuberite local build build id: Unknown
[d68e6fdd5b532615|14:48:03] from commit id: approx: 7a778e291139624ad0733103d57b6e816c5a9ccc built at: approx: 2015-12-28 11:08:46 +0100
[d68e6fdd5b532615|14:48:03] Creating new server instance...
[...]
[8a0bca1c3ee8bb7a|14:48:04] Assertion failed: m_IsConnected, file /tmp/Cuberite2/cuberite/src/PolarSSL++/BlockingSslClientSocket.cpp, line 219
[4c7b68d3907bc990|14:48:04] BioGenCache: 8 hits, 2 misses, saved 80.00 %
[4c7b68d3907bc990|14:48:04] BioGenCache: Avg cache chain length: 0.12
./Cuberite_debug(_Z15PrintStackTracev+0x23)[0xadb417]
./Cuberite_debug(_ZN24cBlockingSslClientSocket7ReceiveEPvm+0x46)[0xdcef7a]
./Cuberite_debug(_ZN10cMojangAPI13SecureRequestERKSsS1_RSs+0x168)[0xc150de]
./Cuberite_debug(_ZN10cMojangAPI18QueryUUIDToProfileERKSs+0x215)[0xc17935]
./Cuberite_debug(_ZN10cMojangAPI6UpdateEv+0x2e8)[0xc18594]
./Cuberite_debug(_ZN10cMojangAPI13cUpdateThread7ExecuteEv+0x1f)[0xc18bf5]
./Cuberite_debug(_ZN9cIsThread9DoExecuteEv+0x2f)[0xad10f1]
./Cuberite_debug(_ZNKSt7_Mem_fnIM9cIsThreadFvvEEclIIEvEEvPS0_DpOT_+0x65)[0xad265d]
./Cuberite_debug(_ZNSt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS1_EE9_M_invokeIILm0EEEEvSt12_Index_tupleIIXspT_EEE+0x43)[0xad25ad]
./Cuberite_debug(_ZNSt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS1_EEclEv+0x1b)[0xad24b5]
./Cuberite_debug(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS3_EEE6_M_runEv+0x1c)[0xad244e]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d390)[0x7f0073666390]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f00738ee182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0072df147d]
Cuberite_debug: /tmp/Cuberite2/cuberite/src/PolarSSL++/BlockingSslClientSocket.cpp:219: int cBlockingSslClientSocket::Receive(void*, size_t): Assertion `0' failed.

But now I can't generate it anymore.

lstep commented Dec 29, 2015

On Ubuntu, just cloned the latest revision, and compiled with debug activated. On my first run, just launching it (no client connected), I got the error:

[d68e6fdd5b532615|14:48:03] Cuberite local build build id: Unknown
[d68e6fdd5b532615|14:48:03] from commit id: approx: 7a778e291139624ad0733103d57b6e816c5a9ccc built at: approx: 2015-12-28 11:08:46 +0100
[d68e6fdd5b532615|14:48:03] Creating new server instance...
[...]
[8a0bca1c3ee8bb7a|14:48:04] Assertion failed: m_IsConnected, file /tmp/Cuberite2/cuberite/src/PolarSSL++/BlockingSslClientSocket.cpp, line 219
[4c7b68d3907bc990|14:48:04] BioGenCache: 8 hits, 2 misses, saved 80.00 %
[4c7b68d3907bc990|14:48:04] BioGenCache: Avg cache chain length: 0.12
./Cuberite_debug(_Z15PrintStackTracev+0x23)[0xadb417]
./Cuberite_debug(_ZN24cBlockingSslClientSocket7ReceiveEPvm+0x46)[0xdcef7a]
./Cuberite_debug(_ZN10cMojangAPI13SecureRequestERKSsS1_RSs+0x168)[0xc150de]
./Cuberite_debug(_ZN10cMojangAPI18QueryUUIDToProfileERKSs+0x215)[0xc17935]
./Cuberite_debug(_ZN10cMojangAPI6UpdateEv+0x2e8)[0xc18594]
./Cuberite_debug(_ZN10cMojangAPI13cUpdateThread7ExecuteEv+0x1f)[0xc18bf5]
./Cuberite_debug(_ZN9cIsThread9DoExecuteEv+0x2f)[0xad10f1]
./Cuberite_debug(_ZNKSt7_Mem_fnIM9cIsThreadFvvEEclIIEvEEvPS0_DpOT_+0x65)[0xad265d]
./Cuberite_debug(_ZNSt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS1_EE9_M_invokeIILm0EEEEvSt12_Index_tupleIIXspT_EEE+0x43)[0xad25ad]
./Cuberite_debug(_ZNSt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS1_EEclEv+0x1b)[0xad24b5]
./Cuberite_debug(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt7_Mem_fnIM9cIsThreadFvvEEPS3_EEE6_M_runEv+0x1c)[0xad244e]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d390)[0x7f0073666390]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f00738ee182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0072df147d]
Cuberite_debug: /tmp/Cuberite2/cuberite/src/PolarSSL++/BlockingSslClientSocket.cpp:219: int cBlockingSslClientSocket::Receive(void*, size_t): Assertion `0' failed.

But now I can't generate it anymore.

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 29, 2015

Member

@lstep how about the uuid command with a non-existent player name, does that produce the assert for you? (note: you need to load the Debuggers plugin to have that command)

Member

madmaxoft commented Dec 29, 2015

@lstep how about the uuid command with a non-existent player name, does that produce the assert for you? (note: you need to load the Debuggers plugin to have that command)

@lstep

This comment has been minimized.

Show comment
Hide comment
@lstep

lstep Dec 29, 2015

Nope, no assert generated (first one, lstep, is me, second one, nosuchuser1434234, doesn't exist)

uuid lstep
[770a5bdf600590d7|15:54:41] Executing console command: "uuid lstep"
[770a5bdf600590d7|15:54:41] Player lstep:
[770a5bdf600590d7|15:54:41]   - in the cache: ""
[770a5bdf600590d7|15:54:41]   - online: "6375d80ec44c43028c598ab0e6d7fb8f"
uuid nosuchuser1434234
[770a5bdf600590d7|15:54:51] Executing console command: "uuid nosuchuser1434234"
[770a5bdf600590d7|15:54:51] Player nosuchuser1434234:
[770a5bdf600590d7|15:54:51]   - in the cache: ""
[770a5bdf600590d7|15:54:52]   - online: ""

lstep commented Dec 29, 2015

Nope, no assert generated (first one, lstep, is me, second one, nosuchuser1434234, doesn't exist)

uuid lstep
[770a5bdf600590d7|15:54:41] Executing console command: "uuid lstep"
[770a5bdf600590d7|15:54:41] Player lstep:
[770a5bdf600590d7|15:54:41]   - in the cache: ""
[770a5bdf600590d7|15:54:41]   - online: "6375d80ec44c43028c598ab0e6d7fb8f"
uuid nosuchuser1434234
[770a5bdf600590d7|15:54:51] Executing console command: "uuid nosuchuser1434234"
[770a5bdf600590d7|15:54:51] Player nosuchuser1434234:
[770a5bdf600590d7|15:54:51]   - in the cache: ""
[770a5bdf600590d7|15:54:52]   - online: ""
@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 29, 2015

Member

I'm baffled. It uses the same underlying function, cMojangAPI::SecureRequest, yet it fails in one case and doesn't in the other.

Member

madmaxoft commented Dec 29, 2015

I'm baffled. It uses the same underlying function, cMojangAPI::SecureRequest, yet it fails in one case and doesn't in the other.

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 29, 2015

Contributor

Not sure if that is something.
https://gist.github.com/Seadragon91/c6f3c9aa28e0d9fb7e73

Ops: I have much more data races (cMojangAPI.cpp). Will create a gist, with the whole log file.

Contributor

Seadragon91 commented Dec 29, 2015

Not sure if that is something.
https://gist.github.com/Seadragon91/c6f3c9aa28e0d9fb7e73

Ops: I have much more data races (cMojangAPI.cpp). Will create a gist, with the whole log file.

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 29, 2015

Contributor

Here is a whole file created with the clang threadsanitizer.
https://gist.github.com/Seadragon91/df3edf98b9aabc0a8ba5

Contributor

Seadragon91 commented Dec 29, 2015

Here is a whole file created with the clang threadsanitizer.
https://gist.github.com/Seadragon91/df3edf98b9aabc0a8ba5

@worktycho

This comment has been minimized.

Show comment
Hide comment
@worktycho

worktycho Dec 29, 2015

Member

@Seadragon91 What c++ standard lib are you using, because that looks like a race condition on shared_ptr ref counts?

Member

worktycho commented Dec 29, 2015

@Seadragon91 What c++ standard lib are you using, because that looks like a race condition on shared_ptr ref counts?

@worktycho

This comment has been minimized.

Show comment
Hide comment
@worktycho

worktycho Dec 29, 2015

Member

Is there any possiblity those race conditions are related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57507?

Member

worktycho commented Dec 29, 2015

Is there any possiblity those race conditions are related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57507?

@Seadragon91

This comment has been minimized.

Show comment
Hide comment
@Seadragon91

Seadragon91 Dec 29, 2015

Contributor

Yes, I think this is a false positive. Will look to use a newer version of clang. Have a few problems with linking. Currently using clang 3.6.

Contributor

Seadragon91 commented Dec 29, 2015

Yes, I think this is a false positive. Will look to use a newer version of clang. Have a few problems with linking. Currently using clang 3.6.

@tigerw

This comment has been minimized.

Show comment
Hide comment
@tigerw

tigerw Dec 29, 2015

Member

The assertion in debug builds is due to our apparently mistaken assumption that, if the server has closed the connection, no more data is forthcoming; LibEvent obviously says otherwise.

The segmentation fault in release builds is due to Mojang closing the connection sooner, exploiting, ahem, a threading problem.

Member

tigerw commented Dec 29, 2015

The assertion in debug builds is due to our apparently mistaken assumption that, if the server has closed the connection, no more data is forthcoming; LibEvent obviously says otherwise.

The segmentation fault in release builds is due to Mojang closing the connection sooner, exploiting, ahem, a threading problem.

@madmaxoft madmaxoft changed the title from Mojang sabotaged us :P (cMojangAPI segmentation faults) to cMojangAPI segmentation faults and asserts Dec 29, 2015

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 29, 2015

Member

@lstep, are you compiling your own executable? Could you try getting branch BeforeLibEventUpdate (don't forget to git submodule update), build and test it? I'm trying to find the cause of the sudden rise of this issue - whether we could attribute it to a change in LibEvent.

Member

madmaxoft commented Dec 29, 2015

@lstep, are you compiling your own executable? Could you try getting branch BeforeLibEventUpdate (don't forget to git submodule update), build and test it? I'm trying to find the cause of the sudden rise of this issue - whether we could attribute it to a change in LibEvent.

@madmaxoft

This comment has been minimized.

Show comment
Hide comment
@madmaxoft

madmaxoft Dec 30, 2015

Member

#2831 seems to have fixed this.

Member

madmaxoft commented Dec 30, 2015

#2831 seems to have fixed this.

@madmaxoft madmaxoft closed this Dec 30, 2015

@SafwatHalaby

This comment has been minimized.

Show comment
Hide comment
@SafwatHalaby

SafwatHalaby Dec 31, 2015

Member

Can we safely close all the related issues?

Member

SafwatHalaby commented Dec 31, 2015

Can we safely close all the related issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment