New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in parallel DOT runs (Origin: bugzilla #756241) #5883

Closed
doxygen opened this Issue Jul 2, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@doxygen
Owner

doxygen commented Jul 2, 2018

status RESOLVED severity normal in component general for ---
Reported in version 1.8.9.1 on platform Other
Assigned to: Dimitri van Heesch

Original attachment names and IDs:

On 2015-10-08 12:33:12 +0000, Bernard B wrote:

Created attachment 312910
Hacky fix for the crash

I'm running doxygen on a 24-core machine, and doxygen spawns that many threads to run dot. There is a concurrency bug in the reference counting of the QCString class which is creating a sporadic use-after-free error which exhibits in different ways.

The problem goes away when setting DOT_NUM_THREADS = 1

In my dot config file, I have DOT_PATH = /tools/bin

Sometimes, doxygen will segfault, with a backtrace like:

0 0x000000397b489d4b in memcpy () from /lib64/libc.so.6

1 0x0000000000435113 in QCString::LSData::resize (d=0x26a16b0, size=55170228) at ../qtools/qcstring.h:417

2 0x00000000004357f2 in QCString::StringRep::resize (this=0x7f348722cd60, newlen=55170228) at ../qtools/qcstring.h:673

3 0x0000000000434d96 in QCString::fill (this=0x7f348722cd60, c=-76 '\264', len=3) at ../qtools/qcstring.h:245

4 0x0000000000434ee2 in QCString::operator+= (this=0x7f348722cd60, str=0xb666da "dot") at ../qtools/qcstring.h:312

5 0x0000000000435a9a in operator+ (s1=..., s2=0xb666da "dot") at ../qtools/qcstring.h:780

6 0x0000000000734c4d in DotRunner::run (this=0x34c9d70) at dot.cpp:813

7 0x0000000000736a19 in DotWorkerThread::run (this=0x34a3ed0) at dot.cpp:1193

8 0x0000000000842c9f in QThreadPrivate::start (arg=0x34a3ed0) at qthread_unix.cpp:87

9 0x000000397bc079d1 in start_thread () from /lib64/libpthread.so.0

10 0x000000397b4e8b5d in clone () from /lib64/libc.so.6

Sometimes, all instances of dot execution will fail with:

sh: /tools/bin/: is a directory
Problems running dot: exit code=126, command='/tools/bin/', arguments='...'

(notice the lack of "dot" being appended to the command above).

Sometimes glibc throws a malloc error.

Sometimes it all runs just fine.

Here is what I think the root cause is: The LSHeader::refCount variable in qcstring.h is not modified atomically. This affects the reference counting on the string returned by Config_getString("DOT_PATH") in DotRunner::run(). Because the threads all start at roughly the same time on different CPUs and the reference count is not incremented atomically, the value of refCount ends up being less than it should be. This leads to the string being prematurely freed by LSData::dispose(), which results in all of the various crashes above. It's a small race window but when you've got 24 cores contending for the same cache line the chances of hitting it increase significantly.

If I replace the increments and decrements of refCount with appropriate calls to the gcc builtin __sync_add_and_fetch() to make these accesses atomic (see terribly hacky attached patch), the bug seems to be resolved. Previously doxygen would crash 2/10 times, and with the attached patch it crashed 0/30 times.

Sorry I can't share a test case to reproduce it, but hopefully this information is sufficient to resolve the issue. The bug is quite sensitive to the input data to doxygen and happens much more reliably with certain sets of input sources files, but then disappears after minor changes.

I could readily reproduce the bug with doxygen 1.8.9.1, and this is the version I tested the fix on. However, on 1.8.10 and also on git master ffff695, I couldn't reproduce it. But I believe that's more likely because other changes have affected the timing around dot generation, and the bug is in fact still there.

On 2015-10-17 15:27:58 +0000, Dimitri van Heesch wrote:

Hi Bernard,

Thanks for the clear description and analysis. The QCString class has recently been reimplemented using reference counting, and this doesn't work well with multi-threading as you observed.

I've pushed the following fix:
SHA: f196c9f

This fix introduces a simple constant string class that is used during DotRunner::run().

Please check if this solves the problem.

On 2015-12-30 10:19:47 +0000, Dimitri van Heesch wrote:

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.11. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information
that you think can be relevant (preferably in the form of a self-contained example).

@doxygen doxygen closed this Jul 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment