MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

erthink · 2015-09-06T15:41:30Z

The root of problem is in wrong usage of pthread_key_create() and pthread_key_delete() inside LMDB.

A reader's slot will be assigned for any thread which has started a read transaction.
For releasing slot on a thread termination, LMDB uses mdb_env_reader_dest() which is registered by pthread_key_create() and later binds to a thread by pthread_setspecific().
Internally, mdb_env_reader_dest() uses a pointer to the mmaped 'data.lock' file, which holds a readers table. On a database closing the 'data.lock' will be unmapped.

Therefore any thread, which previously has read from this database and yet not terminated, will make a segfault inside mdb_env_reader_dest() or will corrupt content of memory.

hyc · 2015-09-06T17:31:05Z

Note that mdb_env_close() already calls pthread_key_delete() which destroys any remaining usage of the thread-specific key. Do you have a test case which shows this crash? There is no such problem in recent versions of glibc.

erthink · 2015-09-06T17:41:46Z

pthread_key_delete() don't performs any cleanup in registered threads, for instance see http://linux.die.net/man/3/pthread_key_delete

I have a coredump from Ubuntu 1404 LTS (glibc 2.19):

Thread 1 (Thread 0x2b00f9001700 (LWP 4096)):
#0  mdb_env_reader_dest.146852 (ptr=0x2b00ead4d0c0) at ./../../../libraries/liblmdb/mdb.c:4407
#1  0x00002b00eaf4df82 in __nptl_deallocate_tsd () at pthread_create.c:158
#2  0x00002b00eaf4e195 in start_thread (arg=0x2b00f9001700) at pthread_create.c:325
#3  0x00002b00eb25e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

Thread 2 (Thread 0x2b00ead6a100 (LWP 4091)):
#0  0x00002b00eb25f55b in __libc_send (fd=3, buf=0xfec970, n=60, flags=-1, flags@entry=16384) at ../sysdeps/unix/sysv/linux/x86_64/send.c:31
#1  0x00002b00eb2583c1 in __GI___vsyslog_chk (pri=<optimized out>, flag=1, fmt=0x2b00ebebd71e "%s", ap=ap@entry=0x7fffcbf11058) at ../misc/syslog.c:279
#2  0x00002b00eb258682 in __syslog_chk (pri=<optimized out>, flag=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:129
#3  0x00002b00ebeb07b3 in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#4  0x00002b00ebeb22be in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#5  0x00002b00edc9e61a in ?? () from /usr/lib/x86_64-linux-gnu/sasl2/libdigestmd5.so
#6  0x00002b00ebeb70f4 in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#7  0x00002b00ebeb1cd7 in sasl_done () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#8  0x000000000048aff8 in slap_sasl_destroy () at sasl.c:1200
#9  slap_destroy () at init.c:253
#10 0x000000000040c0fd in main (argc=<optimized out>, argv=<optimized out>) at main.c:1053

No more threads.

hyc · 2015-09-06T18:05:37Z

You have some other bug in your environment then. Notice pthread_key_delete http://www.eglibc.org/cgi-bin/viewvc.cgi/branches/eglibc-2_19/libc/nptl/pthread_key_delete.c?revision=25243&view=markup invalidates the tsd by changing the sequence number on the key.

nptl_deallocate_tsd only calls the destructor if the sequence number is identical to its original value, which it cannot be after pthread_key_delete() was called.

http://www.eglibc.org/cgi-bin/viewvc.cgi/branches/eglibc-2_19/libc/nptl/pthread_create.c?revision=25243&view=markup

erthink · 2015-09-06T18:28:28Z

Yes, you are right.

This coredump from slapd (brach 2.4) while running test058-syncrepl-asymmetric.

erthink · 2015-09-06T19:47:06Z

Hm, I got another coredump from 2.5-branch.

The same segfault on slapd shutdown at end of test058-syncrepl-asymmetric.
Both cores from CI buzz-testing (massive, parallel, ramfs, high loadavg).

erthink · 2015-09-06T19:54:39Z

I found the bug, nevertheless in LMDB core. It is just a race-condition between:

unmmap() from mdb_env_close()
mdb_env_reader_dest() from a thread cleanup code

hyc · 2015-09-06T20:20:11Z

This still implies some other problem, since slapd_daemon() waits for all threads to exit before returning, and backends aren't shut down until after that. I.e., there should not be any other live threads by the time mdb_env_close() is called.

erthink · 2015-09-06T20:37:25Z

Hmm, as I understand it was intended, but failed... To prevent crashes I have had to add "crutches", for example #45 and #15

On the other hand, this is exactly a race-bug in LMDB and should be fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015

erthink commented Sep 6, 2015

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

Comments

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015

erthink commented Sep 6, 2015

erthink commented Sep 6, 2015

hyc commented Sep 6, 2015

erthink commented Sep 6, 2015