Skip to content
This repository has been archived by the owner on Sep 22, 2022. It is now read-only.

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

Closed
erthink opened this issue Sep 6, 2015 · 8 comments
Closed

MAJOR: segfault or memory corruption by LMDB on close/shutdown #48

erthink opened this issue Sep 6, 2015 · 8 comments
Assignees
Labels
Milestone

Comments

@erthink
Copy link
Owner

erthink commented Sep 6, 2015

The root of problem is in wrong usage of pthread_key_create() and pthread_key_delete() inside LMDB.

  • A reader's slot will be assigned for any thread which has started a read transaction.
  • For releasing slot on a thread termination, LMDB uses mdb_env_reader_dest() which is registered by pthread_key_create() and later binds to a thread by pthread_setspecific().
  • Internally, mdb_env_reader_dest() uses a pointer to the mmaped 'data.lock' file, which holds a readers table. On a database closing the 'data.lock' will be unmapped.

Therefore any thread, which previously has read from this database and yet not terminated, will make a segfault inside mdb_env_reader_dest() or will corrupt content of memory.

@hyc
Copy link
Contributor

hyc commented Sep 6, 2015

Note that mdb_env_close() already calls pthread_key_delete() which destroys any remaining usage of the thread-specific key. Do you have a test case which shows this crash? There is no such problem in recent versions of glibc.

@erthink
Copy link
Owner Author

erthink commented Sep 6, 2015

pthread_key_delete() don't performs any cleanup in registered threads, for instance see http://linux.die.net/man/3/pthread_key_delete

I have a coredump from Ubuntu 1404 LTS (glibc 2.19):

Thread 1 (Thread 0x2b00f9001700 (LWP 4096)):
#0  mdb_env_reader_dest.146852 (ptr=0x2b00ead4d0c0) at ./../../../libraries/liblmdb/mdb.c:4407
#1  0x00002b00eaf4df82 in __nptl_deallocate_tsd () at pthread_create.c:158
#2  0x00002b00eaf4e195 in start_thread (arg=0x2b00f9001700) at pthread_create.c:325
#3  0x00002b00eb25e47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

Thread 2 (Thread 0x2b00ead6a100 (LWP 4091)):
#0  0x00002b00eb25f55b in __libc_send (fd=3, buf=0xfec970, n=60, flags=-1, flags@entry=16384) at ../sysdeps/unix/sysv/linux/x86_64/send.c:31
#1  0x00002b00eb2583c1 in __GI___vsyslog_chk (pri=<optimized out>, flag=1, fmt=0x2b00ebebd71e "%s", ap=ap@entry=0x7fffcbf11058) at ../misc/syslog.c:279
#2  0x00002b00eb258682 in __syslog_chk (pri=<optimized out>, flag=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:129
#3  0x00002b00ebeb07b3 in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#4  0x00002b00ebeb22be in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#5  0x00002b00edc9e61a in ?? () from /usr/lib/x86_64-linux-gnu/sasl2/libdigestmd5.so
#6  0x00002b00ebeb70f4 in ?? () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#7  0x00002b00ebeb1cd7 in sasl_done () from /usr/lib/x86_64-linux-gnu/libsasl2.so.2
#8  0x000000000048aff8 in slap_sasl_destroy () at sasl.c:1200
#9  slap_destroy () at init.c:253
#10 0x000000000040c0fd in main (argc=<optimized out>, argv=<optimized out>) at main.c:1053

No more threads.

@hyc
Copy link
Contributor

hyc commented Sep 6, 2015

You have some other bug in your environment then. Notice pthread_key_delete http://www.eglibc.org/cgi-bin/viewvc.cgi/branches/eglibc-2_19/libc/nptl/pthread_key_delete.c?revision=25243&view=markup invalidates the tsd by changing the sequence number on the key.

nptl_deallocate_tsd only calls the destructor if the sequence number is identical to its original value, which it cannot be after pthread_key_delete() was called.

http://www.eglibc.org/cgi-bin/viewvc.cgi/branches/eglibc-2_19/libc/nptl/pthread_create.c?revision=25243&view=markup

@erthink
Copy link
Owner Author

erthink commented Sep 6, 2015

Yes, you are right.

This coredump from slapd (brach 2.4) while running test058-syncrepl-asymmetric.

@erthink
Copy link
Owner Author

erthink commented Sep 6, 2015

Hm, I got another coredump from 2.5-branch.

The same segfault on slapd shutdown at end of test058-syncrepl-asymmetric.
Both cores from CI buzz-testing (massive, parallel, ramfs, high loadavg).

@erthink
Copy link
Owner Author

erthink commented Sep 6, 2015

I found the bug, nevertheless in LMDB core. It is just a race-condition between:

  • unmmap() from mdb_env_close()
  • mdb_env_reader_dest() from a thread cleanup code

@hyc
Copy link
Contributor

hyc commented Sep 6, 2015

This still implies some other problem, since slapd_daemon() waits for all threads to exit before returning, and backends aren't shut down until after that. I.e., there should not be any other live threads by the time mdb_env_close() is called.

@erthink
Copy link
Owner Author

erthink commented Sep 6, 2015

Hmm, as I understand it was intended, but failed... To prevent crashes I have had to add "crutches", for example #45 and #15

On the other hand, this is exactly a race-bug in LMDB and should be fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants