Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No response after authenticated allocation request - Server broken until restart #894

Closed
joushx opened this issue Mar 11, 2022 · 9 comments
Labels

Comments

@joushx
Copy link

joushx commented Mar 11, 2022

Problem

Our TURN-Server stops working randomly. After the initial request without credentials and the 401 response, the authenticated request does not get an answer. This happens for all sessions until the process is restarted. This happens on multiple servers.

Bildschirmfoto vom 2022-03-07 08-25-53

The logs only show when the clients finally give up and close the socket.

Steps to reproduce

  1. Have coTURN with shared secret auth running
  2. Let it run for some time (can be minutes or days between occurrences)
  3. Try to allocate a port
@korayvt
Copy link
Contributor

korayvt commented Mar 21, 2022

I have the same issue.

@korayvt
Copy link
Contributor

korayvt commented Mar 23, 2022

I realised that authentication related threads deadlock while getting sqlite connection as the following.

Installed sqlite version is pretty old in my environment: sqlite-3.7.17-8.el7_7.1.x86_64

#0  0x00007fa81c2e854d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fa81c2e3e9b in _L_lock_883 ()  from /lib64/libpthread.so.0
#2  0x00007fa81c2e3d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fa81e0ded1d in sqlite3_initialize () from /lib64/libsqlite3.so.0
#4  0x00007fa81e14adf9 in openDatabase () from /lib64/libsqlite3.so.0
#5  0x0000000000431d55 in get_sqlite_connection () at src/apps/relay/dbdrivers/dbd_sqlite.c:185
#6  0x0000000000431e45 in sqlite_get_auth_secrets (sl=0x7fa6d67fb6e0, realm=0x7fa6d67fba25 "myturn.com") at src/apps/relay/dbdrivers/dbd_sqlite.c:213
#7  0x000000000042c042 in get_auth_secrets (sl=0x7fa6d67fb6e0, realm=0x7fa6d67fba25 "myturn.com") at src/apps/relay/userdb.c:318
#8  0x000000000042c7a8 in get_user_key (in_oauth=0, out_oauth=0x7fa6d67fb81c, max_session_time=0x7fa6d67fb820, usname=0x7fa6d67fb824 "1647932598:123456789", realm=0x7fa6d67fba25 "myturn.com",
    key=0x7fa6d67fb7d0 "", nbh=0x7fa7e8038d10) at src/apps/relay/userdb.c:544
#9  0x0000000000418894 in auth_server_receive_message (bev=0x7fa6c0002010, ptr=0x68d0a0 <authserver+864>) at src/apps/relay/netengine.c:463
#10 0x00007fa81cd7173e in bufferevent_run_deferred_callbacks_unlocked () from /lib64/libevent-2.0.so.5
#11 0x00007fa81cd67b05 in event_base_loop () from /lib64/libevent-2.0.so.5
#12 0x000000000041c230 in run_events (eb=0x7fa6c00008f0, e=0x0) at src/apps/relay/netengine.c:1578
#13 0x000000000041caaa in run_auth_server_thread (arg=0x68d0a0 <authserver+864>) at src/apps/relay/netengine.c:1792
#14 0x00007fa81c2e1ea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fa81c00ab0d in clone () from /lib64/libc.so.6

@yeya
Copy link

yeya commented Mar 29, 2022

@korayvt You are referencing 15 different issues that unrelated to this one.
Please edit your last comment and set marks around your logs so it will displayed like code.

@korayvt
Copy link
Contributor

korayvt commented Mar 29, 2022

@korayvt You are referencing 15 different issues that unrelated to this one. Please edit your last comment and set marks around your logs so it will displayed like code.

Done

@ggarber ggarber added the bug label Aug 2, 2022
@ggarber
Copy link
Contributor

ggarber commented Aug 2, 2022

Problem

Our TURN-Server stops working randomly. After the initial request without credentials and the 401 response, the authenticated request does not get an answer. This happens for all sessions until the process is restarted. This happens on multiple servers.

Bildschirmfoto vom 2022-03-07 08-25-53

The logs only show when the clients finally give up and close the socket.

Steps to reproduce

  1. Have coTURN with shared secret auth running
  2. Let it run for some time (can be minutes or days between occurrences)
  3. Try to allocate a port

Do you see something in the server logs?
When it fails is the STUN part (no TURN allocations) still working?

@eakraly
Copy link
Collaborator

eakraly commented Jan 22, 2023

Hi @joushx !
Did you find a solution? Was your issue the same as for @korayvt (old sqlite)?

@korayvt
Copy link
Contributor

korayvt commented Jan 23, 2023

After seeing sqlite lock in the core dump, I compiled coturn excluding sqlite and I haven't had this problem again.

@eakraly
Copy link
Collaborator

eakraly commented Jan 29, 2023

Closing it. Pls open a new issue if you see the problem again. TY!

@eakraly eakraly closed this as completed Jan 29, 2023
@jonesmz
Copy link
Contributor

jonesmz commented Jan 22, 2024

@eakraly Can you please reopen this?

I was able to reproduce this problem with coturn tag 4.5.1.3. It impacted one of my production deployments in December, and I've been researching the underlying issue ever since. Only just able to "catch it in the wild" today.

The issue appears to be some kind of mutex synchronization issue where two threads are both waiting on a mutex.

Not explicitly seen since i had to redact some of the stack variable contents for confidentiality reasons, but the username for the auth of the two stuck threads is the same. This may be relevant, but i can't say for certain.

In a packet capture, the behavior was that the turnserver would respond to an ALLOCATE with the standard error+nonce, but then would ignore ALLOCATE+nonce packets. Each ALLOCATE+nonce appears to be associated with an increase in the amount of memory consumed by the process (appears to be 896 bytes every time, but i only measured this a couple times, so it's not necessarily always this size). I haven't studied how the auth code works yet, but a wild speculation might be that it's queueing the auth processing, waiting on currently in-process sqlite queries to finish before starting anew.

We've tryed to reproduce this with only one success. So far we've launched approximately 1,000 new instances, and only one has had the problem. It appears to only occur on startup, and then impact the process for the duration of execution.

No log statements are made when this bug is triggered.

Moving forward: I am going to compile the turnserver with all of the DB engines disabled, since my organization does not use them.

The version of sqlite3 currently compiled in is 3.7.17, which is the version vended with Amazon Linux 2. It may be that this problem is a bug in sqlite3 that has subsequently been fixed. It's hard to say, and I'm not going to dig further since we're removing all of the compile-time optional functionality entirely.

./backtrace_pid.sh 5365 /opt/turnserver/bin/turnserver

Reading symbols from /opt/turnserver/bin/turnserver...done.
Attaching to program: /opt/turnserver/bin/turnserver, process 5365
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6

Thread 9 (Thread 0x7fb324dfa700 (LWP 5463)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x7fb304000b50, e=0x0) at src/apps/relay/netengine.c:1579
#4  0x000000000042fabf in run_admin_server_thread (arg=0x712f60 <adminserver>) at src/apps/relay/netengine.c:1813
#5  0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fb3257fb700 (LWP 5462)):
#0  0x00007fb3319783ad in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fb331971cfd in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fb333fc5ae4 in sqlite3Malloc () from /lib64/libsqlite3.so.0
#3  0x00007fb333fc5b18 in sqlite3MallocZero () from /lib64/libsqlite3.so.0
#4  0x00007fb333fcf198 in pthreadMutexAlloc () from /lib64/libsqlite3.so.0
#5  0x00007fb333fd2fc9 in sqlite3_initialize () from /lib64/libsqlite3.so.0
#6  0x00007fb33403cbd9 in openDatabase () from /lib64/libsqlite3.so.0
#7  0x0000000000445921 in get_sqlite_connection () at src/apps/relay/dbdrivers/dbd_sqlite.c:185
#8  0x0000000000445a11 in sqlite_get_auth_secrets (redacted) at src/apps/relay/dbdrivers/dbd_sqlite.c:213
#9  0x000000000043f252 in get_auth_secrets (redacted) at src/apps/relay/userdb.c:318
#10 0x000000000043f9b7 in get_user_key (redacted) at src/apps/relay/userdb.c:544
#11 0x000000000042b986 in auth_server_receive_message (bev=0x7fb310001470, ptr=0x6edcc0 <authserver+96>) at src/apps/relay/netengine.c:464
#12 0x00007fb3330a11be in bufferevent_run_deferred_callbacks_unlocked () from /lib64/libevent_core-2.0.so.5
#13 0x00007fb333097493 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#14 0x000000000042f20c in run_events (eb=0x7fb310000b50, e=0x0) at src/apps/relay/netengine.c:1579
#15 0x000000000042fa08 in run_auth_server_thread (arg=0x6edcc0 <authserver+96>) at src/apps/relay/netengine.c:1786
#16 0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#17 0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fb3261fc700 (LWP 5458)):
#0  0x00007fb3319783ad in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fb331971cfd in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fb333fc5ae4 in sqlite3Malloc () from /lib64/libsqlite3.so.0
#3  0x00007fb333fc5c6d in sqlite3DbMallocRaw () from /lib64/libsqlite3.so.0
#4  0x00007fb333fd3e9f in sqlite3StrAccumFinish () from /lib64/libsqlite3.so.0
#5  0x00007fb333fe06d2 in sqlite3VMPrintf () from /lib64/libsqlite3.so.0
#6  0x00007fb333fe6fd4 in sqlite3MPrintf () from /lib64/libsqlite3.so.0
#7  0x00007fb33402b339 in sqlite3CreateIndex.isra.511 () from /lib64/libsqlite3.so.0
#8  0x00007fb33402bcd5 in sqlite3AddPrimaryKey () from /lib64/libsqlite3.so.0
#9  0x00007fb33400fbc4 in sqlite3Parser () from /lib64/libsqlite3.so.0
#10 0x00007fb33401462c in sqlite3RunParser () from /lib64/libsqlite3.so.0
#11 0x00007fb334014ba5 in sqlite3Prepare () from /lib64/libsqlite3.so.0
#12 0x00007fb334014fe7 in sqlite3LockAndPrepare () from /lib64/libsqlite3.so.0
#13 0x00007fb334015073 in sqlite3_prepare () from /lib64/libsqlite3.so.0
#14 0x00007fb33401513e in sqlite3InitCallback () from /lib64/libsqlite3.so.0
#15 0x00007fb3340155dd in sqlite3_exec () from /lib64/libsqlite3.so.0
#16 0x00007fb3340239d9 in sqlite3InitOne () from /lib64/libsqlite3.so.0
#17 0x00007fb334023be0 in sqlite3Init () from /lib64/libsqlite3.so.0
#18 0x00007fb334023cb0 in sqlite3ReadSchema () from /lib64/libsqlite3.so.0
#19 0x00007fb334024072 in sqlite3StartTable () from /lib64/libsqlite3.so.0
#20 0x00007fb33400f5c5 in sqlite3Parser () from /lib64/libsqlite3.so.0
#21 0x00007fb33401462c in sqlite3RunParser () from /lib64/libsqlite3.so.0
#22 0x00007fb334014ba5 in sqlite3Prepare () from /lib64/libsqlite3.so.0
#23 0x00007fb334014fe7 in sqlite3LockAndPrepare () from /lib64/libsqlite3.so.0
#24 0x00007fb334015073 in sqlite3_prepare () from /lib64/libsqlite3.so.0
#25 0x0000000000445897 in init_sqlite_database (sqliteconnection=0x7fb318002198) at src/apps/relay/dbdrivers/dbd_sqlite.c:169
#26 0x000000000044599a in get_sqlite_connection () at src/apps/relay/dbdrivers/dbd_sqlite.c:195
#27 0x0000000000445a11 in sqlite_get_auth_secrets (redacted) at src/apps/relay/dbdrivers/dbd_sqlite.c:213
#28 0x000000000043f252 in get_auth_secrets (redacted) at src/apps/relay/userdb.c:318
#29 0x000000000043f9b7 in get_user_key (redacted) at src/apps/relay/userdb.c:544
#30 0x000000000042b986 in auth_server_receive_message (bev=0x7fb318001470, ptr=0x6edc90 <authserver+48>) at src/apps/relay/netengine.c:464
#31 0x00007fb3330a11be in bufferevent_run_deferred_callbacks_unlocked () from /lib64/libevent_core-2.0.so.5
#32 0x00007fb333097493 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#33 0x000000000042f20c in run_events (eb=0x7fb318000b50, e=0x0) at src/apps/relay/netengine.c:1579
#34 0x000000000042fa08 in run_auth_server_thread (arg=0x6edc90 <authserver+48>) at src/apps/relay/netengine.c:1786
#35 0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#36 0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fb326bfd700 (LWP 5457)):
#0  0x00007fb3319783ad in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fb331971cfd in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fb333fc5ae4 in sqlite3Malloc () from /lib64/libsqlite3.so.0
#3  0x00007fb333fc5c6d in sqlite3DbMallocRaw () from /lib64/libsqlite3.so.0
#4  0x00007fb333fc5c98 in sqlite3DbMallocZero () from /lib64/libsqlite3.so.0
#5  0x00007fb334014a2c in sqlite3Prepare () from /lib64/libsqlite3.so.0
#6  0x00007fb334014fe7 in sqlite3LockAndPrepare () from /lib64/libsqlite3.so.0
#7  0x00007fb334015073 in sqlite3_prepare () from /lib64/libsqlite3.so.0
#8  0x0000000000447fd5 in sqlite_reread_realms (realms_list=0x6f21e0 <realms_list>) at src/apps/relay/dbdrivers/dbd_sqlite.c:1028
#9  0x0000000000441489 in reread_realms () at src/apps/relay/userdb.c:1334
#10 0x000000000042f8d1 in run_auth_server_thread (arg=0x6edc60 <authserver>) at src/apps/relay/netengine.c:1754
#11 0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb3275fe700 (LWP 5398)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x7fb314000b50, e=0x7fb32c57b200) at src/apps/relay/netengine.c:1579
#4  0x000000000042f693 in run_general_relay_thread (arg=0x7fb32c57b010) at src/apps/relay/netengine.c:1700
#5  0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb327fff700 (LWP 5397)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x7fb31c000b50, e=0x7fb32c77c200) at src/apps/relay/netengine.c:1579
#4  0x000000000042f693 in run_general_relay_thread (arg=0x7fb32c77c010) at src/apps/relay/netengine.c:1700
#5  0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb32d37d700 (LWP 5396)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x7fb320000b50, e=0x7fb32d37e200) at src/apps/relay/netengine.c:1579
#4  0x000000000042f693 in run_general_relay_thread (arg=0x7fb32d37e010) at src/apps/relay/netengine.c:1700
#5  0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fb32df7f700 (LWP 5393)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x7fb328000b50, e=0x7fb32df80200) at src/apps/relay/netengine.c:1579
#4  0x000000000042f693 in run_general_relay_thread (arg=0x7fb32df80010) at src/apps/relay/netengine.c:1700
#5  0x00007fb33196f44b in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb3316aa52f in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb33446a000 (LWP 5365)):
#0  0x00007fb3316aa84c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fb3330aad64 in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x00007fb333096dde in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000000000042f20c in run_events (eb=0x1118ce0, e=0x7fb32e1d9048) at src/apps/relay/netengine.c:1579
#4  0x000000000042f2a8 in run_listener_server (ls=0x6d22d0 <turn_params+9328>) at src/apps/relay/netengine.c:1599
#5  0x0000000000429590 in main (argc=0, argv=0x7ffff4343cb0) at src/apps/relay/mainrelay.c:2515
Detaching from program: /opt/turnserver/bin/turnserver, process 5365

Looking at the auth code here:

static int get_auth_secrets(secrets_list_t *sl, uint8_t *realm) {

I'm a bit confused why any database queries are performed at all, when in my situation, the secrets are provided as part of the config file.

I think an improvement here would be to either:

  1. skip querying the DB provider if the secrets are provided in the config
  2. convert the config file secrets to another "db provider" and require the config to explicitly rank the db providers in order of preference. My organization would have ranked the config file as top preference and would have preferred the other DB providers be skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants