Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread stuck in busy loop when RDMA is being used #191

Open
cnanakos opened this issue Jan 23, 2017 · 0 comments
Open

Thread stuck in busy loop when RDMA is being used #191

cnanakos opened this issue Jan 23, 2017 · 0 comments

Comments

@cnanakos
Copy link

cnanakos commented Jan 23, 2017

Hi,
this is easily reproducible over RDMA. The server side uses only one xio thread which accepts > 100 connections. Eventually no more connections are accepted and the thread cannot make progress anymore. I am attaching backtraces from the thread.

(gdb) t 121
[Switching to thread 121 (Thread 0x7f1da37fe700 (LWP 6324))]
#0  0x00007f1ebb083b7e in xio_nexus_release_cb (data=<optimized out>) at ../common/xio_nexus.c:1096
1096	../common/xio_nexus.c: No such file or directory.
(gdb) bt
#0  0x00007f1ebb083b7e in xio_nexus_release_cb (data=<optimized out>) at ../common/xio_nexus.c:1096
#1  0x00007f1ebb053eed in xio_ev_loop_exec_scheduled (loop=loop@entry=0x7f1d9400fa20) at xio/xio_ev_loop.c:368
#2  0x00007f1ebb053f83 in xio_ev_loop_run_helper (loop_hndl=0x7f1d9400fa20, timeout=timeout@entry=-1) at xio/xio_ev_loop.c:412
#3  0x00007f1ebb0542fa in xio_ev_loop_run (loop_hndl=<optimized out>) at xio/xio_ev_loop.c:514
#4  0x00007f1ebb0567b5 in xio_context_run_loop (ctx=0x7f1d940082f0, timeout_ms=timeout_ms@entry=-1) at xio/xio_context.c:504

On another node with the same problem:

(gdb) t 131
[Switching to thread 131 (Thread 0x7fe15bfff700 (LWP 10981))]
#0  0x00007fe374969e89 in INIT_LIST_HEAD (list=<optimized out>) at ./linux/list.h:59
59	./linux/list.h: No such file or directory.
(gdb) bt
#0  0x00007fe374969e89 in INIT_LIST_HEAD (list=<optimized out>) at ./linux/list.h:59
#1  list_del_init (entry=<optimized out>) at ./linux/list.h:166
#2  xio_ev_loop_remove_event (evt=0x7fda44f4fb38) at xio/xio_ev_loop.c:332
#3  0x00007fe374969ee7 in xio_ev_loop_exec_scheduled (loop=loop@entry=0x7fe11c009810) at xio/xio_ev_loop.c:361
#4  0x00007fe374969f83 in xio_ev_loop_run_helper (loop_hndl=0x7fe11c009810, timeout=timeout@entry=-1) at xio/xio_ev_loop.c:412
#5  0x00007fe37496a2fa in xio_ev_loop_run (loop_hndl=<optimized out>) at xio/xio_ev_loop.c:514
#6  0x00007fe37496c7b5 in xio_context_run_loop (ctx=0x7fe11c0095f0, timeout_ms=timeout_ms@entry=-1) at xio/xio_context.c:504

More info included:

(gdb) print *ctx
$2 = {ev_loop = 0x7f408c00fa20, mempool = 0x7f408c0153e0, primary_tasks_pool = {0x7f408c1159c0, 0x0}, primary_pool_ops = {0x7f41a35337a0 <primary_tasks_pool_ops>, 0x0}, initial_tasks_pool = {0x7f408c10e8b0, 0x0}, initial_pool_ops = {0x7f41a3533800 <initial_tasks_pool_ops>, 0x0}, 
  msg_pool = 0x7f408c008790, poll_completions_ctx = 0x7f408c086f80, poll_completions_fn = 0x7f41a32fa620 <xio_rdma_poll_completions>, cpuid = 34, nodeid = 1, polling_timeout = 0, flags = 0, worker = 139915689793280, run_private = 0, is_running = 1, defered_destroy = 0, 
  prealloc_xio_inline_bufs = 0, register_internal_mempool = 0, resereved = 0, stats = {hertz = 3100000000, counter = {7768882, 7768882, 57314737262, 103080027066, 0, 599801601519266, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, name = {0x7f408c007f20 "TX_MSG", 0x7f408c0089d0 "RX_MSG", 
      0x7f408c008860 "TX_BYTES", 0x7f408c008880 "RX_BYTES", 0x7f408c0088a0 "DELAY", 0x7f408c0088c0 "APPDELAY", 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, user_context = 0x0, workqueue = 0x7f408c011a60, ctx_list = {next = 0x7f3ea8e17db8, prev = 0x7f3a7f74ff48}, 
  observable = {impl = 0x7f408c0082f0, observers_list = {next = 0x7f3b27e99fe8, prev = 0x7f408c0154d8}, observer_node = 0x0}, netlink_sock = 0x82, destroy_ctx_work = {function = 0x0, data = 0x0, destructor = 0x0, destructor_data = 0x0, flags = 0, pad = 0}, ctx_list_lock = 0, 
  max_conns_per_ctx = 100, rq_depth = 0, pad = 0}

$6 = {transport = 0x7f3a937b0e40, transport_hndl = 0x7f3b47075b80, primary_tasks_pool = 0x7f408c1159c0, initial_tasks_pool = 0x7f408c10e8b0, trans_observer = {impl = 0x0, notify = 0x0}, ctx_observer = {impl = 0x0, notify = 0x0}, srv_observer = {impl = 0x0, notify = 0x0}, 
  observable = {impl = 0x0, observers_list = {next = 0x7f3b44610a38, prev = 0x7f3b44610a38}, observer_node = 0x0}, kref = {refcount = {counter = 0}}, cid = 171047, state = XIO_NEXUS_STATE_DISCONNECTED, is_first_req = 0, reconnect_retries = 0, is_listener = 0, srq_enabled = 0, 
  close_time_hndl = {work = {function = 0x7f41a3312b20 <xio_nexus_release_cb>, data = 0x7f3b446109e0, destructor = 0x0, destructor_data = 0x0, flags = 0, pad = 0}, timer = {entry = {next = 0x7f3b44610a90, prev = 0x7f3b44610a90}, expires = 954759880295030}}, observers_htbl = {
    next = 0x7f3b44610aa8, prev = 0x7f3b44610aa8
  }, tx_queue = {next = 0x7f3b44610ab8, prev = 0x7f3b44610ab8}, server = 0x7f408c0156e0, server_cid = 0, server_cid_pad = 0, new_transport_hndl = 0x0, portal_uri = 0x0, out_if_addr = 0x0, trans_attr_mask = 0, trans_attr = {
    tos = 0 '\000', pad = "\000\000"
  }, destroy_event = {{ev_handler = 0x7f41a3312c70 <xio_nexus_destroy_handler>, handler = 0x7f41a3312c70 <xio_nexus_destroy_handler>}, {fd = 0, scheduled = 0}, reserved = 0, data = 0x7f3b446109e0, events_list_entry = {next = 0x7f3b44610b10, 
      prev = 0x7f3b44610b10}}, trans_error_event = {{ev_handler = 0x7f41a3313a30 <xio_nexus_trans_error_handler>, handler = 0x7f41a3313a30 <xio_nexus_trans_error_handler>}, {fd = 0, scheduled = 0}, reserved = 0, data = 0x0, events_list_entry = {next = 0x0, prev = 0x0}}, 
  nexus_obs_lock = 0, pad2 = 0, lock_connect = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, nexus_htbl = {mpfld = {next = 0x7f3b44610b78, prev = 0x7f3b44610b78}, keycopy = {id = 171047, pad = "\000\000\000"}}}
(gdb) print *(struct xio_nexus*)0x7f16f962b930
$5 = {transport = 0x7f1704734490, transport_hndl = 0x7f1557257dd0, primary_tasks_pool = 0x7f1d941159c0, initial_tasks_pool = 0x7f1d9410e8b0, trans_observer = {impl = 0x0, notify = 0x0}, ctx_observer = {impl = 0x0, notify = 0x0}, srv_observer = {impl = 0x0, notify = 0x0}, 
  observable = {impl = 0x0, observers_list = {next = 0x7f16f962b988, prev = 0x7f16f962b988}, observer_node = 0x0}, kref = {refcount = {counter = 0}}, cid = 234147, state = XIO_NEXUS_STATE_DISCONNECTED, is_first_req = 0, reconnect_retries = 0, is_listener = 0, srq_enabled = 0, 
  close_time_hndl = {work = {function = 0x7f1ebb083b20 <xio_nexus_release_cb>, data = 0x7f16f962b930, destructor = 0x0, destructor_data = 0x0, flags = 0, pad = 0}, timer = {entry = {next = 0x7f16f962b9e0, prev = 0x7f16f962b9e0}, expires = 1293119591639180}}, observers_htbl = {
    next = 0x7f16f962b9f8, prev = 0x7f16f962b9f8
  }, tx_queue = {next = 0x7f16f962ba08, prev = 0x7f16f962ba08}, server = 0x7f1d940156e0, server_cid = 0, server_cid_pad = 0, new_transport_hndl = 0x0, portal_uri = 0x0, out_if_addr = 0x0, trans_attr_mask = 0, trans_attr = {
    tos = 0 '\000', pad = "\000\000"
  }, destroy_event = {{ev_handler = 0x7f1ebb083c70 <xio_nexus_destroy_handler>, handler = 0x7f1ebb083c70 <xio_nexus_destroy_handler>}, {fd = 0, scheduled = 0}, reserved = 0, data = 0x7f16f962b930, events_list_entry = {next = 0x7f16f962ba60, 
      prev = 0x7f16f962ba60}}, trans_error_event = {{ev_handler = 0x7f1ebb084a30 <xio_nexus_trans_error_handler>, handler = 0x7f1ebb084a30 <xio_nexus_trans_error_handler>}, {fd = 0, scheduled = 0}, reserved = 0, data = 0x0, events_list_entry = {next = 0x0, prev = 0x0}}, 
  nexus_obs_lock = 0, pad2 = 0, lock_connect = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, nexus_htbl = {mpfld = {next = 0x7f16f962bac8, prev = 0x7f16f962bac8}, keycopy = {id = 234147, pad = "\000\000\000"}}}

Please let me know if you need more info. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant