Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker thread recovery for the executor #57

Closed
vertexclique opened this issue Nov 6, 2019 · 4 comments
Closed

Worker thread recovery for the executor #57

vertexclique opened this issue Nov 6, 2019 · 4 comments
Labels
bug Something isn't working

Comments

@vertexclique
Copy link
Member

vertexclique commented Nov 6, 2019

The executor needs to recover worker threads on internal worker thread panic.

Currently, we continue operation with the rest of the threads that are mapped onto the cores. The panicked thread should be assigned to the same affinity that it left.

This case is very edgy but needed for tolerance.

@vertexclique vertexclique added the bug Something isn't working label Nov 6, 2019
@vertexclique vertexclique added this to Needs triage in Project Board via automation Nov 6, 2019
@vertexclique vertexclique added this to the Bastion 0.3.0 milestone Nov 6, 2019
@r3v2d0g r3v2d0g removed this from the Bastion 0.3.0 milestone Nov 22, 2019
@Relrin
Copy link
Member

Relrin commented Jan 28, 2020

Could you remind the place where the bastion_executor::pool::Pool::recover_async_thread call suppossed to be called? Is somewhere in the System or the Worker struct?

@vertexclique
Copy link
Member Author

@Relrin Good news, because we don't need this anymore. Since the recoverable handle is basically triggering the panic machinery and application logic replaces faulty tasks on the fly async threads are continuously running in an ordered fashion. That said we can remove that actually. Here is one of the call stacks of the threads:

    2754 Thread_6479620: bastion-async-thread
    + 2754 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff6a75258f]
    +   2754 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7fff6a755d36]
    +     2754 std::sys::unix::thread::Thread::new::thread_start::hfd083295efc29c61  (in restart_strategy) + 142  [0x10dba140e]  boxed.rs:1015
    +       2754 _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::he5acc65097eb53de  (in restart_strategy) + 62  [0x10db9834e]  boxed.rs:1015
    +         2754 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h20423020184ddedf  (in restart_strategy) + 21  [0x10db5a135]  function.rs:232
    +           2754 std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::h8c5dac9f7c1d33fc  (in restart_strategy) + 318  [0x10db42b3e]  mod.rs:474
    +             2754 std::panic::catch_unwind::h481c2521d61b2ab6  (in restart_strategy) + 49  [0x10db3e501]  panic.rs:394
    +               2754 std::panicking::try::h4e12bfac1217faee  (in restart_strategy) + 231  [0x10db3a437]  panicking.rs:281
    +                 2754 __rust_maybe_catch_panic  (in restart_strategy) + 27  [0x10dba1dcb]  lib.rs:86
    +                   2754 std::panicking::try::do_call::ha5d713747227e745  (in restart_strategy) + 87  [0x10db3a617]  panicking.rs:305
    +                     2754 _$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h0dfe8278fa28380e  (in restart_strategy) + 49  [0x10db3e471]  panic.rs:318
    +                       2754 std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h26d00a1a159d2279  (in restart_strategy) + 49  [0x10db42f41]  mod.rs:475
    +                         2754 std::sys_common::backtrace::__rust_begin_short_backtrace::h327bd5bae056ace5  (in restart_strategy) + 49  [0x10db53491]  backtrace.rs:129
    +                           2754 bastion_executor::distributor::Distributor::assign::_$u7b$$u7b$closure$u7d$$u7d$::h3f90241aad896b9d  (in restart_strategy) + 106  [0x10db3cb8a]  distributor.rs:39
    +                             2754 bastion_executor::worker::main_loop::h05cd6647147c3b05  (in restart_strategy) + 245  [0x10db44695]  worker.rs:174
    +                               2754 bastion_executor::sleepers::Sleepers::wait::hb527910a2641d30c  (in restart_strategy) + 256  [0x10db4f0b0]  sleepers.rs:40
    +                                 2754 std::sync::condvar::Condvar::wait::hfda9899d18c08e1e  (in restart_strategy) + 109  [0x10db8e74d]  condvar.rs:200
    +                                   2754 std::sys_common::condvar::Condvar::wait::hdc16a8abbb3f400f  (in restart_strategy) + 50  [0x10db8c392]  condvar.rs:50
    +                                     2754 std::sys::unix::condvar::Condvar::wait::he5d623c1ed84ea79  (in restart_strategy) + 58  [0x10db8e6ca]  condvar.rs:73
    +                                       2754 _pthread_cond_wait  (in libsystem_pthread.dylib) + 701  [0x7fff6a756040]
    +                                         2754 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff6a695916]

@Relrin
Copy link
Member

Relrin commented Jan 29, 2020

Therefore the bastion_executor::pool::Pool::recover_async_thread method must be dropped as well, right?

@vertexclique
Copy link
Member Author

yes :)

Project Board automation moved this from Needs triage to Closed Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Project Board
  
Closed
Development

No branches or pull requests

3 participants