New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate block production loop when shutting down witness plugin #1314 #1332

Merged
merged 4 commits into from Oct 8, 2018

Conversation

Projects
4 participants
@cogutvalera
Member

cogutvalera commented Sep 17, 2018

PR for "Terminate block production loop when shutting down witness plugin #1314"

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Sep 17, 2018

Travis-CI build failed perhaps related to issue #1303

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Sep 17, 2018

I've restarted Travis-CI build for this PR, just to see if it will fail again or not

@abitmore

This comment has been minimized.

Member

abitmore commented Sep 17, 2018

Restarted again.

} catch(fc::exception& e) {
edump((e.to_detail_string()));
}
plugin_shutdown();

This comment has been minimized.

@abitmore

abitmore Sep 17, 2018

Member

This means plugin_shutdown() will be called twice when shutting down. I don't think it's correct.

This comment has been minimized.

@cogutvalera

cogutvalera Sep 17, 2018

Member

Thanks ! Absolutely agree ! Sorry, will fix now by new method.

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Sep 17, 2018

Thanks ! Now All checks have passed ...

@pmconrad

This comment has been minimized.

pmconrad commented Sep 19, 2018

Thread handling in fc is still a mystery to me.

That said, I think there's a race condition in your code. _block_production_task recreates itself by calling schedule_production_loop() from inside block_production_loop(). That means if block_production_loop is being executed while you call cancel, the cancel will only affect the currently running loop but will not prevent it from scheduling the next one. I may be wrong though.

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Sep 20, 2018

Thread handling in fc is still a mystery to me.

That said, I think there's a race condition in your code. _block_production_task recreates itself by calling schedule_production_loop() from inside block_production_loop(). That means if block_production_loop is being executed while you call cancel, the cancel will only affect the currently running loop but will not prevent it from scheduling the next one. I may be wrong though.

It will wait current block production execution and after that will cancel, so no race condition must be here.

@pmconrad

This comment has been minimized.

pmconrad commented Sep 21, 2018

cancel_and_wait sends the cancel signal and waits for the task to be canceled.
Where does it wait for the currently running task to finish before sending the cancel signal?

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Sep 21, 2018

cancel_and_wait sends the cancel signal and waits for the task to be canceled.
Where does it wait for the currently running task to finish before sending the cancel signal?

  1. Yes, right, it will wait currently running task after sending cancel signal.
  2. It won't schedule the next one loop, why it will ? So which another currently running task do you mean if cancel_and_wait will wait it ?
@pmconrad

This comment has been minimized.

pmconrad commented Sep 24, 2018

cancel_and_wait itself doesn't schedule the next loop.
The currently running task is executing block_production_loop(), and will call schedule_production_loop() as its final step. https://github.com/bitshares/bitshares-core/blob/master/libraries/plugins/witness/witness.cpp#L207

@pmconrad

This comment has been minimized.

pmconrad commented Sep 27, 2018

That change makes the window for the race condition smaller, but doesn't eliminate it.

@pmconrad

This comment has been minimized.

pmconrad commented Sep 28, 2018

It just doesn't work like that. You must either use locking or an atomic check-and-set of some kind.

@jmjatlanta

I believe the code as written will greatly reduce the chance of a production loop executing after shutdown. I have not walked through all scenarios, but I think this covers many of the bases. The comments below are just some opinions.

IMO: locking / check and set are overkill here. A volatile boolean would work if in this plugin we do not want to support a startup, shutdown, and then another startup. I'm not saying atomic_flag doesn't do the job, it is just some unneeded overhead.

I am not approving this PR, as I believe @pmconrad may be thinking of a scenario that I am not. So I will defer to him.

@@ -94,6 +117,7 @@ class witness_plugin : public graphene::app::plugin {
/// For tracking signing keys of specified witnesses, only update when applied a block
fc::flat_map< chain::witness_id_type, fc::optional<chain::public_key_type> > _witness_key_cache;
std::atomic_flag _lock_production_loop = ATOMIC_FLAG_INIT;

This comment has been minimized.

@jmjatlanta

jmjatlanta Oct 1, 2018

IMHO _production_loop_shutting_down is more descriptive

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ok, sure ! Thank you !

if (_block_production_task.valid() && _block_production_task.canceled()) return true;
_lock_production_loop.clear();

This comment has been minimized.

@jmjatlanta

jmjatlanta Oct 1, 2018

Why reset the flag?

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

to end atomic operation after test_and_set

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ah, sure we need also to reset the flag inside if statement also

Thanks !

@@ -74,7 +71,37 @@ class witness_plugin : public graphene::app::plugin {
inline const fc::flat_map< chain::witness_id_type, fc::optional<chain::public_key_type> >& get_witness_key_cache()
{ return _witness_key_cache; }
bool block_production_task_canceled()
{
while (_production_loop_shutting_down.test_and_set()) {}

This comment has been minimized.

@pmconrad

pmconrad Oct 1, 2018

Busy wait is evil, please use a mutex instead.

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ok sure ! Thanks !

private:
void stop_production_loop()
{
while (_production_loop_shutting_down.test_and_set()) {}

This comment has been minimized.

@pmconrad

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ok sure ! Thanks !

@@ -163,6 +164,9 @@ void witness_plugin::refresh_witness_key_cache()
void witness_plugin::schedule_production_loop()
{
// check to prevent possible race condition case
if (block_production_task_canceled()) return;

This comment has been minimized.

@pmconrad

pmconrad Oct 1, 2018

This is superfluous if schedule() and stop() are properly synchronized.

@@ -178,6 +182,9 @@ void witness_plugin::schedule_production_loop()
block_production_condition::block_production_condition_enum witness_plugin::block_production_loop()
{
// check to prevent possible race condition case
if (block_production_task_canceled()) return block_production_condition::block_production_condition_enum::canceled;

This comment has been minimized.

@pmconrad
@@ -74,7 +71,37 @@ class witness_plugin : public graphene::app::plugin {
inline const fc::flat_map< chain::witness_id_type, fc::optional<chain::public_key_type> >& get_witness_key_cache()
{ return _witness_key_cache; }
bool block_production_task_canceled()

This comment has been minimized.

@pmconrad

pmconrad Oct 1, 2018

Please move non-trivial implementations into the .cpp file.

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ok sure ! Thanks !

private:
void stop_production_loop()

This comment has been minimized.

@pmconrad

pmconrad Oct 1, 2018

Please move non-trivial implementations into the .cpp file.

This comment has been minimized.

@cogutvalera

cogutvalera Oct 1, 2018

Member

ok sure ! Thanks !

@pmconrad

This comment has been minimized.

pmconrad commented Oct 1, 2018

I believe there is still a race condition.
The general idea would be to use a mutex to protect access to _block_production_task in both schedule_production_loop and stop_production_loop.

@pmconrad

This comment has been minimized.

pmconrad commented Oct 1, 2018

@jmjatlanta you are right in that the implementation so far eliminates most of the risk. Nevertheless I think we should do it properly. Overhead should not be a problem since it affects only actual witnesses, and only once per second.

@abitmore abitmore added this to In progress in Feature release (201810) via automation Oct 1, 2018

@abitmore abitmore added this to the 201810 - Feature Release milestone Oct 1, 2018

if( _block_production_task.valid() )
{
fc::scoped_lock<boost::mutex> lock( _production_loop_mutex );
_block_production_task.cancel_and_wait(__FUNCTION__);

This comment has been minimized.

@pmconrad

pmconrad Oct 2, 2018

This will result in a deadlock.
The lock is held until cancel_and_wait returns, but schedule_production_loop blocks on the same lock, so it will never complete and cancel_and_wait will never return.

@@ -94,6 +92,7 @@ class witness_plugin : public graphene::app::plugin {
/// For tracking signing keys of specified witnesses, only update when applied a block
fc::flat_map< chain::witness_id_type, fc::optional<chain::public_key_type> > _witness_key_cache;
boost::mutex _production_loop_mutex;

This comment has been minimized.

@pmconrad

pmconrad Oct 2, 2018

Please use fc::mutex instead.

This comment has been minimized.

@cogutvalera

cogutvalera Oct 2, 2018

Member

ok sure. Thanks !

@abitmore

This comment has been minimized.

Member

abitmore commented Oct 2, 2018

Actually I didn't get why we're making it so complicated. Why need another stop_loop call to control the thread at all?

The production loop function is NOT a big task thus won't take much time to execute. So we can simply wait for it to exit if it is still running when trying to shutting it down. So my solution would be:

  • add a member variable e.g. bool shutting_down = false to the class
  • when shutting down, change it to true
  • when entering the production loop, check the variable, if it is true, don't schedule a new task.

Thoughts?

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 2, 2018

we need to use mutex because there maybe different threads if I understood correctly production loop

@abitmore

This comment has been minimized.

Member

abitmore commented Oct 2, 2018

IMHO we don't care whether the loop runs one more time. We don't need a mutex for the new variable because there is only one thread that writes to it and another thread only reads it. (I'm talking about my solution mentioned above, not your code)

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 2, 2018

perhaps production loop maybe multi-threaded so there maybe N-threads that write and N-threads that read, or am I wrong ?

@pmconrad

This comment has been minimized.

pmconrad commented Oct 3, 2018

There is at most one thread executing production_loop at any time, and there should be at most one thread calling stop_production_loop.
@abitmore 's solution should also work.

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 3, 2018

Thank you @abitmore @pmconrad will do it (I mean @abitmore 's solution)

@pmconrad

This comment has been minimized.

pmconrad commented Oct 3, 2018

IMO cancel_and_wait is still needed to prevent crash during shutdown.

@abitmore

This comment has been minimized.

Member

abitmore commented Oct 3, 2018

For my solution, better add a check for shutting_down in block_production_loop() as well.

(BTW actually I don't know how cancel_and_wait works, so @pmconrad's comment could be correct)

@abitmore

This comment has been minimized.

Member

abitmore commented Oct 5, 2018

@cogutvalera the code looks fine to me, but the commit message "review changes" doesn't help in case when revisiting the code/changes in the future. How to Write a Git Commit Message: https://chris.beams.io/posts/git-commit/ . I'm learning this as well.

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 5, 2018

@abitmore Thanks ! I've changed comment.

@@ -143,7 +143,21 @@ void witness_plugin::plugin_startup()
void witness_plugin::plugin_shutdown()
{
// nothing to do
_shutting_down = true;

This comment has been minimized.

@pmconrad

pmconrad Oct 5, 2018

I think this should be moved into stop_block_production()

This comment has been minimized.

@cogutvalera

cogutvalera Oct 5, 2018

Member

@pmconrad Thanks ! Moved in new commit.

@pmconrad

Ok, looks good and simple test worked ok. Thanks!

Feature release (201810) automation moved this from In progress to In Testing Oct 7, 2018

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 7, 2018

Thanks !

@abitmore abitmore merged commit 67f313d into bitshares:develop Oct 8, 2018

2 checks passed

ci/dockercloud Your tests passed in Docker Cloud
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Feature release (201810) automation moved this from In Testing to Done Oct 8, 2018

@cogutvalera

This comment has been minimized.

Member

cogutvalera commented Oct 8, 2018

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment