Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph.in: use timeout when passed as an argument #21432

Closed

Conversation

rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Apr 15, 2018

Adds the code that makes ping use the timeout when passed as an argument. See - http://tracker.ceph.com/issues/19348

Fixes: http://tracker.ceph.com/issues/19348
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am still observing the backtrace with this patch.

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented May 2, 2018

@tchaikov

i am still observing the backtrace with this patch.

Are my steps correct? With my patch this is what I get on using the reproducing recipe on the issue tracker -

$ # preliminary steps
$ MON=3 MGR=1 MDS=1 OSD=1 ../src/vstart.sh -d -x -i 127.0.0.1 -n
$ ps -e | grep ceph # get mon.c's pid
$ sudo kill -9 <mon.c's pid>
$ ps -e | grep ceph # make sure mon.c is dead
$ # actual test
$ date; ./bin/ceph ping mon.c --connect-timeout=5; date
Wed May  2 07:17:33 UTC 2018
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-05-02 07:17:33.616 7f7973dfa700 -1 WARNING: all dangerous and experimental features are enabled.
[errno 110] error calling ping_monitor
Wed May  2 07:17:38 UTC 2018
$ 

And this is what get I on the master branch -

$ date; ./bin/ceph ping mon.c --connect-timeout=5; date
Wed May  2 07:10:03 UTC 2018
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-05-02 07:10:03.363 7efc41a92700 -1 WARNING: all dangerous and experimental features are enabled.
^C(-4, None, 'Interrupted!')
/home/centos/repos/ceph/src/msg/async/Event.cc: In function 'EventCenter::~EventCenter()' thread 7efc39ffb700 time 2018-05-02 07:10:33.558969
/home/centos/repos/ceph/src/msg/async/Event.cc: 174: FAILED assert(time_events.empty())
 ceph version 13.0.2-1925-gb869bfa (b869bfadd9b0918410abf22cc94681186e1dbd64) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7efc47e68e9f]
 2: (()+0x289087) [0x7efc47e69087]
 3: (EventCenter::~EventCenter()+0x1fb) [0x7efc47fa061b]
 4: (PosixWorker::~PosixWorker()+0x5d) [0x7efc47fa688d]
 5: (PosixNetworkStack::~PosixNetworkStack()+0x8e) [0x7efc47fa6a3e]
 6: (void ceph::_any::op_func<StackSingleton>(ceph::_any::op, void*)+0x81) [0x7efc47f9e951]
 7: (std::_Rb_tree<std::pair<std::string, std::type_index>, std::pair<std::pair<std::string, std::type_index> const, ceph::immobile_any<576ul> >, std::_Select1st<std::pair<std::pair<std::string, std::type_index> const, ceph::immobile_any<576ul> > >, CephContext::associated_objs_cmp, std::allocator<std::pair<std::pair<std::string, std::type_index> const, ceph::immobile_any<576ul> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<std::string, std::type_index> const, ceph::immobile_any<576ul> > >*)+0x40) [0x7efc48066bd0]
 8: (CephContext::~CephContext()+0x1d) [0x7efc4806217d]
 9: (CephContext::put()+0x174) [0x7efc48062764]
 10: (librados::RadosClient::~RadosClient()+0x1e3) [0x7efc508d9db3]
 11: (librados::RadosClient::~RadosClient()+0x9) [0x7efc508d9e59]
 12: (rados_shutdown()+0x2e) [0x7efc508837ce]
 13: (()+0x175f1) [0x7efc50b9e5f1]
 14: (PyEval_EvalFrameEx()+0x730a) [0x7efc59eca0ca]
 15: (PyEval_EvalCodeEx()+0x7ed) [0x7efc59ecbefd]
 16: (PyEval_EvalFrameEx()+0x663c) [0x7efc59ec93fc]
 17: (PyEval_EvalFrameEx()+0x67bd) [0x7efc59ec957d]
 18: (PyEval_EvalCodeEx()+0x7ed) [0x7efc59ecbefd]
 19: (()+0x70858) [0x7efc59e55858]
 20: (PyObject_Call()+0x43) [0x7efc59e309a3]
 21: (()+0x5a995) [0x7efc59e3f995]
 22: (PyObject_Call()+0x43) [0x7efc59e309a3]
 23: (PyEval_CallObjectWithKeywords()+0x47) [0x7efc59ec27b7]
 24: (()+0x1156e2) [0x7efc59efa6e2]
 25: (()+0x7e25) [0x7efc59bd0e25]
 26: (clone()+0x6d) [0x7efc591f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aborted
Wed May  2 07:10:33 UTC 2018

@tchaikov
Copy link
Contributor

tchaikov commented May 2, 2018

@rishabh-d-dave have you tried ctrl-c before the timeout expires?

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented May 2, 2018

@tchaikov Okay. What's the expected behaviour in this case? Because my first impression after reading the reproducing recipe was that we wanted to ping command to quit on timeout.

@tchaikov
Copy link
Contributor

tchaikov commented May 2, 2018

@rishabh-d-dave please read the ticket's description

please note, we should allow SIGINT to terminate the waiting with the fix. see run_in_thread().

@rishabh-d-dave
Copy link
Contributor Author

rishabh-d-dave commented May 2, 2018

@tchaikov

On keyboard interrupt, do we want to shutdown the cluster_handle/RADOS client or leave it as it? If we want to shutdown properly, it would be required to clear time_events. Somehow, I would need to tell the messaging service that request is aborted.

@stale
Copy link

stale bot commented Oct 18, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Oct 18, 2018
@rishabh-d-dave
Copy link
Contributor Author

I'll try to take a look and and update this PR ASAP.

@stale stale bot removed the stale label Oct 22, 2018
@rishabh-d-dave
Copy link
Contributor Author

@tchaikov On following the reproducing recipe (which is: spawn cluster, kill a mon and ping that mon), I don't see the traceback copied in the description any more. Can you please verify this? Perhaps this bug is somehow resolved.

@tchaikov
Copy link
Contributor

@rishabh-d-dave yeah. i cannot reproduce the failure with master HEAD. also after reviewing this issue, i think a simpler fix is #24733, could you help test and review it?

@rishabh-d-dave
Copy link
Contributor Author

@tchaikov Sure.

@rishabh-d-dave
Copy link
Contributor Author

@tchaikov shall i proceed to close this PR?

@tchaikov
Copy link
Contributor

@rishabh-d-dave please go on, we can continue working on the fix at #24733.

@rishabh-d-dave rishabh-d-dave deleted the fix-mon-ping-timeout branch July 23, 2019 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants