New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph.in: Check return value when connecting #16130

Merged
merged 1 commit into from Jul 13, 2017

Conversation

Projects
None yet
5 participants
@fullerdj
Contributor

fullerdj commented Jul 5, 2017

The initial RADOS connection may time out. Detect this condition
and return an error message to the user instead of looping forever.

Signed-off-by: Douglas Fuller dfuller@redhat.com

@alfredodeza alfredodeza requested a review from dmick Jul 5, 2017

src/ceph.in Outdated
run_in_thread(cluster_handle.connect, timeout=timeout)
retval, outbuf, outs = run_in_thread(cluster_handle.connect,
timeout=timeout)
if retval == -errno.EINTR:

This comment has been minimized.

@tchaikov

tchaikov Jul 7, 2017

Contributor

this could also caused by a real KeyboardInterrupt.

This comment has been minimized.

@fullerdj

fullerdj Jul 7, 2017

Contributor

Good point. I adjusted the error message accordingly.

@tchaikov tchaikov added needs-qa and removed needs-review labels Jul 8, 2017

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jul 10, 2017

@dmick these changes look ok to me, can you double check we aren't missing anything?

@liewegas

This comment has been minimized.

Member

liewegas commented Jul 11, 2017

http://pulpito.ceph.com/sage-2017-07-11_01:11:01-rados-wip-sage-testing-distro-basic-smithi did not go well:

2017-07-11T01:17:02.138 INFO:teuthology.orchestra.run.smithi099:Running: 'sudo ceph --cluster ceph osd crush tunables default'
2017-07-11T01:17:02.190 INFO:tasks.ceph.mgr.x.smithi099.stderr:2017-07-11 01:17:02.194595 7f5ecc818380 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.191 INFO:tasks.ceph.mgr.x.smithi099.stderr:2017-07-11 01:17:02.194740 7f5ecc818380 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.191 INFO:tasks.ceph.mon.a.smithi099.stderr:2017-07-11 01:17:02.195949 7f839508ac00 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.191 INFO:tasks.ceph.mon.a.smithi099.stderr:2017-07-11 01:17:02.196062 7f839508ac00 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.191 INFO:tasks.ceph.mgr.x.smithi099.stderr:2017-07-11 01:17:02.196712 7f5ecc818380 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.192 INFO:tasks.ceph.mon.a.smithi099.stderr:2017-07-11 01:17:02.198074 7f839508ac00 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.236 INFO:teuthology.orchestra.run.smithi099.stderr:2017-07-11 01:17:02.241184 7feff4a1f700 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:02.249 INFO:teuthology.orchestra.run.smithi099.stderr:2017-07-11 01:17:02.254776 7feff4a1f700 -1 WARNING: all dangerous and experimental features are enabled.
2017-07-11T01:17:03.112 INFO:teuthology.orchestra.run.smithi099.stderr:'NoneType' object is not iterable
2017-07-11T01:17:03.126 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 30, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-sage-testing/qa/tasks/ceph.py", line 325, in crush_setup
    'osd', 'crush', 'tunables', profile])
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi099 with status 1: 'sudo ceph --cluster ceph osd crush tunables default'
@fullerdj

This comment has been minimized.

Contributor

fullerdj commented Jul 11, 2017

Ok, @liewegas, I think this should fix the multiple ^C problem as well.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 11, 2017

@fullerdj your PR includes some submodule changes by accident.

ceph.in: Check return value when connecting
The initial RADOS connection may time out. Detect this condition
and return an error message to the user instead of looping forever.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
@fullerdj

This comment has been minimized.

Contributor

fullerdj commented Jul 11, 2017

Oops, sorry. Fixed.

@@ -899,7 +899,10 @@ def main():
try:
if childargs and childargs[0] == 'ping':
return ping_monitor(cluster_handle, childargs[1], timeout)
run_in_thread(cluster_handle.connect, timeout=timeout)
result = run_in_thread(cluster_handle.connect, timeout=timeout)
if type(result) is tuple and result[0] == -errno.EINTR:

This comment has been minimized.

@dmick

dmick Jul 12, 2017

Member

This looks correct with the way run_in_thread got modified in 7216b06, but I don't understand why it has to return different types depending on the return conditions...sigh...why isn't the "normal" case "t.retval, None, None" then...

I also wonder what, if anything, ever triggered the except KeyboardInterrupt below.

This comment has been minimized.

@dmick

dmick Jul 12, 2017

Member

I don't know that any of these require fixing, but I do wonder what you think about the last point

This comment has been minimized.

@fullerdj

fullerdj Jul 12, 2017

Contributor

I agree that the return value from run_in_thread is strange.

I'm not aware of the revision history here, but I think KeyboardInterrupt is very unlikely to be caught there since it's caught (and, strangely, raised so as to be caught) by run_in_thread (and ping_monitor primarily uses run_in_thread as well).

In fact, I think this issue is now more pervasive; squashing the KeyboardInterrupt leads to an infinite loop here and I suspect it may lead to bad behavior at other unchecked call sites.

This comment has been minimized.

@dmick

dmick Jul 12, 2017

Member

I'd just as soon regularize the return from run_in_thread to a tuple, then, to reduce confusion and verbosity of test.

a long time ago my intent was to propagate the KeyboardInterrupt as the "the thread was interrupted either by human or timeout". I don't know when that changed (could have been before I pushed originally) but it still seems like a plausible approach.

This comment has been minimized.

@dmick

dmick Jul 18, 2017

Member

or...not.

This comment has been minimized.

@fullerdj

fullerdj Jul 18, 2017

Contributor

Sorry, it got merged before I changed it. I’ll do it in a separate PR.

This comment has been minimized.

@fullerdj

fullerdj Jul 18, 2017

Contributor

or...not.

So, it turns out that run_in_thread can return darn near anything, depending on what you're running. So, it is not particularly amenable to standardization. We could have it return a dict with an "error" item, but otherwise we are stuck with this approach.

@liewegas liewegas merged commit c67f986 into ceph:master Jul 13, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@fullerdj fullerdj deleted the fullerdj:wip-djf-ceph-connect-timeout branch Jul 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment