Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: fix boost::asio::async_write() does not return error... #35904

Merged
merged 1 commit into from Jul 10, 2020

Conversation

mkogan1
Copy link
Contributor

@mkogan1 mkogan1 commented Jul 2, 2020

although remote has closed the connection

Fixes: https://tracker.ceph.com/issues/46332

Signed-off-by: Mark Kogan mkogan@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Copy link
Contributor

@mattbenjamin mattbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not knowledgeable enough to say this works, but it looks very plausible; I take it this prevents further calls on the connection?

@mattbenjamin mattbenjamin self-requested a review July 2, 2020 17:06
Copy link
Contributor

@mattbenjamin mattbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, didn't mean to approve before getting feedback

@@ -69,6 +69,9 @@ class StreamIO : public rgw::asio::ClientIO {
yield[ec]);
if (ec) {
ldout(cct, 4) << "write_data failed: " << ec.message() << dendl;
if (ec==boost::asio::error::broken_pipe) {
stream.lowest_layer().shutdown(tcp::socket::shutdown_both, ec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're throwing ec as a rgw::io::Exception just below, but this call to shutdown() will mutate ec. maybe use a separate variable boost::system::error_code ec_ignored for the call to shutdown()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention was to override the ec to indicate the specific condition has occured
for example the error printed to the log at

dout(0) << "ERROR: client_io->complete_request() returned "

changes from
7fb47eedc700 0 ERROR: client_io->complete_request() returned Broken pipe
to
7fc0e14da700 0 ERROR: client_io->complete_request() returned Transport endpoint is not connected

it is not essential, changing to ec_ignored

@cbodley
Copy link
Contributor

cbodley commented Jul 2, 2020

have you narrowed down exactly which parts of the RGWListBuckets op was making additional calls to write after errors were returned? it sounds like we're missing some error handling, and i'd like to address that too if we can

@mkogan1 mkogan1 force-pushed the wip-fix-endpoint-not-connected-r2 branch from be5e45a to f7d43f3 Compare July 6, 2020 10:51
@mkogan1
Copy link
Contributor Author

mkogan1 commented Jul 6, 2020

have you narrowed down exactly which parts of the RGWListBuckets op was making additional calls to write after errors were returned? it sounds like we're missing some error handling, and i'd like to address that too if we can

The flow is as follows

from process_request(...):

int process_request(rgw::sal::RGWRadosStore* const store,

the request is submited at:
std::tie(ret,c) = schedule_request(scheduler, s, op);

completing the requst:

ceph/src/rgw/rgw_process.cc

Lines 288 to 290 in 4425f3e

done:
try {
client_io->complete_request();

  • to:

    size_t write_data(const char* buf, size_t len) override {

  • the boost::asio::async_write(...) at this point of the flow, is called 3rd time, before the fix, an error was not returned here.

    auto bytes = boost::asio::async_write(stream, boost::asio::buffer(buf, len),

  • after the fix, for example like:

    if (ec==boost::asio::error::broken_pipe) {
    boost::system::error_code ec_ignored;
    stream.lowest_layer().shutdown(tcp::socket::shutdown_both, ec_ignored);

  • an error (also 'ec=system:104/Connection reset by peer') is thrown from:

    throw rgw::io::Exception(ec.value(), std::system_category());

and caught at:

} catch (rgw::io::Exception& e) {

@@ -291,6 +291,8 @@ int process_request(rgw::sal::RGWRadosStore* const store,
} catch (rgw::io::Exception& e) {
dout(0) << "ERROR: client_io->complete_request() returned "
<< e.what() << dendl;
perfcounter->inc(l_rgw_qlen, -1);
perfcounter->inc(l_rgw_qactive, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we address this issue in a separate pr? the counters should be encapsulated by the frontends, and shouldn't leak into process_request()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done & working on handling it at the frontend level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @mkogan1 , is the pr fix perfcounter->inc(l_rgw_qlen, -1); already submit? i search no result for it, if have not summit, i would like to submit a new pr about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joke-lee It's not merged yet, waiting on PR #36428 to see if it will facilitate a better solution than directly modifying counters in precess_request

@mkogan1 mkogan1 force-pushed the wip-fix-endpoint-not-connected-r2 branch from f7d43f3 to e02aff0 Compare July 7, 2020 09:52
although remote has closed the connection

Fixes: https://tracker.ceph.com/issues/46332

Signed-off-by: Mark Kogan <mkogan@redhat.com>
@mkogan1 mkogan1 force-pushed the wip-fix-endpoint-not-connected-r2 branch from e02aff0 to c997eb6 Compare July 9, 2020 08:40
@ivancich ivancich added the wip-eric-testing-1 for ivancich testing label Jul 9, 2020
@mkogan1
Copy link
Contributor Author

mkogan1 commented Jul 9, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants