New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: heartbeat with packets large enough to require working jumbo frames #15535

Merged
merged 1 commit into from Jun 14, 2017

Conversation

Projects
None yet
6 participants
@gregsfortytwo
Member

gregsfortytwo commented Jun 7, 2017

Fixes: http://tracker.ceph.com/issues/20087

Signed-off-by: Greg Farnum gfarnum@redhat.com

@@ -797,6 +797,7 @@ OPTION(osd_heartbeat_interval, OPT_INT, 6) // (seconds) how often we ping
OPTION(osd_heartbeat_grace, OPT_INT, 20)
OPTION(osd_heartbeat_min_peers, OPT_INT, 10) // minimum number of peers
OPTION(osd_heartbeat_use_min_delay_socket, OPT_BOOL, false) // prio the heartbeat tcp socket and set dscp as CS6 on it if true
OPTION(osd_heartbeat_min_size, OPT_INT, 2000) // the minimum size of OSD heartbeat messages to send

This comment has been minimized.

@yuyuyu101

yuyuyu101 Jun 7, 2017

Member

so it will increase the default payload size? why not let user set if enable jumbo frame

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jun 7, 2017

Member

We want to protect against users who don't have the setup they think they do. Requiring them to turn on the test rather defeats the purpose. ;)

This comment has been minimized.

@bniver

bniver Jun 8, 2017

Contributor

@gregsfortytwo
Do we have some sort of global concept of min message size (or max) that we expect to travel as a segment?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jun 8, 2017

Member

No, we're using TCP/IP so we don't care at all. This is entirely so that our ping sizes are transmitted in jumbo-sized packets when our other messages are. 100% to catch the case where a network misonfiguration results in the server sending packets the switch won't accept, and so we transmit no replication data but don't get marked down (as our heartbeat pings are under the limit).

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jun 7, 2017

lgtm, though I'm unsure how the network stack breaks this down. did you verify with tcpdump or something that it results in a large packet like the ping command in the tracker ticket?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 7, 2017

Unfortunately I'm just relyin ton user reports at this point -- our MPing messages get through but not ones with data.

I'm not quite sure how it could go out in packets smaller than the whole message, though (or the maximum packet size, which also satisfies our purposes and is why I set it here instead of nearer the max jumbo packet size).

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 7, 2017

My quick googling indicates "normal" non-jumbo frames are in the ~1500 range, so 2000 ought to be enough to break a totally broken config. As a default that seems right to me.

I suspect that in practice "broken" jumbo frame environemnts aren't totally broken, but rather only fail with some frame sizes (maybe in a certain range)? I'm just speculating. If we get better feedback from the field we may eventually make this do something like randomize the padding within a range or something. Until we know more, though, this looks good to me!

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 7, 2017

Hmm, I'm guessing the unittest_direct_messenger failure is related, though:

[ RUN      ] DirectMessenger.AsyncDispatch
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/Mutex.cc: In function 'Mutex::~Mutex()' thread 7f880700c080 time 2017-06-07 05:29:32.834450
/home/jenkins-build/build/workspace/ceph-pull-requests/src/common/Mutex.cc: 73: FAILED assert(nlock == 0)
 ceph version  12.0.2-2454-g3e71fec (3e71fecb27a8b42374710fe4c2f976fe25163dde) luminous (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f87fe259ac2]
 2: (()+0x26fd0b) [0x7f87fe22ad0b]
 3: (QueueStrategy::~QueueStrategy()+0x7b) [0x7f87fe3e6d1b]
 4: (DirectMessenger::~DirectMessenger()+0x6c) [0x558f831b664c]
 5: (DirectMessenger_AsyncDispatch_Test::TestBody()+0x21b) [0x558f831ab1bb]
 6: (void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*)+0x43) [0x558f831dd003]
 7: (testing::Test::Run()+0xba) [0x558f831d3d1a]
 8: (testing::TestInfo::Run()+0x118) [0x558f831d3e68]
 9: (testing::TestCase::Run()+0xb5) [0x558f831d3f45]
 10: (testing::internal::UnitTestImpl::RunAllTests()+0x24f) [0x558f831d421f]
 11: (bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x43) [0x558f831dd4e3]
 12: (testing::UnitTest::Run()+0x44) [0x558f831d4554]
 13: (main()+0x92) [0x558f831a7e02]
 14: (__libc_start_main()+0xf0) [0x7f87fd476830]
 15: (_start()+0x29) [0x558f831a8299]
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 8, 2017

I don't really see how the DirectMessenger tests could be influenced by these changes, but I did add initialization to MOSDPing in the default-constructed case. I don't think anybody looks at those values in this case but it could have been a problem if re-using messages to send back and forth or something else odd.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 8, 2017

Passing unittest_direct_messenger on my local box and in Jenkins.

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 9, 2017

In file included from /home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc:65:0:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/messages/MOSDPing.h: In member function ‘virtual void MOSDPing::decode_payload()’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/messages/MOSDPing.h:92:5: error: ‘class ceph::buffer::list::iterator’ has no member named ‘get_pos’
  (p.get_pos() - payload.begin().get_pos());
     ^
/home/jenkins-build/build/workspace/ceph-pull-requests/src/messages/MOSDPing.h:92:33: error: ‘class ceph::buffer::list::iterator’ has no member named ‘get_pos’
  (p.get_pos() - payload.begin().get_pos());
                                 ^
if (header.version >= 3) {
bufferlist size_bl;
::decode(size_bl, p);
min_message_size = size_bl.length() +

This comment has been minimized.

@liewegas

liewegas Jun 9, 2017

Member

min_message_size = payload.length(); is probably sufficient i think?

@@ -80,6 +99,10 @@ class MOSDPing : public Message {
::encode(op, payload);
::encode(peer_stat, payload);
::encode(stamp, payload);
if (min_message_size) {

This comment has been minimized.

@liewegas

liewegas Jun 9, 2017

Member

decoding is conditional on header.version, but encoding is conditional on whether min_emssage_size >= 0. this should be unconditional (encode empty bufferlist) or swtich the decode condition to p.end() (the former pbly?)

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 9, 2017

Eek, sorry for pushing the broken commit; I forgot to --amend :(
Updated to deal with the encode/decode mismatch as well

@@ -80,6 +99,8 @@ class MOSDPing : public Message {
::encode(op, payload);
::encode(peer_stat, payload);
::encode(stamp, payload);
bufferlist size_bl(MAX(min_message_size - payload.length(), 0));

This comment has been minimized.

@liewegas

liewegas Jun 12, 2017

Member

pretty sure this doesn't actually put anything in the list; the ctor arg just sizes append_buffer (how big we expect the bl to get)

osd: heartbeat with packets large enough to require working jumbo fra…
…mes.

We get periodic reports that users somehow misconfigure one of their switches
so that it drops jumbo frames, yet the servers are still passing them along. In
that case, MOSDOp messages generally don't get through because they are much
larger than the 1500-byte non-jumbo limit, but the MOSDPing messages have kept
going (as they are very small and dispatched independently, even when the
server is willing to make jumbo frames). This means peer OSDs won't mark down
the ones behind the broken switch, despite all IO hanging.
Push the MOSDPing message size over the 1500-byte limit so that anybody in
this scenario will see the OSDs stuck behind a bad switch get marked down.

Fixes: http://tracker.ceph.com/issues/20087

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 13, 2017

Updated again. Verified the size by looking through logs and seeing they were 2004 bytes of payload (min 2000, plus 4 bytes of bufferlist header).

@yuriw yuriw merged commit 5ad4ee8 into ceph:master Jun 14, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
arm64 make check arm64 make check succeeded
Details
make check make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment