Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async: no need to set the new once before binding #11804

Closed
wants to merge 2 commits into from

Conversation

tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Nov 7, 2016

and Processor::bind() will set the new addr to myaddr once it binds
the new listening address.

Fixes: http://tracker.ceph.com/issues/17807
Signed-off-by: Willem Jan Withagen wjw@digiware.nl
Signed-off-by: Kefu Chai kchai@redhat.com

so the type of nonce in messengers is consistent with
entity_addr_t::nonce. and it's more spatial efficient than uint64_t.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

tchaikov commented Nov 7, 2016

@wjwithagen

@wjwithagen
Copy link
Contributor

@tchaikov
Will give it a testrun.
But perhaps even make a typedef?
typedef uint32_t nonce_t
So changes in the future should go easier?

wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 7, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
@wjwithagen
Copy link
Contributor

wjwithagen commented Nov 7, 2016

@tchaikov
With #11804 I'm still getting:

2016-11-07 12:11:22.556827 81306fe00  0 -- 127.0.0.1:0/4190861965 >> 127.0.0.1:6800/2048 conn(0x8131f7800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection:954 connect claims to be 127.0.0.1:6800/1002048 not 127.0.0.1:6800/2048 - wrong node!

And that on something simple as:
/usr/local/bin/python /usr/srcs/Ceph/work/ceph/build/bin/ceph tell osd.0 version

Now running with my patch

@wjwithagen
Copy link
Contributor

@tchaikov @yuyuyu101
Nope, that is not going to do it.

I actually needed to fix what I also did in #11706, and then cephtool-test-mon.sh runs to completion and passes .
I've undeleted my branch, but it still closed.

@tchaikov tchaikov changed the title msg/async: should rebind using the new nonce [DNM] msg/async: should rebind using the new nonce Nov 7, 2016
and Processor::bind() will set the new addr and nonce to myaddr once
it binds the new listening address.

Fixes: http://tracker.ceph.com/issues/17807
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

tchaikov commented Nov 7, 2016

@wjwithagen could you try this change again?

@wjwithagen
Copy link
Contributor

wjwithagen commented Nov 7, 2016

@tchaikov
I'm sort of wondering
With this change the nonce is not set at all directly in the local addr..
It is only thru aliasing of the reference that addr.nonce gets the new nonce??

But now I still get a hanging test:
/usr/srcs/Ceph/work/ceph/qa/workunits/cephtool/test.sh:67: retry_eagain: map_enxio_to_eagain ceph tell osd.0 version
But no more mentions in the log files of wrong node...
Now the ceph command itself complains:
2016-11-07 14:50:48.361985 81306fe00 0 -- 127.0.0.1:0/4278067750 >> 127.0.0.1:6800/20149 conn(0x8131f7800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection:954 connect claims to be 127.0.0.1:6800/1020149 not 127.0.0.1:6800/20149 - wrong node!

@wjwithagen
Copy link
Contributor

@tchaikov @yuyuyu101
In Processor::Rebind has to references to nonce.
One in the async-class and one in the addr class and they both need to be updated
This fragment is from bind:

  nonce += 1000000;
  addr.nonce = nonce;
  msgr->set_myaddr(addr);

And more or less the same needs to be done in rebind as well.
But then still I'm not getting the correct nonce in the addr if I do not use something like the old code.

I also note that there are a lot more uint64_t nonce declarations, not sure if they also conflict with the fact that in entity_addr_t nonce is declared as __u32. Both in the src/msg tree, but also at more places.

@tchaikov tchaikov self-assigned this Nov 8, 2016
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 11, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 11, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 14, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 18, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Nov 19, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
@tchaikov
Copy link
Contributor Author

@wjwithagen i will take a closer look at this issue tomorrow.

@tchaikov
Copy link
Contributor Author

@wjwithagen could you offer the steps to reproduce this issue?

@wjwithagen
Copy link
Contributor

@tchaikov
Right, just from memory, because I'll have to rerun stuff.
And I'm busy with dayjob ATM.

If I revert my change, I'm getting errors on tests where there is an OSD restart, and ceph or rados wants to connect again to that OSD. Connection is then refused due to a difference in the nonce.
That is easiest to see because the test will get stuck, or times out.
And (re)running the command will tell you about this problem.

It can also be found in the OSD logfiles if debug level is high enough. it also complains about not matching NONCE.
I'll see if I can redo the tests with the original code tonight.

@tchaikov
Copy link
Contributor Author

@wjwithagen so, in short, restarting OSD and then running ceph or rados will error out without your patch?

@wjwithagen
Copy link
Contributor

@tchaikov
On FreeBSD yes, although I'm not 100% sure that it is on ALL connections with rados and ceph.
Mainly because I then focussed on the test that failed.
I realise that documenting this in the commit would have been a smart thing.
You can take a shot, and otherwise I'll figure it out tonight.

@wjwithagen
Copy link
Contributor

wjwithagen commented Nov 30, 2016

@tchaikov

The command failing is:
ceph tell osd.0 version
whilest running:
src/test/cephtool-test-mon.sh

And running te command manually on the commandline:

sudo bin/ceph -c src/test/td/t-7202/ceph.conf tell osd.0 version
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2016-11-30 17:14:38.311288 801c16f00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.332033 801c1ab00 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-11-30 17:14:38.363600 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.565084 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:38.973728 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:39.803632 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:41.429071 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!
2016-11-30 17:14:44.660891 81306fe00  0 -- 127.0.0.1:0/4237874432 >> 127.0.0.1:6800/30470 conn(0x8131f4800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 127.0.0.1:6800/1030470 not 127.0.0.1:6800/30470 - wrong node!

This last line continues until the command gets cancelled or osd.0 goes away.

In the OSD log I find repeats of:

2016-11-30 17:17:44.924579 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=51 -
2016-11-30 17:17:44.925175 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_bulk peer close file descriptor 51
2016-11-30 17:17:44.925234 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).read_until read failed
2016-11-30 17:17:44.925330 b9c2d80  1 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0)._process_connection read peer banner and addr failed
2016-11-30 17:17:44.925420 b9c2d80  0 -- 127.0.0.1:6800/1030470 >> - conn(0xbc9a800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2016-11-30 17:17:44.925559 b9c2d80  1 -- 127.0.0.1:6800/1030470 reap_dead start

You'd need to up the loging to get it more verbose. But it tells you that the OSD has gone to 127.0.0.1:6800/1030470 for the instance of the socket. And the rest of the world does not seem to be updated. My fix actually does get the message out to the reset of the deamons, and ceph is able to pick up on that.

BTW:
It also shows why I'd like to get ride of all these DeveloperMode and feature warnings whilest developing.
Huge logfile pollution.

wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Dec 2, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Dec 4, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Dec 4, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
wjwithagen added a commit to wjwithagen/ceph that referenced this pull request Dec 4, 2016
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
@tchaikov tchaikov changed the title [DNM] msg/async: should rebind using the new nonce msg/async: no need to set the new once before binding Dec 8, 2016
@tchaikov tchaikov closed this Dec 14, 2016
@tchaikov tchaikov deleted the wip-17807 branch December 14, 2016 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants