New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: make Infiniband can be forkable #13525

Merged
merged 1 commit into from Mar 2, 2017

Conversation

Projects
None yet
3 participants
@yuyuyu101
Member

yuyuyu101 commented Feb 19, 2017

Signed-off-by: Haomai Wang haomai@xsky.com

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 19, 2017

@Adirl this is another pr to fix rdma background start. it will make registered memory reset after forking. do we need to reset other resources like ibv_pd, ibv_cq after forking?

@Adirl

This comment has been minimized.

Adirl commented Feb 19, 2017

yes
every RDMA resource needs to be reseted

@orendu

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 20, 2017

@Adirl commit updated, passed local test! replacing #13384

@Adirl

This comment has been minimized.

Adirl commented Feb 23, 2017

@yuyuyu101
was #13384 removed from master ? ( i remember it was merged but i don't find it anymore)

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 23, 2017

@Adirl no, we don't merge that. because it fails in tests

@Adirl

This comment has been minimized.

Adirl commented Feb 23, 2017

@yuyuyu101
got it
thanks

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 2, 2017

updated, local tested ok

msg/async/rdma: restart Infiniband resources to handle fork properly
Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 2, 2017

passed cluster tests

@yuyuyu101 yuyuyu101 merged commit 595f945 into ceph:master Mar 2, 2017

2 of 3 checks passed

default Build started sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@yuyuyu101 yuyuyu101 deleted the yuyuyu101:wip-infiniband-fork branch Mar 2, 2017

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 2, 2017

Hi @yuyuyu101 ,

While working on the RDMA-CM, I hit 2 problems due to the pre/post fork:

  1. Can't call ibv_close_device() in the pre_fork handler when using librdmacm.
  2. Need to close all the listeners in pre_fork and recreate them on post_fork

This all makes me think if it isn't a sign that that the pre/post fork solution is not the right one, and that the first access to RDMA resources should be done after the daemonize().

What do you think about it? Do you have any suggestions?

Some more details about the problems:
In RDMA-CM, I need to use rdma_get_devices() instead of ibv_get_device_list()+ibv_open_device(). rdma_free_devices() only frees the device list - the devices resources are left opened. a later call to rdma_get_devices() will use the same opened devices (in our case, the device object that were opened before the fork).
It means that we need to use devices that were opened (ibv_open_device) and used before the fork, which is unsafe.

I tried to workaround it using a patch to librdmacm, to make rdma_free_devices() free the resources and close the devices. But then I hit the second problem. I need to close the RDMAServerSocket's that were opened before the fork and are using the old ibv_context. and recreate it after the fork, which is a mess.

Thanks,
Amir

@orendu @Adirl

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Apr 2, 2017

ok, maybe we can add a pre_fork/post_fork to RDMAServerSocket too? it would be a simpler way to handle as far as I think.

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 3, 2017

@yuyuyu101 ,

ok, maybe we can add a pre_fork/post_fork to RDMAServerSocket too? it would be a simpler way to handle as far as I think.

Yes, I started doing that, and it seems to be working - although it makes things more complicated - what if the port will get occupied between the pre_fork and the post_fork?

With that we could live, but my main concern is still: librdmacm doesn't have the option to close devices. So I can't actually close IB devices on the pre_fork.

The solutions that I thought for that, are all bad:

  1. Extend librdmacm - this is very simple, but will take a long time till the change will propagate, and customers won't be able to run ceph-rdma on all the old OFED's...
  2. Issue dlclose() on the pre_fork and dlopen() again on post_for. dlclose() is very complicated, and it is not promised that the destructor of the library will actually be called. Also, since the librdmacm.so is called implicitly by the ld.so when the program is started, I don't have the handle for it to be used in the close. So this solution is bad too.

Currently I'm working with a modified librdmacm, so I be able to continue develop - but I don't have a solution.

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 3, 2017

@yuyuyu101 ,
Just to make sure I understand:
The reason why we have to access RDMA resources before the fork, is because many tests are expecting the prints to be in certain order, and moving the bind to be after daemonize() will change the order of the prints?

Is there another reason for that?

Amir

@orendu @Adirl

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Apr 3, 2017

@amirv hmm, it's odd that rdma-cm doesn't support closing device... if so, it's really a problem. We don't want to rely on a developing feature.

The last option is we delay listen after forking.. maybe we could add "start" interface to NetworkStack and let asyncmessenger call start when ready

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 3, 2017

@yuyuyu101

@amirv hmm, it's odd that rdma-cm doesn't support closing device... if so, it's really a problem. We don't want to rely on a developing feature.

Yeh, it is ugly - don't know why it was done like that. Anyway, first time a process calls rdma_get_devices() it will call ucma_init() which will open the devices. ucma_uninit() will only be called on the shared library destructor. No way exists to call it directly.

The last option is we delay listen after forking.. maybe we could add "start" interface to NetworkStack and let asyncmessenger call start when ready

This should be perfect

@orendu @Adirl

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Apr 3, 2017

is that really unsafe to refer to device after forking?

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 3, 2017

@yuyuyu101

is that really unsafe to refer to device after forking?

I hope that I explain it accuratly:
Software communicate to HW through memory addresses that are mapped to device.
After the fork, there is a risk that the child will see a copy of the mapped address - so it will read/write to the copy, but the device won't see it. This is effective for every such resource that was opened before the fork, and ibv_open_device() opens many such resources.

This is the theoretical part (which @orendu and @Adirl know better than me) - in practice, calls to rdma_() and ibv_() functions failed, unless I used the patched librdmacm which actually close the device in pre_fork and open it again in post_fork.

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 4, 2017

@yuyuyu101 ,
Can you prepare the patch to delay the listen? I tried, and lost my way there...
If you are too busy, just send me a sketch and I will work it out into a patch.
We are on a tight schedule - so need to fix it ASAP...

Thanks,
Amir

@orendu @Adirl

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Apr 4, 2017

@amirv refer to https://github.com/yuyuyu101/ceph/commits/wip-async-msgr-ready

we could implement prefork/postfork now. according to "is_ready" flag, decide to do actual listen or just queue it wait for ready. this is not a perfect way but I think it could work

@amirv

This comment has been minimized.

Contributor

amirv commented Apr 4, 2017

@yuyuyu101
I think I understand. Will take it and add the listen/bind queuein

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment