Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg: add new async event driver based on poll() #46525

Merged
merged 2 commits into from Nov 17, 2022

Conversation

rafalop
Copy link
Contributor

@rafalop rafalop commented Jun 6, 2022

This is a new event driver for async IO based on poll(), intended to overcome the file descriptor limitations experienced by the select() based driver used by Windows clients.

In Windows, select() can only manage 64*async_op_threads file descriptors, limiting cluster OSD counts to around 200 with default config - see linked bug. Although there are better native event based IO methods for windows, using poll() allows us to keep an unmodified posix stack so it integrates easily.

This new poll() driver can also be used on other niche/legacy systems that provide poll and do not have anything better than select at the moment.

Fixes: https://tracker.ceph.com/issues/55840

Driver to replace select() where useful, currently this is
windows clients as select is the only available driver for it.
Windows is limited by the FD_SETSIZE hard limit of 64
descriptors. This driver Uses poll() or WSAPoll() and maintains
pollfd structures to overcome select() limitations.

Fixes: https://tracker.ceph.com/issues/55840
Signed-off-by: Rafael Lopez <rafael.lopez@softiron.com>
@rafalop rafalop requested a review from a team as a code owner June 6, 2022 02:08
@ionutbalutoiu
Copy link
Contributor

Hello everyone,

I tried to benchmark the performance impact of this change, using the benchmarking methodology described in this Cloudbase blog post.

I ran 5 consecutive benchmarks (with 1000 RBD volumes / each benchmark) for each use case.

The are the results for each use case tested:

We notice that:

  • For the READ operation we have aprox. 1385 MB/sec median bandwidth in both cases (clean main branch vs main branch with this change applied)
  • For the WRITE operation we have aprox. 115 MB/sec median bandwidth in both cases (clean main branch vs main branch with this change applied)

Therefore, I didn't find any performance impact of this change in my testing efforts.

@djgalloway
Copy link

@idryomov Might you be the right person to review this? I'm not really sure which team/lead would be best.

@idryomov
Copy link
Contributor

@djgalloway The labeler is never wrong -- this belongs to Core.

@petrutlucian94 petrutlucian94 added the win32 Specifix changes for the windows platform label Jul 20, 2022
@petrutlucian94
Copy link
Contributor

@tchaikov thanks for pushing the recent Windows related fixes. could you please review this one as well?

@djgalloway
Copy link

@neha-ojha @jdurgin Can either of you please do a code review or delegate someone? Would love to get this merged. Thanks!

@tchaikov
Copy link
Contributor

tchaikov commented Aug 3, 2022

@tchaikov thanks for pushing the recent Windows related fixes. could you please review this one as well?

@petrutlucian94 sorry, i don't have enough bandwidth reviewing this change. just skimmed through it, though. there are quite a few formatting issues.

Signed-off-by: Rafael Lopez <rafael.lopez@softiron.com>
@petrutlucian94
Copy link
Contributor

jenkins test make check arm64

@petrutlucian94
Copy link
Contributor

jenkins test windows

@rafalop
Copy link
Contributor Author

rafalop commented Aug 8, 2022

@petrutlucian94 are you familiar with the jenkins tests? I don't think the 'ceph windows tests' failure is an issue with the code based on console output.

@petrutlucian94
Copy link
Contributor

petrutlucian94 commented Aug 8, 2022

@petrutlucian94 are you familiar with the jenkins tests? I don't think the 'ceph windows tests' failure is an issue with the code based on console output.

indeed, it's unrelated. The job is spinning up a Windows vm using libvirt and apparently fails when trying to retrieve the IP address:

++ sudo virsh domifaddr --source agent --interface Ethernet --full ceph-win-ltsc2019-ceph-windows-pull-requests-11980
++ grep ipv4
++ awk '{print $4}'
++ cut -d / -f1
+ VM_IP=
+ echo 'Retrying in 10 seconds'

All recent jobs seem to have failed because of the same issue. @ionutbalutoiu any thoughts on this?

@ionutbalutoiu
Copy link
Contributor

@petrutlucian94 are you familiar with the jenkins tests? I don't think the 'ceph windows tests' failure is an issue with the code based on console output.

indeed, it's unrelated. The job is spinning up a Windows vm using libvirt and apparently fails when trying to retrieve the IP address:

++ sudo virsh domifaddr --source agent --interface Ethernet --full ceph-win-ltsc2019-ceph-windows-pull-requests-11980
++ grep ipv4
++ awk '{print $4}'
++ cut -d / -f1
+ VM_IP=
+ echo 'Retrying in 10 seconds'

All recent jobs seem to have failed because of the same issue. @ionutbalutoiu any thoughts on this?

I see that starting from the end of last week, this error appeared constantly on all the windows jobs.

I'm trying to see what's wrong.

@ionutbalutoiu
Copy link
Contributor

@petrutlucian94 @rafalop

It seems that the Jenkins job failed to properly get the VM IP address on some of the libvirt hosts from the Jenkins infra.
I submitted this PR to improve the VM IP detection: ceph/ceph-build#2045

Once that gets fixed, I'll retry the tests here.

@ionutbalutoiu
Copy link
Contributor

jenkins test windows

3 similar comments
@ionutbalutoiu
Copy link
Contributor

jenkins test windows

@ionutbalutoiu
Copy link
Contributor

jenkins test windows

@ionutbalutoiu
Copy link
Contributor

jenkins test windows

@djgalloway
Copy link

[2022-08-10T20:06:32.000Z] [googletest] unittest_crush.exe failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\unittest_crush.exe --gtest_output=xml:C:\workspace\test_results\unittest_crush_results.xml  > C:\workspace\test_results\unittest_crush_results.log 2>&1'".
[2022-08-10T20:06:32.000Z] [googletest] unittest_crush_wrapper.exe failed. Error: Command returned non-zero code(3): "cmd /c 'C:\ceph\unittest_crush_wrapper.exe --gtest_output=xml:C:\workspace\test_results\unittest_crush_wrapper_results.xml  > C:\workspace\test_results\unittest_crush_wrapper_results.log 2>&1'".

Are these unrelated?

@ionutbalutoiu
Copy link
Contributor

[2022-08-10T20:06:32.000Z] [googletest] unittest_crush.exe failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\unittest_crush.exe --gtest_output=xml:C:\workspace\test_results\unittest_crush_results.xml  > C:\workspace\test_results\unittest_crush_results.log 2>&1'".
[2022-08-10T20:06:32.000Z] [googletest] unittest_crush_wrapper.exe failed. Error: Command returned non-zero code(3): "cmd /c 'C:\ceph\unittest_crush_wrapper.exe --gtest_output=xml:C:\workspace\test_results\unittest_crush_wrapper_results.xml  > C:\workspace\test_results\unittest_crush_wrapper_results.log 2>&1'".

Are these unrelated?

I'm not sure.

@rafalop @petrutlucian94 - Are these failures related to the code changes from here ?

@petrutlucian94
Copy link
Contributor

petrutlucian94 commented Aug 26, 2022

@rafalop @petrutlucian94 - Are these failures related to the code changes from here ?

those are unrelated, I've just submitted a fix: #47818

@petrutlucian94
Copy link
Contributor

jenkins test windows

@ionutbalutoiu
Copy link
Contributor

jenkins test windows

@petrutlucian94
Copy link
Contributor

The windows job was failing because of some leftover apt sources:

E: Failed to fetch https://4.chacra.ceph.com/r/cortx-motr/master/39f89fa1c6945040433a913f2687c4b4e6cbeb3f/ubuntu/jammy/flavors/default/dists/jammy/main/binary-amd64/Packages  404  Not Found [IP: 158.69.64.56 443]

It's a bit unfortunate that CI jobs aren't isolated in containers or vms, thus subsequent jobs can be impacted if certain files are leaked by the previous jobs.

In the meantime, my colleague @ionutbalutoiu has manually cleaned up some of the Jenkins slaves.

@tchaikov
Copy link
Contributor

the cortx-motr repo hosted on chacra was wiped out somehow.

@djgalloway
Copy link

the cortx-motr repo hosted on chacra was wiped out somehow.

If you don't want them wiped after two weeks, the packages need to be pushed to chacra.ceph.com instead. This is what we do for libboost, for example.

@petrutlucian94
Copy link
Contributor

jenkins test windows

Copy link
Contributor

@petrutlucian94 petrutlucian94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, the formatting issues have been addressed. The CI tests are passing and we thoroughly tested the patch as well.

This change only affects Windows, the HAVE_POLL definition is currently surrounded by a Windows platform check. That being considered, we should merge this fix as soon as possible.

@neha-ojha neha-ojha requested review from rzarzynski and removed request for rzarzynski November 11, 2022 15:51
@neha-ojha
Copy link
Member

@rzarzynski It will be great if you could take a look at this PR before we merge it. There is no teuthology testing needed on it.

@petrutlucian94 petrutlucian94 merged commit 9fcc474 into ceph:main Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build/ops core win32 Specifix changes for the windows platform
Projects
None yet
7 participants