-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msg: add new async event driver based on poll() #46525
Conversation
Driver to replace select() where useful, currently this is windows clients as select is the only available driver for it. Windows is limited by the FD_SETSIZE hard limit of 64 descriptors. This driver Uses poll() or WSAPoll() and maintains pollfd structures to overcome select() limitations. Fixes: https://tracker.ceph.com/issues/55840 Signed-off-by: Rafael Lopez <rafael.lopez@softiron.com>
Hello everyone, I tried to benchmark the performance impact of this change, using the benchmarking methodology described in this Cloudbase blog post. I ran 5 consecutive benchmarks (with 1000 RBD volumes / each benchmark) for each use case. The are the results for each use case tested:
We notice that:
Therefore, I didn't find any performance impact of this change in my testing efforts. |
@idryomov Might you be the right person to review this? I'm not really sure which team/lead would be best. |
@djgalloway The labeler is never wrong -- this belongs to Core. |
@tchaikov thanks for pushing the recent Windows related fixes. could you please review this one as well? |
@neha-ojha @jdurgin Can either of you please do a code review or delegate someone? Would love to get this merged. Thanks! |
@petrutlucian94 sorry, i don't have enough bandwidth reviewing this change. just skimmed through it, though. there are quite a few formatting issues. |
Signed-off-by: Rafael Lopez <rafael.lopez@softiron.com>
jenkins test make check arm64 |
jenkins test windows |
@petrutlucian94 are you familiar with the jenkins tests? I don't think the 'ceph windows tests' failure is an issue with the code based on console output. |
indeed, it's unrelated. The job is spinning up a Windows vm using libvirt and apparently fails when trying to retrieve the IP address:
All recent jobs seem to have failed because of the same issue. @ionutbalutoiu any thoughts on this? |
I see that starting from the end of last week, this error appeared constantly on all the windows jobs. I'm trying to see what's wrong. |
It seems that the Jenkins job failed to properly get the VM IP address on some of the libvirt hosts from the Jenkins infra. Once that gets fixed, I'll retry the tests here. |
jenkins test windows |
3 similar comments
jenkins test windows |
jenkins test windows |
jenkins test windows |
Are these unrelated? |
I'm not sure. @rafalop @petrutlucian94 - Are these failures related to the code changes from here ? |
those are unrelated, I've just submitted a fix: #47818 |
jenkins test windows |
jenkins test windows |
The windows job was failing because of some leftover apt sources:
It's a bit unfortunate that CI jobs aren't isolated in containers or vms, thus subsequent jobs can be impacted if certain files are leaked by the previous jobs. In the meantime, my colleague @ionutbalutoiu has manually cleaned up some of the Jenkins slaves. |
the cortx-motr repo hosted on chacra was wiped out somehow. |
If you don't want them wiped after two weeks, the packages need to be pushed to chacra.ceph.com instead. This is what we do for libboost, for example. |
jenkins test windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, the formatting issues have been addressed. The CI tests are passing and we thoroughly tested the patch as well.
This change only affects Windows, the HAVE_POLL
definition is currently surrounded by a Windows platform check. That being considered, we should merge this fix as soon as possible.
@rzarzynski It will be great if you could take a look at this PR before we merge it. There is no teuthology testing needed on it. |
This is a new event driver for async IO based on poll(), intended to overcome the file descriptor limitations experienced by the select() based driver used by Windows clients.
In Windows, select() can only manage 64*async_op_threads file descriptors, limiting cluster OSD counts to around 200 with default config - see linked bug. Although there are better native event based IO methods for windows, using poll() allows us to keep an unmodified posix stack so it integrates easily.
This new poll() driver can also be used on other niche/legacy systems that provide poll and do not have anything better than select at the moment.
Fixes: https://tracker.ceph.com/issues/55840