New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: avoid out-of-bound vector access in PeeringState::init_hb_stamps #51160
Conversation
When tricking cluster with 'primary-temp' command coupled with OSD adding/removal an OSD might get to a state when it considers itself as primary for a certain PG while received acting set doesn't include this OSD. As a result PeeringState::init_hb_stamps() might allocate too short hb_stamps vector and erroneusly access out-of-bound entries. E.g. primary OSD.0 gets acting set [3,5,1], performs hb_stamps.resize(2) and then trying to add three hb stamps as no primary entry is omitted in the loop over the acting set. Fixes: https://tracker.ceph.com/issues/59491 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
jenkins test api |
jenkins test windows |
I backported this fix to 16.2.11 and could not reproduce the issue anymore. |
@rzarzynski mind reviewing? |
From your description, it seems like the steps to reproduce would be something like:
The result would be that the pg above has an acting set not containing 0, but get_primary().osd still evaluates to 0? That's a larger problem than this segfault -- it means that the primary is no longer a member of the acting set. Can you attach an osdmap with this property to the bug? I'd like to take a look. |
osdmap_after_osd14_remove.zip
I reproduced this on a Pacific 16.2.9 cluster. I will try to attach the osdmaps to this issue: osdmap_no_pg_temp is the OSDmap when no changes were made yet, osdmap_pgtemp is the OSDmap when the pgtemp changes were made, osdmap_after_osd14_remove is the OSDmap when osd.14 was set out and purged from the cluster. That was also the moment the bug was triggered. I have logs of these OSDs. So if you would want to see any of them, please let me know. Also if you need other info. Note: I added the ".zip" extension to the files to be able to upload them,, but they are actually just the binaries osdmaptool produces |
@hydro-b I uploaded them to the bug. I probably won't have time to look at this this week, but I'll try to get to it when I get back from vacation the week of the 20th. |
Apologies, I still haven't had time to look at this. I hope to get to it in the next two weeks. |
Sure, no problem. Just let me know if you need more info or access to a machine that is affected
I will be on PTO starting 10th august. Let me know if you need more information, or access to an affected system. |
@hydro-b I just responded on the bug -- I'm not sure the maps you attached are the right ones. |
I updated the redmine tracker ticket. The attached maps are from the affected cluster and currently all I have.. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
When tricking cluster with 'primary-temp' command coupled with OSD adding/removal an OSD might get to a state when it considers itself as primary for a certain PG while received acting set doesn't include this OSD. As a result PeeringState::init_hb_stamps() might allocate too short hb_stamps vector and erroneusly access out-of-bound entries. E.g. primary OSD.0 gets acting set [3,5,1], performs hb_stamps.resize(2) and then trying to add three hb stamps as no primary entry is omitted in the loop over the acting set.
Fixes: https://tracker.ceph.com/issues/59491
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows