Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: avoid out-of-bound vector access in PeeringState::init_hb_stamps #51160

Closed
wants to merge 1 commit into from

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Apr 20, 2023

When tricking cluster with 'primary-temp' command coupled with OSD adding/removal an OSD might get to a state when it considers itself as primary for a certain PG while received acting set doesn't include this OSD. As a result PeeringState::init_hb_stamps() might allocate too short hb_stamps vector and erroneusly access out-of-bound entries. E.g. primary OSD.0 gets acting set [3,5,1], performs hb_stamps.resize(2) and then trying to add three hb stamps as no primary entry is omitted in the loop over the acting set.

Fixes: https://tracker.ceph.com/issues/59491

Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

When tricking cluster with 'primary-temp' command coupled with OSD
adding/removal an OSD might get to a state when it considers itself as
primary for a certain PG while received acting set doesn't include this
OSD. As a result PeeringState::init_hb_stamps() might
allocate too short hb_stamps vector and erroneusly access out-of-bound
entries. E.g. primary OSD.0 gets acting set [3,5,1], performs
hb_stamps.resize(2) and then trying to add three hb stamps as no primary
entry is omitted in the loop over the acting set.

Fixes: https://tracker.ceph.com/issues/59491

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@ifed01 ifed01 requested a review from a team as a code owner April 20, 2023 10:31
@theanalyst theanalyst assigned theanalyst and unassigned theanalyst Apr 20, 2023
@ifed01 ifed01 changed the title osd: avoid out-of-bound vector access on PeeringState::init_hb_stamps osd: avoid out-of-bound vector access in PeeringState::init_hb_stamps Apr 20, 2023
@ifed01
Copy link
Contributor Author

ifed01 commented Apr 24, 2023

jenkins test api

@ifed01
Copy link
Contributor Author

ifed01 commented Apr 24, 2023

jenkins test windows

@hydro-b
Copy link
Contributor

hydro-b commented Apr 27, 2023

I backported this fix to 16.2.11 and could not reproduce the issue anymore.

@ifed01 ifed01 requested a review from rzarzynski June 2, 2023 12:02
@ifed01
Copy link
Contributor Author

ifed01 commented Jun 2, 2023

@rzarzynski mind reviewing?

@athanatos
Copy link
Contributor

athanatos commented Jun 2, 2023

From your description, it seems like the steps to reproduce would be something like:

  • create cluster with osds 0,...,3
  • set a pg with a replica on 0 to use 0 as the primary via primary-temp
  • stop osd 0 and let it be marked down

The result would be that the pg above has an acting set not containing 0, but get_primary().osd still evaluates to 0?

That's a larger problem than this segfault -- it means that the primary is no longer a member of the acting set. Can you attach an osdmap with this property to the bug? I'd like to take a look.

@hydro-b
Copy link
Contributor

hydro-b commented Jun 7, 2023

osdmap_after_osd14_remove.zip
osdmap_pgtemp.zip
osdmap_no_pg_temp.zip

From your description, it seems like the steps to reproduce would be something like:

* create cluster with osds 0,...,3

* set a pg with a replica on 0 to  use 0 as the primary via primary-temp

* stop osd 0 and let it be marked down

The result would be that the pg above has an acting set not containing 0, but get_primary().osd still evaluates to 0?

That's a larger problem than this segfault -- it means that the primary is no longer a member of the acting set. Can you attach an osdmap with this property to the bug? I'd like to take a look.

I reproduced this on a Pacific 16.2.9 cluster. I will try to attach the osdmaps to this issue: osdmap_no_pg_temp is the OSDmap when no changes were made yet, osdmap_pgtemp is the OSDmap when the pgtemp changes were made, osdmap_after_osd14_remove is the OSDmap when osd.14 was set out and purged from the cluster. That was also the moment the bug was triggered. I have logs of these OSDs. So if you would want to see any of them, please let me know. Also if you need other info.

Note: I added the ".zip" extension to the files to be able to upload them,, but they are actually just the binaries osdmaptool produces

@athanatos
Copy link
Contributor

@hydro-b I uploaded them to the bug. I probably won't have time to look at this this week, but I'll try to get to it when I get back from vacation the week of the 20th.

@athanatos
Copy link
Contributor

Apologies, I still haven't had time to look at this. I hope to get to it in the next two weeks.

@hydro-b
Copy link
Contributor

hydro-b commented Aug 2, 2023

Apologies, I still haven't had time to look at this. I hope to get to it in the next two weeks.

Sure, no problem. Just let me know if you need more info or access to a machine that is affected

Apologies, I still haven't had time to look at this. I hope to get to it in the next two weeks.

I will be on PTO starting 10th august. Let me know if you need more information, or access to an affected system.

@athanatos
Copy link
Contributor

@hydro-b I just responded on the bug -- I'm not sure the maps you attached are the right ones.

@hydro-b
Copy link
Contributor

hydro-b commented Aug 10, 2023

@hydro-b I just responded on the bug -- I'm not sure the maps you attached are the right ones.

I updated the redmine tracker ticket. The attached maps are from the affected cluster and currently all I have..

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Oct 9, 2023
Copy link

github-actions bot commented Nov 8, 2023

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants