Skip to content

src/osd/PrimaryLogPG.cc: add check for norecover flag in maybe_kick_r…#54212

Closed
amathuria wants to merge 1 commit intoceph:mainfrom
amathuria:wip-amathuria-bz-2134786-fix-norecover
Closed

src/osd/PrimaryLogPG.cc: add check for norecover flag in maybe_kick_r…#54212
amathuria wants to merge 1 commit intoceph:mainfrom
amathuria:wip-amathuria-bz-2134786-fix-norecover

Conversation

@amathuria
Copy link
Copy Markdown
Contributor

@amathuria amathuria commented Oct 26, 2023

…ecovery

Adds a check to see if norecover flag is set in maybe_kick_recovery(). maybe_kick_recovery() starts recovery for a particular object if there is an attempt to read a missing object or write to a degraded object. However, this particular work-flow currently does not check the norecover flag before starting recovery for an object.

Fixes: bz#2134786

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

…ecovery

Adds a check to see if norecover flag is set in maybe_kick_recovery().
maybe_kick_recovery() starts recovery for a particular object if there is an attempt to read a missing object or write to a degraded object.
However, this particular work-flow currently does not check the norecover flag before starting recovery for an object.

Fixes: bz#2134786
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
@amathuria amathuria requested a review from a team as a code owner October 26, 2023 13:15
@github-actions github-actions bot added the core label Oct 26, 2023
return;

if (osd->recovery_is_paused()) {
dout(10) << "In maybe_kick_recovery: NORECOVER flag is set" << dendl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How frequent is the call to maybe_kick_recovery()? will this spam the logs?

@athanatos
Copy link
Copy Markdown
Contributor

Hmm, maybe_kick_recovery() is used mainly in the case where an IO has arrived and is blocked on a degraded or missing object. It's really not clear to me that norecover should apply in that situation. The goal of norecover is typically to mitigate short term thoughput/latency issues with client IO. With this change, client IOs on those objects would hang indefinitely, which seems worse.

@amathuria
Copy link
Copy Markdown
Contributor Author

Hmm, maybe_kick_recovery() is used mainly in the case where an IO has arrived and is blocked on a degraded or missing object. It's really not clear to me that norecover should apply in that situation. The goal of norecover is typically to mitigate short term thoughput/latency issues with client IO. With this change, client IOs on those objects would hang indefinitely, which seems worse.

@athanatos I have added some more details in this tracker: https://tracker.ceph.com/issues/63334 in addition to the BZ. Basically, the issue was that recovery was happening while norecover flag was set and client IO was going on. From the logs it was evident that there were degraded objects because the autoscaler had kicked in and PG splitting was taking place which was causing maybe_kick_recovery() to schedule recovery.
I think what you're saying makes sense, so should we let things be the way they are and maybe document this? Or do you have another suggestion on how this can be addressed?

@athanatos
Copy link
Copy Markdown
Contributor

It really doesn't make sense to increase the number of pgs with norecover or nobackfill set. Each newly split PG will need to be backfilled to a new location. It will also trigger peering and therefore a few degraded objects that were being written to when the map change happened.

  1. If there is documentation that recommends setting nobackfill or norecover while increasing pg_num, remove it. We should probably prevent pg_num from being changed while nobackfill or norecover are set.
  2. The documentation for increasing pg_num should explicitly warn that it will trigger backfill/recovery IO proportional to the number of affected PGs. To limit impact on client IO, pg_num should be changed in very small increments waiting until cluster is active+clean each time.
  3. We should document that IOs on degraded/missing objects will still trigger IO even with norecover set -- the alternative would be for IO to hang completely.

@amathuria
Copy link
Copy Markdown
Contributor Author

It really doesn't make sense to increase the number of pgs with norecover or nobackfill set. Each newly split PG will need to be backfilled to a new location. It will also trigger peering and therefore a few degraded objects that were being written to when the map change happened.

1. If there is documentation that recommends setting nobackfill or norecover while increasing pg_num, remove it.  We should probably prevent pg_num from being changed while nobackfill or norecover are set.

2. The documentation for increasing pg_num should explicitly warn that it will trigger backfill/recovery IO proportional to the number of affected PGs.  To limit impact on client IO, pg_num should be changed in _very small_ increments waiting until cluster is active+clean each time.

3. We should document that IOs on degraded/missing objects will still trigger IO even with norecover set -- the alternative would be for IO to hang completely.

Thanks for these points! In the case this issue was raised, the auto-scaler was kicking in and increasing the number of PGs. Does it make sense to mention that as well?

@athanatos
Copy link
Copy Markdown
Contributor

athanatos commented Oct 30, 2023

@amathuria Ah, nice observation. The autoscaler should not run if norecover or nobackfill are set. If it does, that's the most important thing to fix.

@amathuria amathuria closed this Nov 3, 2023
@amathuria
Copy link
Copy Markdown
Contributor Author

@athanatos thanks for clearing that up. I will close this PR and open a new one for the autoscaler change and probably another one for all the potential documentation changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants