Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/OSDMonitor: do not propose on error in prepare_update #50502

Merged
merged 1 commit into from Aug 28, 2023

Conversation

batrick
Copy link
Member

@batrick batrick commented Mar 13, 2023

Fixes: https://tracker.ceph.com/issues/58972

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick added the core label Mar 13, 2023
@github-actions github-actions bot added the mon label Mar 13, 2023
@batrick batrick force-pushed the i58972 branch 2 times, most recently from f048021 to 863a45d Compare April 6, 2023 12:56
@batrick batrick marked this pull request as ready for review April 6, 2023 12:56
@batrick batrick requested a review from a team as a code owner April 6, 2023 12:56
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@ljflores
Copy link
Contributor

jenkins test make check

@ljflores
Copy link
Contributor

Hey @batrick, seeing some failures that seem related to this PR. I'm not entirely sure though, so I've asked Yuri to drop this and rebase so I can confirm. In the meantime, can you see if these jobs look related to your changes?

http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311772/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311736/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311737/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311739/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311738/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311749/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311750/
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-22_15:05:06-rados-wip-yuri7-testing-2023-06-12-1220-distro-default-smithi/7311752/

(there are several more not listed on that link- but seemed redundant to list all of them) It seems like a lot, but they all have one thing in common: All the failures seem to happen when creating erasure pools. I'm guessing there is a mon command that creates erasure pools that has a problem with it.

case -EAGAIN:
wait_for_finished_proposal(op, new C_RetryMessage(this, op));
return true;
return false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljflores @rzarzynski The reason for the failure 7311772 in

#50502 (comment)

is because of this line. The method OSDMonitor::prepare_pool_crush_rule actually changes the crush rules and signals -EAGAIN. It's expected that ::prepare_command_impl will return true; here to trigger a PAXOS proposal. And then, retry the pool creation.

This is really strange to me. Is it actually necessary to split this pool creation into two OSDMap epochs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023-06-22T15:48:31.674+0000 7f79267f3700  1 -- v1:172.21.15.182:6789/0 <== client.4347 v1:172.21.15.182:0/3391191032 8 ==== mon_command({"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"} v 0) v1 ==== 191+0+0 (unknown 3185792114 0 0) 0x558ec2cf4900 con 0x558ec2f0cc00
2023-06-22T15:48:31.674+0000 7f79267f3700 20 mon.a@0(leader) e1 _ms_dispatch existing session 0x558ec30a2b40 for client.?
2023-06-22T15:48:31.674+0000 7f79267f3700 20 mon.a@0(leader) e1  entity_name client.admin global_id 4347 (new_ok) caps allow *
2023-06-22T15:48:31.674+0000 7f79267f3700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"} v 0) v1
2023-06-22T15:48:31.674+0000 7f79267f3700 20 MonCap is_capable service=osd command=osd pool create read write addr v1:172.21.15.182:0/3391191032 on cap allow *
2023-06-22T15:48:31.674+0000 7f79267f3700 20 MonCap  allow so far , doing grant allow *
2023-06-22T15:48:31.674+0000 7f79267f3700 20 MonCap  allow all
2023-06-22T15:48:31.674+0000 7f79267f3700 10 mon.a@0(leader) e1 _allowed_command capable
2023-06-22T15:48:31.674+0000 7f79267f3700  0 log_channel(audit) log [INF] : from='client.? v1:172.21.15.182:0/3391191032' entity='client.admin' cmd=[{"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"}]: dispatch
2023-06-22T15:48:31.674+0000 7f79267f3700  1 -- v1:172.21.15.182:6789/0 --> v1:172.21.15.182:6789/0 -- log(1 entries from seq 78 at 2023-06-22T15:48:31.675841+0000) v1 -- 0x558ec3060a80 con 0x558ec1bd1800
2023-06-22T15:48:31.674+0000 7f79267f3700 10 mon.a@0(leader).paxosservice(osdmap 1..13) dispatch 0x558ec2cf4900 mon_command({"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"} v 0) v1 from client.4347 v1:172.21.15.182:0/3391191032 con 0x558ec2f0cc00
2023-06-22T15:48:31.674+0000 7f79267f3700  5 mon.a@0(leader).paxos(paxos active c 1..51) is_readable = 1 - now=2023-06-22T15:48:31.675871+0000 lease_expire=2023-06-22T15:48:36.349426+0000 has v0 lc 51
2023-06-22T15:48:31.674+0000 7f79267f3700 10 mon.a@0(leader).osd e13 preprocess_query mon_command({"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"} v 0) v1 from client.4347 v1:172.21.15.182:0/3391191032
2023-06-22T15:48:31.674+0000 7f79267f3700  7 mon.a@0(leader).osd e13 prepare_update mon_command({"prefix": "osd pool create", "pool": "unique_pool_0", "pg_num": 1, "pgp_num": 1, "pool_type": "erasure", "erasure_code_profile": "backfill_toofull"} v 0) v1 from client.4347 v1:172.21.15.182:0/3391191032
2023-06-22T15:48:31.674+0000 7f79267f3700  1 mon.a@0(leader).osd e13 implicitly use rule named after the pool: unique_pool_0
2023-06-22T15:48:31.674+0000 7f79267f3700 10 mon.a@0(leader).osd e13 prepare_pool_crush_rule returns -11

from the mon log

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look. It doesn't look like there's a good reason except some code would need adjusted to look at the pending incremental OSDMap for the erasure code profiles.

For the purposes of this PR I will undo this change but add a comment protesting the current-state of the code :)

@ljflores ljflores requested a review from rzarzynski June 28, 2023 22:09
@ljflores
Copy link
Contributor

@rzarzynski requesting your review if you have a moment.

@rzarzynski
Copy link
Contributor

Review-in-progress.

@ljflores
Copy link
Contributor

jenkins test api

@ljflores
Copy link
Contributor

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcbn3IMWEB1783-wip-yuri7-testing-2023-06-23-1022-old-wip-yuri7-testing-2023-06-12-1220-old-wip-yuri7-testing-2023-06-09-1607

Apologies, this comment was confusing. I had finished that batch, but forgot this one had been dropped. It needs to be added to a new batch.

@batrick
Copy link
Member Author

batrick commented Aug 3, 2023

jenkins test api

Fixes: https://tracker.ceph.com/issues/58972
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick batrick merged commit 8043b3d into ceph:main Aug 28, 2023
11 checks passed
@batrick batrick deleted the i58972 branch August 28, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants