Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

Closed
wants to merge 1 commit into from
Closed

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

wants to merge 1 commit into from

Conversation

YiteGu
Copy link
Contributor

@YiteGu YiteGu commented Oct 14, 2021

pg peering forever when trigger async recovery if single osd
config value that "osd_async_recovery_min_cost" differ peer osd.

For example, up[9, 8, 27] acting[9, 8, 27]. osd.9 osd_async_recovery_min_cost
is 20, but osd.8 and osd.27 osd_async_recovery_min_cost is 100. If osd.9 down,
and then restart after some time, osd.9 acting is [8, 27] because of
choose async recovery. So osd.8 is primary, but osd.9 will not be erase in want set
of osd.8, so osd.8 report mon pg_temp[9, 8, 27]. pg will peering forever.

Select min value of osd_async_recovery_min_cost in each peer shard when choose async
recovery.

Fixes: https://tracker.ceph.com/issues/52925
Signed-off-by: Yite Gu <ess_gyt@qq.com>

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the core label Oct 14, 2021
pg peering forever when trigger async recovery if single osd
config value that "osd_async_recovery_min_cost" differ peer osd.

For example, up[9, 8, 27] acting[9, 8, 27]. osd.9 osd_async_recovery_min_cost
is 20, but osd.8 and osd.27 osd_async_recovery_min_cost is 100. If osd.9 down,
and then restart after some time, osd.9 acting is [8, 27] because of
choose async recovery. So osd.8 is primary, but osd.9 will not be erase in want set
of osd.8, so osd.8 report mon pg_temp[9, 8, 27]. pg will peering forever.

Select min value of osd_async_recovery_min_cost in each peer shard when choose async
recovery.

Fixes: https://tracker.ceph.com/issues/52925
Signed-off-by: Yite Gu <ess_gyt@qq.com>
@YiteGu
Copy link
Contributor Author

YiteGu commented Oct 15, 2021

@neha-ojha hi, ojha, look my patch if you have time :)

@neha-ojha neha-ojha self-requested a review October 20, 2021 18:30
@YiteGu
Copy link
Contributor Author

YiteGu commented Oct 22, 2021

I want to add a new value for PG class:
map<pg_shard_t, uint64_t> peer_armc;
use to store peer async_recovery_min_cose, not store in pg_info_t
async_recovery_min_cose of replica store in pg_notify_t when send notify to primary
what do you think? @neha-ojha

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description in the commit message and in the tracker, it sounds like we are trying to make this change to address a very specific use case, where min cost on one osd is intentionally set to a different value. The result observed in the tracker is a known behavior. Ideally we do not recommend changing the default value of osd_async_recovery_min_cost (it is an advanced config option), and if one needs to change it, they should apply to all osds.

Are there any other advantages of this PR?

@YiteGu
Copy link
Contributor Author

YiteGu commented Oct 28, 2021

osd_async_recovery_min_cost, controls how much asynchronous recovery to do. A higher value means asynchronous recovery will be less, whereas a lower value means asynchronous recovery will be more. In some cases, user want adjust this value to further reduce impact of normal recovery in run-time.

"one needs to change it, they should apply to all osds", good idea. I only want to avoid happend accident about this PR.

@YiteGu
Copy link
Contributor Author

YiteGu commented Oct 28, 2021

I really intentionally set to a different value :( , I didn't expect it happend such heavy impact. I think even if set to a different values, we should have a way to let it work normally. what do you think, professor ojha?

@YiteGu
Copy link
Contributor Author

YiteGu commented Nov 1, 2021

After simple investigation, ceph tell osd.* config set osd_async_recovery_min_cost can apply to all osd, but if one osd restart, still cause value different to other osds.

I'm thinking, as you say, do not recommend changing the osd_async_recovery_min_cost. Could we remove osd_async_recovery_min_cost? and we write a special value in code. @neha-ojha

@neha-ojha
Copy link
Member

After simple investigation, ceph tell osd.* config set osd_async_recovery_min_cost can apply to all osd, but if one osd restart, still cause value different to other osds.

Can you please write a test case to demonstrate the problems this could cause?

I'm thinking, as you say, do not recommend changing the osd_async_recovery_min_cost. Could we remove osd_async_recovery_min_cost? and we write a special value in code. @neha-ojha

Not sure I understand what you mean by this

@YiteGu
Copy link
Contributor Author

YiteGu commented Nov 2, 2021

up [30, 11, 16] acting [30, 11, 16]
test: ceph tell osd.* config set osd_async_recovery_min_cost

  1. set osd_async_recovery_min_cost for all osds:
    ceph tell osd.* config set osd_async_recovery_min_cost 200
  2. check config of osd.30:
ceph config show osd.30 osd_async_recovery_min_cost
200
  1. osd.30 daemon down
  2. cluster have some write data
  3. osd.30 daemon start, value of osd_async_recovery_min_cost is defualt 100
  4. say, value of approx_missing_objects is 113 on osd.30, produce pg_temp [11, 16]
  5. osd.11: because approx_missing_objects of up_primary > osd_force_auth_primary_missing_objects, so, want set is [11, 30, 16]. pg_temp change to [11, 30, 16].
  6. osd_async_recovery_min_cost of osd.11 is 200, thus, do normal recovery.
up_primary: 30) has approximate 113(>100) missing objects, osd.11 selected as primary instead
calc_replicated_acting primary is osd.11 with 25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)
 osd.30 (up) accepted 25.188( v 1963708'36604 (11466'32057,1963708'36604] local-lis/les=1963858/1963859 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)
 osd.16 (up) accepted 25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)

2021-11-02 14:33:33.992 7fe0af48b700 20 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_async_recovery_replicated candidates by cost are:
2021-11-02 14:33:33.992 7fe0af48b700 20 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_async_recovery_replicated result want=[11,30,16] async_recovery=
2021-11-02 14:33:33.992 7fe0af48b700 10 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_acting want [11,30,16] != acting [11,16], requesting pg_temp change

Conclusion: if osd_force_auth_primary_missing_objects defferent to peer, cause pg peering forever also. so, my mean is

    if (static_cast<uint64_t>(approx_missing_objects)  >
       cct->_conf.get_val<uint64_t>("osd_async_recovery_min_cost")) {
      candidates_by_cost.emplace(approx_missing_objects, shard_i);
    }

change to

    if (static_cast<uint64_t>(approx_missing_objects)  > 100) {
      candidates_by_cost.emplace(approx_missing_objects, shard_i);
    }

even if osd_async_recovery_min_cost have change to inconsistent by user, it have not impact. @neha-ojha

@YiteGu
Copy link
Contributor Author

YiteGu commented Dec 2, 2021

@neha-ojha hmm~ After I think about it carefully, such as osd_async_recovery_min_cost or osd_force_auth_primary_missing_objects config can not modify careless. administrators must know what myself are doing. Please close this pr.

@neha-ojha
Copy link
Member

@neha-ojha hmm~ After I think about it carefully, such as osd_async_recovery_min_cost or osd_force_auth_primary_missing_objects config can not modify careless. administrators must know what myself are doing. Please close this pr.

exactly, thanks for your patience!

@neha-ojha neha-ojha closed this Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants