osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

YiteGu · 2021-10-14T07:03:27Z

pg peering forever when trigger async recovery if single osd
config value that "osd_async_recovery_min_cost" differ peer osd.

For example, up[9, 8, 27] acting[9, 8, 27]. osd.9 osd_async_recovery_min_cost
is 20, but osd.8 and osd.27 osd_async_recovery_min_cost is 100. If osd.9 down,
and then restart after some time, osd.9 acting is [8, 27] because of
choose async recovery. So osd.8 is primary, but osd.9 will not be erase in want set
of osd.8, so osd.8 report mon pg_temp[9, 8, 27]. pg will peering forever.

Select min value of osd_async_recovery_min_cost in each peer shard when choose async
recovery.

Fixes: https://tracker.ceph.com/issues/52925
Signed-off-by: Yite Gu <ess_gyt@qq.com>

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

pg peering forever when trigger async recovery if single osd config value that "osd_async_recovery_min_cost" differ peer osd. For example, up[9, 8, 27] acting[9, 8, 27]. osd.9 osd_async_recovery_min_cost is 20, but osd.8 and osd.27 osd_async_recovery_min_cost is 100. If osd.9 down, and then restart after some time, osd.9 acting is [8, 27] because of choose async recovery. So osd.8 is primary, but osd.9 will not be erase in want set of osd.8, so osd.8 report mon pg_temp[9, 8, 27]. pg will peering forever. Select min value of osd_async_recovery_min_cost in each peer shard when choose async recovery. Fixes: https://tracker.ceph.com/issues/52925 Signed-off-by: Yite Gu <ess_gyt@qq.com>

YiteGu · 2021-10-15T08:16:09Z

@neha-ojha hi, ojha, look my patch if you have time :)

YiteGu · 2021-10-22T09:28:57Z

I want to add a new value for PG class:
map<pg_shard_t, uint64_t> peer_armc;
use to store peer async_recovery_min_cose, not store in pg_info_t
async_recovery_min_cose of replica store in pg_notify_t when send notify to primary
what do you think? @neha-ojha

neha-ojha

From the description in the commit message and in the tracker, it sounds like we are trying to make this change to address a very specific use case, where min cost on one osd is intentionally set to a different value. The result observed in the tracker is a known behavior. Ideally we do not recommend changing the default value of osd_async_recovery_min_cost (it is an advanced config option), and if one needs to change it, they should apply to all osds.

Are there any other advantages of this PR?

YiteGu · 2021-10-28T07:14:05Z

osd_async_recovery_min_cost, controls how much asynchronous recovery to do. A higher value means asynchronous recovery will be less, whereas a lower value means asynchronous recovery will be more. In some cases, user want adjust this value to further reduce impact of normal recovery in run-time.

"one needs to change it, they should apply to all osds", good idea. I only want to avoid happend accident about this PR.

YiteGu · 2021-10-28T11:18:51Z

I really intentionally set to a different value :( , I didn't expect it happend such heavy impact. I think even if set to a different values, we should have a way to let it work normally. what do you think, professor ojha?

YiteGu · 2021-11-01T06:12:32Z

After simple investigation, ceph tell osd.* config set osd_async_recovery_min_cost can apply to all osd, but if one osd restart, still cause value different to other osds.

I'm thinking, as you say, do not recommend changing the osd_async_recovery_min_cost. Could we remove osd_async_recovery_min_cost? and we write a special value in code. @neha-ojha

neha-ojha · 2021-11-01T20:32:53Z

After simple investigation, ceph tell osd.* config set osd_async_recovery_min_cost can apply to all osd, but if one osd restart, still cause value different to other osds.

Can you please write a test case to demonstrate the problems this could cause?

I'm thinking, as you say, do not recommend changing the osd_async_recovery_min_cost. Could we remove osd_async_recovery_min_cost? and we write a special value in code. @neha-ojha

Not sure I understand what you mean by this

YiteGu · 2021-11-02T09:29:45Z

up [30, 11, 16] acting [30, 11, 16]
test: ceph tell osd.* config set osd_async_recovery_min_cost

set osd_async_recovery_min_cost for all osds:
ceph tell osd.* config set osd_async_recovery_min_cost 200
check config of osd.30:

ceph config show osd.30 osd_async_recovery_min_cost
200

osd.30 daemon down
cluster have some write data
osd.30 daemon start, value of osd_async_recovery_min_cost is defualt 100
say, value of approx_missing_objects is 113 on osd.30, produce pg_temp [11, 16]
osd.11: because approx_missing_objects of up_primary > osd_force_auth_primary_missing_objects, so, want set is [11, 30, 16]. pg_temp change to [11, 30, 16].
osd_async_recovery_min_cost of osd.11 is 200, thus, do normal recovery.

up_primary: 30) has approximate 113(>100) missing objects, osd.11 selected as primary instead
calc_replicated_acting primary is osd.11 with 25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)
 osd.30 (up) accepted 25.188( v 1963708'36604 (11466'32057,1963708'36604] local-lis/les=1963858/1963859 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)
 osd.16 (up) accepted 25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963)

2021-11-02 14:33:33.992 7fe0af48b700 20 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_async_recovery_replicated candidates by cost are:
2021-11-02 14:33:33.992 7fe0af48b700 20 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_async_recovery_replicated result want=[11,30,16] async_recovery=
2021-11-02 14:33:33.992 7fe0af48b700 10 osd.11 pg_epoch: 1963963 pg[25.188( v 1963961'36717 (1959382'33700,1963961'36717] local-lis/les=1963960/1963961 n=32 ec=11459/11459 lis/c 1963960/1963858 les/c/f 1963961/1963859/0 1963962/1963963/1963963) [30,11,16]/[11,16] r=0 lpr=1963963 pi=[1963858,1963963)/1 crt=1963961'36717 lcod 1963961'36716 mlcod 0'0 remapped+peering mbc={}] choose_acting want [11,30,16] != acting [11,16], requesting pg_temp change

Conclusion: if osd_force_auth_primary_missing_objects defferent to peer, cause pg peering forever also. so, my mean is

    if (static_cast<uint64_t>(approx_missing_objects)  >
       cct->_conf.get_val<uint64_t>("osd_async_recovery_min_cost")) {
      candidates_by_cost.emplace(approx_missing_objects, shard_i);
    }

change to

    if (static_cast<uint64_t>(approx_missing_objects)  > 100) {
      candidates_by_cost.emplace(approx_missing_objects, shard_i);
    }

even if osd_async_recovery_min_cost have change to inconsistent by user, it have not impact. @neha-ojha

YiteGu · 2021-12-02T03:11:52Z

@neha-ojha hmm~ After I think about it carefully, such as osd_async_recovery_min_cost or osd_force_auth_primary_missing_objects config can not modify careless. administrators must know what myself are doing. Please close this pr.

neha-ojha · 2021-12-07T21:27:28Z

@neha-ojha hmm~ After I think about it carefully, such as osd_async_recovery_min_cost or osd_force_auth_primary_missing_objects config can not modify careless. administrators must know what myself are doing. Please close this pr.

exactly, thanks for your patience!

github-actions bot added the core label Oct 14, 2021

neha-ojha self-requested a review October 20, 2021 18:30

neha-ojha reviewed Oct 27, 2021

View reviewed changes

neha-ojha closed this Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

YiteGu commented Oct 14, 2021 •

edited

YiteGu commented Oct 15, 2021

YiteGu commented Oct 22, 2021 •

edited

neha-ojha left a comment

YiteGu commented Oct 28, 2021

YiteGu commented Oct 28, 2021

YiteGu commented Nov 1, 2021 •

edited

neha-ojha commented Nov 1, 2021

YiteGu commented Nov 2, 2021

YiteGu commented Dec 2, 2021

neha-ojha commented Dec 7, 2021

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

osd/PG: store osd_async_recovery_min_cost in pg_info_t #43534

Conversation

YiteGu commented Oct 14, 2021 • edited

Checklist

YiteGu commented Oct 15, 2021

YiteGu commented Oct 22, 2021 • edited

neha-ojha left a comment

Choose a reason for hiding this comment

YiteGu commented Oct 28, 2021

YiteGu commented Oct 28, 2021

YiteGu commented Nov 1, 2021 • edited

neha-ojha commented Nov 1, 2021

YiteGu commented Nov 2, 2021

YiteGu commented Dec 2, 2021

neha-ojha commented Dec 7, 2021

YiteGu commented Oct 14, 2021 •

edited

YiteGu commented Oct 22, 2021 •

edited

YiteGu commented Nov 1, 2021 •

edited