osd/PG: introduce all_missing_unfound helper #27205

xiexingguo · 2019-03-27T01:54:53Z

We use pg_log.missing to track each peer's missing objects separately,
whereas missing_loc records the location of all (probably existing) good copies
for both primary and replicas' missing objects. Hence an item from
pg_log.missing or missing_loc is of different meaning and is not comparable.

During recovery, we can skip recovering primary only if

primary is good, e.g., has no missing at all
or all of the primary's missing objects do exist in missing_loc and are
currently unfound

Obviously, the current "all missing objects are unfound" checker is broken.
Fix by introducing an independent all_missing_unfound helper to make the
count of missing objects that are currently unfound correct.

Fixes: http://tracker.ceph.com/issues/38784
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

We use pg_log.missing to track each peer's missing objects separately, whereas missing_loc records the location of all (probably existing) good copies for both primary and replicas' missing objects. Hence an item from pg_log.missing or missing_loc is of different meaning and is not comparable. During recovery, we can skip recovering primary only if - primary is good, e.g., has no missing at all - or all of the primary's missing objects do exist in missing_loc and are currently unfound Obviously, the current "all missing objects are unfound" checker is broken. Fix by introducing an independent all_missing_unfound helper to make the count of missing objects that are currently unfound correct. Fixes: http://tracker.ceph.com/issues/38784 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo · 2019-03-27T03:59:05Z

@neha-ojha I end up fixing http://tracker.ceph.com/issues/38784 in a slightly different way. If this fixer is good, would you like me to add your Signed-off-by line too (since it mainly bases on your analysis :-)

neha-ojha · 2019-03-27T17:33:40Z

src/osd/PrimaryLogPG.cc

-  if (num_missing == num_unfound) {
-    // All of the missing objects we have are unfound.
+  if (!missing.have_missing() || // Primary does not have missing
+      all_missing_unfound()) { // or all of the missing objects are unfound.


@xiexingguo Considering the scenario described in http://tracker.ceph.com/issues/38784, this will help to recover the missing object on the primary, before calling recover_replicas(), but wouldn't all_missing_unfound() (when we have 1 unfound object that is not accounted for in the missing) return false, when checking d949713#diff-d43117703c33bebe41fc39224c8025adR1614?

neha-ojha · 2019-03-27T17:37:20Z

@neha-ojha I end up fixing http://tracker.ceph.com/issues/38784 in a slightly different way. If this fixer is good, would you like me to add your Signed-off-by line too (since it mainly bases on your analysis :-)

Thanks for putting up a PR to fix this, it has been on my to-do list :) Don't worry about adding my signature.

xiexingguo · 2019-03-28T12:06:35Z

2019-03-14 01:34:25.830 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_primary recovering 0 in pg, missing missing(2 may_include_deletes = 1)
2019-03-14 01:34:25.830 7fe778511700 25 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_primary {3:c8460e07:::benchmark_data_smithi131_76364_object3062:head=327'1653(196'181) flag
s = delete,3:caaf3368:::benchmark_data_smithi131_76364_object3032:head=327'1652(196'180) flags = delete}

at first, primary has two missing objects (both are delete)

3:c8460e07:::benchmark_data_smithi131_76364_object3062:head
3:caaf3368:::benchmark_data_smithi131_76364_object3032:head

2019-03-14 01:34:25.831 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] start_recovery_op 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head

primary starts to recover 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head first, because it is old

2019-03-14 01:34:25.835 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] finish_degraded_object 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head

recovery of 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head is done

2019-03-14 01:34:25.853 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_replicas(1)

primary should continue to recover 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head in its own missing list, e.g, because it is also a delete, which is obviously recoverable. But it then switches to recover replicas...
Look at the code:

if (num_missing == num_unfound) {
    // All of the missing objects we have are unfound.
    // Recover the replicas.
    started = recover_replicas(max, handle, &recovery_started);
  }

Since we still have one missing object in primary's missing list, then the only possibility is that we now have a unfound object in primary's missing_loc.

The log continues, until we finally hit the following assert:

 ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT))

which declares that the object in recovering must not be missing on primary too or else it must be a lost revert object.

2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas(1)
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.0(1) missing 1 objects.
2019-03-14 01:34:27.470 7fe774509700 20 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.0(1) missing {3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head=277'1606 flags = none}
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head still unfound
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.3(2) missing 11 objects.
2019-03-14 01:34:27.470 7fe774509700 20 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.3(2) missing {3:c8041a8a:::benchmark_data_smithi131_76364_object3870:head=345'1659 flags = delete,3:c822e451:::benchmark_data_
smithi131_76364_object3251:head=342'1654(196'189) flags = delete,3:c8460e07:::benchmark_data_smithi131_76364_object3062:head=327'1653(196'181) flags = delete,3:c85d03f
2:::benchmark_data_smithi131_76364_object27740:head=277'1606 flags = none,3:c96bdc45:::benchmark_data_smithi131_76364_object3424:head=342'1655 flags = delete,3:c99a90f
e:::benchmark_data_smithi131_76364_object3482:head=343'1656 flags = delete,3:ca143f5a:::benchmark_data_smithi131_76364_object4004:head=346'1661 flags = delete,3:ca3a12
e3:::benchmark_data_smithi131_76364_object3511:head=343'1657 flags = delete,3:cb6ddaac:::benchmark_data_smithi131_76364_object3652:head=343'1658 flags = delete,3:cb8d3
33c:::benchmark_data_smithi131_76364_object4006:head=346'1662 flags = delete,3:cbf2a2e7:::benchmark_data_smithi131_76364_object3929:head=345'1660 flags = delete}
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head still unfound
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head is a delete, removing
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] prep_object_replica_deletes: on 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head

Continue to check the log above, and finally we can see:

there is indeed an unfound object, which is 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head, hence making num_missing == num_unfound == 1. That explains why primary suddenly switches to recover replicas!
3:c8460e07:::benchmark_data_smithi131_76364_object3062:head, which remains in current primary's missing, is also need to be recovered on other replicas, e.g., it is also in peer osd.3(2)'s missing list (see above)
as recovering goes on, recover_replicas finally kicks off replica's recovery of object 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head, which now crashes the osd...

@neha-ojha I think this patch should make the above bug go away, because we never count a recovery_delete object as unfound.

Show me the code, again 👻

   bool is_unfound(const hobject_t &hoid) const {
      auto it = needs_recovery_map.find(hoid);
      if (it == needs_recovery_map.end()) {
        return false;
      }
      if (it->second.is_delete()) {
        return false;
      }
      auto mit = missing_loc.find(hoid);
      return mit == missing_loc.end() || !(*is_recoverable)(mit->second);
    }

So if there is a recovery_delete object in primary's missing list, the new all_missing_unfound helper should instead return false and hence make primary continue to recovery itself first. Right?

neha-ojha · 2019-03-28T20:55:53Z

@xiexingguo your analysis looks right and aligns with mine, I guess what I was suggesting is the following

     bool all_missing_unfound() const {
     const auto& missing = pg_log.get_missing();
     if (!missing.have_missing()) // Primary does not have missing
       return true;
     for (auto& m : missing.get_items()) {
       if (!missing_loc.is_unfound(m.first))
         return false;
     }
     return true;
   }

if (all_missing_unfound()) {
// Recover replicas

might want to rename all_missing_unfound() to something else in that case.

Your version looks fine too!

In purge_strays(), we'll aggressively clear stray_set and add all related peers into peer_purged. However, if the corrsponding peer is down and becomes up again, (unconditionally) adding it to peer_purged will prevent primary from re-purging it. (See Active::react(const MNotifyRec& notevt)) On consuming a new osdmap, let's move any down peers out from peer_purged simutaneously. This way we can lower the risk of leaving any leftover PGs behind. Related-to: http://tracker.ceph.com/issues/38931 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo · 2019-03-30T08:45:53Z

http://pulpito.ceph.com/xxg-2019-03-30_04:16:22-rados:thrash-wip-38784-distro-basic-smithi/

xiexingguo added bug-fix core labels Mar 27, 2019

xiexingguo requested review from liewegas and neha-ojha March 27, 2019 01:54

neha-ojha reviewed Mar 27, 2019

View reviewed changes

xiexingguo added DNM and removed DNM labels Mar 28, 2019

neha-ojha approved these changes Mar 28, 2019

View reviewed changes

neha-ojha added the needs-qa label Mar 28, 2019

xiexingguo merged commit fc46584 into ceph:master Mar 30, 2019

xiexingguo deleted the wip-38784 branch March 30, 2019 08:46

neha-ojha mentioned this pull request Apr 3, 2019

osd: allow EC PGs to do recovery below min_size #17619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PG: introduce all_missing_unfound helper #27205

osd/PG: introduce all_missing_unfound helper #27205

xiexingguo commented Mar 27, 2019

xiexingguo commented Mar 27, 2019

neha-ojha Mar 27, 2019

neha-ojha commented Mar 27, 2019 •

edited

Loading

xiexingguo commented Mar 28, 2019

neha-ojha commented Mar 28, 2019

xiexingguo commented Mar 30, 2019

osd/PG: introduce all_missing_unfound helper #27205

osd/PG: introduce all_missing_unfound helper #27205

Conversation

xiexingguo commented Mar 27, 2019

xiexingguo commented Mar 27, 2019

neha-ojha Mar 27, 2019

Choose a reason for hiding this comment

neha-ojha commented Mar 27, 2019 • edited Loading

xiexingguo commented Mar 28, 2019

neha-ojha commented Mar 28, 2019

xiexingguo commented Mar 30, 2019

neha-ojha commented Mar 27, 2019 •

edited

Loading