Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: introduce all_missing_unfound helper #27205

Merged
merged 2 commits into from
Mar 30, 2019

Conversation

xiexingguo
Copy link
Member

We use pg_log.missing to track each peer's missing objects separately,
whereas missing_loc records the location of all (probably existing) good copies
for both primary and replicas' missing objects. Hence an item from
pg_log.missing or missing_loc is of different meaning and is not comparable.

During recovery, we can skip recovering primary only if

  • primary is good, e.g., has no missing at all
  • or all of the primary's missing objects do exist in missing_loc and are
    currently unfound

Obviously, the current "all missing objects are unfound" checker is broken.
Fix by introducing an independent all_missing_unfound helper to make the
count of missing objects that are currently unfound correct.

Fixes: http://tracker.ceph.com/issues/38784
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

We use pg_log.missing to track each peer's missing objects separately,
whereas missing_loc records the location of all (probably existing) good copies
for both primary and replicas' missing objects. Hence an item from
pg_log.missing or missing_loc is of different meaning and is not comparable.

During recovery, we can skip recovering primary only if
- primary is good, e.g., has no missing at all
- or all of the primary's missing objects do exist in missing_loc and are
  currently unfound

Obviously, the current "all missing objects are unfound" checker is broken.
Fix by introducing an independent all_missing_unfound helper to make the
count of missing objects that are currently unfound correct.

Fixes: http://tracker.ceph.com/issues/38784
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo
Copy link
Member Author

@neha-ojha I end up fixing http://tracker.ceph.com/issues/38784 in a slightly different way. If this fixer is good, would you like me to add your Signed-off-by line too (since it mainly bases on your analysis :-)

if (num_missing == num_unfound) {
// All of the missing objects we have are unfound.
if (!missing.have_missing() || // Primary does not have missing
all_missing_unfound()) { // or all of the missing objects are unfound.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiexingguo Considering the scenario described in http://tracker.ceph.com/issues/38784, this will help to recover the missing object on the primary, before calling recover_replicas(), but wouldn't all_missing_unfound() (when we have 1 unfound object that is not accounted for in the missing) return false, when checking d949713#diff-d43117703c33bebe41fc39224c8025adR1614?

@neha-ojha
Copy link
Member

neha-ojha commented Mar 27, 2019

@neha-ojha I end up fixing http://tracker.ceph.com/issues/38784 in a slightly different way. If this fixer is good, would you like me to add your Signed-off-by line too (since it mainly bases on your analysis :-)

Thanks for putting up a PR to fix this, it has been on my to-do list :) Don't worry about adding my signature.

@xiexingguo xiexingguo added DNM and removed DNM labels Mar 28, 2019
@xiexingguo
Copy link
Member Author

2019-03-14 01:34:25.830 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_primary recovering 0 in pg, missing missing(2 may_include_deletes = 1)
2019-03-14 01:34:25.830 7fe778511700 25 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_primary {3:c8460e07:::benchmark_data_smithi131_76364_object3062:head=327'1653(196'181) flag
s = delete,3:caaf3368:::benchmark_data_smithi131_76364_object3032:head=327'1652(196'180) flags = delete}

at first, primary has two missing objects (both are delete)

  • 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head
  • 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head
2019-03-14 01:34:25.831 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 317'1651 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=2 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] start_recovery_op 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head

primary starts to recover 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head first, because it is old

2019-03-14 01:34:25.835 7fe778511700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] finish_degraded_object 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head

recovery of 3:caaf3368:::benchmark_data_smithi131_76364_object3032:head is done

2019-03-14 01:34:25.853 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=63,(1+1)=13},1={(0+0)=60,(0+1)=3,(1+1)=13},2={(0+0)=1,(0+1)=75}}] recover_replicas(1)

primary should continue to recover 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head in its own missing list, e.g, because it is also a delete, which is obviously recoverable. But it then switches to recover replicas...
Look at the code:

if (num_missing == num_unfound) {
    // All of the missing objects we have are unfound.
    // Recover the replicas.
    started = recover_replicas(max, handle, &recovery_started);
  }

Since we still have one missing object in primary's missing list, then the only possibility is that we now have a unfound object in primary's missing_loc.

The log continues, until we finally hit the following assert:

 ceph_assert(attrs || !pg_log.get_missing().is_missing(soid) || (it_objects != pg_log.get_log().objects.end() && it_objects->second->op == pg_log_entry_t::LOST_REVERT))

which declares that the object in recovering must not be missing on primary too or else it must be a lost revert object.

2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas(1)
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.0(1) missing 1 objects.
2019-03-14 01:34:27.470 7fe774509700 20 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.0(1) missing {3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head=277'1606 flags = none}
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head still unfound
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.3(2) missing 11 objects.
2019-03-14 01:34:27.470 7fe774509700 20 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}]  peer osd.3(2) missing {3:c8041a8a:::benchmark_data_smithi131_76364_object3870:head=345'1659 flags = delete,3:c822e451:::benchmark_data_
smithi131_76364_object3251:head=342'1654(196'189) flags = delete,3:c8460e07:::benchmark_data_smithi131_76364_object3062:head=327'1653(196'181) flags = delete,3:c85d03f
2:::benchmark_data_smithi131_76364_object27740:head=277'1606 flags = none,3:c96bdc45:::benchmark_data_smithi131_76364_object3424:head=342'1655 flags = delete,3:c99a90f
e:::benchmark_data_smithi131_76364_object3482:head=343'1656 flags = delete,3:ca143f5a:::benchmark_data_smithi131_76364_object4004:head=346'1661 flags = delete,3:ca3a12
e3:::benchmark_data_smithi131_76364_object3511:head=343'1657 flags = delete,3:cb6ddaac:::benchmark_data_smithi131_76364_object3652:head=343'1658 flags = delete,3:cb8d3
33c:::benchmark_data_smithi131_76364_object4006:head=346'1662 flags = delete,3:cbf2a2e7:::benchmark_data_smithi131_76364_object3929:head=345'1660 flags = delete}
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head still unfound
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] recover_replicas: 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head is a delete, removing
2019-03-14 01:34:27.470 7fe774509700 10 osd.3 pg_epoch: 365 pg[3.13s0( v 350'1669 lc 327'1652 (0'0,350'1669] local-lis/les=348/349 n=372 ec=237/193 lis/c 348/193 les/c
/f 349/194/0 347/348/341) [3,0,2147483647]/[3,0,3]p3(0) r=0 lpr=348 pi=[193,348)/10 crt=350'1669 mlcod 197'247 active+recovering+degraded+remapped m=1 u=1 mbc={0={(1+0
)=1},1={(0+0)=1},2={(0+0)=1}}] prep_object_replica_deletes: on 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head

Continue to check the log above, and finally we can see:

  • there is indeed an unfound object, which is 3:c85d03f2:::benchmark_data_smithi131_76364_object27740:head, hence making num_missing == num_unfound == 1. That explains why primary suddenly switches to recover replicas!
  • 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head, which remains in current primary's missing, is also need to be recovered on other replicas, e.g., it is also in peer osd.3(2)'s missing list (see above)
  • as recovering goes on, recover_replicas finally kicks off replica's recovery of object 3:c8460e07:::benchmark_data_smithi131_76364_object3062:head, which now crashes the osd...

@neha-ojha I think this patch should make the above bug go away, because we never count a recovery_delete object as unfound.

Show me the code, again 👻

   bool is_unfound(const hobject_t &hoid) const {
      auto it = needs_recovery_map.find(hoid);
      if (it == needs_recovery_map.end()) {
        return false;
      }
      if (it->second.is_delete()) {
        return false;
      }
      auto mit = missing_loc.find(hoid);
      return mit == missing_loc.end() || !(*is_recoverable)(mit->second);
    }

So if there is a recovery_delete object in primary's missing list, the new all_missing_unfound helper should instead return false and hence make primary continue to recovery itself first. Right?

@neha-ojha
Copy link
Member

@xiexingguo your analysis looks right and aligns with mine, I guess what I was suggesting is the following

     bool all_missing_unfound() const {
     const auto& missing = pg_log.get_missing();
     if (!missing.have_missing()) // Primary does not have missing
       return true;
     for (auto& m : missing.get_items()) {
       if (!missing_loc.is_unfound(m.first))
         return false;
     }
     return true;
   }
if (all_missing_unfound()) {
// Recover replicas

might want to rename all_missing_unfound() to something else in that case.

Your version looks fine too!

In purge_strays(), we'll aggressively clear stray_set and
add all related peers into peer_purged.

However, if the corrsponding peer is down and becomes
up again, (unconditionally) adding it to peer_purged
will prevent primary from re-purging it.
(See Active::react(const MNotifyRec& notevt))

On consuming a new osdmap, let's move any down peers out from
peer_purged simutaneously. This way we can lower the risk
of leaving any leftover PGs behind.

Related-to: http://tracker.ceph.com/issues/38931
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo
Copy link
Member Author

@xiexingguo xiexingguo merged commit fc46584 into ceph:master Mar 30, 2019
@xiexingguo xiexingguo deleted the wip-38784 branch March 30, 2019 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants