Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: advance the pg log pointer during recovery even the object is no… #6266

Closed
wants to merge 1 commit into from
Closed

Conversation

guangyy
Copy link
Contributor

@guangyy guangyy commented Oct 14, 2015

…t in missing list

If we mark the unfound objects as lost, there is a chance that the missing list in PG log
is empty but the log pointer does not reflect that properly, which result in crash. This
patch move the log pointer manipulate part non-conditional.

Fixes: 13468
Signed-off-by: Guang Yang yguang@yahoo-inc.com

@guangyy
Copy link
Contributor Author

guangyy commented Oct 14, 2015

Hi @athanatos , does this look correct?

if (missing.missing.empty()) {
log.complete_to = log.log.end();
info.last_complete = info.last_update;
}
while (log.complete_to != log.log.end()) {
if (missing.missing[missing.rmissing.begin()->second].need <=
log.complete_to->version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we run teh risk of (less efficiently) running through this loop to advance to the end of the log on the last recovered object. Perhaps we should leave this hunk behind and just add the second (nearly duplicate) case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yeah.. I will update the commit.

Other than other, does the analysis make sense for this problem?

…t in missing list

If we mark the unfound objects as lost, there is a chance that the missing list in PG log
is empty but the log pointer does not reflect that properly, which result in crash. This
patch move the log pointer manipulate part non-conditional.

Fixes: 13468
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
@guangyy
Copy link
Contributor Author

guangyy commented Oct 15, 2015

Hi @liewegas , updated according to the review comments, please help to review. Thanks!

@liewegas
Copy link
Member

I think it'll cover the case. What I'm not sure is whether it is better to patch things up here or deal with it explicitly in the mark unfound code path.. it seems like it might better there. Is this the recover_got() call in recover_primary() in the case pg_log_entry_t::LOST_REVERT? Why not put it there and leave recover_got alone?

@guangyy
Copy link
Contributor Author

guangyy commented Oct 15, 2015

the recover_got was called with the following stack trace:

2015-10-10 08:13:34.208791 7f20c2c5f700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::recover_got(hobject_t, eversion_t)' thread 7f20c2c5f700 time 2015-10-10 08:13:34.184066
osd/ReplicatedPG.cc: 9128: FAILED assert(info.last_complete == info.last_update)
ceph version 0.87 (c51c8f9)
1: (ReplicatedPG::recover_got(hobject_t, eversion_t)+0x539) [0x873f39]
2: (ReplicatedPG::on_local_recover(hobject_t const&, object_stat_sum_t const&, ObjectRecoveryInfo const&, std::tr1::shared_ptr, ObjectStore::Transaction_)+0x2bc) [0x8a8bbc]
3: (ECBackend::handle_recovery_push(PushOp&, RecoveryMessages_)+0xf08) [0x9aa3f8]
4: (ECBackend::handle_message(std::tr1::shared_ptr)+0x1ae) [0x9b698e]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr&, ThreadPool::TPHandle&)+0x181) [0x874931]
6: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x178) [0x6505d8]
7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x567) [0x650f37]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x742) [0x9e8a32]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x9ec310]
10: (()+0x7a51) [0x7f20efc9da51]
11: (clone()+0x6d) [0x7f20eec2d9ad]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.

@guangyy
Copy link
Contributor Author

guangyy commented Oct 15, 2015

That object is removed from missing (coming from the mark found as lost commands):
ReplicatedPG::mark_all_unfound_lost
-> PGLog::missing_add_event
-> pg_missing_t::add_next_event
-> pg_missing_t::rm

And here we don't check/update the last_complete pointer.

The other way to fix looks like to update the pointer at pg_missing_t::add_next_event after calling rm?

Thanks @liewegas for the review.

@guangyy
Copy link
Contributor Author

guangyy commented Oct 16, 2015

Hi @liewegas , with the above trace, it looks like doing it from within recover_got is more appropriate for this case? Thanks.

@liewegas
Copy link
Member

Yeah, I think you're right!

@liewegas
Copy link
Member

@athanatos can you take a look?

@liewegas
Copy link
Member

passed testing.

@athanatos
Copy link
Contributor

I would like to merge this with a ceph-qa-suite test.

@liewegas
Copy link
Member

@guangyy still need a teuthology test for this before we can merge

@athanatos
Copy link
Contributor

@guangyy Please reach out in #sepia and #ceph-devel if you need help developing a test.

@guangyy
Copy link
Contributor Author

guangyy commented Nov 17, 2015

Hi @athanatos @liewegas , sorry for the delayed response, I worked on something else over the last month so a bit delay for this one.

I already have #sepia access, let me add a test case this week.

@athanatos
Copy link
Contributor

Sounds good, let us know when the test is ready (link to the ceph-qa-suite PR).

// raise last_complete?
if (missing.missing.empty()) {
log.complete_to = log.log.end();
info.last_complete = info.last_update;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unnecessary whitespace change?

@guangyy
Copy link
Contributor Author

guangyy commented Dec 7, 2015

Looks like #6841 is a better fix on root cause, close this one and use that for further tracking.

@guangyy guangyy closed this Dec 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants