New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: On EIO from read recover the primary replica from another copy #14760

Merged
merged 20 commits into from Jun 29, 2017

Conversation

Projects
None yet
2 participants
@dzafman
Member

dzafman commented Apr 25, 2017

No description provided.

@dzafman

This comment has been minimized.

Member

dzafman commented Apr 25, 2017

Not yet handling EIO on reading data for push.

@dzafman dzafman requested review from jdurgin, liewegas and tchaikov Apr 25, 2017

@dzafman

This comment has been minimized.

Member

dzafman commented Apr 25, 2017

@jdurgin @liewegas @tchaikov There may be lots of potential races here. I'm calling maybe_kick_recovery() which I'm not sure is ok if recovery isn't already in progress. During scrub repair we queue DoRecovery() which transitions the state machine and gets reservations just like activation.

My testing involved 1000 objects all getting EIO errors from simultaneous reads. It didn't crash and fixed all of them. I'm not sure if having missing objects outside of recovery is a good idea. This happens outside of reservations which should be ok because there will be very few objects involved unless a disk is failing badly.

I'll run rados suite tomorrow.

@dzafman

This comment has been minimized.

Member

dzafman commented Apr 25, 2017

Jenkins re-test this please

@dzafman

This comment has been minimized.

Member

dzafman commented Apr 25, 2017

Passed rados:thrash run but need to add some injected data errors:

http://pulpito.ceph.com/dzafman-2017-04-25_10:05:29-rados:thrash-wip-zafman-testing-distro-basic-smithi

@dzafman

This comment has been minimized.

Member

dzafman commented May 1, 2017

retest this please

}
if (op_shards.empty()) {
dout(0) << __func__ << " No other replicas available for " << soid << dendl;
return -EIO;

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

we shouldn't be checking for unfound here, or returning EIO - clients can't handle it. this should behave just as if the object were already missing - the op will retry once it's recovered (or if it stays unfound, block)

This comment has been minimized.

@dzafman

dzafman May 17, 2017

Member

There isn't anything else we can do. The client doesn't resend as I've discovered. We'd have to add to waiting_for_unreadable_object but in this case it isn't clear when it would ever be able to be processed.

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

it's the same as any other unfound object. the op will complete when an osd with a copy of it comes up, or the admin marks the unfound objects lost or otherwise does a manual recovery. returning eio causes corruption for clients like rbd

for (auto &&i : op_shards)
missing_loc.add_location(soid, i);
waiting_for_unreadable_object[soid].push_back(op);

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

this is doing the same bookkeeping as primary_error() - just call it once we read (or can't) the version off disk. It looks like primary_error() should be calling pg_log.set_last_requested(0) to reset the recovery point, and requeuing recovery

failed_push(fl, oid);
primary_error(oid, v);
backfills_in_flight.erase(oid);
missing_loc.add_missing(oid, v, eversion_t());

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

this missing_loc.add_missing() update should just be part of primary_error() - the only caller that doesn't do it is prep_object_replica_pushes(), and that's a bug - it's not marked missing yet on the primary yet in that case, but should be

@@ -4613,6 +4625,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
} else {
int r = pgbackend->objects_read_sync(
soid, op.extent.offset, op.extent.length, op.flags, &osd_op.outdata);
if (r == -EIO) r = rep_repair_primary_object(soid, ctx->op);

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

nit: condition bodies should be on the next line and use curly braces

This comment has been minimized.

@dzafman

dzafman May 17, 2017

Member

fixing

{
dout(20) << __func__ << ": " << soid << " from " << from << dendl;
list<pg_shard_t> fl = { from };
get_parent()->failed_push(fl, soid);

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

if you're renaming to failed_pull, you may as well rename PrimaryLogPG::failed_push() too

This comment has been minimized.

@dzafman

dzafman May 17, 2017

Member

Not sure what you mean here. What should failed_push() be called? Sometimes failed_push() is called when errors during push and it is called from _failed_pull() when a pull failed to read at remote OSD.

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

nevermind, I was confused by the push/pull naming in ReplicatedBackend. having the primary process a push message in _do_pull_response() instead of e.g. _do_push_primary() confused me

waiting_for_unreadable_object[soid].push_back(op);
op->mark_delayed("waiting for missing object");
maybe_kick_recovery(soid);

This comment has been minimized.

@jdurgin

jdurgin May 17, 2017

Member

we might be clean. we need to start recovery from scratch just like scrub_finish() does by queuing a DoRecovery peering event. using maybe_kick_recovery here would bypass recovery throttling not update the pg state

This comment has been minimized.

@dzafman

dzafman May 18, 2017

Member

The reason to bypass recovery throttling is that we are handling a client request and this is a low impact pull of a single object. My only concern would be a failing disk constantly returning EIO.

This comment has been minimized.

@jdurgin

jdurgin May 18, 2017

Member

with bluestore giving EIO on checksum failure, a hard power failure causing checksum mismatches across many objects seems likely. in that kind of case, throttling recovery from EIO becomes important.

The bigger problem is maybe_kick_recovery() doesn't handle retrying recovery in the unfound case, since if we aren't in recovery/backfill already, nothing else will queue it. For example, when the object is unfound, the call chain ends up being

maybe_kick_recovery->
  recover_missing->PULL_NONE
@dzafman

This comment has been minimized.

Member

dzafman commented May 18, 2017

retest this please

@dzafman

This comment has been minimized.

Member

dzafman commented May 18, 2017

@liewegas This pull request should not have passed Jenkins without #15161

@dzafman

This comment has been minimized.

Member

dzafman commented Jun 6, 2017

@jdurgin Per your request I've tried but failed to start recovery in rep_repair_primary_object(). The code as it stands now works similarly to scrub repair, admittedly repair is hacky and will be moving out of the osd. I can show you the code I was trying to make work if you like.

@@ -10466,6 +10465,7 @@ void PrimaryLogPG::mark_all_unfound_lost(
} // otherwise, just do what we used to do
dout(10) << e << dendl;
log_entries.push_back(e);
missing_loc.recovered(oid);

This comment has been minimized.

@jdurgin

jdurgin Jun 8, 2017

Member

this should go in the callback passed to submit_log_entries() below - it clears it from needs_recovery_map, so ops can come in for the object before the delete is performed

This comment has been minimized.

@dzafman

dzafman Jun 8, 2017

Member

Unfortunately, I'd need to have all the LOST_DELETE deleted OIDs available to the callback function.

dzafman added some commits May 9, 2017

mon: Bring daemonization in line with other daemons
ceph-mon doesn't use the global_init_daemonize() as the other
daemons do.  This resulted in the daemon staying in the session
of the process that started it.  In testing if I terminated a
test script only ceph-mon would be exit.

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: cleanup: Remove unused code - waiting_for_all_missing list and w…
…ait_for_all_missing()

Signed-off-by: David Zafman <dzafman@redhat.com>
test: Use correct argument --read-perecent
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: On EIO from read recover the primary replica from another copy
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: cleanup: Rename _failed_push to _failed_pull for ReplicatedBackend
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handle read failures when pushing at primary
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handle read errors during backfill
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handle repair when no object_info_t so eversion_t not known at p…
…rimary

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: cleanup: Use available get_head()
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Cancel backfill when can't proceed due to errors
Add new transition CancelBackfill (Backfilling -> NotBackfilling)
When giving up on backfill due to errors use new transition
which includes scheduling retry of backfill.

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Cancel recovering when no more progress can be made
Add new CancelRecovery transition (Recovering -> NotRecovering)
When giving up on recovery due to errors use new transition
which includes scheduling retry of recovery.

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Don't disable cache if doing multiple reads
Handle unusual case that a read fails after the first

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handle errors on reads of subsequent file segment
Signed-off-by: David Zafman <dzafman@redhat.com>

dzafman added some commits May 17, 2017

osd: Handle read error when pushing after pull completion
Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Handle bad object info decode in repair_object()
Signed-off-by: David Zafman <dzafman@redhat.com>
objectstore: Remove allow_eio arg from read, we allow it to be returned
Signed-off-by: David Zafman <dzafman@redhat.com>
osd debug: Add testing support for random read EIO errors
filestore_debug_random_read_err   Specify % of EIO all filestore reads

Signed-off-by: David Zafman <dzafman@redhat.com>
osd debug: Add testing support by adding random read error only when …
…building push ops

add osd_debug_random_push_read_error config option

Signed-off-by: David Zafman <dzafman@redhat.com>
test: Add two new singleton test yamls radom-eio and thrash-eio
New option "random_eio" to Thrasher, sets 1 osd random read percentage
New option "objectsize" to radosbench task (-o bench option)
New option "type" to radosbench specify write, seq or rand

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman

This comment has been minimized.

Member

dzafman commented Jun 29, 2017

Testing PASSED

None of these failures seem to be related to this change.

dzafman-2017-06-23_10:52:23-rados-wip-zafman-testing-distro-basic-smithi

1318636 swift/virtualenv/bin/nosetests 2 errors
1318672 SELinux denials found
1318737 apt-get error
1318791 reached maximum tries (105) after waiting for 630 seconds
1318888 common/ceph_crypto.cc: 71: FAILED assert(crypto_context != __null)
1318900 qa/workunits/rados/test.sh timed out after 3 hours

@jdurgin

looks good!

@jdurgin jdurgin merged commit 4bcd6f6 into ceph:master Jun 29, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
arm64 make check arm64 make check succeeded
Details
make check make check succeeded
Details

@dzafman dzafman deleted the dzafman:wip-19657 branch Jun 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment