New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: cmpext operator should ignore -ENOENT on read #16622

Merged
merged 1 commit into from Aug 1, 2017

Conversation

Projects
None yet
6 participants
@dillaman
Contributor

dillaman commented Jul 27, 2017

The operator already handles the case where the object is truncated. If
an RBD user performs a read + cmpext/write of a sparse image, the read
of the missing object would return a zeroed buffer. Using that zeroed
buffer for the cmpext test would fail since it wasn't ignoring the
-ENOENT read failure.

Signed-off-by: Jason Dillaman dillaman@redhat.com

@dillaman dillaman added this to the luminous milestone Jul 27, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Jul 27, 2017

remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834403 osd.0 osd.0 172.21.15.83:6801/92919 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834562 osd.5 osd.5 172.21.15.115:6808/20364 1 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834403 osd.0 osd.0 172.21.15.83:6801/92919 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834562 osd.5 osd.5 172.21.15.115:6808/20364 1 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834025 osd.3 osd.3 172.21.15.115:6800/20361 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi083/log/ceph.log.gz:2017-07-27 18:57:45.834025 osd.3 osd.3 172.21.15.115:6800/20361 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi115/log/ceph.log.gz:2017-07-27 18:56:46.997082 mon.a mon.0 172.21.15.83:6789/0 5 : cluster [INF] HEALTH_ERR; no osds
remote/smithi115/log/ceph.log.gz:2017-07-27 18:57:45.834403 osd.0 osd.0 172.21.15.83:6801/92919 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi115/log/ceph.log.gz:2017-07-27 18:57:45.834562 osd.5 osd.5 172.21.15.115:6808/20364 1 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head
remote/smithi115/log/ceph.log.gz:2017-07-27 18:57:45.834025 osd.3 osd.3 172.21.15.115:6800/20361 8 : cluster [ERR] Error -2 reading object 56:ed67ef63:test-rados-api-smithi083-94428-61::foo:head

appears in the cluster log, see
/a/sage-2017-07-27_18:08:03-rados-wip-sage-testing-distro-basic-smithi/1453451

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 27, 2017

Yeah -- I saw that as well but it's coming from the low-level ECBackend async reader code. Did you want me to silence -ENOENT error logs down there as part of this?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 27, 2017

@liewegas appended a commit to hide the cluster error log message on -ENOENT

@liewegas

This comment has been minimized.

Member

liewegas commented Jul 28, 2017

test with this included hit

2017-07-28T04:42:15.171 INFO:tasks.ceph.osd.3.smithi095.stderr: ceph version 12.1.1-795-g788745f (788745f47d9e87946bf9ccbae9172c003ccfb467) luminous (rc)
2017-07-28T04:42:15.171 INFO:tasks.ceph.osd.3.smithi095.stderr: 1: (()+0xa9a424) [0x5565fd451424]
2017-07-28T04:42:15.171 INFO:tasks.ceph.osd.3.smithi095.stderr: 2: (()+0x113e0) [0x7f79f01693e0]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 3: (CopyFromFinisher::execute()+0xc) [0x5565fd0b03bc]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 4: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector >&)+0x2f4d) [0x5565fd0518ad]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 5: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0xbf) [0x5565fd06124f]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 6: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x2db) [0x5565fd061acb]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 7: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x308b) [0x5565fd0665ab]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 8: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0xe73) [0x5565fd020463]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 9: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9) [0x5565fcea5239]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x5565fd13b517]
2017-07-28T04:42:15.172 INFO:tasks.ceph.osd.3.smithi095.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x5565fcecc69e]
2017-07-28T04:42:15.173 INFO:tasks.ceph.osd.3.smithi095.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x5565fd498c64]
2017-07-28T04:42:15.173 INFO:tasks.ceph.osd.3.smithi095.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5565fd49bca0]
2017-07-28T04:42:15.173 INFO:tasks.ceph.osd.3.smithi095.stderr: 14: (()+0x770a) [0x7f79f015f70a]
2017-07-28T04:42:15.173 INFO:tasks.ceph.osd.3.smithi095.stderr: 15: (clone()+0x6d) [0x7f79ef1d682d]

but i didn't include http://tracker.ceph.com/issues/20783 (merge conflict). /a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455227

@liewegas liewegas requested a review from jdurgin Jul 28, 2017

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@dillaman Why isn't this handled by RBD? Is the cmpext operator an RBD only operation or is it a general purpose rados operation? If it is general purpose, I would argue that it isn't RADOS' job to massage that error.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 28, 2017

@dzafman because it would be next to impossible for RBD to do that (i.e. if it sees the -ENOENT it could do a "CREATE (does not already exist)" op, then another retry of CMPEXT+WRITE ops -- but then you are just saying an object that exists but is empty it is treated as all zeros) --- which is what this PR formalizes in all cases.

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@dillaman I don't see why RADOS is responsible for generating these semantics. Why doesn't RBD always create objects truncated to the proper size? As a matter of fact, in order for RBD to detected a corruption that is causing objects to be lost, it should treat ENOENT as an actual error.

Recently, a tracker was filed where someone thought they had repaired a PG, but actually lots of filestore replicas were missing. This resulted in not enough EC shards to be present to decode the data. This was a case of real data loss and ENOENT to be generated. Why shouldn't RBD instead of rados decide if it wants to return zeros or return an error?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 28, 2017

@dzafman Since it seems you are concerned about the change to cluster error logging for -ENOENT on EC reads, I'll point out that RBD doing a plain read of a sparse block device would cause those errors to flood the cluster log as librbd / krbd iterated through (correctly) non-existent objects. We don't have cluster error logging when replicated objects are read that don't exist.

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@dillaman I have a solution. In order to differentiate between a non-existent object and a case where some shards are missing, ENOENT will be translated to another error code when we know that the object is supposed to be there, but not enough shards exist. If any shard returns EIO or non-ENOENT we'll return any non-ENOENT error. If all errors are ENOENT (at filestore level) we'll translate that to something else (ENXIO?). If the error you see in your checking is ENOENT, then you know it is an object that doesn't exist. I still think that RBD should be handling non-existent objects itself instead of having RADOS special casing ENOENT.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 28, 2017

This PR just clarifies the cmpext semantics so that a non-existent object is equivalent to zeroes at all ranges, right? Don't we zero-fill in other RADOS ops, like if you do a read that crosses a hole?

I'm not sure how the lost objects in filestore is relevant to this — user-visible ENOENT (which this is) is a statement about the system, whereas lost shards should result in hanging ops or something...?

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@gregsfortytwo A missing shard (ENOENT from objectstore) will currently get back to the client. Since ErasureCode::minimum_to_decode() returns EIO when there aren't enough shards to read an object, I'm going to always use that error code even though we've collected any and all errors from each shard's attempt to read.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 28, 2017

@dzafman, doesn't that only happen for replicated pools where we're (incorrectly) not noticing missing data though? eg ec pools won't produce that error since they're reading from multiple shard — or maybe it fails because the primary also managed to lose the whole object and didn't have an object_info to reference? :/
If there's a ticket I can look at explaining the issue that would be good.

Anyway, my main concern about the semantics here would just be around whether people might be using CMPEXT and want to get an ENOENT out of it, or what happens when it's combined with other ops. The incorrect-ENOENT result is a whole other can of worms.

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@gregsfortytwo If you use cmpext to read an object that doesn't exist, you should still get ENOENT I assume based on the primary having no shard. Also, if OSDs are down such that there aren't enough readable shards available that is handled earlier in the process (maybe that case hangs). Finally, if we get past those two cases and reading shards get ENOENT from the objectstore, we need to decide what error code to return. I hadn't thought about the fact that a client can't tell if ENOENT means that an object doesn't exist, or there are not enough shards because one or more shards got ENOENT when attempting to read from the objectstore.

get_parent()->clog_error() << "Error " << r
<< " reading object "
<< i->first;
if (r != -ENOENT) {

This comment has been minimized.

@dzafman

dzafman Jul 28, 2017

Member

@dillaman @gregsfortytwo I don't think we want to skip ENOENT logging here.

See #16663. My pull request makes it possible for cmpext to trust ENOENT.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 28, 2017

Oh I see, so your concern @dzafman is more about the internal changes where we're turning ENOENT into not-a-fatal-sounding-error?

Is there a way we can adjust the internal interfaces so that in situations where ENOENT is okay we don't throw a big alert, but still preserve a user-visible behavior where a non-existent object is zero-filled for the purposes of CMPEXT?

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 28, 2017

@gregsfortytwo That is what I want. The handle_sub_read() errors are filestore errors which needs a "big alert" as you call it. It is NOT the case of a simple non-existent object.

@@ -4748,8 +4762,11 @@ int PrimaryLogPG::do_extent_cmp(OpContext *ctx, OSDOp& osd_op)
read_op.op.extent.truncate_size = op.extent.truncate_size;
int result = do_osd_ops(ctx, read_ops);
if (result < 0) {
derr << "do_extent_cmp do_osd_ops failed " << result << dendl;
if (result == -ENOENT) {

This comment has been minimized.

@dzafman

dzafman Jul 28, 2017

Member

Won't the check in finish() clear ENOENT so it will never be true here?

This comment has been minimized.

@dillaman

dillaman Jul 28, 2017

Contributor

This is the sync read path -- do_osp_ops would invoke and return the result from do_read, which returns the result from pgbackend->objects_read_sync

This comment has been minimized.

@dzafman

dzafman Jul 28, 2017

Member

got it!

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 28, 2017

Okay, so from your perspective @dzafman then we should only be issuing objects_read_async() against extents we have reason to believe actually exist?

And the way CMPEXT is set up I think it just blindly shoots them off.

That sounds to me like something we can change, although I don't actually know if that's the right place for this barrier to be set up or not. @jdurgin, thoughts?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 28, 2017

@gregsfortytwo That cannot be a correct assertion -- an external user should be able to to issue a read against a non-existent object w/o the cluster spamming false positives.

if (r == -ENOENT) {
osd_op.rval = 0;
read_bl.clear();
delete fill_extent_ctx;

This comment has been minimized.

@dzafman

dzafman Jul 28, 2017

Member

Won't the destructor change do the delete now?

This comment has been minimized.

@dillaman

dillaman Jul 28, 2017

Contributor

Yes -- but that was just to silence coverity. If I avoid the deletion here, I would need to move the nullptr assignment within the else branch. I can do it -- but at least it's consistent that the object is destructed within this method now.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 28, 2017

@dzafman OK -- finally figured out what were saying via play-testing. I'll add a conditional to not perform the async read if the object doesn't exist / is a whiteout.

osd: cmpext operator should ignore -ENOENT on read
The operator already handles the case where the object is truncated. If
an RBD user performs a read + cmpext/write of a sparse image, the read
of the missing object would return a zeroed buffer. Using that zeroed
buffer for the cmpext test would fail since it wasn't ignoring the
-ENOENT read failure.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jul 30, 2017

@dzafman

This comment has been minimized.

Member

dzafman commented Jul 31, 2017

If this merges for Luminous so should #16663.

@jdurgin

jdurgin approved these changes Aug 1, 2017

@jdurgin jdurgin merged commit 61908ac into ceph:master Aug 1, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@dillaman dillaman deleted the dillaman:wip-cmpext-dne branch Aug 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment