Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PrimaryLogPG: fix the oi size mismatch with real object size #21408

Merged
merged 1 commit into from Apr 23, 2018

Conversation

hsepeng
Copy link
Contributor

@hsepeng hsepeng commented Apr 13, 2018

oi (object_info_t) size mismatch with the real object size on the
persistent backend, which was introduced by the old write with
a smaller truncate_seq falsefully modified the oi size.

Fixes: http://tracker.ceph.com/issues/23701
Signed-off-by: Peng Xie peng.hse@xtaotech.com

oi (object_info_t) size mismatch with the real object size on the
persistent backend, which was introduced by the old write with
a smaller truncate_seq falsefully modified the oi size.

Fixes: http://tracker.ceph.com/issues/23701
Signed-off-by: Peng Xie peng.hse@xtaotech.com
@hsepeng
Copy link
Contributor Author

hsepeng commented Apr 18, 2018

hi @batrick , is there anyone who will review my fix. it is important since the bug will introduce data and metadata inconsistent problem causing potential data corruption.

@@ -7899,7 +7899,8 @@ void PrimaryLogPG::write_update_size_and_usage(object_stat_sum_t& delta_stats, o
} else if (length)
ch.insert(offset, length);
modified.union_of(ch);
if (write_full || offset + length > oi.size) {
if (write_full ||
Copy link
Contributor

@tchaikov tchaikov Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this issue exists in master. see 732a950#diff-fb41013d27e932534adb50eb3de2aaa5R7790

my guess is that the backend filestore did a short read, so the returned result was less than left. in the case of trimtrunc, this is expected. because the size of file representing the object is 0. so i think we should probably backport part of 732a950#diff-fb41013d27e932534adb50eb3de2aaa5R7790

@jdurgin what do you think?

Copy link
Contributor Author

@hsepeng hsepeng Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov
i exactly pinpoint the time sequence of the events that causing this bug , please follow my tracker link.
In summary:

  1. CEPHFS Client first issue a write op to the according object in the osd with truncate_seq 0,
    the write was writeback cached in the Client side objectcacher for latter flush
  2. latter on , client sent a truncate op to the mds, mds issued the truncate op with truncate_seq 2 to the same object.
  3. on the osd side, it first receive the truncate op, set the oi.truncate_seq to 2, the truncate
    the according object to size 0.
  4. the client side old write arrives (with truncate_seq 0), and its write offset start at 4063232,
    with len for example 100. however, it found that the object has already got truncated to 0,
    so shrink its write data buffer length to 0
  5. finally, during write_update_size_and_usage, the code falsefully update the oi.size to the
    offset (4063232)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old previous write op, out of range of the truncated object, from the cephfs fuse client should not modify the oi.size. and persist the oi metadata onto the disk, since it caused mismatch among oi.size, oi.truncated_size and real object data size on the backend. it is
not relevant to whether the sparse async read or not. @jdurgin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov @hsepeng You're both right. The particular assert won't be triggered in luminous or later due to the changes to COPY_GET handling to update the cursor with the logical size, not the size read off disk. However, it doesn't make sense to change the logical object size when the write has been superseded by a trimtrunc.

Thus, I think this patch does make sense for master, even though it won't trigger the same assert in cache tiering. It fixes a flaw in the usage accounting for this trimtrunc + write race case. Also, it points out an area where we need more testing - trimtrunc handling.

@tchaikov
Copy link
Contributor

@jdurgin since it's a bug fix. and relatively low-risk. can we have it in mimic?

@jdurgin
Copy link
Member

jdurgin commented Apr 23, 2018

@tchaikov yeah I think this is fine for mimic

@tchaikov tchaikov merged commit 9f35318 into ceph:master Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants