New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: add compare and write API #14868

Closed
wants to merge 12 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@wangzhengyong
Contributor

wangzhengyong commented Apr 28, 2017

Librbd: Add cmpext_write API support for scsi CMPARE_AND_WRITE command
There are many places in this version that need to be improved

ToDo:

  1. rbd cache support and testcase
  2. add more testcases: fsx

@dillaman dillaman self-requested a review Apr 28, 2017

@dillaman

dillaman requested changes May 2, 2017 edited

This implementation is not actually atomic -- it's no different than using the existing read operation followed by a write (except that the compare is offloaded to the OSD instead of your local client).

What really needs to occur is that the new CompareAndWrite operation should chain a "cmpext" RADOS op followed by a "write" RADOS op within the same transaction. That way, if the "cmpext" fails, the "write" will be skipped.

This also implies that you should only support extents ranges that encompass a single object extent range -- otherwise you might successfully update one backing object but fail to update another.

@@ -109,6 +110,23 @@ void ObjectRequest<I>::complete(int r)
if (m_hide_enoent && r == -ENOENT) {
r = 0;
}
if (strcmp(get_op_type(), "cmpext") == 0 && r <= -MAX_ERRNO) {

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Add a new virtual method that is overridden in ObjectCmpextRequest for this logic

ImageCtx *image_ctx = this->m_ictx;
uint64_t offset = -MAX_ERRNO - r;
__u32 object_size = image_ctx->layout.object_size;

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Don't re-implement the striping logic -- just re-use Striper::extent_to_file to convert the object offset to the image offset

uint64_t stripeno = offset / su + objectsetno * stripes_per_object;
uint64_t blockno = stripeno * stripe_count + stripepos;
uint64_t extent_off = blockno * su + off_in_block;
r = -MAX_ERRNO - extent_off;

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

I think the librbd API should accept an optional uint64_t pointer that would be populated with the mismatch offset. The librados API could overload the meaning of the return code since RADOS objects cannot be too large. Images, however, could easily overflow a 32bit signed integer.

This comment has been minimized.

@wangzhengyong

wangzhengyong May 4, 2017

Contributor

@dillaman I think ssize_t is OK for librbd API. In 64-bit system, uint64_t is 8byte, and ssize_t is 8byte too. we only need to change the int to ssize_t
If add an optional uint64_t pointer, we should assign the pointer in callback function, from objrequest to imagerequest, it is more complex

This comment has been minimized.

@dillaman

dillaman May 4, 2017

Contributor

@wangzhengyong Eh -- I really don't want to have a ssize_t return code from the librbd API where the user needs to interpret the error code and extract the offset. It really isn't hard to pass a uint64_t pointer -- especially since a single API request will result in only a single ObjectRequest for CompareAndWrite.

@@ -657,8 +675,137 @@ void ObjectWriteSameRequest::send_write() {
AbstractObjectWriteRequest::send_write();
}
template <typename I>
ObjectCmpextRequest<I>::ObjectCmpextRequest(ImageCtx *ictx, const std::string &oid,
uint64_t object_no, uint64_t object_off, const ceph::bufferlist &cmp_bl,

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: indent params to align with previous line's param starting column

uint64_t object_no, uint64_t object_off, const ceph::bufferlist &cmp_bl,
Context *completion)
: ObjectRequest<I>(util::get_image_ctx(ictx), oid, object_no, object_off, cmp_bl.length(), CEPH_NOSNAP,
completion, false),

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: indent params to align with previous line's param starting column

@@ -686,6 +686,9 @@ CEPH_RBD_API ssize_t rbd_write2(rbd_image_t image, uint64_t ofs, size_t len,
CEPH_RBD_API int rbd_discard(rbd_image_t image, uint64_t ofs, uint64_t len);
CEPH_RBD_API ssize_t rbd_writesame(rbd_image_t image, uint64_t ofs, size_t len,
const char *buf, size_t data_len, int op_flags);
CEPH_RBD_API ssize_t rbd_cmpext_write(rbd_image_t image, uint64_t ofs, size_t len,

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: I'd prefer naming it something along the lines of "compare_and_write``` here and below and in the C++ API. Also, don't forget to update the Python API as well.

@@ -711,6 +714,10 @@ CEPH_RBD_API int rbd_aio_discard(rbd_image_t image, uint64_t off, uint64_t len,
CEPH_RBD_API int rbd_aio_writesame(rbd_image_t image, uint64_t off, size_t len,
const char *buf, size_t data_len,
rbd_completion_t c, int op_flags);
CEPH_RBD_API ssize_t rbd_aio_cmpext_write(rbd_image_t image, uint64_t off, size_t len,
const char *cmp_buf, const char *buf,

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: indentation

@@ -371,13 +371,17 @@ class CEPH_RBD_API Image
ssize_t write2(uint64_t ofs, size_t len, ceph::bufferlist& bl, int op_flags);
int discard(uint64_t ofs, uint64_t len);
ssize_t writesame(uint64_t ofs, size_t len, ceph::bufferlist &bl, int op_flags);
ssize_t cmpext_write(uint64_t ofs, size_t len, ceph::bufferlist &cmp_bl,
ceph::bufferlist& bl, int op_flags);

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: indentation

int aio_write(uint64_t off, size_t len, ceph::bufferlist& bl, RBD::AioCompletion *c);
/* @param op_flags see librados.h constants beginning with LIBRADOS_OP_FLAG */
int aio_write2(uint64_t off, size_t len, ceph::bufferlist& bl,
RBD::AioCompletion *c, int op_flags);
int aio_writesame(uint64_t off, size_t len, ceph::bufferlist& bl,
RBD::AioCompletion *c, int op_flags);
int aio_cmpext_write(uint64_t off, size_t len, ceph::bufferlist& cmp_bl,
ceph::bufferlist& bl, RBD::AioCompletion *c, int op_flags);

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: indentation

@@ -161,12 +165,17 @@ void AioCompletion::complete_request(ssize_t r)
assert(ictx != nullptr);
CephContext *cct = ictx->cct;
if (r < 0 && r <= -MAX_ERRNO) {

This comment has been minimized.

@dillaman

dillaman May 2, 2017

Contributor

Nit: unneeded once you add an optional pointer in the API for returning the mismatch offset

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented May 3, 2017

@dillaman thanks for your comments, i will take CompareAndWrite operation atomic
but, IMO, " chain a "cmpext" RADOS op followed by a "write" RADOS op" is not correct.
If the match fails, we need to calculate the minimum mismatch_offset, which means that all the objects are required to be completed before the cmpext can get the value, even if there is a successful object, nor write. If the match is successful, but also need all the objects are successful before they can write

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 3, 2017

@wangzhengyong This new API method can only work if the offset and length are aligned within a single backing object. Otherwise, it won't be atomic. I spoke w/ @mikechristie about this yesterday and he said such a restriction is fine in order to get this atomic operation into TCMU (i.e. ESX will only write a single sector atomically).

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 3, 2017

@wangzhengyong This is what I mean by "chain" the operations where a guarded write to a clone would first execute an operation to assert that the clone object exists [1] and only if that test passes does it continue w/ the write operations [2].

[1] https://github.com/ceph/ceph/blob/master/src/librbd/io/ObjectRequest.cc#L369
[2] https://github.com/ceph/ceph/blob/master/src/librbd/io/ObjectRequest.cc#L584

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented May 4, 2017

@dillaman I still have questions, if we enable the striping feature, and the stripe_unit size is less than a sector, then cmpext opercation will still cross object. Whether we need to add a new restriction that stripe_unit should be greater than a sector or disable

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 4, 2017

@wangzhengyong In the ImageRequest class for CompareAndWrite, if the number of object requests is > 1, you should fail the request. The stripe unit already needs to be a multiple of the object size, but if someone is using very tiny values, I think that's an understandable limitation.

@wangzhengyong wangzhengyong changed the title from Librbd: Add cmpext_write API to Librbd: Add compare_and_write API May 18, 2017

@wangzhengyong wangzhengyong changed the title from Librbd: Add compare_and_write API to Librbd: Add CompareAndWrite API May 18, 2017

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented May 18, 2017

@dillaman I modified the interface implementation, do some modifies on the basis of ordinary write to let the operation in the same transaction

  1. add the rados op "compext" in add_write_ops
  2. get the return value "mismatch_offset" if mismatch
  3. do not support object cache, because objectcacher will convert the compare_and_write to write op, it is unreasonable

However, there are some problems needed to resolve:

  1. The code can not pass compile, "error: wrong number of template arguments (21, should be 20)"
    because the boost::variant<> used in
    https://github.com/ceph/ceph/blob/master/src/librbd/journal/Types.h#L386
    is limited to 20 defined by BOOST_MPL_LIMIT_LIST_SIZE, could you give me some suggestions?
  2. About the testcase, i don't know how to disable the rbd_cache for compare_and_write test without affecting other use cases
@dillaman

This comment has been minimized.

Contributor

dillaman commented May 19, 2017

@wangzhengyong

The code can not pass compile, "error: wrong number of template arguments (21, should be 20)"
because the boost::variant<> used in
https://github.com/ceph/ceph/blob/master/src/librbd/journal/Types.h#L386
is limited to 20 defined by BOOST_MPL_LIMIT_LIST_SIZE, could you give me some suggestions?

You should be able to switch to a sequence-based variant via boost::make_variant_over:

#include <boost/mpl/vector.hpp>
...
typedef boost::mpl::vector<
  AioDiscardEvent,
...
  AioWriteSameEvent,
  UnknownEvent> EventVector;
typedef boost::make_variant_over<EventVector>::type Event;

About the testcase, i don't know how to disable the rbd_cache for compare_and_write test without affecting other use cases

If the cache is enabled and a writesame is issued, you just need to invalidate the corresponding extents within the cache before sending the writesame ops to the OSDs. The cache discard operation just drops the specified extents from the cache. This can be handle using a call to image_ctx.object_cacher->discard_set(image_ctx.object_set, object_extents).

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 30, 2017

@wangzhengyong Please use "git rebase origin/master" and force push your branch back up instead of merging master into your branch.

if (read_op.outdata.length() != osd_op.indata.length())
return -EINVAL;
if (read_op.outdata.length() != osd_op.indata.length()) {
read_op.outdata.append_zero(osd_op.indata.length() - read_op.outdata.length());

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Seems like this check should be moved to after the for loop, limit the loop to read_op.outdata.length(), and then return (-MAX_ERRNO - read_op.outdata.length()) if the in/out buffers aren't the same length.

This comment has been minimized.

@wangzhengyong

wangzhengyong May 31, 2017

Contributor

@dillaman there will be some "HOLE" while there is no data in the (offset, length). IMO, that is valid, i only need fill it with zero-data.

This comment has been minimized.

@dillaman

dillaman May 31, 2017

Contributor

@wangzhengyong While that is true that RBD treats missing data as zeros, that might not always be the case. In the generic sense, it might be better to return the offset where the first mismatch occurs (be it caused by either data or truncation).

static ObjectRequest* create_compare_and_write(ImageCtxT *ictx, const std::string &oid,
uint64_t object_no,
uint64_t object_off,
const ceph::bufferlist &cmp_data,
const ceph::bufferlist &write_data,
const ::SnapContext &snapc,
Context *completion, int op_flags);
int op_flags,

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Merge error

uint64_t object_off,
const ceph::bufferlist &cmp_data,
const ceph::bufferlist &write_data,
const ::SnapContext &snapc,

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Pass uint64_t *mismatch_offset for optional mismatch result.

ObjectCompareAndWriteRequest(ImageCtx *ictx, const std::string &oid, uint64_t object_no,
uint64_t object_off, const ceph::bufferlist &cmp_bl, const ceph::bufferlist &write_bl,
const ::SnapContext &snapc, Context *completion,

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Pass uint64_t *mismatch_offset for optional mismatch result.

@@ -657,6 +670,70 @@ void ObjectWriteSameRequest::send_write() {
AbstractObjectWriteRequest::send_write();
}
void ObjectCompareAndWriteRequest::add_write_ops(librados::ObjectWriteOperation *wr) {

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: update to new param style

aio_comp->set_request_count(1);
C_AioRequest *req_comp = new C_AioRequest(aio_comp);
image_ctx.image_cache->aio_compare_and_write(std::move(this->m_image_extents),
std::move(m_cmp_bl), std::move(m_bl), m_op_flags, req_comp);

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: indentation

@@ -318,5 +362,7 @@ extern template class librbd::io::ImageWriteRequest<librbd::ImageCtx>;
extern template class librbd::io::ImageDiscardRequest<librbd::ImageCtx>;
extern template class librbd::io::ImageFlushRequest<librbd::ImageCtx>;
extern template class librbd::io::ImageWriteSameRequest<librbd::ImageCtx>;
extern template class librbd::io::ImageCompareAndWriteRequest<librbd::ImageCtx>;

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: extraneous newline

typedef std::vector<std::pair<uint64_t, uint64_t> > Extents;
ObjectCompareAndWriteRequest(ImageCtx *ictx, const std::string &oid, uint64_t object_no,
uint64_t object_off, const ceph::bufferlist &cmp_bl, const ceph::bufferlist &write_bl,

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: indentation

@@ -76,6 +76,13 @@ class ObjectRequest : public ObjectRequestHandle {
const ceph::bufferlist &data,
const ::SnapContext &snapc,
Context *completion, int op_flags);
static ObjectRequest* create_compare_and_write(ImageCtxT *ictx, const std::string &oid,
uint64_t object_no,

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: indentation

// object extent compare mismatch
uint64_t offset = -MAX_ERRNO - r;
Striper::extent_to_file(image_ctx->cct, &image_ctx->layout,
this->m_object_no, offset, this->m_object_len, file_extents);

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: indentation

@dillaman dillaman changed the title from Librbd: Add CompareAndWrite API to librbd: add compare and write API May 30, 2017

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented Jun 6, 2017

@dillaman
Ping!

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jun 6, 2017

@wangzhengyong Sorry -- I don't receive notifications when you push if you don't make a comment. It looks like there is still lots of indentation issues throughout the patchset and a rebase against master is also required.

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented Jun 8, 2017

@dillaman
Sorry about the indentation issues , now i fix them and rebase against the master branch

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jun 9, 2017

@wangzhengyong Thanks! Will review ASAP

@dillaman

Mostly nits -- looks very close

@@ -173,6 +188,20 @@ void ImageRequest<I>::send() {
return;
}
if (get_aio_type() == AIO_TYPE_COMPARE_AND_WRITE) {

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

I'd rather see this logic deleted and the existing prune_object_extents be modified to return a potential error code. ImageCompareAndWriteRequest would override it and return an -EINVAL if there is more than a single object extent.

This comment has been minimized.

@wangzhengyong

wangzhengyong Jun 9, 2017

Contributor

@dillaman that make sense, i will modify this ASAP

uint64_t tid = 0;
assert(this->m_image_extents.size() == 1);
for (auto &extent : this->m_image_extents) {

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: instead of the loop, just use auto &extent = this->m_image_extents.front()

template <typename I>
ObjectRequestHandle *ImageCompareAndWriteRequest<I>::create_object_request(
const ObjectExtent &object_extent,

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: just indent 4 spaces for these params

bufferlist bl;
assemble_extent(object_extent, &bl);
ObjectRequest<I> *req = ObjectRequest<I>::create_compare_and_write(
&image_ctx, object_extent.oid.name,

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: just indent 4 spaces for these params

bufferlist &&bl, uint64_t *mismatch_offset,
int op_flags, const ZTracer::Trace &parent_trace)
: AbstractImageWriteRequest<ImageCtxT>(
image_ctx, aio_comp, std::move(image_extents), "compare_and_write", parent_trace),

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: indent one level past AbstractImageWriteRequest on previous line

}
int r = ictx->io_work_queue->compare_and_write(ofs, len, bufferlist{cmp_bl},
bufferlist{bl}, mismatch_off, op_flags);

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: indentation

}
ictx->io_work_queue->aio_compare_and_write(get_aio_completion(c), off, len,
bufferlist{cmp_bl}, bufferlist{bl}, mismatch_off, op_flags, false);

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: indentation

bufferlist bl;
bl.push_back(create_write_raw(ictx, buf, len));
ictx->io_work_queue->aio_compare_and_write(get_aio_completion(comp), off, len,
std::move(cmp_bl), std::move(bl), mismatch_off, op_flags, false);

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: indentation

if (read_op.outdata[p] != osd_op.indata[p]) {
return (-MAX_ERRNO - p);
}
}
if (read_op.outdata.length() != osd_op.indata.length()) {

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Thinking about this more, even though I think this is likely the correct behavior for the OSD (i.e. you said to compare beyond the extent of the object), this will complicate RBD where truncated/missing objects are treated as zeroed. If a client app reads extents 0~512 of a non-existent object, sees that it's currently zero, and uses that zero buffer as a compare_and_write, the compare and write will fail with mismatch at offset 0. Therefore, I think we should keep the original loop bounds (on the indata, but if p >= read_op.outdata.length(), fail if indata[p] != 0.

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

@jdurgin @liewegas You guys have any thoughts about this? Other alternatives would be to keep the extent check in-place but (1) override it via a flag when issuing the cmpext op, or (2) perhaps allow cmpext to pass a smaller compare buffer than the write buffer (which will allow librbd to trim any trailing zeroes from the compare buffer before sending to the OSD).

@@ -1551,6 +1594,7 @@ TEST_F(TestLibRBD, TestIO)
}
}

This comment has been minimized.

@dillaman

dillaman Jun 9, 2017

Contributor

Nit: remove added blank line

wangzhengyong added some commits Jun 2, 2017

librbd: add CompareAndWrite ObjectRequest
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
librbd: add compare_and_write ImageRequest
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
librbd: handle compare_and_write imagerequest in ImageRequestWQ
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
librbd: add compare_and_write perfcounter
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
librbd: add compare_and_write/aio_compare_and_write API
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
test/librbd: handle cmpext request in LibradosTestStub
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
librbd/journal: handle compare_and_write event
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
ceph osd: fix cmpext bug
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
test/librbd: handle compare_and_write op in test_mock_Replay
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>

wangzhengyong added some commits Jun 12, 2017

test/librbd: add Mock Image cache for compare_and_write
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
test/librbd: add compare_and_write testcase
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
test/librbd: add compare_and_write test for fsx
Signed-off-by: Zhengyong Wang <wangzhengyong@cmss.chinamobile.com>
@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented Jun 13, 2017

@dillaman update

@dillaman

lgtm

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jun 14, 2017

@wangzhengyong few test failures related to compare and write [1] on EC data pools. I am going to open a PR to fix that OSD issue and then this PR should be good-to-go. Thanks.

@wangzhengyong

This comment has been minimized.

Contributor

wangzhengyong commented Jul 6, 2017

@dillaman is there any update?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 6, 2017

@wangzhengyong Negative -- no movement on the OSD fix. Perhaps you can just update the tests to skip the test when a data pool is in-use and pool_requires_alignment2 indicates it's an EC pool.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jul 26, 2017

Manually merged after resolving conflicts

@dillaman dillaman closed this Jul 26, 2017

dillaman added a commit that referenced this pull request Jul 26, 2017

Merge pull request #14868 from wangzhengyong/librbd
librbd: add compare and write API

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment