-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librados: support copy-less writes in librados C API #24943
Conversation
This commit introduces rados_write2(), rados_write_full2(), rados_writesame2(), rados_append2() and rados_setxattr2() which do not copy user buffers into temporary bufferptrs, reducing memory and CPU usage, at expense of requiring that provied buffers don't change through entire write/append, which should perform synchronously anyway. Otherwise these "2" function perform as their non-"2" equivalents. Signed-off-by: Piotr Dałek <git@predictor.org.pl>
This commit introduces rados_aio_write2(), rados_aio_write_full2(), rados_aio_writesame2(), rados_aio_append2() and rados_aio_setxattr2() which do not copy user buffers into temporary bufferptrs, reducing memory and CPU usage, at expense of requiring that provied buffers don't change through entire write/append. Otherwise these "2" function perform as their non-"2" equivalents. Signed-off-by: Piotr Dałek <git@predictor.org.pl>
3baa22f
to
4630896
Compare
* Cancel ops to the old osd when ops are redirected, so buffer will not be used after it has been freed. * out_seq is only updated for new messages, and messages from sent queue are sent using their original sequence. So even when we remove ops from sent queue, the out_seq is correct. Signed-off-by: liyichao <liyichao.good@gmail.com> Reviewed-by: Li Wang <laurence.liwang@gmail.com>
@liewegas please take a look, we think librbd will benefit from it |
* @param off byte offset in the object to begin writing at | ||
* @returns 0 on success, negative error code on failure | ||
*/ | ||
CEPH_RADOS_API int rados_write2(rados_ioctx_t io, const char *oid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I question whether or not new methods are required. In theory it should be safe to assume that the contents of buf
for an AIO method must remain the same until the call completes (similar to aio_write
in Linux [1]). Plus, removing the const-ness of the buf
seems odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I update these functions rados_write, rados_write_full, rados_writesame, rados_append, rados_setxattr, rados_aio_write, rados_aio_append, rados_aio_write_full, rados_aio_writesame, rados_aio_setxattr
? Or what of these should be a new rados_xxx2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman it turns out the input buffer is indeed deep copied by librados. see buffer::list::append(const char *data, unsigned len)
and buffer::ptr::append(const char *p, unsigned l)
.
and i am not sure we can make this semantic change even we bump up the soversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the original rados_write
functions deep copy the buf
because of buffer::list::append
, I agree with @dillaman that it is safe to assume the buf passed in rados_write
, rados_aio_write
, etc should be valid before io finishes, just like write
and aio_write
in Linux. Thus I suggest updating the original rados_write
by replacing buffer::list::append
with buffer::create_static
.
I also intend to remove the copy in librbd aio_write when these functions are updated.
Then, how should we make progress on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyichao if we are sure that all the librados clients can ensure the the buffer passed to these write calls will be valid until the call completes. there is no reason we cannot do the switch.
but to be 100% sure about this, i'd suggest check the callers of librados, like librbd,libcephfs,python-rados and qemu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
librbd and QEMU don't use this interface. fio's rados engine does use it but it assumes the buffer needs to be untouched until the completion callback. I cannot speak for any users of python-rados or libcephfs, though.
@@ -1078,7 +1111,9 @@ void ProtocolV1::randomize_out_seq() { | |||
ssize_t ProtocolV1::write_message(Message *m, bufferlist &bl, bool more) { | |||
FUNCTRACE(cct); | |||
ceph_assert(connection->center->in_thread()); | |||
m->set_seq(++out_seq); | |||
if (m->get_seq() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if a message was being sent to OSD a
with seq 1 but was then redirected to OSD b
, it would keep seq 1 even if the connection to OSD b
had an unrelated in-flight seq 1 message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in _prepare_osd_op, the resent message will be a new message with sequence 0.
@adamemerson could you help me understand any overlap or implications for rados-unleashed? thanks! |
|
||
void cancel_ops(const set<ceph_tid_t> &ops) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider alternatives to std::set here. It pulls in a big dynamically allocated tree for what may not even live that long. boost::container::flat_set has the same semantics and performs well (assuming you insert things in order) and isn't nearly as Heapy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very worthwhile and interesting work.
for (auto const& p: cancel_ops) { | ||
p.first->cancel_ops(p.second); | ||
} | ||
|
||
for (list<LingerOp*>::iterator iter = unregister_lingers.begin(); | ||
iter != unregister_lingers.end(); | ||
++iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely want this commit. I can use it to interfaces for RADOS_unleashed that just take byte ranges or spans or what-have-you without the caller needing to provide a bufferlist at all.
@mattbenjamin The third commit that changes objecter and messenger will let us safely have write interfaces that work with completely unmanaged buffers through the same uses of create static. |
@liyichao @dillaman Is there a way we can try to fast track this? The third commit could be separated out if there's disagreement about the changes to the librados C interface. It would allow bufferlists made by create_static to be used safely from the C++ interface all on its own, from what it looks like. |
I too would like to see the third commit merged. But (1) I think we would need to kill off the simple messenger or fix it as well, and (2) I am pretty sure there is still a possibility for touching freed memory in |
@adamemerson @dillaman , I have separated the third commit into #25689 , please have a look. @dillaman I have fixed (1) and (2). To fix (2), I use a send_lock to make |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
ping @dillaman |
This is going to need to wait until after the Nautilus release has branched from master. We either need #25689 merged first or pull it into this PR and close the other one. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Based on work #12216 , and fix the problem as mentioned in yuyuyu101@794b49b#r15707924 .