Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librados: support copy-less writes in librados C API #24943

Closed
wants to merge 3 commits into from

Conversation

liyichao
Copy link
Contributor

@liyichao liyichao commented Nov 6, 2018

Based on work #12216 , and fix the problem as mentioned in yuyuyu101@794b49b#r15707924 .

  • Cancel ops to the old osd when ops are redirected, so buffer will not be used when it is freed.
  • out_seq is only updated for new messages, and messages from sent queue are sent using their original sequence. So even when we remove ops from sent queue, the out_seq is correct.

This commit introduces rados_write2(), rados_write_full2(),
rados_writesame2(), rados_append2() and rados_setxattr2() which
do not copy user buffers into temporary bufferptrs, reducing
memory and CPU usage, at expense of requiring that provied
buffers don't change through entire write/append, which
should perform synchronously anyway. Otherwise these "2"
function perform as their non-"2" equivalents.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
This commit introduces rados_aio_write2(), rados_aio_write_full2(),
rados_aio_writesame2(), rados_aio_append2() and rados_aio_setxattr2()
which do not copy user buffers into temporary bufferptrs, reducing
memory and CPU usage, at expense of requiring that provied
buffers don't change through entire write/append. Otherwise these "2"
function perform as their non-"2" equivalents.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
* Cancel ops to the old osd when ops are redirected, so buffer will not be used after it has been freed.
* out_seq is only updated for new messages, and messages from sent queue are sent using their original sequence. So even when we remove ops from sent queue, the out_seq is correct.

Signed-off-by: liyichao <liyichao.good@gmail.com>

Reviewed-by: Li Wang <laurence.liwang@gmail.com>
@dragonylffly
Copy link
Contributor

@liewegas please take a look, we think librbd will benefit from it

* @param off byte offset in the object to begin writing at
* @returns 0 on success, negative error code on failure
*/
CEPH_RADOS_API int rados_write2(rados_ioctx_t io, const char *oid,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I question whether or not new methods are required. In theory it should be safe to assume that the contents of buf for an AIO method must remain the same until the call completes (similar to aio_write in Linux [1]). Plus, removing the const-ness of the buf seems odd.

[1] http://man7.org/linux/man-pages/man3/aio_write.3.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I update these functions rados_write, rados_write_full, rados_writesame, rados_append, rados_setxattr, rados_aio_write, rados_aio_append, rados_aio_write_full, rados_aio_writesame, rados_aio_setxattr ? Or what of these should be a new rados_xxx2?

Copy link
Contributor

@tchaikov tchaikov Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman it turns out the input buffer is indeed deep copied by librados. see buffer::list::append(const char *data, unsigned len) and buffer::ptr::append(const char *p, unsigned l).

and i am not sure we can make this semantic change even we bump up the soversion.

Copy link
Contributor Author

@liyichao liyichao Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the original rados_write functions deep copy the buf because of buffer::list::append, I agree with @dillaman that it is safe to assume the buf passed in rados_write, rados_aio_write, etc should be valid before io finishes, just like write and aio_write in Linux. Thus I suggest updating the original rados_write by replacing buffer::list::append with buffer::create_static.

I also intend to remove the copy in librbd aio_write when these functions are updated.

Then, how should we make progress on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyichao if we are sure that all the librados clients can ensure the the buffer passed to these write calls will be valid until the call completes. there is no reason we cannot do the switch.

but to be 100% sure about this, i'd suggest check the callers of librados, like librbd,libcephfs,python-rados and qemu.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librbd and QEMU don't use this interface. fio's rados engine does use it but it assumes the buffer needs to be untouched until the completion callback. I cannot speak for any users of python-rados or libcephfs, though.

@@ -1078,7 +1111,9 @@ void ProtocolV1::randomize_out_seq() {
ssize_t ProtocolV1::write_message(Message *m, bufferlist &bl, bool more) {
FUNCTRACE(cct);
ceph_assert(connection->center->in_thread());
m->set_seq(++out_seq);
if (m->get_seq() == 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if a message was being sent to OSD a with seq 1 but was then redirected to OSD b, it would keep seq 1 even if the connection to OSD b had an unrelated in-flight seq 1 message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in _prepare_osd_op, the resent message will be a new message with sequence 0.

@mattbenjamin
Copy link
Contributor

@adamemerson could you help me understand any overlap or implications for rados-unleashed? thanks!


void cancel_ops(const set<ceph_tid_t> &ops) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider alternatives to std::set here. It pulls in a big dynamically allocated tree for what may not even live that long. boost::container::flat_set has the same semantics and performs well (assuming you insert things in order) and isn't nearly as Heapy.

Copy link
Contributor

@adamemerson adamemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very worthwhile and interesting work.

for (auto const& p: cancel_ops) {
p.first->cancel_ops(p.second);
}

for (list<LingerOp*>::iterator iter = unregister_lingers.begin();
iter != unregister_lingers.end();
++iter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely want this commit. I can use it to interfaces for RADOS_unleashed that just take byte ranges or spans or what-have-you without the caller needing to provide a bufferlist at all.

@adamemerson
Copy link
Contributor

@mattbenjamin The third commit that changes objecter and messenger will let us safely have write interfaces that work with completely unmanaged buffers through the same uses of create static.

@tchaikov tchaikov self-requested a review December 7, 2018 03:17
@adamemerson
Copy link
Contributor

@liyichao @dillaman Is there a way we can try to fast track this? The third commit could be separated out if there's disagreement about the changes to the librados C interface. It would allow bufferlists made by create_static to be used safely from the C++ interface all on its own, from what it looks like.

@dillaman
Copy link

dillaman commented Dec 7, 2018

I too would like to see the third commit merged. But (1) I think we would need to kill off the simple messenger or fix it as well, and (2) I am pretty sure there is still a possibility for touching freed memory in ProtocolV1::write_event since it pops a message from out_q while holding the lock, but touches the assembled bufferlist while not holding the lock.

@liyichao
Copy link
Contributor Author

@adamemerson @dillaman , I have separated the third commit into #25689 , please have a look.

@dillaman I have fixed (1) and (2). To fix (2), I use a send_lock to make Connection::_try_send and cancel_ops mutual exclusive, and deep copy the buffer in outcoming_bl when cancelled message is already in outcoming_bl.

@stale
Copy link

stale bot commented Feb 22, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Feb 22, 2019
@liyichao
Copy link
Contributor Author

ping @dillaman

@stale stale bot removed the stale label Feb 25, 2019
@dillaman
Copy link

This is going to need to wait until after the Nautilus release has branched from master. We either need #25689 merged first or pull it into this PR and close the other one.

@stale
Copy link

stale bot commented Apr 28, 2019

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Apr 28, 2019
@stale
Copy link

stale bot commented Jul 27, 2019

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@stale stale bot closed this Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants