Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Questions for asynchronous IO #6208

Closed
wants to merge 7 commits into from

Conversation

micvbang
Copy link
Contributor

Hello,

As noted in the title this PR is a draft and is not meant to be merged. I'm posting my progress in adding asynchronous IOs to libtorrent and want to give some context for a few questions that I have in regards to whether and how to continue this work.

For context, I saw your blog post from 2012 about async IOs and got interested in trying to improve the way that IOs are done in libtorrent. I know I'm almost 10 years late to the party, but it seems to me that the implementation of IOs in libtorrent hasn't architecturally changed a lot from what you described in that post.

I'm working on the open source project xNVMe which provides a simple API that abstracts over underlying IO paths such as io_uring, libaio, and posix aio to provide all of the benefits of truly asynchronous IOs while requiring the user (libtorrent in this case) to implement only a single API. The xNVMe API then makes it possible to change the actual IO path at runtime.

This PR contains a simple xNVMe disk_interface implementation that is mostly identical to the existing posix disk interface, except that async_read and async_write have been implemented using xNVMe. This implementation is not yet "truly asynchronous" because IOs are reaped immediately after they are posted.

I tried briefly to implement this in a "truly asynchronous" fashion, but my lacking understanding of the libtorrent architecture made it difficult to come up with a good design. Instead of hacking on it for too long, I thought I'd start out by asking you if this is something that you think is worthwhile to work on and might be interested in merging into libtorrent when it's fully functional, or if the current thread pool model is fine as it is?
If you think it's worthwhile, I'd love to ask for some pointers in how you think this could fit in with the existing architecture:

The main problem I've had in implementing truly asynchronous IOs so far, has been to identify when it's sensible to reap IOs.

Other than that, I've struggled a bit with identifying which existing functionality can be reused in a non-blocking context (e.g. excellent helpers like readwritev) as many of them (naturally) assume a blocking behavior.

Do you have any insights or code from your previous experiments with this that could make my life easier?

Thanks!

@@ -213,6 +213,8 @@ rule linking ( properties * )
result += <library>/try_signal//try_signal/<link>static ;
}

result += <linkflags>"-lxnvme -laio -luuid -lnuma -pthread -lrt" ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about android/windows/darwin? Will it build there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently working on Windows support for xNVMe and expect a release in couple of months. It'll be using I/O Control Ports for the async IO.

xNVMe is currently untested on Darwin, but does run on FreeBSD. Android is also untested, but at least the POSIX interfaces should work, and then depending on the kernel it should support libaio and io_uring. I expect that we'll get test coverage for both Darwin and Android :)

Copy link
Owner

@arvidn arvidn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite exciting!
I haven't had time to go through everything in detail yet, and I will have more questions later, but I figured I shouldn't delay posting these too long.

I'm curious about your take on async disk I/O. I kind of dismissed it for these reasons:

  1. open()/close()/stat() are all synchronous in all async I/O models (iirc)
  2. I expect storage to be moving towards Direct Access, i.e. where it's treated more like persistent RAM, mapped into the address space. With this in mind, multi-threaded mmap seems to be somewhat future proof.

include/libtorrent/aux_/xnvme_storage.hpp Outdated Show resolved Hide resolved
src/xnvme_disk_io.cpp Outdated Show resolved Hide resolved
auto const len2 = v2_block ? std::min(default_block_size, piece_size2 - offset) : 0;

iovec_t b = {buffer.data(), std::max(len, len2)};
int const ret = st->readv(m_settings, b, piece, offset, error);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine you could do something similar to the regular async read case here, and move the hashing to the completion handler. right? to avoid a blocking read call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't work on this yet, but it definitely should be possible

xnvme_storage* st = m_torrents[storage].get();
storage_error ec;
status_t ret;
std::tie(ret, p) = st->move_storage(p, flags, ec);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is blocking still, as I'm sure you understand. I think it's fine to leave "fringe" features like this still be blocking. However, if this becomes "production quality", it would probably have to move to a separate thread, as it otherwise would block the network loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed. I was considering doing the read/write IOs asynchronously and using a thread pool to perform the file system operations

std::string m_xnvme_backend;
};

TORRENT_EXPORT std::unique_ptr<disk_interface> xnvme_disk_io_constructor(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of making this a free function, you could make it a function object. That way, you can pass arbitrary arguments into the disk subsystem implementation. For example the xnvme backend. That way you don't need a new settings_pack field nor a new field to session_params.

You can take a look at this example, where I use a custom disk I/O for simulation tests.

The test_disk object is a kind of parameter pack, as well as a function object, it implements operator(), here. Implemented here.

To use this you construct a test_disk object and assign it to session_params::disk_constructor (example).


auto search = m_file_handles->find(fname);
if (search != m_file_handles->end()) {
return search->second;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all file handles opened in both read and write mode? If not, you may get a read-only file here despite asking for write-mode. If files are always opened in read/write mode, you may fail to just seed files from read-only media. or volumes your user only has read-access to.

If you need a more sophisticated file cache, see file_view_pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all file handles opened in both read and write mode?

Currently they are

If files are always opened in read/write mode, you may fail to just seed files from read-only media. or volumes your user only has read-access to.

Great point! I'll look into the file_view_pool cache

include/libtorrent/aux_/xnvme_storage.hpp Outdated Show resolved Hide resolved
src/xnvme_disk_io.cpp Outdated Show resolved Hide resolved
src/xnvme_disk_io.cpp Outdated Show resolved Hide resolved
});

int res = m_torrents[storage]->writev(m_settings, b, r.piece, r.start, *error, whandler);
TORRENT_ASSERT(res >= 0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like both writev() and readv() have failure paths where the handler is not called, which is problematic. It would also trigger this assert.

@micvbang
Copy link
Contributor Author

First of all, thanks for the great comments on the code! I'll make sure to address everything noted.

In regards to your questions about async disk I/O:

  1. open()/close()/stat() are all synchronous in all async I/O models (iirc)

Yep, that's pretty much the case, except for Linux that can do some of these things via io_uring. But nothing is defined in POSIX or otherwise generally available in a portable fashion.

I'm wondering why the lack of asynchronous versions of said calls is enough to dismiss async I/O? There must be some part of the workflow that I'm missing - e.g., is torrenting largely bottlenecked by metadata I/O operations?

My gut feeling would be that caching of file descriptors could amortize the cost of blocking metadata operations to be small enough that the increased performance from async read/write calls would more than make up for it. But I guess I'm basing this on the assumption that many torrents contain files that are big enough to benefit from a higher I/O depth.

This gut feeling is based on nothing, so some of your insight on this would be much appreciated!

  1. I expect storage to be moving towards Direct Access, i.e. where it's treated more like persistent RAM, mapped into the address space. With this in mind, multi-threaded mmap seems to be somewhat future proof.

What do you mean by Direct Access? Do you mean bypassing the file system using raw block devices? Or do you mean new hardware such as persistent memory? Or something else?

Generally, I think it will be an excellent long-term solution to provide an IO path that utilizes async I/Os because the $/GB of HDD and SSDs is likely to stay much lower than hardware that natively supports byte-addressing/unaligned access.

@arvidn
Copy link
Owner

arvidn commented May 21, 2021

I'm wondering why the lack of asynchronous versions of said calls is enough to dismiss async I/O? There must be some part of the workflow that I'm missing - e.g., is torrenting largely bottlenecked by metadata I/O operations?

If I would make blocking calls in the network thread, yes it would be.

stat() is probable the most notable example. When starting a torrent with 1000 files in it, the first thing we need to do is to make 1000 calls to stat() to see if one exists.

It's not a show stopper, you just have to make those calls in a thread pool. However, once you have a thread pool, it's natural to use that for the actual I/O as well. Synchronizing opening and closing files with issuing async reads and writes becomes complicated.

Additionally, hashing pieces ought to be done in separate threads too, so it's hard to escape the thread pool.

My gut feeling would be that caching of file descriptors could amortize the cost of blocking metadata operations to be small enough that the increased performance from async read/write calls would more than make up for it. But I guess I'm basing this on the assumption that many torrents contain files that are big enough to benefit from a higher I/O depth.

Any blocking call is problematic in the main thread, since it would prevent all sockets to respond to events, or peer requests to be responded to. I think most torrents probably do have a few large files, but there are so many torrents in the world that even the minority of many-small-files torrents is still a large number. Both extremes need to be handled well.

What do you mean by Direct Access? Do you mean bypassing the file system using raw block devices? Or do you mean new hardware such as persistent memory? Or something else?

What linux calls DAX: see here and here.

@arvidn
Copy link
Owner

arvidn commented May 21, 2021

By the way, please don't interpret any of my comments as discouragement!
I think it would be awesome to get this to work and see how it does.

@micvbang
Copy link
Contributor Author

By the way, please don't interpret any of my comments as discouragement!
I think it would be awesome to get this to work and see how it does.

Definitely not, your honest feedback is much appreciated!
I think async I/O has got potential to improve the I/O path of libtorrent. But we'll have to test if that's true or not :)

src/xnvme_storage.cpp Outdated Show resolved Hide resolved
src/xnvme_storage.cpp Outdated Show resolved Hide resolved
src/xnvme_storage.cpp Outdated Show resolved Hide resolved
src/xnvme_disk_io.cpp Outdated Show resolved Hide resolved
@micvbang
Copy link
Contributor Author

I'm currently working towards making reads and writes truly asynchronous, i.e. moving the current usage of xnvme_queue_wait to somewhere that we can reap IO completions. Is there a main loop where stuff like this can be done? Do you have a suggestion for a good place to do this?

@arvidn
Copy link
Owner

arvidn commented May 27, 2021

I'm currently working towards making reads and writes truly asynchronous, i.e. moving the current usage of xnvme_queue_wait to somewhere that we can reap IO completions. Is there a main loop where stuff like this can be done? Do you have a suggestion for a good place to do this?

I see two approaches:

  1. Spawn a thread that reaps completions and post callbacks to the io context
  2. Somehow integrate with the io context message loop. This is probably a bit complicated to do in a platform independent manner, but something like an eventfd could probably be used on linux. To get a callback when a file descriptor becomes readable, you can use stream_descriptor and wait on it by calling async_wait(). On windows I would expect you could integrate with a HANDLE (say, a win32 Event) by using object_handle.

@micvbang
Copy link
Contributor Author

I've been wondering - how do you do performance testing of the disk_interfaces?
I first tried to set up a tracker locally and try to move data between two clients, but that doesn't seem to give me very consistent numbers. I've been trying out connection_tester as well, which is a lot easier to get started.

What would you recommend?

@arvidn
Copy link
Owner

arvidn commented Jun 10, 2021

There's this python script that runs transfer benchmarks. This covers more than just the disk I/O, but I would expect that to be an important part of it. This script is not run regularly and may have experienced some bit rot. I would also expect you'd want to tweak certain aspects of it for your specific setup.

https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/run_benchmark.py

There's also a test that benchmarks checking of files. this is much more disk I/O centric.

https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/benchmark_checking.py

@micvbang
Copy link
Contributor Author

I'll try to play around with this - thank you!

@micvbang
Copy link
Contributor Author

Hey again!

I've been attempting to make the IOs "truly" async by not forcing IO reaping inside async_read and async_write. This has brought progress in both the right and wrong direction. The IOs are asynchronous now, which is awesome, but I've messed something up in my implementation that I've had a hard time to track down.

To get the async IOs started, I went with your suggestion of having a separate thread doing the polling. As xNVMe IO queues aren't thread safe, this has required me to add some locking. I think the locking currently is a bit too pessimistic, but an iteration or two should make it clear if that's the case. Overall I think this works well.

I've messed around with the implementation enough to get the test_storage tests passing. That helped me to find a few bugs, which was awesome. There's still a bunch of other tests that don't pass, so there's an obvious avenue for future work.

My current problem (or symptom of my problem) is that when I run client_test I can get it to actually download torrent data -- it progresses through the torrent (slowly), but there's a lot of bytes ending up in "fail" because piece hashing fails, I think. At least I'm often getting logs like these ubuntu-20.04.2.0-desktop-amd64.iso hash for piece xxx failed, which leads me to believe that I screwed up async_hash somehow. My implementation of async_hash is a bit complicated, so it's definitely very likely that there's something there I'm missing.

The readv2 and writev implementations are so simple that I'm pretty confident that they aren't the issue. The only exception to this statement is that readv2 and writev both utilize my prepare_ios function which mimics and replaces readwritev, and I'm not still not entirely certain if what I'm doing is safe. I think it is, though.

Ah, and I definitely see how messed up my implementation of open_file_xnvme is -- the implementation has gotten even more hacky since you saw it last, so a next step definitely is to look into the file_view_pool. I didn't look into this yet as I don't think this relates to the problem I'm currently focusing on.

My current focus is to try to get torrents downloading without seeing any data in the "fail" bucket. My theory is that the problem is with async_hash.
Would you have time to do a review of async_hash and prepare_ios to see if there's something obvious that I'm messing up? Or just a general review to spot the parts that I haven't realized myself are problematic

The code is not currently "truly async" as it waits on all outstanding
IOs before returning to the caller. The performance should improve
substantially once this behavior is changed to "truly async".
WIP: refactor and test prepare_ios
WIP: remove unused storage_error parameter from readv2, writev
@arvidn
Copy link
Owner

arvidn commented Jul 16, 2021

one thing that makes async_hash complicated is that it's expected to "synchronize" with any previous writes that have been posted. So, if writes are issued and started as async tasks, then an async_hash arrives, it's supposed to see all previous writes, even if they haven't completed yet.

In the mmap_disk_io implementation I use a store_buffer (see include/libtorrent/aux_/store_buffer.hpp) where all pending writes are stored while waiting for them to complete. Any read or hash operation always looks there first, before issuing reads from disk.

It's similar to store buffers in CPUs.

@micvbang
Copy link
Contributor Author

Ahhh, this makes a lot of sense! That definitely could explain why I'm seeing random fails.

Thanks for the hint - I'll look into this!

@stale
Copy link

stale bot commented Oct 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants