DRAFT: Questions for asynchronous IO #6208

micvbang · 2021-05-18T11:18:46Z

Hello,

As noted in the title this PR is a draft and is not meant to be merged. I'm posting my progress in adding asynchronous IOs to libtorrent and want to give some context for a few questions that I have in regards to whether and how to continue this work.

For context, I saw your blog post from 2012 about async IOs and got interested in trying to improve the way that IOs are done in libtorrent. I know I'm almost 10 years late to the party, but it seems to me that the implementation of IOs in libtorrent hasn't architecturally changed a lot from what you described in that post.

I'm working on the open source project xNVMe which provides a simple API that abstracts over underlying IO paths such as io_uring, libaio, and posix aio to provide all of the benefits of truly asynchronous IOs while requiring the user (libtorrent in this case) to implement only a single API. The xNVMe API then makes it possible to change the actual IO path at runtime.

This PR contains a simple xNVMe disk_interface implementation that is mostly identical to the existing posix disk interface, except that async_read and async_write have been implemented using xNVMe. This implementation is not yet "truly asynchronous" because IOs are reaped immediately after they are posted.

I tried briefly to implement this in a "truly asynchronous" fashion, but my lacking understanding of the libtorrent architecture made it difficult to come up with a good design. Instead of hacking on it for too long, I thought I'd start out by asking you if this is something that you think is worthwhile to work on and might be interested in merging into libtorrent when it's fully functional, or if the current thread pool model is fine as it is?
If you think it's worthwhile, I'd love to ask for some pointers in how you think this could fit in with the existing architecture:

The main problem I've had in implementing truly asynchronous IOs so far, has been to identify when it's sensible to reap IOs.

Other than that, I've struggled a bit with identifying which existing functionality can be reused in a non-blocking context (e.g. excellent helpers like readwritev) as many of them (naturally) assume a blocking behavior.

Do you have any insights or code from your previous experiments with this that could make my life easier?

Thanks!

elgatito · 2021-05-18T20:06:53Z

Jamfile

@@ -213,6 +213,8 @@ rule linking ( properties * )
 		result += <library>/try_signal//try_signal/<link>static ;
 	}

+	result += <linkflags>"-lxnvme -laio -luuid -lnuma -pthread -lrt" ;


What about android/windows/darwin? Will it build there?

We are currently working on Windows support for xNVMe and expect a release in couple of months. It'll be using I/O Control Ports for the async IO.

xNVMe is currently untested on Darwin, but does run on FreeBSD. Android is also untested, but at least the POSIX interfaces should work, and then depending on the kernel it should support libaio and io_uring. I expect that we'll get test coverage for both Darwin and Android :)

arvidn

This is quite exciting!
I haven't had time to go through everything in detail yet, and I will have more questions later, but I figured I shouldn't delay posting these too long.

I'm curious about your take on async disk I/O. I kind of dismissed it for these reasons:

open()/close()/stat() are all synchronous in all async I/O models (iirc)
I expect storage to be moving towards Direct Access, i.e. where it's treated more like persistent RAM, mapped into the address space. With this in mind, multi-threaded mmap seems to be somewhat future proof.

include/libtorrent/aux_/xnvme_storage.hpp

src/xnvme_disk_io.cpp

arvidn · 2021-05-20T09:40:03Z

src/xnvme_disk_io.cpp

+				auto const len2 = v2_block ? std::min(default_block_size, piece_size2 - offset) : 0;
+
+				iovec_t b = {buffer.data(), std::max(len, len2)};
+				int const ret = st->readv(m_settings, b, piece, offset, error);


I imagine you could do something similar to the regular async read case here, and move the hashing to the completion handler. right? to avoid a blocking read call

I didn't work on this yet, but it definitely should be possible

arvidn · 2021-05-20T09:42:19Z

src/xnvme_disk_io.cpp

+			xnvme_storage* st = m_torrents[storage].get();
+			storage_error ec;
+			status_t ret;
+			std::tie(ret, p) = st->move_storage(p, flags, ec);


this is blocking still, as I'm sure you understand. I think it's fine to leave "fringe" features like this still be blocking. However, if this becomes "production quality", it would probably have to move to a separate thread, as it otherwise would block the network loop.

Yep, agreed. I was considering doing the read/write IOs asynchronously and using a thread pool to perform the file system operations

arvidn · 2021-05-20T09:48:54Z

src/xnvme_disk_io.cpp

+		std::string m_xnvme_backend;
+	};
+
+	TORRENT_EXPORT std::unique_ptr<disk_interface> xnvme_disk_io_constructor(


instead of making this a free function, you could make it a function object. That way, you can pass arbitrary arguments into the disk subsystem implementation. For example the xnvme backend. That way you don't need a new settings_pack field nor a new field to session_params.

You can take a look at this example, where I use a custom disk I/O for simulation tests.

The test_disk object is a kind of parameter pack, as well as a function object, it implements operator(), here. Implemented here.

To use this you construct a test_disk object and assign it to session_params::disk_constructor (example).

arvidn · 2021-05-20T10:08:36Z

src/xnvme_storage.cpp

+
+		auto search = m_file_handles->find(fname);
+		if (search != m_file_handles->end()) {
+			return search->second;


are all file handles opened in both read and write mode? If not, you may get a read-only file here despite asking for write-mode. If files are always opened in read/write mode, you may fail to just seed files from read-only media. or volumes your user only has read-access to.

If you need a more sophisticated file cache, see file_view_pool.

are all file handles opened in both read and write mode?

Currently they are

If files are always opened in read/write mode, you may fail to just seed files from read-only media. or volumes your user only has read-access to.

Great point! I'll look into the file_view_pool cache

include/libtorrent/aux_/xnvme_storage.hpp

src/xnvme_disk_io.cpp

arvidn · 2021-05-20T10:20:32Z

src/xnvme_disk_io.cpp

+			});
+
+			int res = m_torrents[storage]->writev(m_settings, b, r.piece, r.start, *error, whandler);
+			TORRENT_ASSERT(res >= 0);


It looks like both writev() and readv() have failure paths where the handler is not called, which is problematic. It would also trigger this assert.

micvbang · 2021-05-21T12:19:28Z

First of all, thanks for the great comments on the code! I'll make sure to address everything noted.

In regards to your questions about async disk I/O:

open()/close()/stat() are all synchronous in all async I/O models (iirc)

Yep, that's pretty much the case, except for Linux that can do some of these things via io_uring. But nothing is defined in POSIX or otherwise generally available in a portable fashion.

I'm wondering why the lack of asynchronous versions of said calls is enough to dismiss async I/O? There must be some part of the workflow that I'm missing - e.g., is torrenting largely bottlenecked by metadata I/O operations?

My gut feeling would be that caching of file descriptors could amortize the cost of blocking metadata operations to be small enough that the increased performance from async read/write calls would more than make up for it. But I guess I'm basing this on the assumption that many torrents contain files that are big enough to benefit from a higher I/O depth.

This gut feeling is based on nothing, so some of your insight on this would be much appreciated!

I expect storage to be moving towards Direct Access, i.e. where it's treated more like persistent RAM, mapped into the address space. With this in mind, multi-threaded mmap seems to be somewhat future proof.

What do you mean by Direct Access? Do you mean bypassing the file system using raw block devices? Or do you mean new hardware such as persistent memory? Or something else?

Generally, I think it will be an excellent long-term solution to provide an IO path that utilizes async I/Os because the $/GB of HDD and SSDs is likely to stay much lower than hardware that natively supports byte-addressing/unaligned access.

arvidn · 2021-05-21T17:35:22Z

I'm wondering why the lack of asynchronous versions of said calls is enough to dismiss async I/O? There must be some part of the workflow that I'm missing - e.g., is torrenting largely bottlenecked by metadata I/O operations?

If I would make blocking calls in the network thread, yes it would be.

stat() is probable the most notable example. When starting a torrent with 1000 files in it, the first thing we need to do is to make 1000 calls to stat() to see if one exists.

It's not a show stopper, you just have to make those calls in a thread pool. However, once you have a thread pool, it's natural to use that for the actual I/O as well. Synchronizing opening and closing files with issuing async reads and writes becomes complicated.

Additionally, hashing pieces ought to be done in separate threads too, so it's hard to escape the thread pool.

My gut feeling would be that caching of file descriptors could amortize the cost of blocking metadata operations to be small enough that the increased performance from async read/write calls would more than make up for it. But I guess I'm basing this on the assumption that many torrents contain files that are big enough to benefit from a higher I/O depth.

Any blocking call is problematic in the main thread, since it would prevent all sockets to respond to events, or peer requests to be responded to. I think most torrents probably do have a few large files, but there are so many torrents in the world that even the minority of many-small-files torrents is still a large number. Both extremes need to be handled well.

What do you mean by Direct Access? Do you mean bypassing the file system using raw block devices? Or do you mean new hardware such as persistent memory? Or something else?

What linux calls DAX: see here and here.

arvidn · 2021-05-21T17:38:32Z

By the way, please don't interpret any of my comments as discouragement!
I think it would be awesome to get this to work and see how it does.

micvbang · 2021-05-25T10:54:25Z

By the way, please don't interpret any of my comments as discouragement!
I think it would be awesome to get this to work and see how it does.

Definitely not, your honest feedback is much appreciated!
I think async I/O has got potential to improve the I/O path of libtorrent. But we'll have to test if that's true or not :)

src/xnvme_storage.cpp

src/xnvme_disk_io.cpp

micvbang · 2021-05-27T12:49:45Z

I'm currently working towards making reads and writes truly asynchronous, i.e. moving the current usage of xnvme_queue_wait to somewhere that we can reap IO completions. Is there a main loop where stuff like this can be done? Do you have a suggestion for a good place to do this?

arvidn · 2021-05-27T13:51:39Z

I'm currently working towards making reads and writes truly asynchronous, i.e. moving the current usage of xnvme_queue_wait to somewhere that we can reap IO completions. Is there a main loop where stuff like this can be done? Do you have a suggestion for a good place to do this?

I see two approaches:

Spawn a thread that reaps completions and post callbacks to the io context
Somehow integrate with the io context message loop. This is probably a bit complicated to do in a platform independent manner, but something like an eventfd could probably be used on linux. To get a callback when a file descriptor becomes readable, you can use stream_descriptor and wait on it by calling async_wait(). On windows I would expect you could integrate with a HANDLE (say, a win32 Event) by using object_handle.

micvbang · 2021-06-10T11:51:22Z

I've been wondering - how do you do performance testing of the disk_interfaces?
I first tried to set up a tracker locally and try to move data between two clients, but that doesn't seem to give me very consistent numbers. I've been trying out connection_tester as well, which is a lot easier to get started.

What would you recommend?

arvidn · 2021-06-10T12:32:33Z

There's this python script that runs transfer benchmarks. This covers more than just the disk I/O, but I would expect that to be an important part of it. This script is not run regularly and may have experienced some bit rot. I would also expect you'd want to tweak certain aspects of it for your specific setup.

https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/run_benchmark.py

There's also a test that benchmarks checking of files. this is much more disk I/O centric.

https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/benchmark_checking.py

micvbang · 2021-06-10T12:49:21Z

I'll try to play around with this - thank you!

micvbang · 2021-07-16T09:36:56Z

Hey again!

I've been attempting to make the IOs "truly" async by not forcing IO reaping inside async_read and async_write. This has brought progress in both the right and wrong direction. The IOs are asynchronous now, which is awesome, but I've messed something up in my implementation that I've had a hard time to track down.

To get the async IOs started, I went with your suggestion of having a separate thread doing the polling. As xNVMe IO queues aren't thread safe, this has required me to add some locking. I think the locking currently is a bit too pessimistic, but an iteration or two should make it clear if that's the case. Overall I think this works well.

I've messed around with the implementation enough to get the test_storage tests passing. That helped me to find a few bugs, which was awesome. There's still a bunch of other tests that don't pass, so there's an obvious avenue for future work.

My current problem (or symptom of my problem) is that when I run client_test I can get it to actually download torrent data -- it progresses through the torrent (slowly), but there's a lot of bytes ending up in "fail" because piece hashing fails, I think. At least I'm often getting logs like these ubuntu-20.04.2.0-desktop-amd64.iso hash for piece xxx failed, which leads me to believe that I screwed up async_hash somehow. My implementation of async_hash is a bit complicated, so it's definitely very likely that there's something there I'm missing.

The readv2 and writev implementations are so simple that I'm pretty confident that they aren't the issue. The only exception to this statement is that readv2 and writev both utilize my prepare_ios function which mimics and replaces readwritev, and I'm not still not entirely certain if what I'm doing is safe. I think it is, though.

Ah, and I definitely see how messed up my implementation of open_file_xnvme is -- the implementation has gotten even more hacky since you saw it last, so a next step definitely is to look into the file_view_pool. I didn't look into this yet as I don't think this relates to the problem I'm currently focusing on.

My current focus is to try to get torrents downloading without seeing any data in the "fail" bucket. My theory is that the problem is with async_hash.
Would you have time to do a review of async_hash and prepare_ios to see if there's something obvious that I'm messing up? Or just a general review to spot the parts that I haven't realized myself are problematic

The code is not currently "truly async" as it waits on all outstanding IOs before returning to the caller. The performance should improve substantially once this behavior is changed to "truly async".

WIP: refactor and test prepare_ios WIP: remove unused storage_error parameter from readv2, writev

arvidn · 2021-07-16T15:16:50Z

one thing that makes async_hash complicated is that it's expected to "synchronize" with any previous writes that have been posted. So, if writes are issued and started as async tasks, then an async_hash arrives, it's supposed to see all previous writes, even if they haven't completed yet.

In the mmap_disk_io implementation I use a store_buffer (see include/libtorrent/aux_/store_buffer.hpp) where all pending writes are stored while waiting for them to complete. Any read or hash operation always looks there first, before issuing reads from disk.

It's similar to store buffers in CPUs.

micvbang · 2021-07-16T15:20:04Z

Ahhh, this makes a lot of sense! That definitely could explain why I'm seeing random fails.

Thanks for the hint - I'll look into this!

stale · 2021-10-14T17:59:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

micvbang force-pushed the micvbang/xnvme-simple branch from 00ce657 to 3134c17 Compare May 18, 2021 12:01

elgatito reviewed May 18, 2021

View reviewed changes

arvidn reviewed May 21, 2021

View reviewed changes

arvidn reviewed May 27, 2021

View reviewed changes

src/xnvme_storage.cpp Outdated Show resolved Hide resolved

src/xnvme_storage.cpp Outdated Show resolved Hide resolved

src/xnvme_storage.cpp Outdated Show resolved Hide resolved

src/xnvme_disk_io.cpp Outdated Show resolved Hide resolved

micvbang force-pushed the micvbang/xnvme-simple branch from 61fc43a to f401481 Compare July 16, 2021 09:32

micvbang added 7 commits July 16, 2021 12:36

Jamfile: link with xNVMe

b067337

Add simple xNVMe disk_interface for async IOs

b243c5e

The code is not currently "truly async" as it waits on all outstanding IOs before returning to the caller. The performance should improve substantially once this behavior is changed to "truly async".

xnvme_storage readv2/writev: pass callback handlers by value

61ed84b

xnvme_storage: make m_file_handles direct member

1ffb4bb

xnvme_storage: pass errors to readv2/writev callbacks

386fcc1

xnvme_storage: fix PR comments

72e4a95

WIP: add true async readv2, writev, async_hash

dc295cc

WIP: refactor and test prepare_ios WIP: remove unused storage_error parameter from readv2, writev

micvbang force-pushed the micvbang/xnvme-simple branch from f401481 to dc295cc Compare July 16, 2021 10:40

stale bot added the stale label Oct 14, 2021

stale bot closed this Nov 4, 2021

odkrys mentioned this pull request Feb 5, 2022

feature: nonblocking disk IO transmission/transmission#2462

Open

markmdscott mentioned this pull request Apr 20, 2022

asio_disk_io? #6821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Questions for asynchronous IO #6208

DRAFT: Questions for asynchronous IO #6208

micvbang commented May 18, 2021

elgatito May 18, 2021

micvbang May 19, 2021

arvidn left a comment •

edited

Loading

arvidn May 20, 2021

micvbang May 21, 2021

arvidn May 20, 2021

micvbang May 21, 2021

arvidn May 20, 2021

arvidn May 20, 2021

micvbang May 25, 2021

arvidn May 20, 2021

micvbang commented May 21, 2021

arvidn commented May 21, 2021

arvidn commented May 21, 2021

micvbang commented May 25, 2021

micvbang commented May 27, 2021

arvidn commented May 27, 2021

micvbang commented Jun 10, 2021

arvidn commented Jun 10, 2021

micvbang commented Jun 10, 2021

micvbang commented Jul 16, 2021

arvidn commented Jul 16, 2021

micvbang commented Jul 16, 2021

stale bot commented Oct 14, 2021

DRAFT: Questions for asynchronous IO #6208

DRAFT: Questions for asynchronous IO #6208

Conversation

micvbang commented May 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arvidn left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micvbang commented May 21, 2021

arvidn commented May 21, 2021

arvidn commented May 21, 2021

micvbang commented May 25, 2021

micvbang commented May 27, 2021

arvidn commented May 27, 2021

micvbang commented Jun 10, 2021

arvidn commented Jun 10, 2021

micvbang commented Jun 10, 2021

micvbang commented Jul 16, 2021

arvidn commented Jul 16, 2021

micvbang commented Jul 16, 2021

stale bot commented Oct 14, 2021

arvidn left a comment •

edited

Loading