Design Discussion of Async Support #3254

siying · 2017-12-12T00:45:01Z

We decided to have the design discussion of async support here. Let's see how it works.

The initial discussion started at the prototyping work by @yuslepukhin
#3063
and
#2727

siying · 2017-12-12T00:45:37Z

I thought about the code and proposed we refactor the code in this way to make it easier to have better mantainable code for both of async and sync. This is just thoughts. Let me know your opinion on this.

Get(): have a global context and pass it everywhere.
Global context can look like something like this:

struct GetContext {
   input, output;
   some global variables: timers, merge context, current get context, etc.
   VersionSet::Get context (FilePicker etc);
   TableCache::Get context
   BlockBasedTableReader::Get context
};

And along the path a function is rewritten in a way that I/O path is clearly split. If most 1 I/O is possible, it can be this:

function TableCache::Get(GetContext* get_context) {
   pre_io_step(get_context);
   table_reader.Get(get_context)
   post_io_step(get_context);
}

In this way, in the the callback OnTableReaderGetFinish() can be simply post_io_step().

Similarly, if there is a loop of possible I/O, it can look like this:

function VersionSet::Get(GetContext* get_context) {
  construct_file_picker(get_context);
  do {
    pick_next_file(get_context);
    TableCache::Get(get_context);
  } while (get_context->found_result);
}

In this way, I hope the sync code path will be even simpler and the performance regression will be less.

Iterators
I hope all the callback contexts are stored in iterators themselves. They shouldn't be more context passed around.
An example of block based table iterator seek can look like that (it currently is implemented as two level iterator, but I translate to plain term to be clear):

BlockBasedTableIterator::Seek() {
  FindDataBlock();
  PrepareReadDataBlock();
  if (need_read_data_block_from_file_) {
    ReadDataBlock();
  }
  SearchInDataBlock();
}

In this way, async's OnReadDataBlockFinish() can be BlockBasedTableIterator::SearchInDataBlock().

To make the code change more incremental. I suggest we first focus on two things: (1) code refactoring step by step; I hope each code refactoring should by itself improve the maintainability of the code base; (2) first focus on the step of data block reading part to async. Other possible I/O path, like index block and table reader constructing can be added later; (3) focus on block based table reader first. Skip other table readers for now.

yuslepukhin · 2017-12-13T22:36:15Z

@siying The general principles are sound. Both 1) and 2) are implemented within the PRs are submitted.
Iterators do not have separate context. The picture is complicated by the fact that a) some recursively call into its own public interfaces. b) Has huge pieces of code that are being reused from different public interfaces and thus are require callback wiring.

I have contemplated introducing the global context but rejected the idea due to the following a) Maintenance. This would introduce full stack dependencies with the same class. b) Most of the context would go un-used as a lot of ops are satisfied from cache.
For an example of a semi-global context see BlockBasedReader Open() where the operation invokes many async reads but a single context serves for all of them as the table reader building progresses.
Finally, the loop example is oversimplified.

siying · 2017-12-14T18:23:57Z

@yuslepukhin of course it is oversimplified. I just want to show the idea that, if it is like a state-machine. The callback function can do the next step as the state machine, and the sync function can just flow in the natural way.

How about 3? I think this can help we move incrementally.

yuslepukhin · 2017-12-15T00:22:11Z

@siying Can you give an example of the code that depicts the async functionality, so I could better understand the idea? Breaking the function in pre- post- parts is good but this is still sync code.

Base design questions here before we move forward.

Do we want to avoid code duplication and make the same code sync/async?
If we do, then the same code should account for possible scenarios:
- We run sync
- We run async but some operations complete sync
- We run async but everything completes sync

If we want a separate async path that only utilizes parts of the sync code, then it would be extremely hard on the development/maintenance and testing.
With the same code I used the same sync tests to validate the shared code functionality.
At the same time I am partaking of replicating main sync path changes into my PR which is a nightmare.

yuslepukhin · 2017-12-27T22:00:24Z

@siying I am throwing together a prototype in a form of the unit test that makes use of co_routines. It has many advantages, such as we do not need to refactor code or change its sync nature.
The cons, of course, it is not a standard even in C++17 and is not implemented on many compilers although it is experimentally runs on LLVM/Clang. So the proposal may change and it may render my prototype not-functional. However, unless they change it drastically, it should not be too much trouble to adjust.

yuslepukhin · 2018-01-04T00:50:33Z

@siying See a working example with coroutines at #3325

For now ignore the platform dependent code. Look at the test and async/coroutine_random_access_test.cc and RandomAccessFileReader::RequestRead() which is a clone of Read() for demonstration purposes.

publicocean0 · 2018-01-16T12:14:01Z

It is a good idea but pay attention to how to design the api for coroutines or fibers in rocksdb: It could be not compatible with JNI extensions. In a old test I tried to use boost fibers inside JNI . The thread inner status inside c++ is not compatible with jvm thread model. It is crashing JVM. The better solution is to abstract the library providing fibers or coroutines. In this way it can come from java or from c++ based on what are you using. You can find a old my discussion also in stackoverflow.

yuslepukhin · 2018-01-16T18:33:20Z

@publicocean0 Thanks for the input, I will check that out. The coroutines proposal as implemented by MS is significantly different from that of the fibers and/or boost fibers and boost coroutines. It is probable that on JNI boundary the the chain of coroutines may need to be stopped and converted either back to a sync model OR to a callback model as supported by JVM. In either case, I am not seeing how it would be possible to reuse Java threadpool for I/O completion although internally JVM does just that.

publicocean0 · 2018-01-16T22:45:08Z

JVM is surelly in not a good moment , there is a crossroads for future: sun is too much slow to propose new solutions . Anyway here we can t wait for the distant future , probbaly the better solution is to study a solution flexible. My opinion is fibers might be embedded at OS levels not at language level... but the way for seeing it in win,linux,mac is not so fast probably . For example there are already new sperimental OS removing completly old threads concept at user level , replacing it with fibers. Actually the most important problem with fibers are locks. Locks block the thread and not just the fiber so they have a bad side effect for concurrency. Java Quasar (fibers library for java) implemented a special extension (with same api) for locks for fibers. Anyway all synchonized blocks might be replaced. What i verified personally is to try to use boost fibers in c++ and to call it using jni because it is for long time i wished to use fibers with rocksdb. Unfurtunally it is not possible to call back a java method from inside a c++ fiber : jvm crashes. It seams JVM checks stack consistency at JNI boundary so using fibers/courotines it finds a inconsistency based on old thread model. Anyway you can wait until the result is ready and then returns the result. However if we wait for the result when it is ready (synchronous model) at JNI level ... you effectively limit the power offered by fibers/coroutines. I m not sure but with coroutines the status is not attached to thread local like with fibers: maybe with coroutiines is more simplier to use the same thread in c++ and in java without conficts in the status of thread.Anyway you have to use 2 different coroutines using isolated schedulers 1 in java , 1 in c++ in pipeline.

kennthhz-zz · 2021-07-07T22:36:19Z

Now that coroutine is in C++20 and you have io_uring in linux, maybe this effort should be revitalized.

riversand963 · 2021-07-19T17:31:00Z

Thanks @kennthhz for reviving this discussion.
I believe RocksDB will continue to rely on C++11 (let me know otherwise). Besides, coroutine support is still WIP even for C++20 with a lot of different opinions.
For io_uring, @siying and @anand1976 have some support via liburing.

kennthhz-zz · 2021-07-21T18:33:45Z

Thanks Yangqin. Are you one of the maintainers of rocksdb? Coroutine support is officially shipped with c++20 compilers so it is no longer WIP. In fact, the entire C++20 spec was finalized a while ago and the standard committee has moved on to C++23. I'd like to propose and collaborate with like minded people to add support for coroutine based read async API for rocksdb with underlying io_uring support. Thanks

…

On Mon, Jul 19, 2021 at 10:31 AM Yanqin Jin ***@***.***> wrote: Thanks @kennthhz <https://github.com/kennthhz> for reviving this discussion. I believe RocksDB will continue to rely on C++11 (let me know otherwise). Besides, coroutine support is still WIP even for C++20 with a lot of different opinions. For io_uring, @siying <https://github.com/siying> and @anand1976 <https://github.com/anand1976> have some support via liburing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3254 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFQVWVLHUYX2DOCQABUHISTTYROODANCNFSM4EHZAICQ> .

kennthhz-zz · 2021-08-05T20:36:03Z

IMHO the right thing to do is to create a separate coroutine API chain for non-blocking async operations. Opportunistically reuse some sync code but should reduce the call stack depth and use nested co-await. It is pretty much like C#.

kennthhz-zz · 2021-11-03T04:31:05Z

I want to fully reviving this issue/work. My company uses RocksDB as storage engine for a large cloud based NoSql service. The efficiency and performance is critical. So support non-blocking IO async API is critical. We believe RocksDB should take advantage of new technologies such as io_uring and C++ coroutine for this. We have a fork that is a working progress: https://github.com/kennthhz/rocksdb
It still has some gaps: 1) need to reduce some code duplication between async and sync code path. 2) need to handle perf counter in async mode; 3) need to support internal io_uring completion background job (one per DB). 4) add unit test

jay-zhuang · 2021-11-09T04:42:28Z

@kennthhz rocksdb already supports io_uring, it's enabled by default:

rocksdb/build_tools/build_detect_platform

Line 174 in e970248

if test -z "$ROCKSDB_USE_IO_URING"; then

For moving to c++20, rocksdb needs to consider users that still use compiler not supporting c++20.

kennthhz-zz · 2021-11-09T04:54:52Z

@jay-zhuang RocksDB's current support for io_uring is not truly async. It still blocks the thread while waiting for IO completion. That is why it only make some sense in MultiGet Api, but not Get. Also it is not supported in flushing and compaction. To be real async, we need to use coroutine. I agree it is premature to require all clients to upgrade gcc 10. That is why we will use #define to isolate them.

yuslepukhin · 2021-11-09T18:09:53Z

I have created in the past a fully async implementations for Get/MultiGet that does not block on IO on Windows. However, there is a very small Windows specific part in it having to do with async IO. I have faced multiple issues with it. First of all, it required a lot of code refactoring to make it run in both sync and async modes. Making iterators async presented a particular challenge. It worked, but it yielded only marginal improvements under specific circumstances. In exchange, you would get a programming paradigm that is not familiar to many and difficult to maintain.
I had another version of async based on co-routines. That had a distinct advantage of much smaller code changes. The structure of the code remained virtually intact and it was maintainable. However, at that point coroutines, both the standard and MS implementation was not sufficiently mature. In particular, callback based async implementation lead to stack overflows with coroutines.

kennthhz-zz · 2021-11-09T18:49:13Z

Thanks @yuslepukhin. I did looked at your implementation back then. Coroutine definitely was a better programming paradigm, not to mention an actual language standard. My current fork with coroutine + IO_URING (https://github.com/kennthhz/rocksdb) doesn't have callback stack overflow issue (through promise chaining) and worked as expected. It has been almost 4 years since your last effort. Coroutine in gcc compiler and MS compiler have improved to the point that can be used in production. Through my ex-Microsoft colleague I know that is the case in Microsoft. My current focus is coroutine in Linux with IO_URING as async IO.

We use RocksDB with higher performance in mind to support cloud based multi-tenant partitioned storage (in our new NoSQL DB) scenario at the CPU level. Each db instance represents a tenant's partition of a collection (i.e. table) to achieve isolation. A node of 10 vCore (we use K8s container) can hosts tens or even a hundred plus partitions (each partition can be small as 10-20GB in disk of 1TB. So a node can host up 100 RocksDB instance). RocksDB is used as a library with threading controlled by our DB service's resource governor logic because it knows how to manage resource usage better. Threading model is similar to Seastar in that it has a single worker thread per CPU and manage all requests at task level through a lock-free task queue. So having non-blocking async IO is critical. Coroutine + IO_URING give us much better latency and throughput because we use much less threads, have higher CPU utilization, and in most case is running under share-nothing, lock-free pattern. This architecture approach in database system is becoming more and more common (ScallyDB, YottaByte etc); and RocksDB has been used more and more by DBs as default storage engine. Hence the coroutine support is important to keep up with evolution.

My propose is

Have coroutine async follow the same layering structure as sync code path so it easier to compare and maintain.
Refactor and extract common "sync" code between async and sync scenario to be used by both. So the async and sync code path mainly differ in call like co_await, co_return vs just function all and return statement.
Fix current perf collector only works in sync path because it uses thread local.
Support async completion and callback in internal and external mode. In internal mode, RocksDB owns the io_uring and completion callback. In external mode, the io_uring is passed by client and client supply a delegate handling IO completion.
Add compiler conditional directive to isolate coroutine code so that it doesn't force users of RocksDB to upgrade to gcc10+.

People in this thread please share your thoughts.

riversand963 · 2021-11-13T01:11:52Z

@kennthhz I haven't checked your implementation yet. A few questions:

How well is coroutine-related code separated from existing code?
How are exceptions thrown by coroutine-code handled?

kennthhz-zz · 2021-11-15T21:14:02Z

@riversand963
The coroutine-related code at function level is completely separate from sync code. coroutine has special key word return type so they can't mix with existing code in the same function. But they do share some common non-async code in function body. So I refactor out the common non-async code where it makes sense so both sync and async code use them (to avoid code duplication). It is done at a case by case basis. Such refactoring in my implementation is still ongoing but you can see the direction.

coroutine's exception handling has two part:1) handled exception is no different than normal function in that you need to use try-catch. 2) unhandled exception: coroutine has unhandled exception event. But I haven't use them since current RocksDB code uses return code for error handling. I don't see any internal try-catch.

wwj6591812 · 2022-04-07T03:57:30Z

@kennthhz
hi，I download you code in https://github.com/kennthhz/rocksdb with master branch.

After read code with "async get with coroutine"，I have a questions.
(1)Could you give some examples of how a user call AsyncGet() in db_impl.cc?
(2)How are the coroutine(AsyncRead() in io_posix.cc) resumed?

Thanks.

subains · 2022-10-05T18:48:56Z

Hi,

I ported the c++20 coroutine and io_uring changes from https://github.com/kennthhz/rocksdb fork. to the TiKV RocksDB fork which IIRC has the latest merges from upstream. My one concern was around thread local variables and the first one I've come across is the SuperVersion thread local infrastructure. I've hacked around it but it's unsatisfactory (for now). This will be a generic issue with any thread local state that is going to be shared across multiple invocations of Async[Multi]Get() from the same thread etc. Are there other similar issues I should be aware of? I think this is a very cool feature and kennthhz has done great work with his prototype, it's a good starting point.

Regards,
-sunny

siying · 2022-10-05T19:02:28Z

@subains RocksDB uses thread-local in SuperVersion and some other places. SuperVersion doesn't have to be thread local. As long as it can be fetched from a pool of SVs efficiently, it will be fine. Getting it from a pool rather than thread local also has other benefits too. Feel free to give it approach a try. You can also help us develop the pooled SuperVersion implementation too.

subains · 2022-10-05T19:10:07Z

@siying thanks for the quick reply. I will give the pool version a try. My current hack passes Valgrind but it's still a hack :-). I'm not very familiar with RocksDB code. My immediate plan is to improve the hack so that I can test the changes in a larger context within TiKV and also test out the C++ <-> Rust bindings with async control flow.

beef9999 · 2023-02-10T10:24:16Z

Even though MS defeated Google in terms of letting their stackless coroutine model be accepted by the standard committee, we must admit that the stackful model is always a strong competitor. It has a much longer history, and has been widely used, for instance, boost coroutine. And it's less intrusive to the legacy code. We don't have to add the co_wait and co_return keyword to every piece of code.

This is an experiment we did to integrate stackful coroutine into RocksDB. It's not only focused on MultiGet, but a overall transformation for locks, I/O, and task scheduling. The async runtime is driven by io_uring.

200 lines of code to rewrite the 600'000 lines RocksDB into a coroutine program

siying added async design discussion labels Dec 12, 2017

siying added this to Design Discussion in Async Support Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Discussion of Async Support #3254

Design Discussion of Async Support #3254

siying commented Dec 12, 2017

siying commented Dec 12, 2017

yuslepukhin commented Dec 13, 2017 •

edited

Loading

siying commented Dec 14, 2017

yuslepukhin commented Dec 15, 2017 •

edited

Loading

yuslepukhin commented Dec 27, 2017

yuslepukhin commented Jan 4, 2018

publicocean0 commented Jan 16, 2018 •

edited

Loading

yuslepukhin commented Jan 16, 2018

publicocean0 commented Jan 16, 2018 •

edited

Loading

kennthhz-zz commented Jul 7, 2021

riversand963 commented Jul 19, 2021

kennthhz-zz commented Jul 21, 2021 via email

kennthhz-zz commented Aug 5, 2021

kennthhz-zz commented Nov 3, 2021 •

edited

Loading

jay-zhuang commented Nov 9, 2021

kennthhz-zz commented Nov 9, 2021

yuslepukhin commented Nov 9, 2021

kennthhz-zz commented Nov 9, 2021 •

edited

Loading

riversand963 commented Nov 13, 2021

kennthhz-zz commented Nov 15, 2021

wwj6591812 commented Apr 7, 2022 •

edited

Loading

subains commented Oct 5, 2022 •

edited

Loading

siying commented Oct 5, 2022

subains commented Oct 5, 2022

beef9999 commented Feb 10, 2023 •

edited

Loading

Design Discussion of Async Support #3254

Design Discussion of Async Support #3254

Comments

siying commented Dec 12, 2017

siying commented Dec 12, 2017

yuslepukhin commented Dec 13, 2017 • edited Loading

siying commented Dec 14, 2017

yuslepukhin commented Dec 15, 2017 • edited Loading

yuslepukhin commented Dec 27, 2017

yuslepukhin commented Jan 4, 2018

publicocean0 commented Jan 16, 2018 • edited Loading

yuslepukhin commented Jan 16, 2018

publicocean0 commented Jan 16, 2018 • edited Loading

kennthhz-zz commented Jul 7, 2021

riversand963 commented Jul 19, 2021

kennthhz-zz commented Jul 21, 2021 via email

kennthhz-zz commented Aug 5, 2021

kennthhz-zz commented Nov 3, 2021 • edited Loading

jay-zhuang commented Nov 9, 2021

kennthhz-zz commented Nov 9, 2021

yuslepukhin commented Nov 9, 2021

kennthhz-zz commented Nov 9, 2021 • edited Loading

riversand963 commented Nov 13, 2021

kennthhz-zz commented Nov 15, 2021

wwj6591812 commented Apr 7, 2022 • edited Loading

subains commented Oct 5, 2022 • edited Loading

siying commented Oct 5, 2022

subains commented Oct 5, 2022

beef9999 commented Feb 10, 2023 • edited Loading

yuslepukhin commented Dec 13, 2017 •

edited

Loading

yuslepukhin commented Dec 15, 2017 •

edited

Loading

publicocean0 commented Jan 16, 2018 •

edited

Loading

publicocean0 commented Jan 16, 2018 •

edited

Loading

kennthhz-zz commented Nov 3, 2021 •

edited

Loading

kennthhz-zz commented Nov 9, 2021 •

edited

Loading

wwj6591812 commented Apr 7, 2022 •

edited

Loading

subains commented Oct 5, 2022 •

edited

Loading

beef9999 commented Feb 10, 2023 •

edited

Loading