Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Discussion of Async Support #3254

Open
siying opened this issue Dec 12, 2017 · 25 comments
Open

Design Discussion of Async Support #3254

siying opened this issue Dec 12, 2017 · 25 comments

Comments

@siying
Copy link
Contributor

siying commented Dec 12, 2017

We decided to have the design discussion of async support here. Let's see how it works.

The initial discussion started at the prototyping work by @yuslepukhin
#3063
and
#2727

@siying
Copy link
Contributor Author

siying commented Dec 12, 2017

I thought about the code and proposed we refactor the code in this way to make it easier to have better mantainable code for both of async and sync. This is just thoughts. Let me know your opinion on this.

Get(): have a global context and pass it everywhere.
Global context can look like something like this:

struct GetContext {
   input, output;
   some global variables: timers, merge context, current get context, etc.
   VersionSet::Get context (FilePicker etc);
   TableCache::Get context
   BlockBasedTableReader::Get context
};

And along the path a function is rewritten in a way that I/O path is clearly split. If most 1 I/O is possible, it can be this:

function TableCache::Get(GetContext* get_context) {
   pre_io_step(get_context);
   table_reader.Get(get_context)
   post_io_step(get_context);
}

In this way, in the the callback OnTableReaderGetFinish() can be simply post_io_step().

Similarly, if there is a loop of possible I/O, it can look like this:

function VersionSet::Get(GetContext* get_context) {
  construct_file_picker(get_context);
  do {
    pick_next_file(get_context);
    TableCache::Get(get_context);
  } while (get_context->found_result);
}

In this way, I hope the sync code path will be even simpler and the performance regression will be less.

Iterators
I hope all the callback contexts are stored in iterators themselves. They shouldn't be more context passed around.
An example of block based table iterator seek can look like that (it currently is implemented as two level iterator, but I translate to plain term to be clear):

BlockBasedTableIterator::Seek() {
  FindDataBlock();
  PrepareReadDataBlock();
  if (need_read_data_block_from_file_) {
    ReadDataBlock();
  }
  SearchInDataBlock();
}

In this way, async's OnReadDataBlockFinish() can be BlockBasedTableIterator::SearchInDataBlock().

To make the code change more incremental. I suggest we first focus on two things: (1) code refactoring step by step; I hope each code refactoring should by itself improve the maintainability of the code base; (2) first focus on the step of data block reading part to async. Other possible I/O path, like index block and table reader constructing can be added later; (3) focus on block based table reader first. Skip other table readers for now.

@siying siying added this to Design Discussion in Async Support Dec 12, 2017
@yuslepukhin
Copy link
Contributor

yuslepukhin commented Dec 13, 2017

@siying The general principles are sound. Both 1) and 2) are implemented within the PRs are submitted.
Iterators do not have separate context. The picture is complicated by the fact that a) some recursively call into its own public interfaces. b) Has huge pieces of code that are being reused from different public interfaces and thus are require callback wiring.

I have contemplated introducing the global context but rejected the idea due to the following a) Maintenance. This would introduce full stack dependencies with the same class. b) Most of the context would go un-used as a lot of ops are satisfied from cache.
For an example of a semi-global context see BlockBasedReader Open() where the operation invokes many async reads but a single context serves for all of them as the table reader building progresses.
Finally, the loop example is oversimplified.

@siying
Copy link
Contributor Author

siying commented Dec 14, 2017

@yuslepukhin of course it is oversimplified. I just want to show the idea that, if it is like a state-machine. The callback function can do the next step as the state machine, and the sync function can just flow in the natural way.

How about 3? I think this can help we move incrementally.

@yuslepukhin
Copy link
Contributor

yuslepukhin commented Dec 15, 2017

@siying Can you give an example of the code that depicts the async functionality, so I could better understand the idea? Breaking the function in pre- post- parts is good but this is still sync code.

Base design questions here before we move forward.

  • Do we want to avoid code duplication and make the same code sync/async?
  • If we do, then the same code should account for possible scenarios:
    • We run sync
    • We run async but some operations complete sync
    • We run async but everything completes sync

If we want a separate async path that only utilizes parts of the sync code, then it would be extremely hard on the development/maintenance and testing.
With the same code I used the same sync tests to validate the shared code functionality.
At the same time I am partaking of replicating main sync path changes into my PR which is a nightmare.

@yuslepukhin
Copy link
Contributor

@siying I am throwing together a prototype in a form of the unit test that makes use of co_routines. It has many advantages, such as we do not need to refactor code or change its sync nature.
The cons, of course, it is not a standard even in C++17 and is not implemented on many compilers although it is experimentally runs on LLVM/Clang. So the proposal may change and it may render my prototype not-functional. However, unless they change it drastically, it should not be too much trouble to adjust.

@yuslepukhin
Copy link
Contributor

@siying See a working example with coroutines at #3325

For now ignore the platform dependent code. Look at the test and async/coroutine_random_access_test.cc and RandomAccessFileReader::RequestRead() which is a clone of Read() for demonstration purposes.

@publicocean0
Copy link

publicocean0 commented Jan 16, 2018

It is a good idea but pay attention to how to design the api for coroutines or fibers in rocksdb: It could be not compatible with JNI extensions. In a old test I tried to use boost fibers inside JNI . The thread inner status inside c++ is not compatible with jvm thread model. It is crashing JVM. The better solution is to abstract the library providing fibers or coroutines. In this way it can come from java or from c++ based on what are you using. You can find a old my discussion also in stackoverflow.

@yuslepukhin
Copy link
Contributor

@publicocean0 Thanks for the input, I will check that out. The coroutines proposal as implemented by MS is significantly different from that of the fibers and/or boost fibers and boost coroutines. It is probable that on JNI boundary the the chain of coroutines may need to be stopped and converted either back to a sync model OR to a callback model as supported by JVM. In either case, I am not seeing how it would be possible to reuse Java threadpool for I/O completion although internally JVM does just that.

@publicocean0
Copy link

publicocean0 commented Jan 16, 2018

JVM is surelly in not a good moment , there is a crossroads for future: sun is too much slow to propose new solutions . Anyway here we can t wait for the distant future , probbaly the better solution is to study a solution flexible. My opinion is fibers might be embedded at OS levels not at language level... but the way for seeing it in win,linux,mac is not so fast probably . For example there are already new sperimental OS removing completly old threads concept at user level , replacing it with fibers. Actually the most important problem with fibers are locks. Locks block the thread and not just the fiber so they have a bad side effect for concurrency. Java Quasar (fibers library for java) implemented a special extension (with same api) for locks for fibers. Anyway all synchonized blocks might be replaced. What i verified personally is to try to use boost fibers in c++ and to call it using jni because it is for long time i wished to use fibers with rocksdb. Unfurtunally it is not possible to call back a java method from inside a c++ fiber : jvm crashes. It seams JVM checks stack consistency at JNI boundary so using fibers/courotines it finds a inconsistency based on old thread model. Anyway you can wait until the result is ready and then returns the result. However if we wait for the result when it is ready (synchronous model) at JNI level ... you effectively limit the power offered by fibers/coroutines. I m not sure but with coroutines the status is not attached to thread local like with fibers: maybe with coroutiines is more simplier to use the same thread in c++ and in java without conficts in the status of thread.Anyway you have to use 2 different coroutines using isolated schedulers 1 in java , 1 in c++ in pipeline.

@kennthhz-zz
Copy link

Now that coroutine is in C++20 and you have io_uring in linux, maybe this effort should be revitalized.

@riversand963
Copy link
Contributor

Thanks @kennthhz for reviving this discussion.
I believe RocksDB will continue to rely on C++11 (let me know otherwise). Besides, coroutine support is still WIP even for C++20 with a lot of different opinions.
For io_uring, @siying and @anand1976 have some support via liburing.

@kennthhz-zz
Copy link

kennthhz-zz commented Jul 21, 2021 via email

@kennthhz-zz
Copy link

IMHO the right thing to do is to create a separate coroutine API chain for non-blocking async operations. Opportunistically reuse some sync code but should reduce the call stack depth and use nested co-await. It is pretty much like C#.

@kennthhz-zz
Copy link

kennthhz-zz commented Nov 3, 2021

I want to fully reviving this issue/work. My company uses RocksDB as storage engine for a large cloud based NoSql service. The efficiency and performance is critical. So support non-blocking IO async API is critical. We believe RocksDB should take advantage of new technologies such as io_uring and C++ coroutine for this. We have a fork that is a working progress: https://github.com/kennthhz/rocksdb
It still has some gaps: 1) need to reduce some code duplication between async and sync code path. 2) need to handle perf counter in async mode; 3) need to support internal io_uring completion background job (one per DB). 4) add unit test

@jay-zhuang
Copy link
Contributor

@kennthhz rocksdb already supports io_uring, it's enabled by default:

if test -z "$ROCKSDB_USE_IO_URING"; then

For moving to c++20, rocksdb needs to consider users that still use compiler not supporting c++20.

@kennthhz-zz
Copy link

@jay-zhuang RocksDB's current support for io_uring is not truly async. It still blocks the thread while waiting for IO completion. That is why it only make some sense in MultiGet Api, but not Get. Also it is not supported in flushing and compaction. To be real async, we need to use coroutine. I agree it is premature to require all clients to upgrade gcc 10. That is why we will use #define to isolate them.

@yuslepukhin
Copy link
Contributor

I have created in the past a fully async implementations for Get/MultiGet that does not block on IO on Windows. However, there is a very small Windows specific part in it having to do with async IO. I have faced multiple issues with it. First of all, it required a lot of code refactoring to make it run in both sync and async modes. Making iterators async presented a particular challenge. It worked, but it yielded only marginal improvements under specific circumstances. In exchange, you would get a programming paradigm that is not familiar to many and difficult to maintain.
I had another version of async based on co-routines. That had a distinct advantage of much smaller code changes. The structure of the code remained virtually intact and it was maintainable. However, at that point coroutines, both the standard and MS implementation was not sufficiently mature. In particular, callback based async implementation lead to stack overflows with coroutines.

@kennthhz-zz
Copy link

kennthhz-zz commented Nov 9, 2021

Thanks @yuslepukhin. I did looked at your implementation back then. Coroutine definitely was a better programming paradigm, not to mention an actual language standard. My current fork with coroutine + IO_URING (https://github.com/kennthhz/rocksdb) doesn't have callback stack overflow issue (through promise chaining) and worked as expected. It has been almost 4 years since your last effort. Coroutine in gcc compiler and MS compiler have improved to the point that can be used in production. Through my ex-Microsoft colleague I know that is the case in Microsoft. My current focus is coroutine in Linux with IO_URING as async IO.

We use RocksDB with higher performance in mind to support cloud based multi-tenant partitioned storage (in our new NoSQL DB) scenario at the CPU level. Each db instance represents a tenant's partition of a collection (i.e. table) to achieve isolation. A node of 10 vCore (we use K8s container) can hosts tens or even a hundred plus partitions (each partition can be small as 10-20GB in disk of 1TB. So a node can host up 100 RocksDB instance). RocksDB is used as a library with threading controlled by our DB service's resource governor logic because it knows how to manage resource usage better. Threading model is similar to Seastar in that it has a single worker thread per CPU and manage all requests at task level through a lock-free task queue. So having non-blocking async IO is critical. Coroutine + IO_URING give us much better latency and throughput because we use much less threads, have higher CPU utilization, and in most case is running under share-nothing, lock-free pattern. This architecture approach in database system is becoming more and more common (ScallyDB, YottaByte etc); and RocksDB has been used more and more by DBs as default storage engine. Hence the coroutine support is important to keep up with evolution.

My propose is

  1. Have coroutine async follow the same layering structure as sync code path so it easier to compare and maintain.
  2. Refactor and extract common "sync" code between async and sync scenario to be used by both. So the async and sync code path mainly differ in call like co_await, co_return vs just function all and return statement.
  3. Fix current perf collector only works in sync path because it uses thread local.
  4. Support async completion and callback in internal and external mode. In internal mode, RocksDB owns the io_uring and completion callback. In external mode, the io_uring is passed by client and client supply a delegate handling IO completion.
  5. Add compiler conditional directive to isolate coroutine code so that it doesn't force users of RocksDB to upgrade to gcc10+.

People in this thread please share your thoughts.

@riversand963
Copy link
Contributor

@kennthhz I haven't checked your implementation yet. A few questions:

  • How well is coroutine-related code separated from existing code?
  • How are exceptions thrown by coroutine-code handled?

@kennthhz-zz
Copy link

@riversand963
The coroutine-related code at function level is completely separate from sync code. coroutine has special key word return type so they can't mix with existing code in the same function. But they do share some common non-async code in function body. So I refactor out the common non-async code where it makes sense so both sync and async code use them (to avoid code duplication). It is done at a case by case basis. Such refactoring in my implementation is still ongoing but you can see the direction.

coroutine's exception handling has two part:1) handled exception is no different than normal function in that you need to use try-catch. 2) unhandled exception: coroutine has unhandled exception event. But I haven't use them since current RocksDB code uses return code for error handling. I don't see any internal try-catch.

@wwj6591812
Copy link

wwj6591812 commented Apr 7, 2022

@kennthhz
hi,I download you code in https://github.com/kennthhz/rocksdb with master branch.

After read code with "async get with coroutine",I have a questions.
(1)Could you give some examples of how a user call AsyncGet() in db_impl.cc?
(2)How are the coroutine(AsyncRead() in io_posix.cc) resumed?

Thanks.

@subains
Copy link

subains commented Oct 5, 2022

Hi,

I ported the c++20 coroutine and io_uring changes from https://github.com/kennthhz/rocksdb fork. to the TiKV RocksDB fork which IIRC has the latest merges from upstream. My one concern was around thread local variables and the first one I've come across is the SuperVersion thread local infrastructure. I've hacked around it but it's unsatisfactory (for now). This will be a generic issue with any thread local state that is going to be shared across multiple invocations of Async[Multi]Get() from the same thread etc. Are there other similar issues I should be aware of? I think this is a very cool feature and kennthhz has done great work with his prototype, it's a good starting point.

Regards,
-sunny

@siying
Copy link
Contributor Author

siying commented Oct 5, 2022

@subains RocksDB uses thread-local in SuperVersion and some other places. SuperVersion doesn't have to be thread local. As long as it can be fetched from a pool of SVs efficiently, it will be fine. Getting it from a pool rather than thread local also has other benefits too. Feel free to give it approach a try. You can also help us develop the pooled SuperVersion implementation too.

@subains
Copy link

subains commented Oct 5, 2022

@siying thanks for the quick reply. I will give the pool version a try. My current hack passes Valgrind but it's still a hack :-). I'm not very familiar with RocksDB code. My immediate plan is to improve the hack so that I can test the changes in a larger context within TiKV and also test out the C++ <-> Rust bindings with async control flow.

@beef9999
Copy link

beef9999 commented Feb 10, 2023

Even though MS defeated Google in terms of letting their stackless coroutine model be accepted by the standard committee, we must admit that the stackful model is always a strong competitor. It has a much longer history, and has been widely used, for instance, boost coroutine. And it's less intrusive to the legacy code. We don't have to add the co_wait and co_return keyword to every piece of code.

This is an experiment we did to integrate stackful coroutine into RocksDB. It's not only focused on MultiGet, but a overall transformation for locks, I/O, and task scheduling. The async runtime is driven by io_uring.

200 lines of code to rewrite the 600'000 lines RocksDB into a coroutine program

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Async Support
  
Design Discussion
Development

No branches or pull requests

9 participants