Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider CLOCK_BOOTTIME (monotonic, but accurate over suspend) over CLOCK_MONOTONIC #369

Closed
jorangreef opened this issue Jun 22, 2021 · 19 comments

Comments

@jorangreef
Copy link
Contributor

I understand that io_uring currently uses CLOCK_MONOTONIC internally for measuring elapsed time for the purpose of timeouts.

This is mostly fine, but where timeouts are long-lived, e.g. an hour or two, and where the system or VM is suspended, then io_uring will measure these timeout intervals incorrectly, because CLOCK_MONOTONIC excludes the elapsed time during the suspend.

CLOCK_BOOTTIME might be a better choice because it's the higher resolution, more resilient interface of the two. For example, any system that needs to measure monotonic time intervals that also correspond to actual time passing across systems, such as a distributed databases tracking leader leases, would probably want to use CLOCK_BOOTTIME when submitting timeouts to io_uring.

@jorangreef jorangreef changed the title Consider CLOCK_BOOTTIME (monotonic, accurate over suspend) over CLOCK_MONOTONIC Consider CLOCK_BOOTTIME (monotonic, but accurate over suspend) over CLOCK_MONOTONIC Jun 22, 2021
@YoSTEALTH
Copy link
Contributor

Hmm this is an interesting situation. If you have a countdown and you suspend your system, you would want to resume the countdown right? vs just going boom when you unsuspend! :D

Wouldn't it make more sense to add this databases tracking outside of suspending vm?

@jorangreef
Copy link
Contributor Author

Yes, but again, it all depends on the application context. Some applications might want a timeout N hours into the future, and might want this to reflect the passing of actual time, not just non-suspend time. Others might not. It depends. The argument cuts both ways.

The kernel was actually going to switch the behavior of CLOCK_MONOTONIC to CLOCK_BOOTTIME at one point but this broke applications. However, io_uring only works with CLOCK_MONOTONIC at present, without exposing CLOCK_BOOTTIME, so the question is, which is a better building block?

@YoSTEALTH
Copy link
Contributor

It would be nice if you can set your own flags, Like there is a sqe->timeout_flags flag option but right now it seems its only for
0 Relative Timeout
1 IORING_TIMEOUT_ABS

It would be nice if you could pass CLOCK_* flags into it and well. https://elixir.bootlin.com/linux/v5.13-rc6/source/include/uapi/linux/time.h#L49

It could also be that @axboe had already planned for this and just forgot to mention/implement it

@jorangreef
Copy link
Contributor Author

@YoSTEALTH that would be perfect.

@YoSTEALTH
Copy link
Contributor

Tags @axboe

@axboe
Copy link
Owner

axboe commented Aug 27, 2021

I think we should just change it, should be identical apart from the suspend aspect (which is important).

@axboe
Copy link
Owner

axboe commented Aug 27, 2021

What I'm saying is that my initial reaction was to make it selectable, but is there really any point in that? Should just use CLOCK_BOOTTIME.

@axboe
Copy link
Owner

axboe commented Aug 27, 2021

Well, I guess there is risk in that... Something like this should work, just set IORING_TIMEOUT_BOOTTIME in the timeout_flags. I would prefer just making it the default, but I can see valid use cases for both - eg a request that wants monotonic behavior across suspend, vs a request that has a hard timeout and should just be failed immediately on resume if too much time has passed.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0637717d0a92..017d636889a3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -508,6 +508,7 @@ struct io_timeout_data {
 	struct hrtimer			timer;
 	struct timespec64		ts;
 	enum hrtimer_mode		mode;
+	u32				flags;
 };
 
 struct io_accept {
@@ -5897,7 +5898,10 @@ static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data,
 	req->timeout.off = 0; /* noseq */
 	data = req->async_data;
 	list_add_tail(&req->timeout.list, &ctx->timeout_list);
-	hrtimer_init(&data->timer, CLOCK_MONOTONIC, mode);
+	if (data->flags & IORING_TIMEOUT_BOOTTIME)
+		hrtimer_init(&data->timer, CLOCK_BOOTTIME, mode);
+	else
+		hrtimer_init(&data->timer, CLOCK_MONOTONIC, mode);
 	data->timer.function = io_timeout_fn;
 	hrtimer_start(&data->timer, timespec64_to_ktime(*ts), mode);
 	return 0;
@@ -5979,7 +5983,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (off && is_timeout_link)
 		return -EINVAL;
 	flags = READ_ONCE(sqe->timeout_flags);
-	if (flags & ~IORING_TIMEOUT_ABS)
+	if (flags & ~(IORING_TIMEOUT_ABS | IORING_TIMEOUT_BOOTTIME))
 		return -EINVAL;
 
 	req->timeout.off = off;
@@ -5991,12 +5995,16 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	data = req->async_data;
 	data->req = req;
+	data->flags = flags;
 
 	if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr)))
 		return -EFAULT;
 
 	data->mode = io_translate_timeout_mode(flags);
-	hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode);
+	if (flags & IORING_TIMEOUT_BOOTTIME)
+		hrtimer_init(&data->timer, CLOCK_BOOTTIME, data->mode);
+	else
+		hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode);
 
 	if (is_timeout_link) {
 		struct io_submit_link *link = &req->ctx->submit_state.link;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index afd43c5cd822..60526fcf3be2 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -155,6 +155,7 @@ enum {
  */
 #define IORING_TIMEOUT_ABS	(1U << 0)
 #define IORING_TIMEOUT_UPDATE	(1U << 1)
+#define IORING_TIMEOUT_BOOTTIME	(1U << 2)
 
 /*
  * sqe->splice_flags

@YoSTEALTH
Copy link
Contributor

Could you not make it so developers can pass their own CLOCK_* flag? This way they can choose what type of timing they need for their application, also not break things currently implemented.

@axboe
Copy link
Owner

axboe commented Aug 27, 2021

The only way to do that would be to set aside some timeout flags to do just that. Which is indeed possible, we currently have ~12 clock sources, which would mean 4 bits in the flags. It's a bit ugly though, but would be more flexible.

@YoSTEALTH
Copy link
Contributor

Giving user that flexibility and not breaking current timeout is a +1

@axboe
Copy link
Owner

axboe commented Aug 27, 2021

The proposed patch doesn't risk any breakage, but it does only allow choosing monotonic or boottime. My worry here is the interface, then you'll have to do things like:

io_uring_prep_timeout(sqe, &ts, 0, flags | (clocksource << some_shift);

and some flags might conflict with the clock source, and the clock source would have to be validated as well. I'm just worried it's too much flexibility or over-designing it, when monotonic/boottime might be the only truly useful cases because a lot of the other sources don't really make sense for this use case. realtime might be an exception, not sure.

@YoSTEALTH
Copy link
Contributor

Wait, I actually use the realtime/CLOCK_REALTIME/0 currently or whats called relative time?

ok then, how about realtime, monotonic, boottime for now then add other feature as users request it?

It would also be nice to get another person or two's point of view.

@axboe
Copy link
Owner

axboe commented Aug 28, 2021

Since I think REALTIME is the only other sane option, it actually ends up not being too bad as we can just keep them as bits. Here's an updated version:

https://lore.kernel.org/io-uring/0e110ae2-f744-df44-e017-bf603f481348@kernel.dk/

@YoSTEALTH
Copy link
Contributor

I might be wrong here but don't you already have relative / IORING_TIMEOUT_REALTIME set 0?
Like https://elixir.bootlin.com/linux/v5.13-rc6/source/include/uapi/linux/time.h#L49 also you already have tests written for it https://github.com/axboe/liburing/blob/master/test/timeout.c#L980 also this man https://github.com/axboe/liburing/blob/master/man/io_uring_enter.2#L331

@axboe
Copy link
Owner

axboe commented Aug 28, 2021

Relative is independent of clocksource, it can be set or not for either clocksource. Don't confuse the bit values in the timeout mask with the absolute clock source values, that's why io_timeout_get_clock() translates them.

The posted patch will break nothing, the default behavior is the same as before. If you set nothing, then monotonic is used. If you set _ABS, then absolute values is used for monotonic.

@YoSTEALTH
Copy link
Contributor

True, was confusing IORING_TIMEOUT_* with CLOCK_* naming/values. io_timeout_get_clock makes sense now.

Thank for the patch and clarification :)

ammarfaizi2 pushed a commit to ammarfaizi2/linux-block that referenced this issue Aug 28, 2021
Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.

Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.

Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.

Link: axboe/liburing#369
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@jorangreef
Copy link
Contributor Author

Thanks @axboe and @YoSTEALTH, awesome to see this.

fengguang pushed a commit to 0day-ci/linux that referenced this issue Aug 30, 2021
Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.

Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.

Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.

Link: axboe/liburing#369
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@YoSTEALTH
Copy link
Contributor

@axboe close, thanks.

@axboe axboe closed this as completed Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants