Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panicked at 'overflow when subtracting duration from instant' #45

Closed
kumy opened this issue Sep 21, 2023 · 20 comments
Closed

panicked at 'overflow when subtracting duration from instant' #45

kumy opened this issue Sep 21, 2023 · 20 comments

Comments

@kumy
Copy link
Contributor

kumy commented Sep 21, 2023

I have included bkt in one of our internal tool. I got at least 5 complaints already from users having such error (MacOs users):

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

or (distinct column number on this one) Here it's bkt v0.7.1

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Clearing the bkt-cache folder was our quick fix.

I'm sorry I don't have a backtrace yet, but will try to provide it next time it happen. I thought it might be worth to already open that issue.

@dimo414
Copy link
Owner

dimo414 commented Sep 21, 2023

Ooh interesting, yes please let me know if you uncover a backtrace. I have never observed this but we should definitely fix it. Will take a deeper look this evening but if you identify a fix in the meantime feel free to send a PR.

@dimo414
Copy link
Owner

dimo414 commented Sep 22, 2023

I think this is rust-lang/rust#100141 - surprising!

@kumy
Copy link
Contributor Author

kumy commented Sep 22, 2023

Sounds to be our issue. Do you think the workaround will help?

@dimo414
Copy link
Owner

dimo414 commented Sep 22, 2023

Yeah I'm sure we can work around it :) If you're able to get a backtrace or reproduction steps that'll help pinpoint things but I'll do some more investigating this weekend.

@kumy
Copy link
Contributor Author

kumy commented Sep 22, 2023

Unfortunately not soon. I'm a Linux user, so I'm not concerned. And no one reported it since I asked internally…

🤞

@dimo414
Copy link
Owner

dimo414 commented Sep 22, 2023

Could you add RUST_BACKTRACE=1 to your internal tool (if you haven't already)?

@dimo414
Copy link
Owner

dimo414 commented Sep 23, 2023

@kumy can you also share any more details about how your tool is invoking bkt? E.g. is it using a long --ttl, or passing any other flags? I tried to reproduce the trace by caching a command with a long TTL and rebooting, but I think my Mac isn't affected by this bug - it appears to use a signed type on my machine rather than a u64.

I suspect the most reasonable fix will unfortunately be to stop using Instant and instead swap to chrono::DateTime or similar. I don't see a reasonable way to fix a line like this one without changing either the type or the behavior.

@kumy
Copy link
Contributor Author

kumy commented Sep 24, 2023

(From memory, more details to come on monday)
From the last user report, the failing call was with a --ttl 15min. But it may not be related especially to this cache entry, I think this is the first call we have in all the script, so maybe related to pruning routine when bkt is launched.
I hope to be able to provide a trace as soon as I have a new user report.

EDIT 2023-09-29: no user reported the issue yet

@kumy
Copy link
Contributor Author

kumy commented Sep 25, 2023

can you also share any more details about how your tool is invoking bkt?

We're using that pattern:

bkt \
  --scope="${ENVIRONMENT}.${params[0]}.foo" \
  --ttl="${TTL}" \
  --discard-failures \
  -- \
  $THE_COMMAND "${params[@]}"

is it using a long --ttl

We're using ttl: 15min, 1h, 1d, 1w, depending on the queried data

@kumy
Copy link
Contributor Author

kumy commented Nov 8, 2023

Hi @dimo414
Someone had the issue this morning. This is the stack trace.
We still have a copy of the cache directory, in case you need something else.

Hope it'll be useful 🤞

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:33
stack backtrace:
   0:        0x106f5dde4 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hb34d93b518a69020
   1:        0x106f9caeb - core::fmt::write::ha76b1d04ecc430e1
   2:        0x106f67e90 - std::io::Write::write_fmt::h4e16facc4367e611
   3:        0x106f5dc00 - std::sys_common::backtrace::print::hdb50332c504929b8
   4:        0x106f687f5 - std::panicking::default_hook::{{closure}}::h59fe0a3fe5f24ecc
   5:        0x106f685d1 - std::panicking::default_hook::h907db83ced102c19
   6:        0x106f68dfa - std::panicking::rust_panic_with_hook::h428ca3503467e89a
   7:        0x106f5e0b3 - std::panicking::begin_panic_handler::{{closure}}::h051de3881f27cd34
   8:        0x106f5deb9 - std::sys_common::backtrace::__rust_end_short_backtrace::hda22941d8908552b
   9:        0x106f689ad - _rust_begin_unwind
  10:        0x106faef33 - core::panicking::panic_fmt::hed2e16a438b6f44b
  11:        0x106faed2a - core::option::expect_failed::haa40f9b5b7a5c9b9
  12:        0x106f77dd7 - <std::time::Instant as core::ops::arith::Sub<core::time::Duration>>::sub::hcccab8d2f746e34c
  13:        0x106ef443c - bkt::Bkt::retrieve_impl::hcf0543033a636198
  14:        0x106f0080f - bkt::main::hf321a34913da4dc1
  15:        0x106ef0ca6 - std::sys_common::backtrace::__rust_begin_short_backtrace::h6f324c6f687f5aea
  16:        0x106ef9cdc - std::rt::lang_start::{{closure}}::h6e3e120d3a431146
  17:        0x106f68894 - std::panicking::try::ha03e99799e87108d
  18:        0x106f8a9de - std::rt::lang_start_internal::hbf92d065385ce277
  19:        0x106f0494c - _main
  20:     0x7ff803e053a6 - <unknown>

@kumy
Copy link
Contributor Author

kumy commented Nov 9, 2023

Someone else had it this morning

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
   2: core::panicking::panic_display
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:72:5
   3: core::panicking::panic_str
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:56:5
   4: core::option::expect_failed
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:1880:5
   5: core::option::Option<T>::expect
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:738:21
   6: <std::time::Instant as core::ops::arith::Sub<core::time::Duration>>::sub
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/time.rs:424:9
   7: bkt::Bkt::retrieve_streaming
   8: bkt::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

And with RUST_BACKTRACE=full

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:9
stack backtrace:
   0:        0x10d4389f2 - std::backtrace_rs::backtrace::libunwind::trace::h3274e49b4ebf7e53
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x10d4389f2 - std::backtrace_rs::backtrace::trace_unsynchronized::hd211d632c7624604
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x10d4389f2 - std::sys_common::backtrace::_print_fmt::h1e774fa26a8ee490
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:66:5
   3:        0x10d4389f2 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd2e8cbde22b780fc
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:45:22
   4:        0x10d455bba - core::fmt::write::hd6692086cdd356a7
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/fmt/mod.rs:1202:17
   5:        0x10d43577c - std::io::Write::write_fmt::h6043124a2486acbb
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/io/mod.rs:1679:15
   6:        0x10d43a12b - std::sys_common::backtrace::_print::h3d78a31fac988d9b
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:48:5
   7:        0x10d43a12b - std::sys_common::backtrace::print::hdfd67a42b253a0c5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:35:9
   8:        0x10d43a12b - std::panicking::default_hook::{{closure}}::h87a12b8b06887dd7
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:295:22
   9:        0x10d439e37 - std::panicking::default_hook::h7f68dad17e0bfaa4
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:314:9
  10:        0x10d43a74f - std::panicking::rust_panic_with_hook::hd57427cbbfc3717a
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:698:17
  11:        0x10d43a693 - std::panicking::begin_panic_handler::{{closure}}::h33aab6d04e2bba70
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:588:13
  12:        0x10d438e88 - std::sys_common::backtrace::__rust_end_short_backtrace::h0e7a76f927db9964
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:138:18
  13:        0x10d43a35d - rust_begin_unwind
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
  14:        0x10d461403 - core::panicking::panic_fmt::hcf6f3c517c6f3cb3
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
  15:        0x10d4546eb - core::panicking::panic_display::ha4c5bde01ffb9415
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:72:5
  16:        0x10d45469c - core::panicking::panic_str::h93da7ca3f4eb85fd
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:56:5
  17:        0x10d461299 - core::option::expect_failed::h7ed1f1962cc2062e
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:1880:5
  18:        0x10d4385c2 - core::option::Option<T>::expect::h43a9229dec635650
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:738:21
  19:        0x10d4385c2 - <std::time::Instant as core::ops::arith::Sub<core::time::Duration>>::sub::h71b7739ee54a949a
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/time.rs:424:9
  20:        0x10d3b02b5 - bkt::Bkt::retrieve_streaming::h77a38c05e52cbe27
  21:        0x10d3ba5f0 - bkt::main::hd77d3c47a29551e3
  22:        0x10d3b2c66 - std::sys_common::backtrace::__rust_begin_short_backtrace::hba5bd1edd81f5832
  23:        0x10d3b678c - std::rt::lang_start::{{closure}}::h0f91855c29255625
  24:        0x10d43001e - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hb30f3c124d6422bd
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:283:13
  25:        0x10d43001e - std::panicking::try::do_call::h6ab65f1fbf11caa5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
  26:        0x10d43001e - std::panicking::try::hb963b3a8541f11db
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
  27:        0x10d43001e - std::panic::catch_unwind::hd044952603e5f56c
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
  28:        0x10d43001e - std::rt::lang_start_internal::{{closure}}::h3a2dc7ba533401e5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:48
  29:        0x10d43001e - std::panicking::try::do_call::h88480942a6edc3bf
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
  30:        0x10d43001e - std::panicking::try::ha84f131c957dae86
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
  31:        0x10d43001e - std::panic::catch_unwind::hab613c35aec6b94f
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
  32:        0x10d43001e - std::rt::lang_start_internal::ha5deaf08dab8765b
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:20
  33:        0x10d3bf76f - _main
  34:     0x7ff802ee841f - <unknown>

@kumy
Copy link
Contributor Author

kumy commented Nov 22, 2023

Anything else we can do to help here?

@dimo414
Copy link
Owner

dimo414 commented Nov 23, 2023

😆 tell my employer to prioritize this bug! I will try to figure out the right fix for this soon, it's a little messy.

@dimo414
Copy link
Owner

dimo414 commented Nov 27, 2023

Took a deeper look at this just now. The good news is it looks like the only problematic line is the one linked above, where we construct the CacheStatus::Hit. The bad news is I don't think there's a simple fix. I see a couple of options (assuming Rust doesn't fix the Instance implementation on their end...):

  1. Swap to Instant::checked_sub(). This is the least-invasive change, but it's not clear what to do with failures. The only safe value to return is Instant::now() (i.e. reporting that the cached data is brand new, regardless of its age) which would break --warm - the data should eventually still be cleared when the TTL expires.
    • We could invalidate the cache when this happens, which is probably harmless but I worry this could invalidate data too often. The linked Rust bug indicates the issue occurs on reboots, but I'm not sure if that's the only circumstance where this happens. If I could reproduce this locally that would help.
  2. If we can confirm this issue is tied to reboots we could adjust the default cache dir to not carry over across reboots (/tmp is typically deleted on boot, but given that the reported cases above aren't using a custom dir it sounds like OSX doesn't reliably do so).
    • This could even be a usage-specific fix, i.e. instead of changing bkt callers could use a different --cache_dir per-boot. Not elegant, of course, but it might be easiest 😄
  3. The implication in the Rust bug is that Instant is simply not the right type to use. This is a bit surprising to me and the intended alternative isn't clear, but we could stop using Instant. This would be a breaking API change so I'd rather avoid / delay it unless we're clearly migrating to a type that is intended to support bkt's use case and won't have these sort of surprising sharp edges.
    • Maybe this isn't so risky - bkt relies on SystemTime today to determine when the TTL is expired, so perhaps we should just use that consistently and not worry about situations where SystemTime doesn't work as expected (namely when the system clock is changed / updated).

@kumy
Copy link
Contributor Author

kumy commented Nov 28, 2023

  1. Given the few number of occurrences we had on our side. I think we could leave with: destroying the failing cache entry and return the real content from "the command". Maybe displaying a line on stderr to warn when this error arise 🤷 . (That is my own personal preferred choice)
  2. I asked users, and they cold boot every morning. We are not storing the cache in a volatile dir, and we would like to keep it's content between reboots.
  3. Sorry I'm not in the internal of bkt, I won't comment.

Thanks for bkt and for all your time!

@dimo414
Copy link
Owner

dimo414 commented Nov 28, 2023

Thanks for the context on #2, that does make sense. I would actually suggest changing the cache dir to be cleared on reboot to align with the behavior your Linux users experience. According to this post /tmp is cleared on boot while /var/tmp (which I believe is what OSX uses by default as $TMPDIR) is not, so passing --cache_dir=/tmp for your OSX users might be sufficient to prevent this issue.

I played around a little more with swapping to SystemTime everywhere, and despite its limitations it's looking like a viable change. I could swap to e.g. time but I'm not yet sure if there's sufficient value to doing so. Will flesh it out and publish a fix soon, probably this weekend.

@kumy
Copy link
Contributor Author

kumy commented Nov 28, 2023 via email

@dimo414
Copy link
Owner

dimo414 commented Nov 28, 2023

Oh you didn't mention using a custom cache directory above. In that case your OSX users will need to pick up this fix (once it's released).

@kumy
Copy link
Contributor Author

kumy commented Nov 29, 2023

FTR, one user reported the error this morning after powering his Mac...

@dimo414
Copy link
Owner

dimo414 commented Jan 13, 2024

@kumy apologies for the delay resolving this, I just pushed a commit that should mitigate this issue. I will cut a new release soon but if you / any of your users would like to test the updated binary before I do so that would be great! Artifacts can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants