New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MatchData#offset return utf8 character offset #2581
base: trunk
Are you sure you want to change the base?
Conversation
.haystack | ||
.char_indices() | ||
.position(|n| n.0 >= self.region.offset()) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this unwrap
is firing which is causing the Crash workflow to fail. Commit 0c99a8c.
This PR failing the crash worfklow with a panic in impl Drop for Guard
is a bit concerning and appears to be a lurking bit of unsoundness.
https://github.com/artichoke/artichoke/actions/runs/5133653787/jobs/9294514151?pr=2581
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/matchdata/mod.rs:315:18
stack backtrace:
0: rust_begin_unwind
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
1: core::panicking::panic_fmt
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
2: core::panicking::panic
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:114:5
3: core::option::Option<T>::unwrap
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:942:21
4: artichoke_backend::extn::core::matchdata::MatchData::offset
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/matchdata/mod.rs:311:26
5: artichoke_backend::extn::core::matchdata::MatchData::begin
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/matchdata/mod.rs:179:35
6: artichoke_backend::extn::core::matchdata::trampoline::begin
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/matchdata/trampoline.rs:17:17
7: artichoke_backend::extn::core::matchdata::mruby::matchdata_begin
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/matchdata/mruby.rs:49:18
8: mrb_vm_exec
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1792:18
9: mrb_vm_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1282:12
10: mrb_top_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:3115:7
11: mrb_load_exec
at /home/runner/work/artichoke/artichoke/artichoke-backend/mrbgems/mruby-compiler/core/parse.y:6918:7
12: mrb_load_nstring_cxt
at /home/runner/work/artichoke/artichoke/artichoke-backend/mrbgems/mruby-compiler/core/parse.y:6990:10
13: <artichoke_backend::sys::protect::Eval as artichoke_backend::sys::protect::Protect>::run
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:115:9
14: protect_body
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:13:10
15: mrb_protect_error
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:431:14
16: mrb_protect
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:20:10
17: artichoke_backend::sys::protect::protect
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:46:17
18: artichoke_backend::sys::protect::eval
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:25:5
19: artichoke_backend::eval::<impl artichoke_core::eval::Eval for artichoke_backend::artichoke::Artichoke>::eval::{{closure}}
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/eval.rs:27:42
20: artichoke_backend::artichoke::Artichoke::with_ffi_boundary
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:98:26
21: artichoke_backend::eval::<impl artichoke_core::eval::Eval for artichoke_backend::artichoke::Artichoke>::eval
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/eval.rs:27:13
22: artichoke_backend::load::<impl artichoke_core::load::LoadSources for artichoke_backend::artichoke::Artichoke>::load_source
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/load.rs:109:9
23: artichoke_backend::extn::core::kernel::require::load
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/kernel/require.rs:34:22
24: artichoke_backend::extn::core::kernel::trampoline::load
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/kernel/trampoline.rs:114:19
25: artichoke_backend::extn::core::kernel::mruby::kernel_load
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/kernel/mruby.rs:61:18
26: mrb_vm_exec
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1792:18
27: mrb_vm_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1282:12
28: mrb_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:3100:10
29: mrb_funcall_with_block
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:681:13
30: mrb_funcall_argv
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:692:10
31: <artichoke_backend::sys::protect::Funcall as artichoke_backend::sys::protect::Protect>::run
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:91:13
32: protect_body
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:13:10
33: mrb_protect_error
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:431:14
34: mrb_protect
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:20:10
35: artichoke_backend::sys::protect::protect
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:46:17
36: artichoke_backend::sys::protect::funcall
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:16:5
37: <artichoke_backend::value::Value as artichoke_core::value::Value>::funcall::{{closure}}
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/value.rs:159:17
38: artichoke_backend::artichoke::Artichoke::with_ffi_boundary
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:98:26
39: <artichoke_backend::value::Value as artichoke_core::value::Value>::funcall
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/value.rs:158:13
40: spec_runner::mspec::run
at ./src/mspec.rs:93:18
41: spec_runner::try_main
at ./src/main.rs:172:24
42: spec_runner::main
at ./src/main.rs:110:18
43: core::ops::function::FnOnce::call_once
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'main' panicked at 'Dropping Guard with no State', /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:233:45
stack backtrace:
0: 0x556f0000f28a - std::backtrace_rs::backtrace::libunwind::trace::ha9053a9a07ca49cb
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x556f0000f28a - std::backtrace_rs::backtrace::trace_unsynchronized::h9c2852a457ad564e
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x556f0000f28a - std::sys_common::backtrace::_print_fmt::h457936fbfaa0070f
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:65:5
3: 0x556f0000f28a - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5779d7bf7f70cb0c
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:44:22
4: 0x556f0003009e - core::fmt::write::h5a4baaff1bcd3eb5
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/fmt/mod.rs:1232:17
5: 0x556f0000c875 - std::io::Write::write_fmt::h4bc1f301cb9e9cce
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/io/mod.rs:1684:15
6: 0x556f0000f055 - std::sys_common::backtrace::_print::h5fcdc36060f177e8
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:47:5
7: 0x556f0000f055 - std::sys_common::backtrace::print::h54ca9458b876c8bf
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:34:9
8: 0x556f000106af - std::panicking::default_hook::{{closure}}::hbe471161c7664ed6
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:271:22
9: 0x556f000103eb - std::panicking::default_hook::ha3500da57aa4ac4f
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:290:9
10: 0x556f00010c58 - std::panicking::rust_panic_with_hook::h50c09d000dc561d2
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:692:13
11: 0x556f00010b12 - std::panicking::begin_panic_handler::{{closure}}::h9e2b2176e00e0d9c
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:581:13
12: 0x556f0000f6f6 - std::sys_common::backtrace::__rust_end_short_backtrace::h5739b8e512c09d02
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:150:18
13: 0x556f00010862 - rust_begin_unwind
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
14: 0x556eff8c6163 - core::panicking::panic_fmt::hf33a1475b4dc5c3e
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
15: 0x556effad40b0 - <artichoke_backend::artichoke::Guard as core::ops::drop::Drop>::drop::{{closure}}::h03acbacb8665d386
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:233:45
16: 0x556effb18290 - core::option::Option<T>::unwrap_or_else::h98c260c82ae7a891
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:992:21
17: 0x556effad3fba - <artichoke_backend::artichoke::Guard as core::ops::drop::Drop>::drop::h5c639a5a03628e50
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:233:21
18: 0x556effb8d8eb - core::ptr::drop_in_place<artichoke_backend::artichoke::Guard>::h37be7690eb755a0a
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ptr/mod.rs:490:1
19: 0x556effc6e7d9 - artichoke_backend::extn::core::kernel::mruby::kernel_load::h8e83fce385d4f283
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/extn/core/kernel/mruby.rs:66:1
20: 0x556effce0a8e - mrb_vm_exec
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1792:18
21: 0x556effcdd291 - mrb_vm_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:1282:12
22: 0x556effcf16b5 - mrb_run
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:3100:10
23: 0x556effcdb955 - mrb_funcall_with_block
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:681:13
24: 0x556effcdba14 - mrb_funcall_argv
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:692:10
25: 0x556effb518fd - <artichoke_backend::sys::protect::Funcall as artichoke_backend::sys::protect::Protect>::run::had268d8aff26327a
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:91:13
26: 0x556effd0696d - protect_body
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:13:10
27: 0x556effcdaac9 - mrb_protect_error
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/src/vm.c:431:14
28: 0x556effd069d1 - mrb_protect
at /home/runner/work/artichoke/artichoke/artichoke-backend/vendor/mruby/mrbgems/mruby-error/src/exception.c:20:10
29: 0x556effb50c40 - artichoke_backend::sys::protect::protect::h2564e1bb47afdc74
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:46:17
30: 0x556effb50a47 - artichoke_backend::sys::protect::funcall::h63bc21215de2d8f1
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/sys/protect.rs:16:5
31: 0x556effb39c55 - <artichoke_backend::value::Value as artichoke_core::value::Value>::funcall::{{closure}}::h1192b775fafb7460
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/value.rs:159:17
32: 0x556effac52ec - artichoke_backend::artichoke::Artichoke::with_ffi_boundary::h84192cd4a228a955
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/artichoke.rs:98:26
33: 0x556effb3987c - <artichoke_backend::value::Value as artichoke_core::value::Value>::funcall::hee37548fe29604bf
at /home/runner/work/artichoke/artichoke/artichoke-backend/src/value.rs:158:13
34: 0x556eff9171ba - spec_runner::mspec::run::h893fe048f9a10b30
at /home/runner/work/artichoke/artichoke/spec-runner/src/mspec.rs:93:18
35: 0x556eff918959 - spec_runner::try_main::h8200c89b51f585c3
at /home/runner/work/artichoke/artichoke/spec-runner/src/main.rs:172:24
36: 0x556eff917a46 - spec_runner::main::h8e1a9ceafc1dd7c0
at /home/runner/work/artichoke/artichoke/spec-runner/src/main.rs:110:18
37: 0x556eff9101bb - core::ops::function::FnOnce::call_once::hbfb58f9e389056b5
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
38: 0x556eff90a0fe - std::sys_common::backtrace::__rust_begin_short_backtrace::h762524ad39c17153
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:134:18
39: 0x556eff91afc1 - std::rt::lang_start::{{closure}}::ha85c3fe14162c26b
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:166:18
40: 0x556f0000749c - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hd6efcd3bec896f2c
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:287:13
41: 0x556f0000749c - std::panicking::try::do_call::hce04e543bb1f4cbb
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
42: 0x556f0000749c - std::panicking::try::h3342dd4e1f680968
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
43: 0x556f0000749c - std::panic::catch_unwind::h148ce1e59ac0cee7
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
44: 0x556f0000749c - std::rt::lang_start_internal::{{closure}}::h25f9dda2057a67fe
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:48
45: 0x556f0000749c - std::panicking::try::do_call::h7caaaeaf9401650b
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
46: 0x556f0000749c - std::panicking::try::he7d15285746cbbc2
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
47: 0x556f0000749c - std::panic::catch_unwind::h89fb4f50c0301fe0
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
48: 0x556f0000749c - std::rt::lang_start_internal::h078acd489417d3c1
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:20
49: 0x556eff91af9a - std::rt::lang_start::ha418d6f7c511ad91
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:165:17
50: 0x556eff919ace - main
51: 0x7f6189e73d90 - <unknown>
52: 0x7f6189e73e40 - __libc_start_main
53: 0x556eff8c67f5 - _start
54: 0x0 - <unknown>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the offset isn't guaranteed to exist in the MatchData
. The None
case needs to be handled.
I believe this PR is blocked on #1909. I'll try to work on it this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the yak shave it's looking like this PR will turn into, but this part of the code needs love.
Regexp
and MatchData
were the first custom types added to Artichoke and I've developed better patterns for impling native APIs since these types were added.
If you're interested in digging in deep here, the spinoso-regexp
crate is where I'd love to see some additional investment.
let offset = self.region.offset(); | ||
let offset = self | ||
.haystack | ||
.char_indices() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling bstr
directly here is buggy for a couple of reasons:
- It presumes that the
String
value the haystack bytes were derived from has a UTF-8 string encoding. bstr
'sCharIndices
iterator, like all of the library's UTF-8 oriented methods, has different behavior than Artichoke for some types of invalid UTF-8 byte sequences.bstr
uses the substitution of maximal subparts replacement strategy which means some invalid UTF-8 byte sequences like\xF0\x9F\x8
get collapsed to one replacement character, whereas Artichoke treats these 3 bytes as 3 characters.
end | ||
|
||
def offset_returns_utf8_character_index | ||
raise unless 'тест'.match('с').offset(0) == [2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing from discord: can we add a test similar to this which includes an invalid UTF-8 byte sequence?
Something like this:
raise unless "\xF0\x9F\x87тест".match('с').offset(0) == [4, 5]
Similarly, if we could duplicate all of the test cases for binary strings (e.g. 'тест'.b
), we'd be able to ensure we are supporting non UTF-8 encodings as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using current stable MRI (3.2.2) the test case that you've proposed raises invalid byte sequence in UTF-8 (ArgumentError)
, while Artichoke raises ArgumentError (regex crate utf8 backend for Regexp only supports UTF-8 haystacks)
due to an error returned from str::from_utf8
. So other than the different message wording, this seems to be in line with MRI behavior.
@@ -308,7 +308,12 @@ impl MatchData { | |||
} else { | |||
haystack.len() | |||
}; | |||
let offset = self.region.offset(); | |||
let offset = self | |||
.haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having the haystack
be stored in the MatchData
value as a Vec<u8>
instead of a spinoso_string::String
is a mistake that will make fixing this bug correctly impossible.
See also this related ticket: #614.
If you're up for it, we'll want to change the MatchData
rust struct to not store the haystack and regexp within it. Additionally, we should be storing the Value
s that represent the haystack and regexp as instance variables on the MatchData
's Value
(Setting these Value
s as ivars is necessary so mruby can correctly mark these stored values during GC due to the FFI internals of the Rust/C MatchData
bridge. we can chat about this on Discord if you're curious about learning more.)
I think setting ivars on values is not something Artichoke supports right now. This is a missing API in artichoke-core
: https://artichoke.github.io/artichoke/artichoke_core/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A hack to workaround this gap would be to store the spinoso_string::String
on the Rust MatchData
struct instead of a Vec<u8>
.
const FUNCTIONAL_TEST: &[u8] = include_bytes!("matchdata_test.rb"); | ||
|
||
#[test] | ||
fn functional() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this!
Fixes #2580