-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MatchData#offset return utf8 character offset #2581
base: trunk
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# frozen_string_literal: true | ||
|
||
def spec | ||
offset_returns_utf8_character_index | ||
|
||
true | ||
end | ||
|
||
def offset_returns_utf8_character_index | ||
raise unless 'тест'.match('с').offset(0) == [2, 3] | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,7 +14,7 @@ use std::fmt::Write as _; | |
use std::ops::{Bound, RangeBounds}; | ||
use std::str; | ||
|
||
use bstr::BString; | ||
use bstr::{BString, ByteSlice}; | ||
use scolapasta_string_escape::format_debug_escape_into; | ||
|
||
use crate::convert::{implicitly_convert_to_int, implicitly_convert_to_string}; | ||
|
@@ -308,7 +308,12 @@ impl MatchData { | |
} else { | ||
haystack.len() | ||
}; | ||
let offset = self.region.offset(); | ||
let offset = self | ||
.haystack | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. having the See also this related ticket: #614. If you're up for it, we'll want to change the I think setting ivars on values is not something Artichoke supports right now. This is a missing API in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A hack to workaround this gap would be to store the |
||
.char_indices() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. calling
|
||
.position(|n| n.0 >= self.region.offset()) | ||
.unwrap(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this This PR failing the crash worfklow with a panic in https://github.com/artichoke/artichoke/actions/runs/5133653787/jobs/9294514151?pr=2581
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the offset isn't guaranteed to exist in the |
||
|
||
Ok(Some([offset + begin, offset + end])) | ||
} else { | ||
Ok(None) | ||
|
@@ -363,3 +368,20 @@ impl MatchData { | |
self.regexp.inner().capture0(haystack) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use crate::test::prelude::*; | ||
|
||
const SUBJECT: &str = "MatchData"; | ||
const FUNCTIONAL_TEST: &[u8] = include_bytes!("matchdata_test.rb"); | ||
|
||
#[test] | ||
fn functional() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for adding this! |
||
let mut interp = interpreter(); | ||
let result = interp.eval(FUNCTIONAL_TEST); | ||
unwrap_or_panic_with_backtrace(&mut interp, SUBJECT, result); | ||
let result = interp.eval(b"spec"); | ||
unwrap_or_panic_with_backtrace(&mut interp, SUBJECT, result); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing from discord: can we add a test similar to this which includes an invalid UTF-8 byte sequence?
Something like this:
Similarly, if we could duplicate all of the test cases for binary strings (e.g.
'тест'.b
), we'd be able to ensure we are supporting non UTF-8 encodings as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using current stable MRI (3.2.2) the test case that you've proposed raises
invalid byte sequence in UTF-8 (ArgumentError)
, while Artichoke raisesArgumentError (regex crate utf8 backend for Regexp only supports UTF-8 haystacks)
due to an error returned fromstr::from_utf8
. So other than the different message wording, this seems to be in line with MRI behavior.