Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour for unmatched subpatterns in Regex::Matchdata #12806

Closed
straight-shoota opened this issue Nov 30, 2022 · 1 comment · Fixed by #12810
Closed

Inconsistent behaviour for unmatched subpatterns in Regex::Matchdata #12806

straight-shoota opened this issue Nov 30, 2022 · 1 comment · Fixed by #12810
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. status:discussion topic:stdlib:text

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Nov 30, 2022

A regex match with unused optional subpatterns leads to inconsistent and apparently unexpected behaviour when querying information about those subpatterns in Regex::Matchdata.

md = /f(?<x>x)?o/.match("foo").not_nil!

md # => Regex::MatchData("fo" 1:nil)

md[1]?   # => nil (OK)
md["x"]? # => nil (OK)
md[1]   # IndexError: Capture group 1 was not matched (OK)
md["x"] # KeyError: Capture group 'x' was not matched (OK)
md.begin(1) # NilAssertionError: Nil assertion failed (😒)
md.end(1)   # NilAssertionError: Nil assertion failed (😒)
md.byte_begin(1) # => -1 (BUG)
md.byte_end(1)   # => -1 (BUG)

The #[]? and #[] methods handle this condition explicitly, returning nil or raising IndexError or KexError with an error message indicating the problem. A minor downside is that the error type does not distinquish between index out of range and an unused subpattern, but that's probably okay. And you can tell from the error message which error it is.

#begin and #end raise an unspecific NilAssertionError. This should be improved. I think it makes sense to raise IndexError like #[] does. I would also suggest non-raising variants #begin? and #end? which return nil if the subpattern was unmatched (or index is out of range).

#byte_begin and #byte_end both return -1 which should be considered a bug. This is clearly incorrect behaviour.
Error handling for these methods should be like #begin and #end raising IndexError. The additional non-raising variants would also make sense.

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. status:discussion topic:stdlib:text labels Nov 30, 2022
@straight-shoota
Copy link
Member Author

Actually, instead of non-raising variants of #begin and #end methods, maybe a bit different approach could be better. I'm thinking about a combined method that returns both begin and end as a single Range(Int32, Int32) value. Retrieving both indices doesn't add any considerable overhead, and I think the API might be a bit more concise with less methods overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. status:discussion topic:stdlib:text
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant