-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set UTF8 match flag to compute utf8 captures correctly and fix core dump #8
base: master
Are you sure you want to change the base?
Conversation
dupe() must use the old pattern() as the StringPiece object is not the pattern. This caused core dumps in perl 5.34 with gcc 11 However, I just looked at the code to re-engine-PCRE2 and it doesn't make a new object. I don't know enough about regex+XS to know if this causes a leak in re-engine-RE2 or if PCRE2 is wrong, but one seems to be incorrect |
there is 1 other call to new RE2(...) that I'm not sure about. It is trying to "stringify()", but it seems like that logic should just be simply passing the const char *exp variable as in |
This will allow you to remove the unicode handling from the BUGS section in README too |
Hi Todd, I came across your pull request when researching the bugs section of this module. Since we handle many Unicode emails, we are hesitant to put re2 in production. I'm wondering if you are using your patched version in a live system? |
Hi, this is still something I'd like to address, sorry I hadn't responded sooner...
The fix here doesn't handle the fact a string in perl can be either UTF8 or latin1. It's not just compiling the regexp that matters, but a UTF-8 compiled regexp can be matched with Latin1 text and the other way around. In particular these TODO tests would be a starting point: https://github.com/dgl/re-engine-RE2/blob/master/t/07.utf8.t but I think there would need to be more tests too. I'm concerned this change doesn't change any tests, in particular it's using the UTF8 flag of the regexp to decide if the match should be UTF8 or not, which seems like a misunderstanding of how this works. Although it may appear to work because UTF-8 text is far more common than latin1 thesedays, I'd like to see tests exercising all combinations of match string and regexp with UTF8.
If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this). That's kind of what the BUGS section is trying to get at. I know this code is used in various email systems already. |
Thank you! I'm a little confused by your comment: "If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this)." Could you give me an example of where/how they could be different? |
The test I mentioned is the easiest, just remove the TODO lines and try this: use strict;
use Test::More tests => 3;
#use re::engine::RE2;
{
use utf8;
# Match UTF-8 LHS against UTF-8 RHS
ok "£" =~ /^£$/;
}
{
use utf8;
ok "\x{a3}" =~ /£/;
}
{
use utf8;
ok "£" =~ /^\x{a3}$/;
} I commented out |
Ahh, I see.. based on your examples, I believe our regex will be negatively impacted by this bug. I'm no where near an expert in how to fix this, but thank you for looking at it. I'm happy to test any fixes if you get a chance to take a stab at it! |
@amit777 we use this fix for 100s million to billions of emails per day in production so it is stress tested extensively. However, the calling code is for URL extraction and so we have 2 different regexes compiled - one latin1 regex to use against latin1 text and a much more complex/slow utf8 regex to use against utf8 text. As a result, I haven't verified all the cases mentioned above @dgl i based these changes on code form the PCRE and PCRE2 XS implementations. It definitely fixes completely incorrect match offset values and crashes in our 2 regex case. There might be further work for other cases, but if it doesn't cause regressions, it will at least help those who ever need to act on utf8 text with utf8 regexes |
RXf_MATCH_UTF8 flag is required so that the matches ($1, $2...) seen by perl are computed correctly. Without this, the current code will have strings with wrong starting offset