Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

todd-richmond
Copy link

RXf_MATCH_UTF8 flag is required so that the matches ($1, $2...) seen by perl are computed correctly. Without this, the current code will have strings with wrong starting offset

@todd-richmond
Copy link
Author

todd-richmond commented Oct 27, 2021

dupe() must use the old pattern() as the StringPiece object is not the pattern. This caused core dumps in perl 5.34 with gcc 11

However, I just looked at the code to re-engine-PCRE2 and it doesn't make a new object. I don't know enough about regex+XS to know if this causes a leak in re-engine-RE2 or if PCRE2 is wrong, but one seems to be incorrect
regexp *re = RegSV(rx);
return re->pprivate;

@todd-richmond todd-richmond changed the title set UTF8 match flag so that captures are computed properly set UTF8 match flag to compute utf8 captures correctly and fix core dump Oct 27, 2021
@todd-richmond
Copy link
Author

todd-richmond commented Oct 27, 2021

there is 1 other call to new RE2(...) that I'm not sure about. It is trying to "stringify()", but it seems like that logic should just be simply passing the const char *exp variable as in
new RE2(exp, otions)
Nevermind - I think that would lose any changes from stringify() call

@todd-richmond
Copy link
Author

This will allow you to remove the unicode handling from the BUGS section in README too

@amit777
Copy link

amit777 commented Oct 21, 2022

Hi Todd, I came across your pull request when researching the bugs section of this module. Since we handle many Unicode emails, we are hesitant to put re2 in production. I'm wondering if you are using your patched version in a live system?

@dgl
Copy link
Owner

dgl commented Oct 21, 2022

Hi, this is still something I'd like to address, sorry I hadn't responded sooner...

This will allow you to remove the unicode handling from the BUGS section in README too

The fix here doesn't handle the fact a string in perl can be either UTF8 or latin1. It's not just compiling the regexp that matters, but a UTF-8 compiled regexp can be matched with Latin1 text and the other way around.

In particular these TODO tests would be a starting point: https://github.com/dgl/re-engine-RE2/blob/master/t/07.utf8.t but I think there would need to be more tests too.

I'm concerned this change doesn't change any tests, in particular it's using the UTF8 flag of the regexp to decide if the match should be UTF8 or not, which seems like a misunderstanding of how this works. Although it may appear to work because UTF-8 text is far more common than latin1 thesedays, I'd like to see tests exercising all combinations of match string and regexp with UTF8.

Since we handle many Unicode emails, we are hesitant to put re2 in production.

If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this). That's kind of what the BUGS section is trying to get at. I know this code is used in various email systems already.

@amit777
Copy link

amit777 commented Oct 21, 2022

Thank you! I'm a little confused by your comment: "If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this)."

Could you give me an example of where/how they could be different?

@dgl
Copy link
Owner

dgl commented Oct 21, 2022

Could you give me an example of where/how they could be different?

The test I mentioned is the easiest, just remove the TODO lines and try this:

use strict;
use Test::More tests => 3;
#use re::engine::RE2;

{
  use utf8;

  # Match UTF-8 LHS against UTF-8 RHS
  ok "£" =~ /$/;
}

{
  use utf8;

  ok "\x{a3}" =~ /£/;
}

{
  use utf8;

  ok "£" =~ /^\x{a3}$/;
}

I commented out use re::engine::RE2 and they all passed. Uncomment and you'll see the last two fail. This test is depending on the fact that "\x{a3}" is a pound sign but perl will use latin1 encoding for it. Usually perl hides this from you, but you can use Dump from Devel::Peek to see what the value of the UTF8 flag is.

@amit777
Copy link

amit777 commented Oct 21, 2022

Ahh, I see.. based on your examples, I believe our regex will be negatively impacted by this bug. I'm no where near an expert in how to fix this, but thank you for looking at it. I'm happy to test any fixes if you get a chance to take a stab at it!

@todd-richmond
Copy link
Author

@amit777 we use this fix for 100s million to billions of emails per day in production so it is stress tested extensively. However, the calling code is for URL extraction and so we have 2 different regexes compiled - one latin1 regex to use against latin1 text and a much more complex/slow utf8 regex to use against utf8 text. As a result, I haven't verified all the cases mentioned above

@dgl i based these changes on code form the PCRE and PCRE2 XS implementations. It definitely fixes completely incorrect match offset values and crashes in our 2 regex case. There might be further work for other cases, but if it doesn't cause regressions, it will at least help those who ever need to act on utf8 text with utf8 regexes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants