set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

todd-richmond · 2021-10-27T21:33:15Z

RXf_MATCH_UTF8 flag is required so that the matches ($1, $2...) seen by perl are computed correctly. Without this, the current code will have strings with wrong starting offset

todd-richmond · 2021-10-27T21:49:51Z

dupe() must use the old pattern() as the StringPiece object is not the pattern. This caused core dumps in perl 5.34 with gcc 11

However, I just looked at the code to re-engine-PCRE2 and it doesn't make a new object. I don't know enough about regex+XS to know if this causes a leak in re-engine-RE2 or if PCRE2 is wrong, but one seems to be incorrect
regexp *re = RegSV(rx);
return re->pprivate;

todd-richmond · 2021-10-27T22:15:05Z

there is 1 other call to new RE2(...) that I'm not sure about. It is trying to "stringify()", but it seems like that logic should just be simply passing the const char *exp variable as in
new RE2(exp, otions)
Nevermind - I think that would lose any changes from stringify() call

todd-richmond · 2021-12-18T01:27:01Z

This will allow you to remove the unicode handling from the BUGS section in README too

amit777 · 2022-10-21T03:45:24Z

Hi Todd, I came across your pull request when researching the bugs section of this module. Since we handle many Unicode emails, we are hesitant to put re2 in production. I'm wondering if you are using your patched version in a live system?

dgl · 2022-10-21T04:12:40Z

Hi, this is still something I'd like to address, sorry I hadn't responded sooner...

This will allow you to remove the unicode handling from the BUGS section in README too

The fix here doesn't handle the fact a string in perl can be either UTF8 or latin1. It's not just compiling the regexp that matters, but a UTF-8 compiled regexp can be matched with Latin1 text and the other way around.

In particular these TODO tests would be a starting point: https://github.com/dgl/re-engine-RE2/blob/master/t/07.utf8.t but I think there would need to be more tests too.

I'm concerned this change doesn't change any tests, in particular it's using the UTF8 flag of the regexp to decide if the match should be UTF8 or not, which seems like a misunderstanding of how this works. Although it may appear to work because UTF-8 text is far more common than latin1 thesedays, I'd like to see tests exercising all combinations of match string and regexp with UTF8.

Since we handle many Unicode emails, we are hesitant to put re2 in production.

If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this). That's kind of what the BUGS section is trying to get at. I know this code is used in various email systems already.

amit777 · 2022-10-21T05:24:39Z

Thank you! I'm a little confused by your comment: "If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this)."

Could you give me an example of where/how they could be different?

dgl · 2022-10-21T05:47:17Z

Could you give me an example of where/how they could be different?

The test I mentioned is the easiest, just remove the TODO lines and try this:

use strict;
use Test::More tests => 3;
#use re::engine::RE2;

{
  use utf8;

  # Match UTF-8 LHS against UTF-8 RHS
  ok "£" =~ /^£$/;
}

{
  use utf8;

  ok "\x{a3}" =~ /£/;
}

{
  use utf8;

  ok "£" =~ /^\x{a3}$/;
}

I commented out use re::engine::RE2 and they all passed. Uncomment and you'll see the last two fail. This test is depending on the fact that "\x{a3}" is a pound sign but perl will use latin1 encoding for it. Usually perl hides this from you, but you can use Dump from Devel::Peek to see what the value of the UTF8 flag is.

amit777 · 2022-10-21T06:46:16Z

Ahh, I see.. based on your examples, I believe our regex will be negatively impacted by this bug. I'm no where near an expert in how to fix this, but thank you for looking at it. I'm happy to test any fixes if you get a chance to take a stab at it!

todd-richmond · 2022-10-24T19:17:59Z

@amit777 we use this fix for 100s million to billions of emails per day in production so it is stress tested extensively. However, the calling code is for URL extraction and so we have 2 different regexes compiled - one latin1 regex to use against latin1 text and a much more complex/slow utf8 regex to use against utf8 text. As a result, I haven't verified all the cases mentioned above

@dgl i based these changes on code form the PCRE and PCRE2 XS implementations. It definitely fixes completely incorrect match offset values and crashes in our 2 regex case. There might be further work for other cases, but if it doesn't cause regressions, it will at least help those who ever need to act on utf8 text with utf8 regexes

todd-richmond added 2 commits October 27, 2021 14:30

set UTF8 match flag so that captures are computed properly

1f2cf65

use the old pattern when duping

b8d378d

todd-richmond changed the title ~~set UTF8 match flag so that captures are computed properly~~ set UTF8 match flag to compute utf8 captures correctly and fix core dump Oct 27, 2021

todd-richmond added 2 commits October 27, 2021 15:31

use original pattern string

3c2707a

revert previous change

81afff7

todd-richmond mentioned this pull request Jun 7, 2023

re::engine::RE2 ignores the target string utf8 flag #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

todd-richmond commented Oct 27, 2021

todd-richmond commented Oct 27, 2021 •

edited

Loading

todd-richmond commented Oct 27, 2021 •

edited

Loading

todd-richmond commented Dec 18, 2021

amit777 commented Oct 21, 2022

dgl commented Oct 21, 2022

amit777 commented Oct 21, 2022

dgl commented Oct 21, 2022

amit777 commented Oct 21, 2022

todd-richmond commented Oct 24, 2022

set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

Are you sure you want to change the base?

set UTF8 match flag to compute utf8 captures correctly and fix core dump #8

Conversation

todd-richmond commented Oct 27, 2021

todd-richmond commented Oct 27, 2021 • edited Loading

todd-richmond commented Oct 27, 2021 • edited Loading

todd-richmond commented Dec 18, 2021

amit777 commented Oct 21, 2022

dgl commented Oct 21, 2022

amit777 commented Oct 21, 2022

dgl commented Oct 21, 2022

amit777 commented Oct 21, 2022

todd-richmond commented Oct 24, 2022

todd-richmond commented Oct 27, 2021 •

edited

Loading

todd-richmond commented Oct 27, 2021 •

edited

Loading