use pcre instead of oniguruma? #31

Open
191919 opened this Issue Jan 16, 2012 · 6 comments

3 participants

@191919

I know you use oniguruma's posix interface, but it bloated the size of the final distribution. Since most of server have libpcre pre-installed, will you please consider switching to pcre? (Just change 2~3 lines of code). Thank you.

@deepfryed

Onigurama has some nice features such as named captures -- not sure if its used here. If regular expression engines can be made configurable at compile time, maybe it would nice to have a choice between onigurama, pcre and re2.

@191919

I agree: named capture is useful for a HTTP server when handling rewrites, but pcre has named captures too. pcre-8.20 has JIT support which makes it almost the fastest regex engine in C, and the incoming pcre-8.30 includes support for UTF16 strings (so-called pcre16). RE2 is in C++, so there should be a good wrapper before it is introduced to a C world.

@ellzey
Owner

I agree with all the commentary here. We should abstract the regex in a way which we can utilize the base features we need out of the engine.

With both -DENABLE_REGEX_* compile-time options plus a vtbl for pcre, onig, posix, and re2, I think we can all be happy.

Right now the only feature of onig I use is the POSIX regex (for portability).

@ellzey
Owner

If anyone is interested, here are some stats covering various regex engines. Searches were done against a copy of the full works of mark twain.

-----------------
Regex: 'Twain'
[oniguruma] time:    10 ms (2388 matches)
[     pcre] time:    20 ms (2388 matches)
[ pcre-dfa] time:    20 ms (2388 matches)
[      tre] time:   540 ms (2388 matches)
[      re2] time:    10 ms (2388 matches)
[ pcre-jit] time:    20 ms (2388 matches)
-----------------
Regex: '^Twain'
[oniguruma] time:    10 ms (100 matches)
[     pcre] time:   140 ms (100 matches)
[ pcre-dfa] time:   160 ms (100 matches)
[      tre] time:   280 ms (100 matches)
[      re2] time:    50 ms (100 matches)
[ pcre-jit] time:    30 ms (100 matches)
-----------------
Regex: '.*fence.*'
[oniguruma] time:   220 ms (284 matches)
[     pcre] time:   490 ms (284 matches)
[ pcre-dfa] time:   720 ms (284 matches)
[      tre] time:  1120 ms (284 matches)
[      re2] time:    50 ms (284 matches)
[ pcre-jit] time:   100 ms (284 matches)
-----------------
Regex: '.*one day? will'
[oniguruma] time:   230 ms (1 matches)
[     pcre] time:   510 ms (1 matches)
[ pcre-dfa] time:   730 ms (1 matches)
[      tre] time:  1140 ms (1 matches)
[      re2] time:    60 ms (1 matches)
[ pcre-jit] time:   190 ms (1 matches)
-----------------
Regex: 'Twain$'
[oniguruma] time:    20 ms (127 matches)
[     pcre] time:    40 ms (127 matches)
[ pcre-dfa] time:    40 ms (127 matches)
[      tre] time:   800 ms (127 matches)
[      re2] time:     0 ms (127 matches)
[ pcre-jit] time:    20 ms (127 matches)
-----------------
Regex: 'Huck[a-zA-Z]+|Finn[a-zA-Z]+'
[oniguruma] time:    50 ms (83 matches)
[     pcre] time:    30 ms (83 matches)
[ pcre-dfa] time:    30 ms (83 matches)
[      tre] time:  1330 ms (83 matches)
[      re2] time:    60 ms (83 matches)
[ pcre-jit] time:    20 ms (83 matches)
-----------------
Regex: 'Tom|Sawyer|Huckleberry|Finn'
[oniguruma] time:    80 ms (3015 matches)
[     pcre] time:    70 ms (3015 matches)
[ pcre-dfa] time:    70 ms (3015 matches)
[      tre] time:  2110 ms (3015 matches)
[      re2] time:    80 ms (3015 matches)
[ pcre-jit] time:    40 ms (3015 matches)
-----------------
Regex: '[a-zA-Z]+ing'
[oniguruma] time:  1610 ms (95863 matches)
[     pcre] time:  1480 ms (95863 matches)
[ pcre-dfa] time:  2070 ms (95863 matches)
[      tre] time:   870 ms (95863 matches)
[      re2] time:    70 ms (95863 matches)
[ pcre-jit] time:   340 ms (95863 matches)
-----------------
Regex: '[a-zA-Z]+ing$'
[oniguruma] time:  1420 ms (5360 matches)
[     pcre] time:  1450 ms (5360 matches)
[ pcre-dfa] time:  2180 ms (5360 matches)
[      tre] time:   860 ms (5360 matches)
[      re2] time:    50 ms (5360 matches)
[ pcre-jit] time:   350 ms (5360 matches)
-----------------
Regex: '^.{1,3}$'
[oniguruma] time:   550 ms (296 matches)
[     pcre] time:   170 ms (296 matches)
[ pcre-dfa] time:   200 ms (296 matches)
[      tre] time:   390 ms (296 matches)
[      re2] time:    60 ms (296 matches)
[ pcre-jit] time:    40 ms (296 matches)
-----------------
Regex: '([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]'
[oniguruma] time:   410 ms (675 matches)
[     pcre] time:  1750 ms (675 matches)
[ pcre-dfa] time:  1860 ms (675 matches)
[      tre] time:  1900 ms (675 matches)
[      re2] time:    80 ms (675 matches)
[ pcre-jit] time:   200 ms (675 matches)
-----------------
Regex: 'Tom.{0,30}river|river.{0,30}Tom'
[oniguruma] time:   110 ms (4 matches)
[     pcre] time:    80 ms (4 matches)
[ pcre-dfa] time:   110 ms (4 matches)
[      tre] time:  1170 ms (4 matches)
[      re2] time:    60 ms (4 matches)
[ pcre-jit] time:    40 ms (4 matches)
@191919

I took a look at regex-test/oniguruma.c:

    r    = onig_search(reg, subject, end, start, range, region, ONIG_OPTION_NONE);

    if (r >= 0) {
        found = 1;
    } else {
        found = 0;
    }

found is either 1 or 0. Have you updated your source to give the number of matched results?

I grabbed http://news.sina.com.cn/ (an HTML page) as mark.txt and ./runtest with a simple regex (more HTTPisy)

/([0-9a-z]+)\?(.*)

The result is:

Regex: '/([0-9a-z]+)\?(.*)'
[ pcre] time: 7 ms (116 matches)
[pcre-dfa] time: 11 ms (116 matches)
[ tre] time: 44 ms (116 matches)
[ re2] time: 2 ms (116 matches)
[pcre-jit] time: 2 ms (116 matches)
[oniguruma] time: 41 ms (1 matches)

@ellzey
Owner

The code was updated. Thus the results above.

I'll be keeping this feature request on the back-burner for now. Issue 30 covers a disable regex option, while another issue is for finding local POSIX compat regex engine before compiling onig.

If you really want to utilize another engine, you can use one of the many hooks to implement it. That is one of the benefits of having all the per-request and per-connection hook mechanisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment