Skip to content

Commit

Permalink
grep: use PCRE v2 under the hood for -G & -E for big performance gain
Browse files Browse the repository at this point in the history
Change the underlying engine powering POSIX basic & extended patterns
to be PCRE v2 under the hood.

This relies on an experimental SVN-trunk only PCRE v2 API which Philip
Hazel (the PCRE maintainer) wrote up in response to a feature request
I filed1[1].

This allows us to use pcre2_pattern_convert() to power all grep regex
matches by converting the POSIX patterns into PCRE syntax before
compiling them.

Due to PCRE generally being faster than POSIX, but most importantly
due to its JIT feature (where available) this speeds up grep by
a *lot*.

The improvements to the "perl" tests are already a part of this
series, but all the other benchmarks show improvements made by this
change alone:

    $ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease USE_LIBPCRE2_BUNDLED=Y CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run v2.13.0 HEAD -- p*grep*
    [...]
    Test                                           v2.13.0             HEAD
    -----------------------------------------------------------------------------------------
    7810.1: grep worktree, cheap regex            0.19(0.35+0.62)     0.18(0.34+0.57) -5.3%
    7810.2: grep worktree, expensive regex        4.35(30.52+0.32)    0.92(5.55+0.39) -78.9%
    7810.3: grep --cached, cheap regex            2.92(2.83+0.06)     2.83(2.75+0.07) -3.1%
    7810.4: grep --cached, expensive regex        21.12(21.02+0.08)   6.28(6.15+0.10) -70.3%
    7820.1: basic grep how.to                     0.28(1.27+0.41)     0.19(0.33+0.58) -32.1%
    7820.2: extended grep how.to                  0.28(1.19+0.49)     0.19(0.32+0.55) -32.1%
    7820.3: perl grep how.to                      0.27(1.10+0.52)     0.19(0.26+0.63) -29.6%
    7820.5: basic grep ^how to                    0.27(1.22+0.43)     0.18(0.28+0.63) -33.3%
    7820.6: extended grep ^how to                 0.27(1.20+0.44)     0.18(0.29+0.62) -33.3%
    7820.7: perl grep ^how to                     0.48(2.81+0.39)     0.18(0.29+0.61) -62.5%
    7820.9: basic grep [how] to                   0.42(2.19+0.44)     0.21(0.34+0.65) -50.0%
    7820.10: extended grep [how] to               0.41(2.18+0.43)     0.21(0.36+0.63) -48.8%
    7820.11: perl grep [how] to                   0.47(2.63+0.38)     0.20(0.29+0.70) -57.4%
    7820.13: basic grep \(e.t[^ ]*\|v.ry\) rare   0.55(3.25+0.43)     0.19(0.53+0.52) -65.5%
    7820.14: extended grep (e.t[^ ]*|v.ry) rare   0.54(3.30+0.42)     0.19(0.51+0.53) -64.8%
    7820.15: perl grep (e.t[^ ]*|v.ry) rare       0.88(5.77+0.41)     0.19(0.52+0.53) -78.4%
    7820.17: basic grep m\(ú\|u\)lt.b\(æ\|y\)te   0.28(1.28+0.47)     0.18(0.35+0.57) -35.7%
    7820.18: extended grep m(ú|u)lt.b(æ|y)te      0.28(1.32+0.43)     0.18(0.28+0.64) -35.7%
    7820.19: perl grep m(ú|u)lt.b(æ|y)te          0.32(1.62+0.43)     0.18(0.28+0.64) -43.8%
    7821.1: fixed grep int                        0.50(1.76+0.58)     0.39(1.16+0.69) -22.0%
    7821.2: basic grep int                        0.55(1.83+0.72)     0.41(1.10+0.68) -25.5%
    7821.3: extended grep int                     0.56(1.78+0.73)     0.47(1.17+0.75) -16.1%
    7821.4: perl grep int                         0.52(1.57+0.80)     0.47(1.29+0.64) -9.6%
    7821.6: fixed grep -i int                     0.55(2.06+0.64)     0.44(1.32+0.65) -20.0%
    7821.7: basic grep -i int                     0.59(2.10+0.67)     0.51(1.40+0.76) -13.6%
    7821.8: extended grep -i int                  0.56(2.08+0.67)     0.53(1.45+0.72) -5.4%
    7821.9: perl grep -i int                      0.58(2.12+0.59)     0.51(1.36+0.78) -12.1%
    7821.11: fixed grep æ                         0.30(1.34+0.40)     0.18(0.27+0.64) -40.0%
    7821.12: basic grep æ                         0.30(1.21+0.53)     0.18(0.31+0.59) -40.0%
    7821.13: extended grep æ                      0.30(1.30+0.44)     0.18(0.28+0.63) -40.0%
    7821.14: perl grep æ                          0.29(1.22+0.52)     0.18(0.33+0.57) -37.9%
    7821.16: fixed grep -i æ                      0.23(0.86+0.51)     0.18(0.27+0.63) -21.7%
    7821.17: basic grep -i æ                      0.23(0.86+0.50)     0.18(0.28+0.62) -21.7%
    7821.18: extended grep -i æ                   0.24(0.89+0.49)     0.18(0.28+0.62) -25.0%
    7821.19: perl grep -i æ                       0.22(0.80+0.48)     0.18(0.30+0.60) -18.2%

Caveats & other things to mention:

 * This will expose PCRE v2 (as opposed to C library reg(comp|exec))to
   the network via gitweb in its default configuration. See
   <CACBZZX6V8qbnrZAdhRvPthy5Z91iEG8rrJ=Sf9tdkOt52M9j1Q@mail.gmail.com>
   for a discussion of security & other caveats related to that.

 * I'm checking for PCRE2_CONVERT_POSIX_BASIC to enable this, but the
   experimental API of pcre2_pattern_convert() may change before it
   makes it into a release.

   If we think this patch is awesome enough to get into a git release
   regardless, it should be guarded by some other method so we don't
   rudely tie upstream PCRE to this API least they break git versions
   in the wild.

 * One way to do to that would be to guard this via the
   USE_LIBPCRE2_BUNDLED flag, but see the above E-Mail thread for
   concerns about shipping an embedded PCRE, and for ways that could
   be made OK.

 * We could ship some copy of just the logic in
   pcre2_pattern_convert() & use the system PCRE instead. I haven't
   tried splitting it off from the PCRE codebase, and don't know how
   hard that would be.

 * There are outstanding bugs in the pcre2_pattern_convert()
   function. Grepping with -G and -E for all ASCII characters from
   1..127 both "$char" and "\\$char" will produce numerous
   differences. These are mostly obscure cases, I'm working out fixes
   to those with Philip.

1. https://bugs.exim.org/show_bug.cgi?id=2106

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
  • Loading branch information
avar committed May 29, 2017
1 parent 7dd367e commit a3cc090
Show file tree
Hide file tree
Showing 5 changed files with 89 additions and 32 deletions.
96 changes: 67 additions & 29 deletions grep.c
Expand Up @@ -470,8 +470,47 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
int options = PCRE2_MULTILINE;
const uint8_t *character_tables = NULL;
int jitret;
int icase = opt->regflags & REG_ICASE || p->ignore_case;
PCRE2_SPTR pattern = (PCRE2_SPTR)p->pattern;
PCRE2_SIZE length = p->patternlen;
int copied_pattern = 0;
struct strbuf pattern_sb = STRBUF_INIT;
#ifdef PCRE2_CONVERT_POSIX_BASIC
int convret;
PCRE2_UCHAR *convpatbuf = NULL;
PCRE2_SIZE convpatlen;
int converted_pattern = 0;
#endif

assert(opt->pcre2);
if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) {
if (icase)
strbuf_add(&pattern_sb, "(?i)", 4);
if (opt->fixed)
strbuf_add(&pattern_sb, "\\Q", 2);
strbuf_add(&pattern_sb, p->pattern, p->patternlen);
if (opt->fixed)
strbuf_add(&pattern_sb, "\\E", 2);

pattern = (PCRE2_SPTR)pattern_sb.buf;
length = pattern_sb.len;
copied_pattern = 1;
} else if (opt->pcre2_posix_emulation) {
#ifdef PCRE2_CONVERT_POSIX_BASIC
convret = pcre2_pattern_convert(pattern, length,
(opt->regflags & REG_EXTENDED
? PCRE2_CONVERT_POSIX_EXTENDED
: PCRE2_CONVERT_POSIX_BASIC),
&convpatbuf, &convpatlen, NULL);
if (convret != 0) {
pcre2_get_error_message(convret, errbuf, sizeof(errbuf));
compile_regexp_failed(p, (const char *)&errbuf);
}
pattern = convpatbuf;
length = convpatlen;
converted_pattern = 1;
#endif
} else
assert(opt->pcre2);

p->pcre2_compile_context = NULL;

Expand All @@ -486,11 +525,16 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
if (is_utf8_locale() && has_non_ascii(p->pattern))
options |= PCRE2_UTF;

p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
p->patternlen, options, &error, &erroffset,
p->pcre2_compile_context);
p->pcre2_pattern = pcre2_compile(pattern, length, options, &error,
&erroffset, p->pcre2_compile_context);

if (p->pcre2_pattern) {
if (copied_pattern)
strbuf_release(&pattern_sb);
#ifdef PCRE2_CONVERT_POSIX_BASIC
if (converted_pattern)
pcre2_converted_pattern_free(convpatbuf);
#endif
p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, NULL);
if (!p->pcre2_match_data)
die("Couldn't allocate PCRE2 match data");
Expand Down Expand Up @@ -582,7 +626,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
static void free_pcre2_pattern(struct grep_pat *p)
{
}
#endif /* !USE_LIBPCRE2 */

static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
{
Expand All @@ -604,41 +647,21 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
compile_regexp_failed(p, errbuf);
}
}
#endif /* !USE_LIBPCRE2 */

static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
{
#ifndef USE_LIBPCRE2
int icase, ascii_only;
#endif
int err;

p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
#ifndef USE_LIBPCRE2
icase = opt->regflags & REG_ICASE || p->ignore_case;
ascii_only = !has_non_ascii(p->pattern);

#ifdef USE_LIBPCRE2
if (has_null(p->pattern, p->patternlen)) {
struct strbuf sb = STRBUF_INIT;
if (icase)
strbuf_add(&sb, "(?i)", 4);
if (opt->fixed)
strbuf_add(&sb, "\\Q", 2);
strbuf_add(&sb, p->pattern, p->patternlen);
if (opt->fixed)
strbuf_add(&sb, "\\E", 2);

p->pattern = sb.buf;
p->patternlen = sb.len;

/* FIXME: Check in compile_pcre2_pattern() that we're
* using basic rx using !opt->pcre2 && <something>
*/
opt->pcre2 = 1;

compile_pcre2_pattern(p, opt);
return;
}
#endif

/*
* Even when -F (fixed) asks us to do a non-regexp search, we
* may not be able to correctly case-fold when -i
Expand Down Expand Up @@ -672,12 +695,26 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
compile_fixed_regexp(p, opt);
return;
}
#endif

if (opt->pcre2) {
compile_pcre2_pattern(p, opt);
return;
}

#ifdef USE_LIBPCRE2
if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) {
compile_pcre2_pattern(p, opt);
return;
}

#ifdef PCRE2_CONVERT_POSIX_BASIC
opt->pcre2_posix_emulation = 1;
compile_pcre2_pattern(p, opt);
return;
#endif
#endif

if (opt->pcre1) {
compile_pcre1_regexp(p, opt);
return;
Expand All @@ -690,6 +727,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
regfree(&p->regexp);
compile_regexp_failed(p, errbuf);
}
return;
}

static struct grep_expr *compile_pattern_or(struct grep_pat **);
Expand Down
5 changes: 5 additions & 0 deletions grep.h
Expand Up @@ -29,6 +29,9 @@ typedef int pcre2_compile_context;
typedef int pcre2_match_context;
typedef int pcre2_jit_stack;
#endif
#ifndef PCRE2_CONVERT_POSIX_EXTENDED
typedef int pcre2_convert_context;
#endif
#include "kwset.h"
#include "thread-utils.h"
#include "userdiff.h"
Expand Down Expand Up @@ -73,6 +76,7 @@ struct grep_pat {
pcre_jit_stack *pcre1_jit_stack;
const unsigned char *pcre1_tables;
int pcre1_jit_on;
pcre2_convert_context *pcre2_convert_context;
pcre2_code *pcre2_pattern;
pcre2_match_data *pcre2_match_data;
pcre2_compile_context *pcre2_compile_context;
Expand Down Expand Up @@ -143,6 +147,7 @@ struct grep_opt {
int use_reflog_filter;
int pcre1;
int pcre2;
int pcre2_posix_emulation;
int relative;
int pathname;
int null_following_name;
Expand Down
6 changes: 6 additions & 0 deletions t/README
Expand Up @@ -820,6 +820,12 @@ use these, and "test_set_prereq" for how to define your own.
USE_LIBPCRE2=YesPlease. Wrap any PCRE using tests that for some
reason need v2 of the PCRE library instead of v1 in these.

- LIBPCRE2_BUNDLED

Git was compiled with the bundled PCRE v2 support via
USE_LIBPCRE2=YesPlease &
USE_LIBPCRE2_BUNDLED=IWantPatternConvertAwesomeSauce.

- CASE_INSENSITIVE_FS

Test is run on a case insensitive file system.
Expand Down
13 changes: 10 additions & 3 deletions t/t7008-grep-binary.sh
Expand Up @@ -100,9 +100,16 @@ test_expect_success 'git grep ile a' '
git grep ile a
'

test_expect_failure 'git grep .fi a' '
git grep .fi a
'
if test_have_prereq LIBPCRE2_BUNDLED
then
test_expect_success 'git grep .fi a' '
git grep .fi a
'
else
test_expect_failure 'git grep .fi a' '
git grep .fi a
'
fi

nul_match 1 '-F' 'yQf'
nul_match 0 '-F' 'yQx'
Expand Down
1 change: 1 addition & 0 deletions t/test-lib.sh
Expand Up @@ -1018,6 +1018,7 @@ test -z "$NO_PYTHON" && test_set_prereq PYTHON
test -n "$USE_LIBPCRE1$USE_LIBPCRE2" && test_set_prereq PCRE
test -n "$USE_LIBPCRE1" && test_set_prereq LIBPCRE1
test -n "$USE_LIBPCRE2" && test_set_prereq LIBPCRE2
test -n "$USE_LIBPCRE2_BUNDLED" && test_set_prereq LIBPCRE2_BUNDLED
test -z "$NO_GETTEXT" && test_set_prereq GETTEXT

# Can we rely on git's output in the C locale?
Expand Down

0 comments on commit a3cc090

Please sign in to comment.