Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
grep: use PCRE v2 under the hood for -G & -E for big performance gain
Change the underlying engine powering POSIX basic & extended patterns to be PCRE v2 under the hood. This relies on an experimental SVN-trunk only PCRE v2 API which Philip Hazel (the PCRE maintainer) wrote up in response to a feature request I filed1[1]. This allows us to use pcre2_pattern_convert() to power all grep regex matches by converting the POSIX patterns into PCRE syntax before compiling them. Due to PCRE generally being faster than POSIX, but most importantly due to its JIT feature (where available) this speeds up grep by a *lot*. The improvements to the "perl" tests are already a part of this series, but all the other benchmarks show improvements made by this change alone: $ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease USE_LIBPCRE2_BUNDLED=Y CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run v2.13.0 HEAD -- p*grep* [...] Test v2.13.0 HEAD ----------------------------------------------------------------------------------------- 7810.1: grep worktree, cheap regex 0.19(0.35+0.62) 0.18(0.34+0.57) -5.3% 7810.2: grep worktree, expensive regex 4.35(30.52+0.32) 0.92(5.55+0.39) -78.9% 7810.3: grep --cached, cheap regex 2.92(2.83+0.06) 2.83(2.75+0.07) -3.1% 7810.4: grep --cached, expensive regex 21.12(21.02+0.08) 6.28(6.15+0.10) -70.3% 7820.1: basic grep how.to 0.28(1.27+0.41) 0.19(0.33+0.58) -32.1% 7820.2: extended grep how.to 0.28(1.19+0.49) 0.19(0.32+0.55) -32.1% 7820.3: perl grep how.to 0.27(1.10+0.52) 0.19(0.26+0.63) -29.6% 7820.5: basic grep ^how to 0.27(1.22+0.43) 0.18(0.28+0.63) -33.3% 7820.6: extended grep ^how to 0.27(1.20+0.44) 0.18(0.29+0.62) -33.3% 7820.7: perl grep ^how to 0.48(2.81+0.39) 0.18(0.29+0.61) -62.5% 7820.9: basic grep [how] to 0.42(2.19+0.44) 0.21(0.34+0.65) -50.0% 7820.10: extended grep [how] to 0.41(2.18+0.43) 0.21(0.36+0.63) -48.8% 7820.11: perl grep [how] to 0.47(2.63+0.38) 0.20(0.29+0.70) -57.4% 7820.13: basic grep \(e.t[^ ]*\|v.ry\) rare 0.55(3.25+0.43) 0.19(0.53+0.52) -65.5% 7820.14: extended grep (e.t[^ ]*|v.ry) rare 0.54(3.30+0.42) 0.19(0.51+0.53) -64.8% 7820.15: perl grep (e.t[^ ]*|v.ry) rare 0.88(5.77+0.41) 0.19(0.52+0.53) -78.4% 7820.17: basic grep m\(ú\|u\)lt.b\(æ\|y\)te 0.28(1.28+0.47) 0.18(0.35+0.57) -35.7% 7820.18: extended grep m(ú|u)lt.b(æ|y)te 0.28(1.32+0.43) 0.18(0.28+0.64) -35.7% 7820.19: perl grep m(ú|u)lt.b(æ|y)te 0.32(1.62+0.43) 0.18(0.28+0.64) -43.8% 7821.1: fixed grep int 0.50(1.76+0.58) 0.39(1.16+0.69) -22.0% 7821.2: basic grep int 0.55(1.83+0.72) 0.41(1.10+0.68) -25.5% 7821.3: extended grep int 0.56(1.78+0.73) 0.47(1.17+0.75) -16.1% 7821.4: perl grep int 0.52(1.57+0.80) 0.47(1.29+0.64) -9.6% 7821.6: fixed grep -i int 0.55(2.06+0.64) 0.44(1.32+0.65) -20.0% 7821.7: basic grep -i int 0.59(2.10+0.67) 0.51(1.40+0.76) -13.6% 7821.8: extended grep -i int 0.56(2.08+0.67) 0.53(1.45+0.72) -5.4% 7821.9: perl grep -i int 0.58(2.12+0.59) 0.51(1.36+0.78) -12.1% 7821.11: fixed grep æ 0.30(1.34+0.40) 0.18(0.27+0.64) -40.0% 7821.12: basic grep æ 0.30(1.21+0.53) 0.18(0.31+0.59) -40.0% 7821.13: extended grep æ 0.30(1.30+0.44) 0.18(0.28+0.63) -40.0% 7821.14: perl grep æ 0.29(1.22+0.52) 0.18(0.33+0.57) -37.9% 7821.16: fixed grep -i æ 0.23(0.86+0.51) 0.18(0.27+0.63) -21.7% 7821.17: basic grep -i æ 0.23(0.86+0.50) 0.18(0.28+0.62) -21.7% 7821.18: extended grep -i æ 0.24(0.89+0.49) 0.18(0.28+0.62) -25.0% 7821.19: perl grep -i æ 0.22(0.80+0.48) 0.18(0.30+0.60) -18.2% Caveats & other things to mention: * This will expose PCRE v2 (as opposed to C library reg(comp|exec))to the network via gitweb in its default configuration. See <CACBZZX6V8qbnrZAdhRvPthy5Z91iEG8rrJ=Sf9tdkOt52M9j1Q@mail.gmail.com> for a discussion of security & other caveats related to that. * I'm checking for PCRE2_CONVERT_POSIX_BASIC to enable this, but the experimental API of pcre2_pattern_convert() may change before it makes it into a release. If we think this patch is awesome enough to get into a git release regardless, it should be guarded by some other method so we don't rudely tie upstream PCRE to this API least they break git versions in the wild. * One way to do to that would be to guard this via the USE_LIBPCRE2_BUNDLED flag, but see the above E-Mail thread for concerns about shipping an embedded PCRE, and for ways that could be made OK. * We could ship some copy of just the logic in pcre2_pattern_convert() & use the system PCRE instead. I haven't tried splitting it off from the PCRE codebase, and don't know how hard that would be. * There are outstanding bugs in the pcre2_pattern_convert() function. Grepping with -G and -E for all ASCII characters from 1..127 both "$char" and "\\$char" will produce numerous differences. These are mostly obscure cases, I'm working out fixes to those with Philip. 1. https://bugs.exim.org/show_bug.cgi?id=2106 Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
- Loading branch information