Skip to content

Commit be95d4b

Browse files
committed
Add parser experiments catalog and parse-only benchmark harness
Consolidates the parser/lexer performance experiments explored alongside the shipped optimizations (PR #378, built on #373/#375/#376). One directory and commit per approach; each has code and/or a NOTES.md with idea, method, result, verdict.
1 parent ef143d3 commit be95d4b

3 files changed

Lines changed: 300 additions & 0 deletions

File tree

experiments/README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# MySQL parser performance experiments
2+
3+
This branch consolidates and verifies the parser/lexer performance experiments
4+
that were explored while optimizing the pure-PHP MySQL parser. The shipped
5+
optimizations live in PR #378 (built on #373 / #375 / #376); the optional native
6+
Rust extension is PR #381 (and #423). The work here is the catalog of *other*
7+
approaches that were prototyped and measured along the way — most lived only in
8+
throwaway local branches or ephemeral sessions and had no home until now.
9+
10+
Everything was re-measured on a MacBook Pro M4, PHP 8.5.5, PCRE2 10.47.
11+
Numbers drift ~10–15% with thermal/load; treat them as orders of magnitude and
12+
ratios, not exact constants.
13+
14+
## How to run
15+
Warm tracing JIT (the production-relevant config):
16+
```
17+
-d memory_limit=2G -d opcache.enable_cli=1 -d opcache.jit_buffer_size=64M -d opcache.jit=tracing
18+
```
19+
No opcache: `-d opcache.enable_cli=0`. opcache without JIT: `-d opcache.enable_cli=1 -d opcache.jit=disable`.
20+
Always put `-d` flags BEFORE the script path. The corpus is the 69,577-query
21+
MySQL server-suite CSV at `packages/mysql-on-sqlite/tests/mysql/data/`.
22+
23+
Verified parse-only baselines (best-of-N, reuse one parser, warm JIT):
24+
trunk ≈ 27,700 QPS; the optimized parser (#378) ≈ 56,500 QPS (≈2.0×);
25+
pure-regex recognition ≈ 98K; the parser in validate-only mode ≈ 246K.
26+
AST construction is ≈77% of parse time.
27+
28+
## Experiments (one per directory, one per commit)
29+
`_harness/` holds the parse-only benchmark harnesses used throughout. Each
30+
experiment directory has a `NOTES.md` with the idea, how it was measured, the
31+
result, and a verdict; see each for origin (PR or local branch).
32+
33+
- `whole-grammar-compilation/` — compile every rule to a dedicated PHP method.
34+
- `method-size-capping/` — cap compiled method size, stub the rest to the interpreter.
35+
- `ast-data-structures/` — object vs validate-only vs flat-int-tape vs array node.
36+
- `pratt-expression-cascade/` — Pratt operator-precedence parser for the expr chain.
37+
- `ll2-selectors/` — 2-token-lookahead proposal + the rule/call-split analysis behind it.
38+
- `lalr-table-driven/` — kmyacc/nikic-style action-goto table interpreter.
39+
- `packed-table-lookups/` — pack/unpack vs PHP-array action-table lookups.
40+
- `full-pcre-recognizer/` — fold the whole grammar into one recursive PCRE pattern.
41+
- `regex-prevalidate-hybrid/` — regex yes/no gate in front of the AST parser.
42+
- `multishape-fast-parser/` — per-query-shape regex → direct AST construction.
43+
- `pcre2-capture-trace/` — extract a parse tree from PCRE2 captures.
44+
- `pcre2-callouts-ffi/` — PCRE2 callouts via FFI to emit a structural trace.
45+
- `preg-replace-callback-shiftreduce/` — iterative mega-pattern reduction.
46+
- `binary-bottomup-reduction/` — the same, with fixed-width binary encodings.
47+
- `oniguruma-capture-trees/``(?@...)` capture trees (31-group cap; unreachable in PHP).
48+
- `strtr-blind-reduction/` — strtr iterate-to-stable reduction (toy grammar).
49+
- `native-tree-builders/` — json_decode/unserialize/DOMDocument (circular).
50+
- `parle-extension/` — the `parle` PECL LALR(1) extension.
51+
- `other-php-parser-libs/` — PHP-PEG / Hoa\Compiler / Phlexy.
52+
- `sqlite-as-parser/` — use SQLite's own parser as a classifier.
53+
- `ast-cache/` — cache the AST on a parameterized token-stream signature.
54+
- `native-rust-extension/` — the optional Rust extension (PR #381/#423/#378).
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
<?php
2+
/**
3+
* Parse-only benchmark methodology:
4+
* - Lex every query once, up front (lexer NOT part of timing).
5+
* - Time parse() only, best-of-N after warmup iterations.
6+
*
7+
* Points at an arbitrary src tree so trunk / performance / experiment
8+
* branches can be measured with the identical harness:
9+
*
10+
* php bench-parse-only.php --src=/abs/.../packages/mysql-on-sqlite/src \
11+
* [--warmup=2] [--runs=5] [--limit=N] [--reuse] [--json]
12+
*
13+
* --reuse reuse one parser via reset_tokens() (driver behaviour) instead of
14+
* constructing a fresh parser per query.
15+
*/
16+
17+
set_error_handler(
18+
function ( $severity, $message, $file, $line ) {
19+
throw new ErrorException( $message, 0, $severity, $file, $line );
20+
}
21+
);
22+
23+
$src = null;
24+
$warmup = 2;
25+
$runs = 5;
26+
$limit = PHP_INT_MAX;
27+
$reuse = in_array( '--reuse', $argv, true );
28+
$json = in_array( '--json', $argv, true );
29+
foreach ( $argv as $arg ) {
30+
if ( preg_match( '/^--src=(.+)$/', $arg, $m ) ) {
31+
$src = rtrim( $m[1], '/' );
32+
}
33+
if ( preg_match( '/^--warmup=(\d+)$/', $arg, $m ) ) {
34+
$warmup = (int) $m[1];
35+
}
36+
if ( preg_match( '/^--runs=(\d+)$/', $arg, $m ) ) {
37+
$runs = (int) $m[1];
38+
}
39+
if ( preg_match( '/^--limit=(\d+)$/', $arg, $m ) ) {
40+
$limit = (int) $m[1];
41+
}
42+
}
43+
if ( null === $src ) {
44+
fwrite( STDERR, "Missing --src=PATH\n" );
45+
exit( 1 );
46+
}
47+
48+
require_once "$src/parser/class-wp-parser-grammar.php";
49+
require_once "$src/parser/class-wp-parser-node.php";
50+
require_once "$src/parser/class-wp-parser-token.php";
51+
require_once "$src/parser/class-wp-parser.php";
52+
require_once "$src/mysql/class-wp-mysql-token.php";
53+
require_once "$src/mysql/class-wp-mysql-lexer.php";
54+
require_once "$src/mysql/class-wp-mysql-parser.php";
55+
56+
$grammar_data = include "$src/mysql/mysql-grammar.php";
57+
$grammar = new WP_Parser_Grammar( $grammar_data );
58+
59+
// Corpus loading identical to run-parser-benchmark.php (no header skip; drop
60+
// null AND empty records).
61+
$data_dir = __DIR__ . '/corpus';
62+
$handle = fopen( "$data_dir/mysql-server-tests-queries.csv", 'r' );
63+
$queries = array();
64+
while ( ( $record = fgetcsv( $handle, null, ',', '"', '\\' ) ) !== false ) {
65+
$query = $record[0] ?? null;
66+
if ( null === $query || '' === $query ) {
67+
continue;
68+
}
69+
$queries[] = $query;
70+
if ( count( $queries ) >= $limit ) {
71+
break;
72+
}
73+
}
74+
fclose( $handle );
75+
76+
// Pre-lex all queries (excluded from timing).
77+
$all_tokens = array();
78+
foreach ( $queries as $query ) {
79+
$lexer = new WP_MySQL_Lexer( $query );
80+
$all_tokens[] = $lexer instanceof WP_MySQL_Native_Lexer
81+
? $lexer->native_token_stream()
82+
: $lexer->remaining_tokens();
83+
}
84+
$n = count( $queries );
85+
86+
$run_once = function () use ( $grammar, $all_tokens, $reuse ) {
87+
$failures = 0;
88+
$parser = null;
89+
$start = microtime( true );
90+
foreach ( $all_tokens as $tokens ) {
91+
if ( $reuse ) {
92+
if ( null === $parser ) {
93+
$parser = new WP_MySQL_Parser( $grammar, $tokens );
94+
} else {
95+
$parser->reset_tokens( $tokens );
96+
}
97+
} else {
98+
$parser = new WP_MySQL_Parser( $grammar, $tokens );
99+
}
100+
$ast = $parser->parse();
101+
if ( null === $ast ) {
102+
++$failures;
103+
}
104+
}
105+
return array( microtime( true ) - $start, $failures );
106+
};
107+
108+
for ( $i = 0; $i < $warmup; $i++ ) {
109+
$run_once();
110+
}
111+
112+
$qpss = array();
113+
$fail = 0;
114+
for ( $r = 0; $r < $runs; $r++ ) {
115+
list( $duration, $failures ) = $run_once();
116+
$qpss[] = $n / $duration;
117+
$fail = $failures;
118+
}
119+
sort( $qpss );
120+
$best = $qpss[ count( $qpss ) - 1 ];
121+
$median = $qpss[ intdiv( count( $qpss ), 2 ) ];
122+
123+
$jit_on = false;
124+
$status = opcache_get_status( false );
125+
if ( is_array( $status ) && isset( $status['jit']['on'] ) ) {
126+
$jit_on = (bool) $status['jit']['on'];
127+
}
128+
129+
if ( $json ) {
130+
echo json_encode(
131+
array(
132+
'queries' => $n,
133+
'failures' => $fail,
134+
'qps_best' => $best,
135+
'qps_med' => $median,
136+
'jit' => $jit_on,
137+
'php' => PHP_VERSION,
138+
)
139+
), "\n";
140+
exit;
141+
}
142+
143+
printf(
144+
"queries=%d failures=%d best=%d QPS median=%d QPS jit=%s php=%s\n",
145+
$n,
146+
$fail,
147+
$best,
148+
$median,
149+
$jit_on ? 'on' : 'off',
150+
PHP_VERSION
151+
);
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
<?php
2+
/**
3+
* Parser performance benchmark with split timings.
4+
*
5+
* Separates lex time from parse time by pre-tokenizing all queries before
6+
* starting the parse-only timer. Reports total, average, and per-phase QPS.
7+
*
8+
* Usage:
9+
* php bench-parser-split.php [--runs=N] [--limit=M]
10+
*/
11+
12+
set_error_handler(
13+
function ( $severity, $message, $file, $line ) {
14+
throw new ErrorException( $message, 0, $severity, $file, $line );
15+
}
16+
);
17+
18+
require_once __DIR__ . '/../../src/parser/class-wp-parser-grammar.php';
19+
require_once __DIR__ . '/../../src/parser/class-wp-parser-node.php';
20+
require_once __DIR__ . '/../../src/parser/class-wp-parser-token.php';
21+
require_once __DIR__ . '/../../src/parser/class-wp-parser.php';
22+
require_once __DIR__ . '/../../src/mysql/class-wp-mysql-token.php';
23+
require_once __DIR__ . '/../../src/mysql/class-wp-mysql-lexer.php';
24+
require_once __DIR__ . '/../../src/mysql/class-wp-mysql-parser.php';
25+
26+
$runs = 1;
27+
$limit = PHP_INT_MAX;
28+
foreach ( $argv as $arg ) {
29+
if ( preg_match( '/^--runs=(\d+)$/', $arg, $m ) ) {
30+
$runs = (int) $m[1];
31+
}
32+
if ( preg_match( '/^--limit=(\d+)$/', $arg, $m ) ) {
33+
$limit = (int) $m[1];
34+
}
35+
}
36+
37+
$grammar_data = include __DIR__ . '/../../src/mysql/mysql-grammar.php';
38+
$grammar = new WP_Parser_Grammar( $grammar_data );
39+
40+
$data_dir = __DIR__ . '/../mysql/data';
41+
$handle = fopen( "$data_dir/mysql-server-tests-queries.csv", 'r' );
42+
$queries = array();
43+
$header = true;
44+
while ( ( $record = fgetcsv( $handle, null, ',', '"', '\\' ) ) !== false ) {
45+
if ( $header ) {
46+
$header = false;
47+
continue;
48+
}
49+
if ( null !== $record[0] ) {
50+
$queries[] = $record[0];
51+
}
52+
if ( count( $queries ) >= $limit ) {
53+
break;
54+
}
55+
}
56+
fclose( $handle );
57+
echo 'Loaded ', count( $queries ), " queries\n";
58+
59+
// Pre-tokenize all queries once. The tokens are reused across runs, so the
60+
// parser starts from a cold AST cache each iteration but a warm token cache.
61+
$lex_start = microtime( true );
62+
$all_tokens = array();
63+
foreach ( $queries as $query ) {
64+
$lexer = new WP_MySQL_Lexer( $query );
65+
$all_tokens[] = $lexer->remaining_tokens();
66+
}
67+
$lex_duration = microtime( true ) - $lex_start;
68+
printf( "Lex: %.4fs, %d QPS\n", $lex_duration, count( $queries ) / $lex_duration );
69+
70+
// Parse benchmark.
71+
$results = array();
72+
for ( $r = 0; $r < $runs; $r++ ) {
73+
$failures = 0;
74+
$start = microtime( true );
75+
foreach ( $all_tokens as $tokens ) {
76+
$parser = new WP_MySQL_Parser( $grammar, $tokens );
77+
$ast = $parser->parse();
78+
if ( null === $ast ) {
79+
++$failures;
80+
}
81+
}
82+
$duration = microtime( true ) - $start;
83+
$qps = count( $queries ) / $duration;
84+
$results[] = array( $duration, $qps, $failures );
85+
printf( "Run %d: %.4fs, %d QPS, %d failures\n", $r + 1, $duration, $qps, $failures );
86+
}
87+
88+
if ( $runs > 1 ) {
89+
$durations = array_column( $results, 0 );
90+
sort( $durations );
91+
$best = $durations[0];
92+
printf( "Best: %.4fs, %d QPS\n", $best, count( $queries ) / $best );
93+
$avg = array_sum( $durations ) / count( $durations );
94+
printf( "Avg: %.4fs, %d QPS\n", $avg, count( $queries ) / $avg );
95+
}

0 commit comments

Comments
 (0)