Skip to content

[fix] Rewrite search queries to avoid parenthesised-step slow path#178

Merged
line-o merged 1 commit intoeXist-db:masterfrom
joewiz:fix/search-query-perf
May 6, 2026
Merged

[fix] Rewrite search queries to avoid parenthesised-step slow path#178
line-o merged 1 commit intoeXist-db:masterfrom
joewiz:fix/search-query-perf

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented May 6, 2026

Summary

Each search function in modules/app.xqm was using the shape $app:data//(branch1 | branch2 | ...) -- a parenthesised union step expression. At runtime this defeats the structural-index fast path: the engine materialises the full descendant axis under each $app:data root and applies the union as a generic step, instead of dispatching each branch by qname through the structural index.

This PR rewrites each function to the equivalent split form $app:data//branch1 | $app:data//branch2 | ..., where each branch is an independent path with its own structural-index lookup.

What changed

5 functions in src/main/xar-resources/modules/app.xqm:

  • search-in-module-location -- single-branch parenthesised step, parens dropped
  • search-in-module-name -- single-branch parenthesised step, parens dropped
  • search-in-description -- 2-branch union, distributed over $app:data//
  • search-in-signature -- 2-branch union, distributed
  • search-everywhere -- 9-branch union, distributed

Plus a comment block explaining why the rewrite matters and pointing to the upstream optimiser PR.

Why both forms are equivalent

XPath's | is a set union with document-order sort and duplicate elimination. For paths P and predicate-paths A, B:

$P//(A | B)   ≡   $P//A | $P//B

The right-hand form evaluates each path independently and unions the results; the left-hand form materialises the descendant axis once and applies the union per-node. Both produce the same node-set in document order.

Numbers

Synthetic xqdoc-shaped corpus (200 modules, 30 functions each = 6,000 functions, ngram-indexed on description/name/signature/param/return), measured against an embedded eXist running develop:

function shape before (parens) after (split)
search-in-description (2-branch) ~38 ms ~3 ms
search-in-signature (2-branch) ~34 ms ~3 ms
search-everywhere (9-branch) ~35 ms ~5 ms
search-in-module-location (1-branch parens) ~30 ms ~3 ms
search-in-module-name (1-branch parens) ~30 ms ~3 ms

The function-reference UI's keystroke-latency on large corpora drops correspondingly.

Related work

Companion PR upstream: eXist-db/exist#6303 -- adds an Optimizer pass that automatically unwraps the single-step parens shape //(name), so future code that accidentally uses parens around a single step gets the win for free. The union-of-steps distribution that this PR does by hand is left as an upstream follow-up because it requires more invasive AST rewriting (distributing the parent path over union branches needs either a PathExpr.replaceAllSteps-style API or rewriting the outer PathExpr at its parent).

Investigation thread: eXist-db/exist#6295 -- @line-o reported residual ngram performance issues in this app after #6300 merged. Diagnosis pinned the slow path to the parenthesised-step shape in this app's queries, not to ngram or the optimizer's predicate-rewriting. This PR fixes the app side; #6303 fixes the engine side as far as it can.

Test plan

  • Visual diff review of app.xqm: 5 functions rewritten, semantics-preserving
  • Cypress E2E (fundoc_spec.cy.js includes a search-everywhere case for "exist_home") -- run by maintainer / CI

[This PR was prepared with Claude Code. -Joe]

🤖 Generated with Claude Code

Each search function in app.xqm was using the shape
`$app:data//(branch1 | branch2 | ...)` -- a parenthesised union step
expression. At runtime this defeats the structural index fast path:
the engine materialises the full descendant axis under each
`$app:data` root, then applies the parenthesised expression as a
generic step, instead of dispatching each branch by qname through the
structural index.

The split form `$app:data//branch1 | $app:data//branch2 | ...` is
semantically identical (XPath's `|` is set union with document-order
sort and dedup) but evaluates each branch as an independent path with
its own structural-index lookup.

On a synthetic xqdoc corpus (~6,000 functions) the full
`search-everywhere` query goes from ~35ms (parenthesised form) down
to ~5ms (split form). The function reference UI's keystroke latency
on large corpora drops correspondingly.

Two of the functions (`search-in-module-location`,
`search-in-module-name`) had a single-branch parenthesised step that
also hit the same slow path; they're rewritten by simply dropping the
unnecessary parens.

For the upstream optimiser-side companion fix (which addresses the
single-step `//(name)` shape automatically), see
eXist-db/exist#6303. The union-of-steps
distribution that this commit performs by hand is left as an upstream
follow-up because it requires more invasive AST rewriting than the
parser/optimiser currently support.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duncdrum duncdrum added this to v7.0.0 May 6, 2026
@line-o line-o requested review from a team May 6, 2026 23:01
Copy link
Copy Markdown
Member

@line-o line-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh yes, let's do this

@line-o line-o merged commit 2ba3170 into eXist-db:master May 6, 2026
2 checks passed
@github-project-automation github-project-automation Bot moved this to Done in v7.0.0 May 6, 2026
@joewiz joewiz deleted the fix/search-query-perf branch May 7, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants