Skip to content

Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery#16031

Closed
drempapis wants to merge 2 commits into
apache:mainfrom
drempapis:fuzzy-query-accountable
Closed

Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery#16031
drempapis wants to merge 2 commits into
apache:mainfrom
drempapis:fuzzy-query-accountable

Conversation

@drempapis
Copy link
Copy Markdown
Contributor

The need

AutomatonQuery and its subclasses (RegexpQuery, WildcardQuery, PrefixQuery, TermRangeQuery) build a CompiledAutomaton eagerly in their constructor and retain it as a field. Because AutomatonQuery implements Accountable, callers can perform request-scoped memory accounting by reading Accountable#ramBytesUsed() at construction time.

The FuzzyQuery constructs the Levenshtein automata lazily inside FuzzyTermsEnum, storing them on the search-scopedAttributeSource so they can be shared across segments during a single rewrite. As a result there is currently no public way to ask a FuzzyQuery how much RAM its automata will cost without actually executing it; FuzzyQuery does not implement Accountable, and the AutomatonAttribute mechanism inside FuzzyTermsEnum is private.

Changes

This PR introduces two additions and one visibility relaxation, with no behavioural changes:

  1. FuzzyQuery now implements Accountable. The ramBytesUsed() returns a stable value: shallowSizeOfInstance(FuzzyQuery.class) + term.ramBytesUsed(). It excludes the Levenshtein automata: those are not retained by the query, they live per-search on an AttributeSource. Folding them into ramBytesUsed() would make the value jump from 0 to N after the first execution and inflate query-cache accounting with memory the query does not own (see LUCENE-9350).
  2. A new public method FuzzyQuery#computeAutomataRamBytes(AttributeSource atts) returns the aggregate RAM cost of the CompiledAutomaton[] used to execute the query, building them on the supplied AttributeSource via the same sharing mechanism FuzzyTermsEnum already uses across segments. Returns 0L when maxEdits == 0. Calling getTermsEnum(terms, atts) afterwards with the same AttributeSource reuses the primed automata instead of rebuilding them.
  3. The FuzzyTermsEnum.AutomatonAttribute and FuzzyTermsEnum.AutomatonAttributeImpl are widened from private to package-private so FuzzyQuery can install/read the same attribute type. They remain non-public Lucene internals.

How clients can use it

Clients who run FuzzyQuery and need to bound or report memory now have two complementary, non-disruptive primitives:

  • ramBytesUsed(): cheap, stable, query-identity-preserving accounting of the query object itself. Safe to fold into existing Accountable walks.
  • computeAutomataRamBytes(atts): a pre-flight handle on the dominant cost (the CompiledAutomaton[] transition tables). Two usage patterns:
    1. Pre-flight, reuse path. Passing an AttributeSource into computeAutomataRamBytes, account/charge the returned bytes, then pass the same AttributeSource into getTermsEnum. The automata are built once and reused across all segments — no duplicate work.
    2. In-flight observation. Subclassing FuzzyQuery and overriding getTermsEnum(Terms, AttributeSource); after super.getTermsEnum returns, call computeAutomataRamBytes(atts) on the same atts. init is idempotent on a primed attribute, so this only walks the already-built array( no second build). That way lets callers attach accounting to query execution without changing the Lucene execution path.

@drempapis
Copy link
Copy Markdown
Contributor Author

@rmuir can you please have a look?

@github-actions github-actions Bot added this to the 11.0.0 milestone May 4, 2026
@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 4, 2026

I'm not sure we need the complexity. The algorithm constructs a dfa, linear in space and time with respect to the input string

@drempapis
Copy link
Copy Markdown
Contributor Author

@rmuir thank you for your feedback.

Fair on the complexity, let me clarify the trade-off, because I think the change is smaller than it reads, and the gap it fills is real.

On complexity. The PR doesn't introduce any new construction, caching, or sharing mechanism. The two real additions are implements Accountable with a stable shallowSize + term.ramBytesUsed() (three lines) and computeAutomataRamBytes(AttributeSource), which simply walks the existing AutomatonAttribute path that FuzzyTermsEnum already uses to share automata across segment; same supplier, same idempotent init, same array. The only structural change is widening AutomatonAttribute / AutomatonAttributeImpl from private to package-private so FuzzyQuery can install/read the same attribute type the enum already installs. They remain non-public Lucene internals.

On inear in input string. Agreed, per query, the DFA is O(n) with k ≤ 2 and a small constant. The motivation for the API isn't a single FuzzyQuery, it's the aggregate. In real workloads, a parsed query is often a BooleanQuery with many fuzzy clauses against fields with long terms; multiply by concurrent requests and the aggregate O(m · n) heap charge is large enough, and unpredictable enough from the host's point of view, to push the JVM toward OOM with no graceful degradation. Lucene rightly doesn't bound any of this, m, n, the per-segment fan-out, or the number of in-flight rewrites, so the host has to.

What this enables. Hosts that already do request-scoped accounting for the rest of the family, RegexpQuery, WildcardQuery, PrefixQuery, TermRangeQuery, all of which expose their compiled-automaton cost through Accountable#ramBytesUsed() because they retain the automaton. Currently can't extend the same scheme to FuzzyQuery, because FuzzyQuery deliberately doesn't retain its automata on the query. The important part, the computeAutomataRamBytes(atts) lets a host pre-compute the automata cost on demand, charge it against a circuit breaker (or any metric applied), and then run the query, passing the sameAttributeSource into rewrite so the automata are not built twice. So the marginal CPU cost of measurement is zero in the steady-state path: you'd have built those automata during search anyway.

What it does not change. No effect on equals/hashCode (so query-cache keys are unchanged), no behavioral change to rewrite, scoring, or cross-segment automata sharing, no public-API change on FuzzyTermsEnum, no default-path change for callers that don't touch the new API.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 5, 2026

Hosts that already do request-scoped accounting for the rest of the family, RegexpQuery, WildcardQuery, PrefixQuery, TermRangeQuery, all of which expose their compiled-automaton cost through Accountable#ramBytesUsed() because they retain the automaton. Currently can't extend the same scheme to FuzzyQuery, because FuzzyQuery deliberately doesn't retain its automata on the query.

I guess this is what I'm disputing. It isn't really "in the family" since it builds a small DFA correctly. It doesn't have unpredictable size and runtime like Regexp or Wildcard.

I'm worried if we expose too many guts for these small things, it will hinder future improvements. For this one, a good future improvement would be to look at removing tableization if we don't need it.

@drempapis
Copy link
Copy Markdown
Contributor Author

I don't think FuzzyQuery really belongs in the same group as RegexpQuery, WildcardQuery, PrefixQuery, and TermRangeQuery. Those queries take user patterns, and the compiled automaton can grow without a clear bound. That's why they keep the automaton on the query and report it through Accountable#ramBytesUsed().

The FuzzyQuery is different. Its DFA only depends on termLength and maxEdits. maxEdits is capped at 2, and prefixLength is clipped to the term length. So the size and the build time are bounded and predictable.

I ran a small JMH benchmark to show this

import java.util.concurrent.TimeUnit;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.util.AttributeSource;
import org.openjdk.jmh.annotations.AuxCounters;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 1)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
public class FuzzyAutomatonRamBenchmark {

  @Param({"16", "32", "64", "128", "256"})
  public int termLength;

  @Param({"1", "2"})
  public int maxEdits;

  private FuzzyQuery query;

  @AuxCounters(AuxCounters.Type.EVENTS)
  @State(Scope.Thread)
  public static class AutomataRam {
    public long ramBytes;
  }

  @Setup(Level.Trial)
  public void setup() {
    StringBuilder sb = new StringBuilder(termLength);
    for (int i = 0; i < termLength; i++) {
      sb.append((char) ('a' + (i % 26)));
    }
    query =
        new FuzzyQuery(
            new Term("f", sb.toString()),
            maxEdits,
            0,
            FuzzyQuery.defaultMaxExpansions,
            true);
  }

  @Benchmark
  public long buildAutomata(AutomataRam metrics) {
    long bytes = query.computeAutomataRamBytes(new AttributeSource());
    metrics.ramBytes = bytes;
    return bytes;
  }
}
maxEdits  termLength  build µs/op   retained bytes/build
    1         16          71.5         ~49 KB
    1         32         180.5        ~115 KB
    1         64         351.2        ~225 KB
    1        128         705.8        ~445 KB
    1        256       1,424.4        ~885 KB
    2         16         882.8        ~309 KB
    2         32       2,589.5        ~770 KB
    2         64       5,819.1       ~1.52 MB
    2        128      10,838.3       ~3.04 MB
    2        256      25,701.7       ~6.10 MB

Two things are clear:

  • For a fixed maxEdits, the memory grows roughly linearly with termLength.
  • Going from maxEdits=1 to maxEdits=2 adds a constant factor of about 6–7×. It is not the kind of blow-up we see with regexp or wildcard.

So the DFA is small and predictable.

exposing too many guts may hinder future improvements (e.g. removing tableization)

I agree that removing tableization sounds like a good change, and it would lower these numbers. But that is a separate improvement. It changes the constant, not the fact that the cost depends on the input.

What mostly matters to me is concurrency. The numbers above are the cost of one query. The same automata are shared across segments inside that query, so segment count does not multiply the cost, but the per-query cost is still held for the full lifetime of the search. On a busy server you have many fuzzy queries running at the same time, and long terms (names, emails, IDs close to length 256) do happen in real workloads. A few MB per query, times many concurrent searches, adds up. Today this cost is invisible to any per-request accounting. A future change that removes tableization will make the constant smaller, but the variability per request is still there as long as termLength and maxEdits come from the user. I think giving callers a way to measure this cost per request is exactly what accounting hooks are for.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 5, 2026

It changes the constant, not the fact that the cost depends on the input.

This is not correct.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 5, 2026

for (int i = 0; i < termLength; i++) {
sb.append((char) ('a' + (i % 26)));
}

I don't think these are good realistic test strings. It will inflate the numbers due to huge alphabet size. How many words have 20+ unique letters in them?

@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 5, 2026

On a busy server you have many fuzzy queries running at the same time, and long terms (names, emails, IDs close to length 256) do happen in real workloads. A few MB per query, times many concurrent searches, adds up.

I don't buy that there is a single real actual use-case of someone using fuzzy query with massive concurrency on terms of length 256. If someone is doing this, they should change their tokenization and search strategy.

@drempapis
Copy link
Copy Markdown
Contributor Author

I don't think these are good realistic test strings. It will inflate the numbers due to huge alphabet size. How many words have 20+ unique letters in them?

Fair point, I re-ran with alphabet in {4, 8, 12, 26} so the realistic case (~ 8 unique letters) and the worst case (26) are both visible. Per-build retained bytes (sum of CompiledAutomaton.ramBytesUsed() across the maxEdits+1 automata):

termLength  edits  alpha=4   alpha=8   alpha=12  alpha=26
    16        1     ~36 KB    ~41 KB    ~46 KB    ~50 KB
    32        1     ~69 KB    ~78 KB    ~87 KB   ~118 KB
    64        1    ~134 KB   ~151 KB   ~169 KB   ~231 KB
   128        1    ~264 KB   ~299 KB   ~333 KB   ~456 KB
   256        1    ~523 KB   ~593 KB   ~662 KB   ~906 KB
    16        2    ~230 KB   ~259 KB   ~288 KB   ~316 KB
    32        2    ~466 KB   ~525 KB   ~583 KB   ~789 KB
    64        2    ~938 KB  ~1.06 MB  ~1.18 MB  ~1.59 MB
   128        2   ~1.88 MB  ~2.12 MB  ~2.36 MB  ~3.19 MB
   256        2   ~3.77 MB  ~4.25 MB  ~4.73 MB  ~6.39 MB
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 1)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
public class FuzzyAutomatonRamBenchmark {

  @Param({"16", "32", "64", "128", "256"})
  public int termLength;

  @Param({"1", "2"})
  public int maxEdits;

  @Param({"4", "8", "12", "26"})
  public int alphabet;

  private FuzzyQuery query;

  @AuxCounters(AuxCounters.Type.EVENTS)
  @State(Scope.Thread)
  public static class AutomataRam {
    public long ramBytes;
  }

  @Setup(Level.Trial)
  public void setup() {
    if (alphabet < 1 || alphabet > 26) {
      throw new IllegalArgumentException("alphabet must be in [1, 26], got: " + alphabet);
    }
    StringBuilder sb = new StringBuilder(termLength);
    for (int i = 0; i < termLength; i++) {
      sb.append((char) ('a' + (i % alphabet)));
    }
    query =
        new FuzzyQuery(
            new Term("f", sb.toString()),
            maxEdits,
            0,
            FuzzyQuery.defaultMaxExpansions,
            true);
  }

  @Benchmark
  public long buildAutomata(AutomataRam metrics) {
    long bytes = query.computeAutomataRamBytes(new AttributeSource());
    metrics.ramBytes = bytes;
    return bytes;
  }
}
  • The alphabet does inflate the constant, but not by much. From realistic (alphabet=8) to saturated (alphabet=26) the bytes grow by ~30–50%. Going from 4 to 8 grows them by ~10–15%. So my earlier numbers were biased, but not by a multiple, the shape and order of magnitude are the same.

  • The two scaling properties hold at every alphabet: bytes are linear in termLength (each doubling of length doubles the bytes, within a few percent), and going from maxEdits=1 to maxEdits=2 is a bounded ~7× regardless of alphabet.

So the cost is driven by termLength and maxEdits. The realistic case is still meaningful in absolute terms. A 64-character term with 8 unique letters at maxEdits=2 retains ~1 MB per build, and a 128-character term retains ~2 MB. These are realistic shapes for many fields people index as keyword and run fuzzy queries against:

  • Identifiers: UUIDs / ISBNS and any kind of IDs
  • URLs and file paths
  • Hostnames and DNS labels
  • Tokenized phrases
  • etc

"Long fuzzy terms" is not a corner case. On a busy node with many concurrent fuzzy queries, and more stuff running in parallel, the per-request automata cost is real, and today it is invisible to per-request accounting. This PR is tied to a concrete use case where a single query includes many tokens, and some of those tokens are large.

I agree that ideally the user should fix their tokenization and search strategy. But on the server side we cannot count on that. The implementation has to be defensive and protect the node regardless of how clients shape their queries, that is exactly what request-scoped accounting is for.

If you see a better way to expose this, I'm open to it, happy to change direction.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented May 5, 2026

Sorry that analysis is not correct. Put the llm away and we can discuss it.

Alphabet size definitely drives the memory size if we remove tableization.

@drempapis drempapis closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants