Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery#16031
Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery#16031drempapis wants to merge 2 commits into
Conversation
|
@rmuir can you please have a look? |
|
I'm not sure we need the complexity. The algorithm constructs a dfa, linear in space and time with respect to the input string |
|
@rmuir thank you for your feedback. Fair on the complexity, let me clarify the trade-off, because I think the change is smaller than it reads, and the gap it fills is real. On complexity. The PR doesn't introduce any new construction, caching, or sharing mechanism. The two real additions are implements On inear in input string. Agreed, per query, the DFA is What this enables. Hosts that already do request-scoped accounting for the rest of the family, What it does not change. No effect on |
I guess this is what I'm disputing. It isn't really "in the family" since it builds a small DFA correctly. It doesn't have unpredictable size and runtime like Regexp or Wildcard. I'm worried if we expose too many guts for these small things, it will hinder future improvements. For this one, a good future improvement would be to look at removing tableization if we don't need it. |
|
I don't think The I ran a small JMH benchmark to show this Two things are clear:
So the DFA is small and predictable.
I agree that removing tableization sounds like a good change, and it would lower these numbers. But that is a separate improvement. It changes the constant, not the fact that the cost depends on the input. What mostly matters to me is concurrency. The numbers above are the cost of one query. The same automata are shared across segments inside that query, so segment count does not multiply the cost, but the per-query cost is still held for the full lifetime of the search. On a busy server you have many fuzzy queries running at the same time, and long terms (names, emails, IDs close to length 256) do happen in real workloads. A few MB per query, times many concurrent searches, adds up. Today this cost is invisible to any per-request accounting. A future change that removes tableization will make the constant smaller, but the variability per request is still there as long as termLength and maxEdits come from the user. I think giving callers a way to measure this cost per request is exactly what accounting hooks are for. |
This is not correct. |
I don't think these are good realistic test strings. It will inflate the numbers due to huge alphabet size. How many words have 20+ unique letters in them? |
I don't buy that there is a single real actual use-case of someone using fuzzy query with massive concurrency on terms of length 256. If someone is doing this, they should change their tokenization and search strategy. |
Fair point, I re-ran with alphabet in {4, 8, 12, 26} so the realistic case (~ 8 unique letters) and the worst case (26) are both visible. Per-build retained bytes (sum of CompiledAutomaton.ramBytesUsed() across the maxEdits+1 automata):
So the cost is driven by
"Long fuzzy terms" is not a corner case. On a busy node with many concurrent fuzzy queries, and more stuff running in parallel, the per-request automata cost is real, and today it is invisible to per-request accounting. This PR is tied to a concrete use case where a single query includes many tokens, and some of those tokens are large. I agree that ideally the user should fix their tokenization and search strategy. But on the server side we cannot count on that. The implementation has to be defensive and protect the node regardless of how clients shape their queries, that is exactly what request-scoped accounting is for. If you see a better way to expose this, I'm open to it, happy to change direction. |
|
Sorry that analysis is not correct. Put the llm away and we can discuss it. Alphabet size definitely drives the memory size if we remove tableization. |
The need
AutomatonQueryand its subclasses (RegexpQuery,WildcardQuery,PrefixQuery,TermRangeQuery) build aCompiledAutomatoneagerly in their constructor and retain it as a field. Because AutomatonQuery implementsAccountable, callers can perform request-scoped memory accounting by readingAccountable#ramBytesUsed()at construction time.The
FuzzyQueryconstructs the Levenshtein automata lazily insideFuzzyTermsEnum, storing them on the search-scopedAttributeSourceso they can be shared across segments during a single rewrite. As a result there is currently no public way to ask aFuzzyQueryhow much RAM its automata will cost without actually executing it;FuzzyQuerydoes not implementAccountable, and theAutomatonAttributemechanism insideFuzzyTermsEnumisprivate.Changes
This PR introduces two additions and one visibility relaxation, with no behavioural changes:
FuzzyQuerynowimplements Accountable. TheramBytesUsed()returns a stable value:shallowSizeOfInstance(FuzzyQuery.class) + term.ramBytesUsed(). It excludes the Levenshtein automata: those are not retained by the query, they live per-search on anAttributeSource. Folding them intoramBytesUsed()would make the value jump from 0 to N after the first execution and inflate query-cache accounting with memory the query does not own (see LUCENE-9350).FuzzyQuery#computeAutomataRamBytes(AttributeSource atts)returns the aggregate RAM cost of theCompiledAutomaton[]used to execute the query, building them on the suppliedAttributeSourcevia the same sharing mechanismFuzzyTermsEnumalready uses across segments. Returns0LwhenmaxEdits == 0. CallinggetTermsEnum(terms, atts)afterwards with the sameAttributeSourcereuses the primed automata instead of rebuilding them.FuzzyTermsEnum.AutomatonAttributeandFuzzyTermsEnum.AutomatonAttributeImplare widened fromprivateto package-private soFuzzyQuerycan install/read the same attribute type. They remain non-public Lucene internals.How clients can use it
Clients who run
FuzzyQueryand need to bound or report memory now have two complementary, non-disruptive primitives:ramBytesUsed(): cheap, stable, query-identity-preserving accounting of the query object itself. Safe to fold into existingAccountablewalks.computeAutomataRamBytes(atts): a pre-flight handle on the dominant cost (theCompiledAutomaton[]transition tables). Two usage patterns:AttributeSourceintocomputeAutomataRamBytes, account/charge the returned bytes, then pass the sameAttributeSourceintogetTermsEnum. The automata are built once and reused across all segments — no duplicate work.FuzzyQueryand overridinggetTermsEnum(Terms, AttributeSource); aftersuper.getTermsEnumreturns, callcomputeAutomataRamBytes(atts)on the sameatts.initis idempotent on a primed attribute, so this only walks the already-built array( no second build). That way lets callers attach accounting to query execution without changing the Lucene execution path.