-
Notifications
You must be signed in to change notification settings - Fork 1.3k
add NGramSynonymTokenizer [LUCENE-5253] #6317
Copy link
Copy link
Closed
Description
I'd like to propose that we have another n-gram tokenizer which can process synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram size is fixed, i.e. minGramSize = maxGramSize.
Today, I think we have the following problems when using SynonymFilter with NGramTokenizer.
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ expand=true and N = 2 (2-gram).
- There is no consensus (I think :-) how we assign offsets to generated synonym tokens DE, EF and FG when expanding source token AB and BC.
- If the query pattern looks like XABC or ABCY, it cannot be matched even if there is a document "…XABCY…" in index when autoGeneratePhraseQueries set to true, because there is no "XA" or "CY" tokens in the index.
NGramSynonymTokenizer can solve these problems by providing the following methods.
- NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't tokenize registered words. e.g.
| source text | NGramTokenizer+SynonymFilter | NGramSynonymTokenizer |
|---|---|---|
| ABC | AB/DE/BC/EF/FG | ABC/DEFG |
- The back and forth of the registered words, NGramSynonymTokenizer generates extra tokens w/ posInc=0. e.g.
| source text | NGramTokenizer+SynonymFilter | NGramSynonymTokenizer |
|---|---|---|
| XYZABC123 | XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23 | XY/YZ/Z/ABC/DEFG/1/12/23 |
In the above sample, "Z" and "1" are the extra tokens.
Migrated from LUCENE-5253 by Koji Sekiguchi (@kojisekig), resolved Oct 02 2013
Linked issues:
Reactions are currently unavailable