Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

chie8842 · 2019-09-21T18:11:02Z

There is unconvenience when we use kuromoji_tokenizer's search mode and synonym_token_filter toggether.

Problem Description

Now I have following analyzer setting,

{
        "settings": {
            "analysis": {
                "tokenizer": {
                    "ja_tokenizer": {
                        "type": "kuromoji_tokenizer",
                        "mode": "search"
                },
                "filter": {
                    "my_synonym": {
                        "type": "synonym",
                        "synonyms": [
                            '焼きぎょうざ, 焼き餃子, 焼餃子'
                        ]
                    }
                },
                "analyzer": {
                    "ja_analyzer": {
                        "type": "custom",
                        "tokenizer": "ja_tokenizer",
                        "filter": [
                            "my_synonym"
                        ]
                    }
                }
            }
        }
    }

When I was going to create index from documents with above analyzer, following error is appeared.

/var/log/elasticearch/searcher.log

``` [2019-09-21T16:05:10,498][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [n01] failed to put template [products_template] java.lang.IllegalArgumentException: failed to build synonyms at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:138) ~[?:?] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.getChainAwareTokenFilterFactory(SynonymTokenFilterFactory.java:90) ~[?:?] at org.elasticsearch.index.analysis.AnalyzerComponents.createComponents(AnalyzerComponents.java:84) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.create(CustomAnalyzerProvider.java:63) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.build(CustomAnalyzerProvider.java:50) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.produceAnalyzer(AnalysisRegistry.java:584) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:534) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:216) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexService.(IndexService.java:180) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:411) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:563) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:512) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.validateAndAddTemplate(MetaDataIndexTemplateService.java:235) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.access$300(MetaDataIndexTemplateService.java:65) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService$2.execute(MetaDataIndexTemplateService.java:176) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.2.jar:7.3.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:835) [?:?] Caused by: java.text.ParseException: Invalid synonym rule at line 4 at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more Caused by: java.lang.IllegalArgumentException: term: 焼餃子 analyzed to a token (焼餃子) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more ```

I researched about this error and I understand following points.

1. `kuromoji_tokenizer`'s `search` mode outputs these three tokens for `焼餃子`.

curl -XPOST 'http://localhost:9200/my_index/_analyze?pretty' -H "Content-type: application/json" -d '{"analyzer": "ja_analyzer", "text": "焼餃子"}'
{
  "tokens" : [
    {
      "token" : "焼",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "焼餃子",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "餃子",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

2. `ESSolrSynonymParser` checks position incrementation.

ESSolrSynonymParser.java#L55
SynonymMap.java

In such case above tokenized words "焼" and "焼餃子" have same position number, so ESSolrSynonymParser fails because of position incrementation checking.

So I could create index with updating synonym content.
from:

                        "synonyms": [
                            '焼きぎょうざ, 焼き餃子, 焼餃子'
                        ]

to:

                        "synonyms": [
                            '焼 きぎ ょうざ, 焼き餃子, 焼 餃子'
                        ]

We need to tokenize synonym words and put space between each tokens.

My Opinion

I think many Japanese user want to use kuromoji_tokenizer's search mode and synonym_token_filter toggether.
So I'm glad if analyzer is improved not to fail with them.

Or at least, writing document about it in the Elasticssearch document.
Because at now, I couldn't find good description about this problem and I could understand the reason with only after reading source code.

Elasticsearch version (bin/elasticsearch --version):
7.2
Plugins installed: []

JVM version (java -version):
openjdk 11.0.4 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Ubuntu 18.04.3 LTS

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-21T21:22:07Z

Pinging @elastic/es-search

chie8842 · 2019-09-24T14:45:44Z

I found a PR about this Issue's solution.
#34331 (comment)

jimczi · 2020-11-23T20:58:01Z

The Kuromoji tokenizer now supports a discard_compound_token option that can be used in conjunction with the search mode to output a single path in the tokenization. When this option is set to true, the synonym graph filter should work fine so I am closing this issue.
Please reopen if the provided solution doesn't work as expected.

dnhatn added the :Search/Analysis How text is split into tokens label Sep 21, 2019

rjernst added the Team:Search Meta label for search team label May 4, 2020

jimczi closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

chie8842 commented Sep 21, 2019

elasticmachine commented Sep 21, 2019

chie8842 commented Sep 24, 2019

jimczi commented Nov 23, 2020

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

Comments

chie8842 commented Sep 21, 2019

Problem Description

1. kuromoji_tokenizer's search mode outputs these three tokens for 焼餃子.

2. ESSolrSynonymParser checks position incrementation.

My Opinion

elasticmachine commented Sep 21, 2019

chie8842 commented Sep 24, 2019

jimczi commented Nov 23, 2020

1. `kuromoji_tokenizer`'s `search` mode outputs these three tokens for `焼餃子`.

2. `ESSolrSynonymParser` checks position incrementation.