Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

Closed
chie8842 opened this issue Sep 21, 2019 · 3 comments
Closed
Labels
:Search/Analysis How text is split into tokens Team:Search Meta label for search team

Comments

@chie8842
Copy link

There is unconvenience when we use kuromoji_tokenizer's search mode and synonym_token_filter toggether.

Problem Description

Now I have following analyzer setting,

{
        "settings": {
            "analysis": {
                "tokenizer": {
                    "ja_tokenizer": {
                        "type": "kuromoji_tokenizer",
                        "mode": "search"
                },
                "filter": {
                    "my_synonym": {
                        "type": "synonym",
                        "synonyms": [
                            '焼きぎょうざ, 焼き餃子, 焼餃子'
                        ]
                    }
                },
                "analyzer": {
                    "ja_analyzer": {
                        "type": "custom",
                        "tokenizer": "ja_tokenizer",
                        "filter": [
                            "my_synonym"
                        ]
                    }
                }
            }
        }
    }

When I was going to create index from documents with above analyzer, following error is appeared.

/var/log/elasticearch/searcher.log ``` [2019-09-21T16:05:10,498][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [n01] failed to put template [products_template] java.lang.IllegalArgumentException: failed to build synonyms at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:138) ~[?:?] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.getChainAwareTokenFilterFactory(SynonymTokenFilterFactory.java:90) ~[?:?] at org.elasticsearch.index.analysis.AnalyzerComponents.createComponents(AnalyzerComponents.java:84) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.create(CustomAnalyzerProvider.java:63) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.build(CustomAnalyzerProvider.java:50) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.produceAnalyzer(AnalysisRegistry.java:584) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:534) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:216) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexService.(IndexService.java:180) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:411) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:563) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:512) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.validateAndAddTemplate(MetaDataIndexTemplateService.java:235) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.access$300(MetaDataIndexTemplateService.java:65) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService$2.execute(MetaDataIndexTemplateService.java:176) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.2.jar:7.3.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:835) [?:?] Caused by: java.text.ParseException: Invalid synonym rule at line 4 at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more Caused by: java.lang.IllegalArgumentException: term: 焼餃子 analyzed to a token (焼餃子) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more ```

I researched about this error and I understand following points.

1. kuromoji_tokenizer's search mode outputs these three tokens for 焼餃子.

curl -XPOST 'http://localhost:9200/my_index/_analyze?pretty' -H "Content-type: application/json" -d '{"analyzer": "ja_analyzer", "text": "焼餃子"}'
{
  "tokens" : [
    {
      "token" : "焼",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "焼餃子",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "餃子",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

2. ESSolrSynonymParser checks position incrementation.

ESSolrSynonymParser.java#L55
SynonymMap.java

In such case above tokenized words "焼" and "焼餃子" have same position number, so ESSolrSynonymParser fails because of position incrementation checking.

So I could create index with updating synonym content.
from:

                        "synonyms": [
                            '焼きぎょうざ, 焼き餃子, 焼餃子'
                        ]

to:

                        "synonyms": [
                            '焼 きぎ ょうざ, 焼き餃子, 焼 餃子'
                        ]

We need to tokenize synonym words and put space between each tokens.

My Opinion

I think many Japanese user want to use kuromoji_tokenizer's search mode and synonym_token_filter toggether.
So I'm glad if analyzer is improved not to fail with them.

Or at least, writing document about it in the Elasticssearch document.
Because at now, I couldn't find good description about this problem and I could understand the reason with only after reading source code.

Elasticsearch version (bin/elasticsearch --version):
7.2
Plugins installed: []

JVM version (java -version):
openjdk 11.0.4 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Ubuntu 18.04.3 LTS

@dnhatn dnhatn added the :Search/Analysis How text is split into tokens label Sep 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@chie8842
Copy link
Author

I found a PR about this Issue's solution.
#34331 (comment)

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@jimczi
Copy link
Contributor

jimczi commented Nov 23, 2020

The Kuromoji tokenizer now supports a discard_compound_token option that can be used in conjunction with the search mode to output a single path in the tokenization. When this option is set to true, the synonym graph filter should work fine so I am closing this issue.
Please reopen if the provided solution doesn't work as expected.

@jimczi jimczi closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Analysis How text is split into tokens Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants