Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean (nori) Analysis Synonym Filter build failed #37751

Closed
AnSungHyun opened this issue Jan 23, 2019 · 5 comments
Closed

Korean (nori) Analysis Synonym Filter build failed #37751

AnSungHyun opened this issue Jan 23, 2019 · 5 comments
Assignees
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team

Comments

@AnSungHyun
Copy link

Error When Index Setting "Synonym Filter" with "Korean (nori) Analysis"

Elasticsearch version (bin/elasticsearch --version): 6.5.3

Plugins installed: [ analysis-nori ]

JVM version (java -version): java version "1.8.0_121"

OS version (uname -a if on a Unix-like system):
Linux search 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

1. Korean (nori) Analysis Install
bin/elasticsearch-plugin install analysis-nori

2. Index Setting
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym"
            ]
          }
        }
      }
    }
  }
}

3. Error Message

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[node-test][192.168.0.1:9300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "parse_exception: Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: 풋사과 analyzed to a token (풋) with position increment != 1 (got: 0)"
      }
    }
  },
  "status": 400
}

4. I tried synonym graph filter, but It was not resolved.
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym_graph"
            ]
          }
        }
      }
    }
  }
}

5. analyze token result after remove synonym filter
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

Try Analyze:

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "풋사과"
}

Result:

{
  "tokens" : [
    {
      "token" : "풋사과",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "풋",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "사과",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

"풋사과" is compound words
Can not use synonyms in compound words?

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@romseygeek romseygeek added the :Search/Analysis How text is split into tokens label Jan 23, 2019
@jimczi
Copy link
Contributor

jimczi commented Jan 23, 2019

Thanks for reporting this problem @AnSungHyun .
This is similar than #34331 except that it occurs in a Tokenizer. The synonym filter checks that the input synonyms can be analyzed in a single form and fails to build if not. Since the mixed mode of the Korean tokenizer preserves the compound and the splitted form it is not possible currently to add a compound word in a synonym dictionary. I discussed with @romseygeek offline and we think that it is possible to add the same workaround than #34331 for tokenizers. This would allow us to change the tokenizer option when we build the synonym map. In this case we'd change the mixed mode to discard (removes the compound) in order to make it compatible with the synonym building.

@jimczi jimczi added the >bug label Jan 23, 2019
@jimczi jimczi self-assigned this Jan 23, 2019
@jimczi
Copy link
Contributor

jimczi commented Jan 23, 2019

I forgot the fact that the output should also contains the compound and the decompound form of the expanded synonyms. Unfortunately this is not possible in the synonym filter so the proposed solution above wouldn't work. Another possibility is to extract the de-compounding in a separate token filter instead of doing it in the tokenizer. This way it would be possible to set the synonym filter before the decompounding filter and the tokenizer would always output a single path.

@seohoryu
Copy link

seohoryu commented Aug 1, 2019

@AnSungHyun
I am not sure this is the right way to solve this issue. But, I believe this can be a workaround for you. In my case, I registered "대한민국,한국,코리아" and I met the same issue like you. "대한민국" is a compound word so it makes the same error exactly. However I added "대한민국" to user-dictionary and this error went away.

Here is my settings.

PUT test    
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt",
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter" : ["synonym"]
          }
        },
        "filter" : { 
          "synonym" : {
            "type" : "synonym",
            "synonyms_path" : "analysis/synonyms.txt" 
          } 
        }
      }
    }
  },
...
}

And then added "대한민국" to userdict_ko.txt.

I hope this is helpful for you.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@jimczi
Copy link
Contributor

jimczi commented Dec 16, 2020

I am closing this issue as won't fix for now. Using. the mixed mode of the nori tokenizer doesn't work with multi-word synonyms but this is more broader problem. The solution for now is to use the discard mode in order to ensure that a single path is produced.

@jimczi jimczi closed this as completed Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants