Korean (nori) Analysis Synonym Filter build failed #37751

AnSungHyun · 2019-01-23T09:03:06Z

Error When Index Setting "Synonym Filter" with "Korean (nori) Analysis"

Elasticsearch version (bin/elasticsearch --version): 6.5.3

Plugins installed: [ analysis-nori ]

JVM version (java -version): java version "1.8.0_121"

OS version (uname -a if on a Unix-like system):
Linux search 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

1. Korean (nori) Analysis Install
bin/elasticsearch-plugin install analysis-nori

2. Index Setting
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym"
            ]
          }
        }
      }
    }
  }
}

3. Error Message

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[node-test][192.168.0.1:9300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "parse_exception: Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: 풋사과 analyzed to a token (풋) with position increment != 1 (got: 0)"
      }
    }
  },
  "status": 400
}

4. I tried synonym graph filter, but It was not resolved.
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym_graph"
            ]
          }
        }
      }
    }
  }
}

5. analyze token result after remove synonym filter
Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

Try Analyze:

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "풋사과"
}

Result:

{
  "tokens" : [
    {
      "token" : "풋사과",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "풋",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "사과",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

"풋사과" is compound words
Can not use synonyms in compound words?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-23T09:06:28Z

Pinging @elastic/es-search

jimczi · 2019-01-23T11:03:31Z

Thanks for reporting this problem @AnSungHyun .
This is similar than #34331 except that it occurs in a Tokenizer. The synonym filter checks that the input synonyms can be analyzed in a single form and fails to build if not. Since the mixed mode of the Korean tokenizer preserves the compound and the splitted form it is not possible currently to add a compound word in a synonym dictionary. I discussed with @romseygeek offline and we think that it is possible to add the same workaround than #34331 for tokenizers. This would allow us to change the tokenizer option when we build the synonym map. In this case we'd change the mixed mode to discard (removes the compound) in order to make it compatible with the synonym building.

jimczi · 2019-01-23T14:44:19Z

I forgot the fact that the output should also contains the compound and the decompound form of the expanded synonyms. Unfortunately this is not possible in the synonym filter so the proposed solution above wouldn't work. Another possibility is to extract the de-compounding in a separate token filter instead of doing it in the tokenizer. This way it would be possible to set the synonym filter before the decompounding filter and the tokenizer would always output a single path.

seohoryu · 2019-08-01T06:41:44Z

@AnSungHyun
I am not sure this is the right way to solve this issue. But, I believe this can be a workaround for you. In my case, I registered "대한민국,한국,코리아" and I met the same issue like you. "대한민국" is a compound word so it makes the same error exactly. However I added "대한민국" to user-dictionary and this error went away.

Here is my settings.

PUT test    
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt",
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter" : ["synonym"]
          }
        },
        "filter" : { 
          "synonym" : {
            "type" : "synonym",
            "synonyms_path" : "analysis/synonyms.txt" 
          } 
        }
      }
    }
  },
...
}

And then added "대한민국" to userdict_ko.txt.

I hope this is helpful for you.

jimczi · 2020-12-16T12:53:29Z

I am closing this issue as won't fix for now. Using. the mixed mode of the nori tokenizer doesn't work with multi-word synonyms but this is more broader problem. The solution for now is to use the discard mode in order to ensure that a single path is produced.

romseygeek added the :Search/Analysis How text is split into tokens label Jan 23, 2019

jimczi added the >bug label Jan 23, 2019

jimczi self-assigned this Jan 23, 2019

rjernst added the Team:Search Meta label for search team label May 4, 2020

jimczi closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korean (nori) Analysis Synonym Filter build failed #37751

Korean (nori) Analysis Synonym Filter build failed #37751

AnSungHyun commented Jan 23, 2019

elasticmachine commented Jan 23, 2019

jimczi commented Jan 23, 2019

jimczi commented Jan 23, 2019

seohoryu commented Aug 1, 2019 •

edited

jimczi commented Dec 16, 2020

Korean (nori) Analysis Synonym Filter build failed #37751

Korean (nori) Analysis Synonym Filter build failed #37751

Comments

AnSungHyun commented Jan 23, 2019

elasticmachine commented Jan 23, 2019

jimczi commented Jan 23, 2019

jimczi commented Jan 23, 2019

seohoryu commented Aug 1, 2019 • edited

jimczi commented Dec 16, 2020

seohoryu commented Aug 1, 2019 •

edited