Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][Nori ] Add custom dictionary terms on index setting. #35842

Closed
kimjmin opened this issue Nov 23, 2018 · 1 comment
Closed
Assignees
Labels
>enhancement :Search/Analysis How text is split into tokens

Comments

@kimjmin
Copy link

kimjmin commented Nov 23, 2018

Describe the feature:

Currently to use custom dictionary in Nori, only way is saving dictionary in file and set it's path on index/_settings/index/analysis.
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori-tokenizer.html

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

Unfortunately, there is no way for users who are using Elastic Cloud or ECE to add custom dictionary on their elasticsearch cluster. For Elastic Cloud, custom plugin menu can be used for file upload, but it is effective only when create cluster, which means can be refreshed on production system.

For synonym token filter, there is feature that user can add customer's dictionary on index setting.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["my_stop", "synonym"]
                    }
                },
                "filter" : {
                        "my_stop": {
                                "type" : "stop",
                                "stopwords": ["bar"]
                        },
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

We need to add this kind of setting functionally on Nori (and also other analysis plugins if needed) so users on Elastic Cloud environment can use custom dictionary on their search use-cases.

@kimjmin kimjmin added the :Search/Analysis How text is split into tokens label Nov 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@kimjmin kimjmin changed the title [Feature Request][Nori ] [Feature Request][Nori ] Add custom dictionary terms on index setting. Nov 23, 2018
jimczi added a commit to jimczi/elasticsearch that referenced this issue Nov 30, 2018
This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes elastic#35842
jimczi added a commit that referenced this issue Dec 7, 2018
Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
jimczi added a commit that referenced this issue Dec 7, 2018
Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

3 participants