Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer "vi_tokenizer" doesn't work with character filter "html_strip" #98

Closed
seta-hainguyen opened this issue Apr 24, 2021 · 3 comments

Comments

@seta-hainguyen
Copy link

Here is my settings for analyzer:

"vn_html_analyzer": { "filter": [ "icu_folding" ], "char_filter": [ "html_strip" ], "type": "custom", "tokenizer": "vi_tokenizer" }

When I tried:
GET localhost:9200/question/_analyze { "analyzer" : "vn_html_analyzer", "text" : "<p>đỗ đại học</p>" }

It throws error:
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[7d1c0721c1d6][172.17.0.2:9300][indices:admin/analyze[s]]"}],"type":"string_index_out_of_bounds_exception","reason":"String index out of range: -1"},"status":500}

When I replace the tokenizer "vi_tokenizer" by "standard", the error did not occur

I'm using elasticsearch 7.3.1, elasticsearch-analysis-vietnamese 7.3.1 and install it using dockerfile:

FROM elasticsearch:7.3.1

COPY elasticsearch-analysis-vietnamese-7.3.1.zip /usr/share/elasticsearch/

RUN cd /usr/share/elasticsearch &&
bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-vietnamese-7.3.1.zip &&
bin/elasticsearch-plugin install analysis-icu

@seta-hainguyen seta-hainguyen changed the title Tokenizer "vi_tokenizer" doesn't work with character filer "html_strip" Tokenizer "vi_tokenizer" doesn't work with character filter "html_strip" Apr 24, 2021
@duydo
Copy link
Owner

duydo commented Apr 25, 2021

@seta-hainguyen The version 7.3.1 with old VnTokenizer has a lot of issues, I switched to use another tokenizer from CocCoc team for the plugin so I don't maintain the plugin with VnTokenizer any more.

Currently plugin is compatible to ES v7.4.0 and later, you can refer the document to build the plugin with version you expect.

@seta-hainguyen
Copy link
Author

@duydo Thank you. Do you have any notice about java version to build Coccoc tokenizer's project and your project ?

@duydo
Copy link
Owner

duydo commented Apr 26, 2021

@seta-hainguyen The CocCoc tokenizer is written in C++, so you have to build it as shared library on Elasticsearch node which you intend to install the plugin on.
The ES plugin is compatible with Java 8 and later.

@duydo duydo closed this as completed May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants