add nori_number token filter in analysis-nori #53583

danmuzi · 2020-03-15T14:40:00Z

The KoreanNumberFilter has included in Nori after Lucene 8.2.0. (LUCENE-8812)
However, it isn't supported now in Nori Analysis plugin. (Kuromoji supports kuromoji_number)
It seems to be missing(#30397) because KoreanNumberFilter didn't exist at Lucene 7.4.0 that supports Nori first time.
This PR is for that.

elasticmachine · 2020-03-16T12:40:07Z

Pinging @elastic/es-search (:Search/Analysis)

elasticmachine · 2020-03-16T12:40:09Z

Pinging @elastic/es-docs (>docs)

jimczi

Thanks for adding this filter danmuzi ! I left one comment regarding the discard_punctuation option of the tokenizer that was added to handle the number filter correctly.

docs/plugins/analysis-nori.asciidoc

danmuzi · 2020-03-19T17:35:24Z

Thanks for your review, @jimczi 👍
I added some commits reflecting your comments.
But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.
So I submitted commits separately.
If you think creating a new PR for discard_punctuation is better, I will write a new PR for that.
If so, I'll remove the discard_punctuation commit in this PR and rebase to master branch after discard_punctuation PR is submitted.
What do you think about this?

jimczi

I left two minor comments but the change looks good to me.

But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.

I think it's ok since the discard_punctuation option was added specifically for the number token filter. Let's add both in the same pr, thanks for separating the commits though.

plugins/analysis-nori/src/test/java/org/elasticsearch/index/analysis/NoriAnalysisTests.java

jimczi · 2020-03-19T20:10:36Z

docs/plugins/analysis-nori.asciidoc

+This filter does this kind of normalization and allows a search for 3200 to match ３．２천 in text,
+but can also be used to make range facets based on the normalized numbers and so on.
+
+Notice that this analyzer uses a token composition scheme and relies on punctuation tokens


Maybe add add a NOTE: to emphasize this part ?

danmuzi · 2020-03-20T18:39:30Z

Thanks Jim.
I added a commit based on your review.
It's about adding NOTE in asciidoc and test cases for discard_punctuation.
Please check it!

jimczi · 2020-03-23T13:24:47Z

@elasticmachine ok to test

danmuzi · 2020-03-23T17:09:21Z

I'm not sure why elasticsearch-ci/2 and elasticsearch-ci/bwc and elasticsearch-ci/default-distro are failed.
It doesn't seem related but I found an indentation problem in this PR.
And I rebased this patch to the latest master branch commit.

danmuzi · 2020-03-23T18:06:50Z

Because of the my rebase mistake, the previous Jenkins build history has been lost in this conversation.
So I attach the failure builds before the rebase.
elasticsearch-ci/2 : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/19332/
elasticsearch-ci/bwc : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-bwc/19021/
elasticsearch-ci/default-distro : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request+default-distro/18876/

After the rebase, all Jenkins builds passed.

jimczi

LGTM, thanks @danmuzi !

danmuzi · 2020-03-23T18:19:20Z

Thanks for your kind reviews! @jimczi

This change adds the `nori_number` token filter. It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.

jrodewig added :Search Relevance/Analysis How text is split into tokens >docs General docs changes labels Mar 16, 2020

jimczi added >feature v7.7.0 v8.0.0 and removed >docs General docs changes labels Mar 16, 2020

jimczi reviewed Mar 17, 2020

View reviewed changes

docs/plugins/analysis-nori.asciidoc Show resolved Hide resolved

jimczi reviewed Mar 19, 2020

View reviewed changes

danmuzi added 5 commits March 24, 2020 02:06

add nori_number token filter

b6221c5

add discard_punctuation option in nori_tokenizer

d6dbe51

add description about using discard_punctuation in nori_number

9bd0ebe

add note in asciidoc and test cases for discard_punctuation

6ed33a1

fix wrong indentation in nori_number test

1a4367e

danmuzi force-pushed the add-nori-number-filter branch from 3864137 to 1a4367e Compare March 23, 2020 17:08

jimczi approved these changes Mar 23, 2020

View reviewed changes

jimczi merged commit 8d4ff29 into elastic:master Mar 23, 2020

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

codebrain mentioned this pull request Apr 15, 2020

Add nori_number token filter in analysis-nori and discard_punctuation to filter elastic/elasticsearch-net#4591

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add nori_number token filter in analysis-nori #53583

add nori_number token filter in analysis-nori #53583

danmuzi commented Mar 15, 2020

elasticmachine commented Mar 16, 2020

elasticmachine commented Mar 16, 2020

jimczi left a comment

danmuzi commented Mar 19, 2020

jimczi left a comment

jimczi Mar 19, 2020

danmuzi Mar 20, 2020

danmuzi commented Mar 20, 2020

jimczi commented Mar 23, 2020

danmuzi commented Mar 23, 2020

danmuzi commented Mar 23, 2020

jimczi left a comment

danmuzi commented Mar 23, 2020

add nori_number token filter in analysis-nori #53583

add nori_number token filter in analysis-nori #53583

Conversation

danmuzi commented Mar 15, 2020

elasticmachine commented Mar 16, 2020

elasticmachine commented Mar 16, 2020

jimczi left a comment

Choose a reason for hiding this comment

danmuzi commented Mar 19, 2020

jimczi left a comment

Choose a reason for hiding this comment

jimczi Mar 19, 2020

Choose a reason for hiding this comment

danmuzi Mar 20, 2020

Choose a reason for hiding this comment

danmuzi commented Mar 20, 2020

jimczi commented Mar 23, 2020

danmuzi commented Mar 23, 2020

danmuzi commented Mar 23, 2020

jimczi left a comment

Choose a reason for hiding this comment

danmuzi commented Mar 23, 2020