token_count type : add enable_position_increments option (fix #23227) #24175

fbaligand · 2017-04-19T08:16:50Z

Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)

Fix #23227

elasticmachine · 2017-04-19T08:16:52Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

jimczi

This looks great, thanks @fbaligand
I left one comment regarding the meaning of the option. The title of the PR is misleading I think, enable_position_increments should not take 0-increment token into account so it's still the number of positions that we are after. We simply don't want to count positions where all tokens have been filtered.

Can you also add a description of the option in the docs:
https://github.com/elastic/elasticsearch/blob/master/docs/reference/mapping/types/token-count.asciidoc

jimczi · 2017-04-19T08:56:38Z

core/src/main/java/org/elasticsearch/index/mapper/TokenCountFieldMapper.java

+     * @param fieldValue field value to pass to analyzer
+     * @return number of tokens in a token stream
+     * @throws IOException if tokenStream throws it
+     */


enablePositionIncrement should not take 0-increment into account.
I think you can just add a param to countPositions and instead of count += position.getPositionIncrement(); you can do: count += enablePosIncr ? Math.min(1, pos.getPosIncr()) : pos.getPosIncr(); ?

jimczi · 2017-04-19T08:59:32Z

core/src/test/java/org/elasticsearch/index/mapper/TokenCountFieldMapperTests.java

+                    return new TokenStreamComponents(new MockTokenizer(), tokenStream);
+                }
+            };
+        assertThat(TokenCountFieldMapper.countTokens(analyzer, "", ""), equalTo(3));


If we don't count token with 0-increment this should be equal to 2

Hi @jimczi
First, thanks for this first review.

Then, concerning your comment : the feature global goal is to have an option to count the number of tokens after analysis, whatever their positions. In that unit test, there are 3 tokens in the stream, so that's why I expect 3 as a result.

Given your comment, maybe the option "enable_position_increments" is misnamed for this feature goal ?

I think the naming is fine. This feature as described in the issue is about not counting tokens filtered from the token stream. This is what enable_position_increments=false does and I think it's all we need to do here. If your analyzer adds alternative tokens to each position they should not alter the final count since we're looking for the number of tokens in the original text.

OK, I get what you mean.
That makes sense, especially when you use synonym filter.

=> I will fix all your requested changes and push them soon.

fbaligand · 2017-04-20T09:21:34Z

Hi @jimczi

I fixed your requested changes.

By the way, I fixed also some compilation warnings on TokenCount classes.
If you want that I cancel this last commit, tell me.

jimczi · 2017-04-20T20:05:41Z

jenkins please test this

jimczi

This is ready I think. I left one comment regarding the docs that could clarify the usage.
This PR is actually tested by our internal CI so I'll merge it if it passes the test.
Thank you very much for your contribution @fbaligand !

jimczi · 2017-04-20T20:27:31Z

docs/reference/mapping/types/token-count.asciidoc

 counting tokens. This means that even if the analyzer filters out stop
-words they are included in the count.
+words they are included in the count.  
+If you don't want to count stop words, set option `enable_position_increments` to `false`


this is redundant with the definition below and it's a little bit confusing since it impacts not only the stop filter ?

You're right.
I see 2 options :

remove the note.

only keep this part to just explain the mechanism :

Technically, the token_count type sums position increments rather than counting tokens.

Which do you prefer ?

Moreover, I can add this precision to enable_position_increments definition :

Set to true if you want to count all tokens computed by analyzer tokenizer.

What do you think ?

I think you can just remove the note. The definition of the param below is descriptive enough.

So, for enable_position_increments definition, you mean to not add "Set to true if you want to count all tokens computed by analyzer tokenizer" ?

Yep the current definition is ok.

Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of tokens (not position increments count)

fbaligand · 2017-04-20T22:16:50Z

OK. I removed the note, rebased my code on master branch, and pushed it !

Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of tokens (not position increments count)

fbaligand · 2017-04-21T07:52:31Z

Great !
Thanks @jimczi for the review, merge into master and inclusion in 5.5.0 release !
It was a pleasure to "work" on this PR with you :)

* master: (61 commits) Build: Move plugin cli and tests to distribution tool (elastic#24220) Peer Recovery: remove maxUnsafeAutoIdTimestamp hand off (elastic#24243) Adds version 5.3.2 and backwards compatibility indices for 5.3.1 Add utility method to parse named XContent objects with typed prefix (elastic#24240) MultiBucketsAggregation.Bucket should not extend Writeable (elastic#24216) Don't expose cleaned-up tasks as pending in PrioritizedEsThreadPoolExecutor (elastic#24237) Adds declareNamedObjects methods to ConstructingObjectParser (elastic#24219) ESIntegTestCase.indexRandom should not introduce types. (elastic#24202) Tests: Extend InternalStatsTests (elastic#24212) IndicesQueryCache should delegate the scorerSupplier method. (elastic#24209) Speed up parsing of large `terms` queries. (elastic#24210) [TEST] make sure that the random query_string query generator defines a default_field or a list of fields token_count type : add an option to count tokens (fix elastic#23227) (elastic#24175) Query string default field (elastic#24214) Make Aggregations an abstract class rather than an interface (elastic#24184) [TEST] ensure expected sequence no and version are set when index/delete engine operation has a document failure Extract batch executor out of cluster service (elastic#24102) Add 5.3.1 to bwc versions Added "release-state" support to plugin docs Added examples to cross cluster search of using cluster settings ...

fbaligand mentioned this pull request Apr 19, 2017

token_count type : add an option to count tokens (and not positions) #23227

Closed

jimczi added :Search Foundations/Mapping Index mappings, including merging and defining field types >feature v5.5.0 v6.0.0-alpha1 labels Apr 19, 2017

jimczi requested changes Apr 19, 2017

View reviewed changes

fbaligand force-pushed the master branch from 3b26788 to d7ff899 Compare April 20, 2017 09:19

jimczi approved these changes Apr 20, 2017

View reviewed changes

fbaligand added 4 commits April 21, 2017 00:14

token_count type : add an option to count tokens (fix elastic#23227)

17ed746

Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of tokens (not position increments count)

fix enable_position_increments=false to count only distinct positions

e099446

add documentation for new option 'enable_position_increments'

d174943

remove compilation warning on TokenCount classes

34a031b

fbaligand force-pushed the master branch from d7ff899 to 34a031b Compare April 20, 2017 22:15

jimczi merged commit 4a45579 into elastic:master Apr 20, 2017

clintongormley added >enhancement and removed >feature labels Apr 24, 2017

fbaligand mentioned this pull request Apr 27, 2017

token_count type : add an option to count tokens (fix #23227) #24372

Closed

fbaligand changed the title ~~token_count type : add an option to count tokens (fix #23227)~~ token_count type : add enable_position_increments option (fix #23227) Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token_count type : add enable_position_increments option (fix #23227) #24175

token_count type : add enable_position_increments option (fix #23227) #24175

fbaligand commented Apr 19, 2017 •

edited

Loading

elasticmachine commented Apr 19, 2017

jimczi left a comment

jimczi Apr 19, 2017

jimczi Apr 19, 2017

fbaligand Apr 19, 2017

jimczi Apr 19, 2017

fbaligand Apr 19, 2017

fbaligand commented Apr 20, 2017 •

edited

Loading

jimczi commented Apr 20, 2017

jimczi left a comment

jimczi Apr 20, 2017

fbaligand Apr 20, 2017 •

edited

Loading

jimczi Apr 20, 2017

fbaligand Apr 20, 2017 •

edited

Loading

jimczi Apr 20, 2017

fbaligand commented Apr 20, 2017

fbaligand commented Apr 21, 2017

token_count type : add enable_position_increments option (fix #23227) #24175

token_count type : add enable_position_increments option (fix #23227) #24175

Conversation

fbaligand commented Apr 19, 2017 • edited Loading

elasticmachine commented Apr 19, 2017

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbaligand commented Apr 20, 2017 • edited Loading

jimczi commented Apr 20, 2017

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbaligand Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbaligand Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbaligand commented Apr 20, 2017

fbaligand commented Apr 21, 2017

fbaligand commented Apr 19, 2017 •

edited

Loading

fbaligand commented Apr 20, 2017 •

edited

Loading

fbaligand Apr 20, 2017 •

edited

Loading

fbaligand Apr 20, 2017 •

edited

Loading