Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token_count type : add enable_position_increments option (fix #23227) #24175

Merged
merged 4 commits into from
Apr 20, 2017

Conversation

fbaligand
Copy link
Contributor

@fbaligand fbaligand commented Apr 19, 2017

Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)

Fix #23227

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @fbaligand
I left one comment regarding the meaning of the option. The title of the PR is misleading I think, enable_position_increments should not take 0-increment token into account so it's still the number of positions that we are after. We simply don't want to count positions where all tokens have been filtered.

Can you also add a description of the option in the docs:
https://github.com/elastic/elasticsearch/blob/master/docs/reference/mapping/types/token-count.asciidoc

* @param fieldValue field value to pass to analyzer
* @return number of tokens in a token stream
* @throws IOException if tokenStream throws it
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enablePositionIncrement should not take 0-increment into account.
I think you can just add a param to countPositions and instead of count += position.getPositionIncrement(); you can do: count += enablePosIncr ? Math.min(1, pos.getPosIncr()) : pos.getPosIncr(); ?

return new TokenStreamComponents(new MockTokenizer(), tokenStream);
}
};
assertThat(TokenCountFieldMapper.countTokens(analyzer, "", ""), equalTo(3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't count token with 0-increment this should be equal to 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jimczi
First, thanks for this first review.

Then, concerning your comment : the feature global goal is to have an option to count the number of tokens after analysis, whatever their positions. In that unit test, there are 3 tokens in the stream, so that's why I expect 3 as a result.

Given your comment, maybe the option "enable_position_increments" is misnamed for this feature goal ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming is fine. This feature as described in the issue is about not counting tokens filtered from the token stream. This is what enable_position_increments=false does and I think it's all we need to do here. If your analyzer adds alternative tokens to each position they should not alter the final count since we're looking for the number of tokens in the original text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I get what you mean.
That makes sense, especially when you use synonym filter.

=> I will fix all your requested changes and push them soon.

@fbaligand
Copy link
Contributor Author

fbaligand commented Apr 20, 2017

Hi @jimczi

I fixed your requested changes.

By the way, I fixed also some compilation warnings on TokenCount classes.
If you want that I cancel this last commit, tell me.

@jimczi
Copy link
Contributor

jimczi commented Apr 20, 2017

jenkins please test this

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready I think. I left one comment regarding the docs that could clarify the usage.
This PR is actually tested by our internal CI so I'll merge it if it passes the test.
Thank you very much for your contribution @fbaligand !

counting tokens. This means that even if the analyzer filters out stop
words they are included in the count.
words they are included in the count.
If you don't want to count stop words, set option `enable_position_increments` to `false`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant with the definition below and it's a little bit confusing since it impacts not only the stop filter ?

Copy link
Contributor Author

@fbaligand fbaligand Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
I see 2 options :

  • remove the note.
  • only keep this part to just explain the mechanism :

Technically, the token_count type sums position increments rather than counting tokens.

Which do you prefer ?

Moreover, I can add this precision to enable_position_increments definition :

Set to true if you want to count all tokens computed by analyzer tokenizer.

What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just remove the note. The definition of the param below is descriptive enough.

Copy link
Contributor Author

@fbaligand fbaligand Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for enable_position_increments definition, you mean to not add "Set to true if you want to count all tokens computed by analyzer tokenizer" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the current definition is ok.

@fbaligand
Copy link
Contributor Author

OK. I removed the note, rebased my code on master branch, and pushed it !

@jimczi jimczi merged commit 4a45579 into elastic:master Apr 20, 2017
jimczi pushed a commit that referenced this pull request Apr 20, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
@fbaligand
Copy link
Contributor Author

Great !
Thanks @jimczi for the review, merge into master and inclusion in 5.5.0 release !
It was a pleasure to "work" on this PR with you :)

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 21, 2017
* master: (61 commits)
  Build: Move plugin cli and tests to distribution tool (elastic#24220)
  Peer Recovery: remove maxUnsafeAutoIdTimestamp hand off (elastic#24243)
  Adds version 5.3.2 and backwards compatibility indices for 5.3.1
  Add utility method to parse named XContent objects with typed prefix (elastic#24240)
  MultiBucketsAggregation.Bucket should not extend Writeable (elastic#24216)
  Don't expose cleaned-up tasks as pending in PrioritizedEsThreadPoolExecutor (elastic#24237)
  Adds declareNamedObjects methods to ConstructingObjectParser (elastic#24219)
  ESIntegTestCase.indexRandom should not introduce types. (elastic#24202)
  Tests: Extend InternalStatsTests (elastic#24212)
  IndicesQueryCache should delegate the scorerSupplier method. (elastic#24209)
  Speed up parsing of large `terms` queries. (elastic#24210)
  [TEST] make sure that the random query_string query generator defines a default_field or a list of fields
  token_count type : add an option to count tokens (fix elastic#23227) (elastic#24175)
  Query string default field (elastic#24214)
  Make Aggregations an abstract class rather than an interface (elastic#24184)
  [TEST] ensure expected sequence no and version are set when index/delete engine operation has a document failure
  Extract batch executor out of cluster service (elastic#24102)
  Add 5.3.1 to bwc versions
  Added "release-state" support to plugin docs
  Added examples to cross cluster search of using cluster settings
  ...
@fbaligand fbaligand changed the title token_count type : add an option to count tokens (fix #23227) token_count type : add enable_position_increments option (fix #23227) Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v5.5.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

token_count type : add an option to count tokens (and not positions)
4 participants