Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token_count type : add an option to count tokens (and not positions) #23227

Closed
fbaligand opened this issue Feb 17, 2017 · 10 comments · Fixed by #24175
Closed

token_count type : add an option to count tokens (and not positions) #23227

fbaligand opened this issue Feb 17, 2017 · 10 comments · Fixed by #24175
Labels
discuss :Search/Analysis How text is split into tokens

Comments

@fbaligand
Copy link
Contributor

fbaligand commented Feb 17, 2017

Elasticsearch version: 2.4.2 and 5.2.1

Description of the problem including expected versus actual behavior:
Currently, if I have a "token_count" field, based on an analyzer containing a stop filter, the indexed "token_count" field counts stop words.
It would be great to have an option to only count analysis result tokens.

Steps to reproduce:

  1. I have this index configuration :
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "stop_words"
            ]
          }
        },
        "filter": {
          "stop_words": {
            "type": "stop",
            "stopwords": [
              "this", "is", "a"
            ]
          }
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "mytext": {
        	"index": "analyzed",
         	"type": "string",
         	"analyzer": "default",
         	"fields": {
         		"length": {
         			"type": "token_count",
         			"analyzer": "default",
         			"store": "yes"
         		}
         	}
        }
      }
    }
  }
}
  1. I index this document :
{
 "mytext": "this is a cat"
}
  1. I make this search query :
GET _search
{
  "query": {
    "query_string": {
      "query": "mytext.length:1"
    }
  }
}
  1. It should return 1 result. But it actually returns 0 result.
@clintongormley
Copy link

This limitation is documented:

Technically the token_count type sums position increments rather than counting tokens. This means that even if the analyzer filters out stop words they are included in the count.

@clintongormley clintongormley added :Search/Analysis How text is split into tokens discuss labels Feb 17, 2017
@clintongormley
Copy link

@nik9000 how feasible would it be to change this behaviour?

@jpountz
Copy link
Contributor

jpountz commented Feb 17, 2017

Maybe we could add a discount_overlaps option like similarities have.

@nik9000
Copy link
Member

nik9000 commented Feb 17, 2017

I'm not sure! It has been years since I looked at the APIs this is built on.

@fbaligand
Copy link
Contributor Author

Is it possible to have an option so that :

  • first, the field source content pass through all analyzer chain (char_filter, tokenizer, token filter),
  • and at the end, we count the result tokens and store it in the indexed field

By this way, we are sure that token_count equals the analyzer tokens count

@jimczi
Copy link
Contributor

jimczi commented Feb 17, 2017

@fbaligand that's the current behavior except that we count the number of positions created by the analyzer chain. The stop filter removes some token but increment the position on the token right after it in order to be able to build accurate phrase query.
I think that discount_overlaps is what we currently do by counting the positions and not the tokens. Though it would not solve this issue which is about position increment greater than 1.
enable_position_increments is probably what we're looking after (like query parsers have) so that we don't count position holes ?

@fbaligand
Copy link
Contributor Author

@jimczi
Thanks a lot for this explanation ! I understand better now.
An option like enable_position_increments would be great !

@nik9000
Copy link
Member

nik9000 commented Feb 17, 2017

Yeah, like enable_position_increments: false to get the behavior that you are asking for. What we have is true.

@fbaligand
Copy link
Contributor Author

That would be perfect !

@fbaligand fbaligand changed the title token_count does not count well with an analyzer containing stop filter token_count type : add an option to count tokens (and not positions) Apr 16, 2017
@fbaligand
Copy link
Contributor Author

fbaligand commented Apr 19, 2017

Hi @clintongormley, @jpountz, @nik9000, @jimczi

I just submitted a PR (#24175) which adds the new boolean option "enable_position_increments", as described by @nik9000.
By default, option value is true (which means that positions increments are counted).
And if option is set to false, only result tokens (after full analysis) are counted.

fbaligand added a commit to fbaligand/elasticsearch that referenced this issue Apr 20, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
jimczi pushed a commit that referenced this issue Apr 20, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
jimczi pushed a commit that referenced this issue Apr 20, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
jasontedor added a commit to jasontedor/elasticsearch that referenced this issue Apr 21, 2017
* master: (61 commits)
  Build: Move plugin cli and tests to distribution tool (elastic#24220)
  Peer Recovery: remove maxUnsafeAutoIdTimestamp hand off (elastic#24243)
  Adds version 5.3.2 and backwards compatibility indices for 5.3.1
  Add utility method to parse named XContent objects with typed prefix (elastic#24240)
  MultiBucketsAggregation.Bucket should not extend Writeable (elastic#24216)
  Don't expose cleaned-up tasks as pending in PrioritizedEsThreadPoolExecutor (elastic#24237)
  Adds declareNamedObjects methods to ConstructingObjectParser (elastic#24219)
  ESIntegTestCase.indexRandom should not introduce types. (elastic#24202)
  Tests: Extend InternalStatsTests (elastic#24212)
  IndicesQueryCache should delegate the scorerSupplier method. (elastic#24209)
  Speed up parsing of large `terms` queries. (elastic#24210)
  [TEST] make sure that the random query_string query generator defines a default_field or a list of fields
  token_count type : add an option to count tokens (fix elastic#23227) (elastic#24175)
  Query string default field (elastic#24214)
  Make Aggregations an abstract class rather than an interface (elastic#24184)
  [TEST] ensure expected sequence no and version are set when index/delete engine operation has a document failure
  Extract batch executor out of cluster service (elastic#24102)
  Add 5.3.1 to bwc versions
  Added "release-state" support to plugin docs
  Added examples to cross cluster search of using cluster settings
  ...
fbaligand added a commit to fbaligand/elasticsearch that referenced this issue Apr 27, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of positions
(position increments are not counted)
fbaligand added a commit to fbaligand/elasticsearch that referenced this issue Apr 27, 2017
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of positions
(position increments are not counted)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search/Analysis How text is split into tokens
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants