token_count type : add an option to count tokens (and not positions) #23227

fbaligand · 2017-02-17T10:26:47Z

Elasticsearch version: 2.4.2 and 5.2.1

Description of the problem including expected versus actual behavior:
Currently, if I have a "token_count" field, based on an analyzer containing a stop filter, the indexed "token_count" field counts stop words.
It would be great to have an option to only count analysis result tokens.

Steps to reproduce:

I have this index configuration :

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "stop_words"
            ]
          }
        },
        "filter": {
          "stop_words": {
            "type": "stop",
            "stopwords": [
              "this", "is", "a"
            ]
          }
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "mytext": {
        	"index": "analyzed",
         	"type": "string",
         	"analyzer": "default",
         	"fields": {
         		"length": {
         			"type": "token_count",
         			"analyzer": "default",
         			"store": "yes"
         		}
         	}
        }
      }
    }
  }
}

I index this document :

{
 "mytext": "this is a cat"
}

I make this search query :

GET _search
{
  "query": {
    "query_string": {
      "query": "mytext.length:1"
    }
  }
}

It should return 1 result. But it actually returns 0 result.

The text was updated successfully, but these errors were encountered:

clintongormley · 2017-02-17T10:35:59Z

This limitation is documented:

Technically the token_count type sums position increments rather than counting tokens. This means that even if the analyzer filters out stop words they are included in the count.

clintongormley · 2017-02-17T10:36:39Z

@nik9000 how feasible would it be to change this behaviour?

jpountz · 2017-02-17T13:10:05Z

Maybe we could add a discount_overlaps option like similarities have.

nik9000 · 2017-02-17T13:19:17Z

I'm not sure! It has been years since I looked at the APIs this is built on.

fbaligand · 2017-02-17T14:17:30Z

Is it possible to have an option so that :

first, the field source content pass through all analyzer chain (char_filter, tokenizer, token filter),
and at the end, we count the result tokens and store it in the indexed field

By this way, we are sure that token_count equals the analyzer tokens count

jimczi · 2017-02-17T14:52:35Z

@fbaligand that's the current behavior except that we count the number of positions created by the analyzer chain. The stop filter removes some token but increment the position on the token right after it in order to be able to build accurate phrase query.
I think that discount_overlaps is what we currently do by counting the positions and not the tokens. Though it would not solve this issue which is about position increment greater than 1.
enable_position_increments is probably what we're looking after (like query parsers have) so that we don't count position holes ?

fbaligand · 2017-02-17T19:49:41Z

@jimczi
Thanks a lot for this explanation ! I understand better now.
An option like enable_position_increments would be great !

nik9000 · 2017-02-17T19:51:53Z

Yeah, like enable_position_increments: false to get the behavior that you are asking for. What we have is true.

fbaligand · 2017-02-17T19:53:30Z

That would be perfect !

fbaligand · 2017-04-19T08:21:52Z

Hi @clintongormley, @jpountz, @nik9000, @jimczi

I just submitted a PR (#24175) which adds the new boolean option "enable_position_increments", as described by @nik9000.
By default, option value is true (which means that positions increments are counted).
And if option is set to false, only result tokens (after full analysis) are counted.

Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of tokens (not position increments count)

* master: (61 commits) Build: Move plugin cli and tests to distribution tool (elastic#24220) Peer Recovery: remove maxUnsafeAutoIdTimestamp hand off (elastic#24243) Adds version 5.3.2 and backwards compatibility indices for 5.3.1 Add utility method to parse named XContent objects with typed prefix (elastic#24240) MultiBucketsAggregation.Bucket should not extend Writeable (elastic#24216) Don't expose cleaned-up tasks as pending in PrioritizedEsThreadPoolExecutor (elastic#24237) Adds declareNamedObjects methods to ConstructingObjectParser (elastic#24219) ESIntegTestCase.indexRandom should not introduce types. (elastic#24202) Tests: Extend InternalStatsTests (elastic#24212) IndicesQueryCache should delegate the scorerSupplier method. (elastic#24209) Speed up parsing of large `terms` queries. (elastic#24210) [TEST] make sure that the random query_string query generator defines a default_field or a list of fields token_count type : add an option to count tokens (fix elastic#23227) (elastic#24175) Query string default field (elastic#24214) Make Aggregations an abstract class rather than an interface (elastic#24184) [TEST] ensure expected sequence no and version are set when index/delete engine operation has a document failure Extract batch executor out of cluster service (elastic#24102) Add 5.3.1 to bwc versions Added "release-state" support to plugin docs Added examples to cross cluster search of using cluster settings ...

Add option "enable_position_increments" with default value true. If option is set to false, indexed value is the number of positions (position increments are not counted)

clintongormley added :Search/Analysis How text is split into tokens discuss labels Feb 17, 2017

fbaligand changed the title ~~token_count does not count well with an analyzer containing stop filter~~ token_count type : add an option to count tokens (and not positions) Apr 16, 2017

fbaligand mentioned this issue Apr 19, 2017

token_count type : add enable_position_increments option (fix #23227) #24175

Merged

jimczi closed this as completed in #24175 Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token_count type : add an option to count tokens (and not positions) #23227

token_count type : add an option to count tokens (and not positions) #23227

fbaligand commented Feb 17, 2017 •

edited

clintongormley commented Feb 17, 2017

clintongormley commented Feb 17, 2017

jpountz commented Feb 17, 2017

nik9000 commented Feb 17, 2017

fbaligand commented Feb 17, 2017

jimczi commented Feb 17, 2017

fbaligand commented Feb 17, 2017

nik9000 commented Feb 17, 2017

fbaligand commented Feb 17, 2017

fbaligand commented Apr 19, 2017 •

edited

token_count type : add an option to count tokens (and not positions) #23227

token_count type : add an option to count tokens (and not positions) #23227

Comments

fbaligand commented Feb 17, 2017 • edited

clintongormley commented Feb 17, 2017

clintongormley commented Feb 17, 2017

jpountz commented Feb 17, 2017

nik9000 commented Feb 17, 2017

fbaligand commented Feb 17, 2017

jimczi commented Feb 17, 2017

fbaligand commented Feb 17, 2017

nik9000 commented Feb 17, 2017

fbaligand commented Feb 17, 2017

fbaligand commented Apr 19, 2017 • edited

fbaligand commented Feb 17, 2017 •

edited

fbaligand commented Apr 19, 2017 •

edited