term vector request #3115

brwe · 2013-05-29T18:12:06Z

Returns information and statistics on terms in the fields of a particular document as stored in the index.

    curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'

Three types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'

or adding by adding the requested fields in the request body (see example below).

Term information

term frequency in the field (always returned)
term positions ("positions" : true)
start and end offsets ("offsets" : true)
term payload ("payloads" : true)

If the requested information wasn't stored in the index, it will be omitted without further warning.
See mapping on how to configure your index to store term vectors.

Term statistics

Setting "term_statistics" to "true" (default is "false") will return

total term frequency (how often a term occurs in all documents)
document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics

Setting "field_statistics" to "false" (default is "true") will omit

document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)

Here is a sample request that returns everything for the field "text":

    curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d
    '{
            "fields" : ["text"],
            "offsets" : true,
            "payloads" : true,
            "positions" : true,
            "term_statistics" : true,
            "field_statistics" : true
    }'

This will return all the information described above for the field "text" in document "1" of type "tweet" in index "twitter".

Behavior

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

Next steps

A term vector request for more than one document similar to multi get

Closes #3114

s1monw · 2013-05-29T19:10:17Z

I think it would be nice to see the response format. The description only show the request format. Britta can you add that to the commit?

clintongormley · 2013-05-30T07:45:00Z

Very nice indeed - this will be useful!

brwe · 2013-05-30T10:57:29Z

I added an example to the commit message, see below. Is that what you meant?

Example
-------------------------

First, we create an index that stores term vectors, payloads etc. :

    curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
        "mappings": {
            "tweet": {
                "properties": {
                    "text": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "store" : "yes",
                                "index_analyzer" : "fulltext_analyzer"
                         },
                     "fullname": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "index_analyzer" : "fulltext_analyzer"
                         }
                 }
            }
        },
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 0
            },
            "analysis": {
                    "analyzer": {
                        "fulltext_analyzer": {
                            "type": "custom",
                            "tokenizer": "whitespace",
                            "filter": [
                                "lowercase",
                                "type_as_payload"
                            ]
                        }
                    }
            }
         }
    }'

Second, we add some documents:

    curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
      "fullname" : "John Doe",
      "text" : "twitter test test test "

    }'

    curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
      "fullname" : "Jane Doe",
      "text" : "Another twitter test ..."

    }'

The following request returns all information and statistics for firld "text" in document "1" (John Doe):

     curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
                    "fields" : ["text"],
                    "offsets" : true,
                    "payloads" : true,
                    "positions" : true,
                    "term_statistics" : true,
                    "field_statistics" : true
            }'

Response:

    {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "1",
      "_version" : 1,
      "exists" : true,
      "term_vectors" : {
        "text" : {
          "field_statistics" : {
            "sum_doc_freq" : 6,
            "doc_count" : 2,
            "sum_ttf" : 8
          },
          "terms" : {
            "test" : {
              "doc_freq" : 2,
              "ttf" : 4,
              "term_freq" : 3,
              "pos" : [ 1, 2, 3 ],
              "start" : [ 8, 13, 18 ],
              "end" : [ 12, 17, 22 ],
              "payload" : [ "word", "word", "word" ]
            },
            "twitter" : {
              "doc_freq" : 2,
              "ttf" : 2,
              "term_freq" : 1,
              "pos" : [ 0 ],
              "start" : [ 0 ],
              "end" : [ 7 ],
              "payload" : [ "word" ]
            }
          }
        }
      }
    }

s1monw · 2013-05-30T11:34:20Z

good stuff! yes that is what I was talking about!

jpountz · 2013-05-30T12:42:39Z

I'm worried that this API exposes top-level terms statistics because it forces this API to perform one random seek for term vectors and another one in the terms dictionary. So maybe top-level field statistics should be an opt-in option or a different API?

s1monw · 2013-05-30T12:44:12Z

@jpountz you are talking about term_statistics like docFreq? this is disabled by default so you have to opt in.

jpountz · 2013-05-30T12:45:43Z

Sorry, I completely missed the options in the query!

s1monw · 2013-05-30T12:54:09Z

no worries!

kimchy · 2013-05-30T12:56:36Z

src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java

+        String currentFieldName = null;
+        XContentParser parser;
+
+        parser = XContentFactory.xContent(cont).createParser(cont);


need to wrap in try ... finally block and close the parser

================================ Returns information and statistics on terms in the fields of a particular document as stored in the index. curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' Tree types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...' or adding by adding the requested fields in the request body (see example below). Term information ------------------------- - term frequency in the field (always returned) - term positions ("positions" : true) - start and end offsets ("offsets" : true) - term payloads ("payloads" : true), as base64 encoded bytes If the requested information wasn't stored in the index, it will be omitted without further warning. See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors. Term statistics ------------------------- Setting "term_statistics" to "true" (default is "false") will return - total term frequency (how often a term occurs in all documents) - document frequency (the number of documents containing the current term) By default these values are not returned since term statistics can have a serious performance impact. Field statistics ------------------------- Setting "field_statistics" to "false" (default is "true") will omit - document count (how many documents contain this field) - sum of document frequencies (the sum of document frequencies for all terms in this field) - sum of total term frequencies (the sum of total term frequencies of each term in this field) Behavior ------------------------- The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. Example ------------------------- First, we create an index that stores term vectors, payloads etc. : curl -s -XPUT 'http://localhost:9200/twitter/' -d '{ "mappings": { "tweet": { "properties": { "text": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : "yes", "index_analyzer" : "fulltext_analyzer" }, "fullname": { "type": "string", "term_vector": "with_positions_offsets_payloads", "index_analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }' Second, we add some documents: curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{ "fullname" : "John Doe", "text" : "twitter test test test " }' curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{ "fullname" : "Jane Doe", "text" : "Another twitter test ..." }' The following request returns all information and statistics for field "text" in document "1" (John Doe): curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{ "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }' Equivalently, all parameters can be passed as URI parameters: curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true' Response: { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_version" : 1, "exists" : true, "term_vectors" : { "text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "pos" : [ 1, 2, 3 ], "start" : [ 8, 13, 18 ], "end" : [ 12, 17, 22 ], "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ] }, "twitter" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "pos" : [ 0 ], "start" : [ 0 ], "end" : [ 7 ], "payload" : [ "d29yZA==" ] } } } } } Further changes: ------------------------- XContentBuilder new method public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value) to put an integer array. IndicesAnalysisService make token filter for saving payloads available in elasticsearch AbstractFieldMapper/TypeParser make term vector options string available and also fix the parsing of this string: with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields. Closes elastic#3114

s1monw · 2013-06-07T19:06:28Z

+1 on the latest commit version! LGTM push it to master and backport to 0.90

See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/docs-termvectors.html Related: * elastic/elasticsearch#3114 * elastic/elasticsearch#3115

kimchy reviewed May 30, 2013
View reviewed changes

brwe closed this Jun 13, 2013

bleskes mentioned this pull request Jul 16, 2013

API to get the terms used in a "more_like_this" query #596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

term vector request #3115

term vector request #3115

brwe commented May 29, 2013

s1monw commented May 29, 2013

clintongormley commented May 30, 2013

brwe commented May 30, 2013

s1monw commented May 30, 2013

jpountz commented May 30, 2013

s1monw commented May 30, 2013

jpountz commented May 30, 2013

s1monw commented May 30, 2013

kimchy May 30, 2013

s1monw commented Jun 7, 2013

term vector request #3115

term vector request #3115

Conversation

brwe commented May 29, 2013

Term information

Term statistics

Field statistics

Behavior

Next steps

s1monw commented May 29, 2013

clintongormley commented May 30, 2013

brwe commented May 30, 2013

s1monw commented May 30, 2013

jpountz commented May 30, 2013

s1monw commented May 30, 2013

jpountz commented May 30, 2013

s1monw commented May 30, 2013

kimchy May 30, 2013

Choose a reason for hiding this comment

s1monw commented Jun 7, 2013