Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

term vector request #3115

Closed
wants to merge 1 commit into from
Closed

term vector request #3115

wants to merge 1 commit into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented May 29, 2013

Returns information and statistics on terms in the fields of a particular document as stored in the index.

    curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'

Three types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'

or adding by adding the requested fields in the request body (see example below).

Term information

  • term frequency in the field (always returned)
  • term positions ("positions" : true)
  • start and end offsets ("offsets" : true)
  • term payload ("payloads" : true)

If the requested information wasn't stored in the index, it will be omitted without further warning.
See mapping on how to configure your index to store term vectors.

Term statistics

Setting "term_statistics" to "true" (default is "false") will return

  • total term frequency (how often a term occurs in all documents)
  • document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics

Setting "field_statistics" to "false" (default is "true") will omit

  • document count (how many documents contain this field)
  • sum of document frequencies (the sum of document frequencies for all terms in this field)
  • sum of total term frequencies (the sum of total term frequencies of each term in this field)

Here is a sample request that returns everything for the field "text":

    curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d
    '{
            "fields" : ["text"],
            "offsets" : true,
            "payloads" : true,
            "positions" : true,
            "term_statistics" : true,
            "field_statistics" : true
    }'

This will return all the information described above for the field "text" in document "1" of type "tweet" in index "twitter".

Behavior

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

Next steps

A term vector request for more than one document similar to multi get

Closes #3114

@s1monw
Copy link
Contributor

s1monw commented May 29, 2013

I think it would be nice to see the response format. The description only show the request format. Britta can you add that to the commit?

@clintongormley
Copy link

Very nice indeed - this will be useful!

@brwe
Copy link
Contributor Author

brwe commented May 30, 2013

I added an example to the commit message, see below. Is that what you meant?

Example
-------------------------

First, we create an index that stores term vectors, payloads etc. :

    curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
        "mappings": {
            "tweet": {
                "properties": {
                    "text": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "store" : "yes",
                                "index_analyzer" : "fulltext_analyzer"
                         },
                     "fullname": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "index_analyzer" : "fulltext_analyzer"
                         }
                 }
            }
        },
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 0
            },
            "analysis": {
                    "analyzer": {
                        "fulltext_analyzer": {
                            "type": "custom",
                            "tokenizer": "whitespace",
                            "filter": [
                                "lowercase",
                                "type_as_payload"
                            ]
                        }
                    }
            }
         }
    }'

Second, we add some documents:

    curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
      "fullname" : "John Doe",
      "text" : "twitter test test test "

    }'

    curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
      "fullname" : "Jane Doe",
      "text" : "Another twitter test ..."

    }'

The following request returns all information and statistics for firld "text" in document "1" (John Doe):

     curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
                    "fields" : ["text"],
                    "offsets" : true,
                    "payloads" : true,
                    "positions" : true,
                    "term_statistics" : true,
                    "field_statistics" : true
            }'

Response:

    {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "1",
      "_version" : 1,
      "exists" : true,
      "term_vectors" : {
        "text" : {
          "field_statistics" : {
            "sum_doc_freq" : 6,
            "doc_count" : 2,
            "sum_ttf" : 8
          },
          "terms" : {
            "test" : {
              "doc_freq" : 2,
              "ttf" : 4,
              "term_freq" : 3,
              "pos" : [ 1, 2, 3 ],
              "start" : [ 8, 13, 18 ],
              "end" : [ 12, 17, 22 ],
              "payload" : [ "word", "word", "word" ]
            },
            "twitter" : {
              "doc_freq" : 2,
              "ttf" : 2,
              "term_freq" : 1,
              "pos" : [ 0 ],
              "start" : [ 0 ],
              "end" : [ 7 ],
              "payload" : [ "word" ]
            }
          }
        }
      }
    }

@s1monw
Copy link
Contributor

s1monw commented May 30, 2013

good stuff! yes that is what I was talking about!

@jpountz
Copy link
Contributor

jpountz commented May 30, 2013

I'm worried that this API exposes top-level terms statistics because it forces this API to perform one random seek for term vectors and another one in the terms dictionary. So maybe top-level field statistics should be an opt-in option or a different API?

@s1monw
Copy link
Contributor

s1monw commented May 30, 2013

@jpountz you are talking about term_statistics like docFreq? this is disabled by default so you have to opt in.

@jpountz
Copy link
Contributor

jpountz commented May 30, 2013

Sorry, I completely missed the options in the query!

@s1monw
Copy link
Contributor

s1monw commented May 30, 2013

no worries!

String currentFieldName = null;
XContentParser parser;

parser = XContentFactory.xContent(cont).createParser(cont);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to wrap in try ... finally block and close the parser

================================

Returns information and statistics on terms in the fields of a particular document as stored in the index.

        curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'

Tree types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

	curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'

or adding by adding the requested fields in the request body (see example below).

Term information
-------------------------

- term frequency in the field (always returned)
- term positions ("positions" : true)
- start and end offsets ("offsets" : true)
- term payloads ("payloads" : true), as base64 encoded bytes

If the requested information wasn't stored in the index, it will be omitted without further warning.
See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors.

Term statistics
-------------------------

Setting "term_statistics" to "true" (default is "false") will return

- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics
-------------------------

Setting "field_statistics" to "false" (default is "true") will omit

- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)

Behavior
-------------------------

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

Example
-------------------------

First, we create an index that stores term vectors, payloads etc. :

    curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
        "mappings": {
            "tweet": {
                "properties": {
                    "text": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "store" : "yes",
                                "index_analyzer" : "fulltext_analyzer"
                         },
                     "fullname": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "index_analyzer" : "fulltext_analyzer"
                         }
                 }
            }
        },
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 0
            },
            "analysis": {
                    "analyzer": {
                        "fulltext_analyzer": {
                            "type": "custom",
                            "tokenizer": "whitespace",
                            "filter": [
                                "lowercase",
                                "type_as_payload"
                            ]
                        }
                    }
            }
         }
    }'

Second, we add some documents:

    curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
      "fullname" : "John Doe",
      "text" : "twitter test test test "

    }'

    curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
      "fullname" : "Jane Doe",
      "text" : "Another twitter test ..."

    }'

The following request returns all information and statistics for field "text" in document "1" (John Doe):

     curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
                    "fields" : ["text"],
                    "offsets" : true,
                    "payloads" : true,
                    "positions" : true,
                    "term_statistics" : true,
                    "field_statistics" : true
            }'
Equivalently, all parameters can be passed as URI parameters:
     curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true'

Response:

  {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_version" : 1,
    "exists" : true,
    "term_vectors" : {
      "text" : {
        "field_statistics" : {
          "sum_doc_freq" : 6,
          "doc_count" : 2,
          "sum_ttf" : 8
        },
        "terms" : {
          "test" : {
            "doc_freq" : 2,
            "ttf" : 4,
            "term_freq" : 3,
            "pos" : [ 1, 2, 3 ],
            "start" : [ 8, 13, 18 ],
            "end" : [ 12, 17, 22 ],
            "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ]
          },
          "twitter" : {
            "doc_freq" : 2,
            "ttf" : 2,
            "term_freq" : 1,
            "pos" : [ 0 ],
            "start" : [ 0 ],
            "end" : [ 7 ],
            "payload" : [ "d29yZA==" ]
          }
        }
      }
    }
  }

Further changes:
-------------------------

XContentBuilder
new method
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value)
to put an integer array.

IndicesAnalysisService
make token filter for saving payloads available in elasticsearch

AbstractFieldMapper/TypeParser
make term vector options string available and also fix the parsing of this string:
with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields.

Closes elastic#3114
@s1monw
Copy link
Contributor

s1monw commented Jun 7, 2013

+1 on the latest commit version! LGTM push it to master and backport to 0.90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

return term vectors and some statistics for a document
5 participants