New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
term vector request #3115
term vector request #3115
Conversation
I think it would be nice to see the response format. The description only show the request format. Britta can you add that to the commit? |
Very nice indeed - this will be useful! |
I added an example to the commit message, see below. Is that what you meant? Example First, we create an index that stores term vectors, payloads etc. :
Second, we add some documents:
The following request returns all information and statistics for firld "text" in document "1" (John Doe):
Response:
|
good stuff! yes that is what I was talking about! |
I'm worried that this API exposes top-level terms statistics because it forces this API to perform one random seek for term vectors and another one in the terms dictionary. So maybe top-level field statistics should be an opt-in option or a different API? |
@jpountz you are talking about |
Sorry, I completely missed the options in the query! |
no worries! |
String currentFieldName = null; | ||
XContentParser parser; | ||
|
||
parser = XContentFactory.xContent(cont).createParser(cont); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to wrap in try ... finally block and close the parser
================================ Returns information and statistics on terms in the fields of a particular document as stored in the index. curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' Tree types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...' or adding by adding the requested fields in the request body (see example below). Term information ------------------------- - term frequency in the field (always returned) - term positions ("positions" : true) - start and end offsets ("offsets" : true) - term payloads ("payloads" : true), as base64 encoded bytes If the requested information wasn't stored in the index, it will be omitted without further warning. See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors. Term statistics ------------------------- Setting "term_statistics" to "true" (default is "false") will return - total term frequency (how often a term occurs in all documents) - document frequency (the number of documents containing the current term) By default these values are not returned since term statistics can have a serious performance impact. Field statistics ------------------------- Setting "field_statistics" to "false" (default is "true") will omit - document count (how many documents contain this field) - sum of document frequencies (the sum of document frequencies for all terms in this field) - sum of total term frequencies (the sum of total term frequencies of each term in this field) Behavior ------------------------- The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. Example ------------------------- First, we create an index that stores term vectors, payloads etc. : curl -s -XPUT 'http://localhost:9200/twitter/' -d '{ "mappings": { "tweet": { "properties": { "text": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : "yes", "index_analyzer" : "fulltext_analyzer" }, "fullname": { "type": "string", "term_vector": "with_positions_offsets_payloads", "index_analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }' Second, we add some documents: curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{ "fullname" : "John Doe", "text" : "twitter test test test " }' curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{ "fullname" : "Jane Doe", "text" : "Another twitter test ..." }' The following request returns all information and statistics for field "text" in document "1" (John Doe): curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{ "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }' Equivalently, all parameters can be passed as URI parameters: curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true' Response: { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_version" : 1, "exists" : true, "term_vectors" : { "text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "pos" : [ 1, 2, 3 ], "start" : [ 8, 13, 18 ], "end" : [ 12, 17, 22 ], "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ] }, "twitter" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "pos" : [ 0 ], "start" : [ 0 ], "end" : [ 7 ], "payload" : [ "d29yZA==" ] } } } } } Further changes: ------------------------- XContentBuilder new method public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value) to put an integer array. IndicesAnalysisService make token filter for saving payloads available in elasticsearch AbstractFieldMapper/TypeParser make term vector options string available and also fix the parsing of this string: with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields. Closes elastic#3114
+1 on the latest commit version! LGTM push it to master and backport to 0.90 |
Returns information and statistics on terms in the fields of a particular document as stored in the index.
Three types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.
Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url
or adding by adding the requested fields in the request body (see example below).
Term information
If the requested information wasn't stored in the index, it will be omitted without further warning.
See mapping on how to configure your index to store term vectors.
Term statistics
Setting "term_statistics" to "true" (default is "false") will return
By default these values are not returned since term statistics can have a serious performance impact.
Field statistics
Setting "field_statistics" to "false" (default is "true") will omit
Here is a sample request that returns everything for the field "text":
This will return all the information described above for the field "text" in document "1" of type "tweet" in index "twitter".
Behavior
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.
Next steps
A term vector request for more than one document similar to multi get
Closes #3114