New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make term statistics and term vectors accessible in scripts #4161
Closed
Closed
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4489f9c
do not call score() twice
brwe 38688d3
make term statistics accessible in scripts
brwe 7718b8d
implemented comments from @s1monw, @clintongormley, @martijnvg
brwe 0cd0028
implemented @s1monw comments
brwe b132824
save term frequency to avoid null checks
brwe de7a768
randomize number of shards in some tests and remove unneeded benchmar…
brwe 4a0fd70
remove todo in doc and some minor changes
brwe File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
[[modules-advanced-scripting]] | ||
== Text scoring in scripts | ||
|
||
|
||
Text features, such as term or document frequency for a specific term can be accessed in scripts (see <<modules-scripting, scripting documentation>> ) with the `_shard` variable. This can be useful if, for example, you want to implement your own scoring model using for example a script inside a <<query-dsl-function-score-query,function score query>>. | ||
Statistics over the document collection are computed *per shard*, not per | ||
index. | ||
|
||
[float] | ||
=== Nomenclature: | ||
|
||
|
||
[horizontal] | ||
`df`:: | ||
|
||
document frequency. The number of documents a term appears in. Computed | ||
per field. | ||
|
||
|
||
`tf`:: | ||
|
||
term frequency. The number times a term appears in a field in one specific | ||
document. | ||
|
||
`ttf`:: | ||
|
||
total term frequency. The number of times this term appears in all | ||
documents, that is, the sum of `tf` over all documents. Computed per | ||
field. | ||
|
||
`df` and `ttf` are computed per shard and therefore these numbers can vary | ||
depending on the shard the current document resides in. | ||
|
||
|
||
[float] | ||
=== Shard statistics: | ||
|
||
`_shard.numDocs()`:: | ||
|
||
Number of documents in shard. | ||
|
||
`_shard.maxDoc()`:: | ||
|
||
Maximal document number in shard. | ||
|
||
`_shard.numDeletedDocs()`:: | ||
|
||
Number of deleted documents in shard. | ||
|
||
|
||
[float] | ||
=== Field statistics: | ||
|
||
Field statistics can be accessed with a subscript operator like this: | ||
`_shard['FIELD']`. | ||
|
||
|
||
`_shard['FIELD'].docCount()`:: | ||
|
||
Number of documents containing the field `FIELD`. Does not take deleted documents into account. | ||
|
||
`_shard['FIELD'].sumttf()`:: | ||
|
||
Sum of `ttf` over all terms that appear in field `FIELD` in all documents. | ||
|
||
`_shard['FIELD'].sumdf()`:: | ||
|
||
The sum of `df` s over all terms that appear in field `FIELD` in all | ||
documents. | ||
|
||
|
||
Field statistics are computed per shard and therfore these numbers can vary | ||
depending on the shard the current document resides in. | ||
The number of terms in a field cannot be accessed using the `_shard` variable. See <<mapping-core-types, word count mapping type>> on how to do that. | ||
|
||
[float] | ||
=== Term statistics: | ||
|
||
Term statistics for a field can be accessed with a subscript operator like | ||
this: `_shard['FIELD']['TERM']`. This will never return null, even if term or field does not exist. | ||
If you do not need the term frequency, call `_shard['FIELD'].get('TERM', 0)` | ||
to avoid uneccesary initialization of the frequencies. The flag will have only | ||
affect is your set the `index_options` to `docs` (see <<mapping-core-types, mapping documentation>>). | ||
|
||
|
||
`_shard['FIELD']['TERM'].df()`:: | ||
|
||
`df` of term `TERM` in field `FIELD`. Will be returned, even if the term | ||
is not present in the current document. | ||
|
||
`_shard['FIELD']['TERM'].ttf()`:: | ||
|
||
The sum of term frequencys of term `TERM` in field `FIELD` over all | ||
documents. Will be returned, even if the term is not present in the | ||
current document. | ||
|
||
`_shard['FIELD']['TERM'].tf()`:: | ||
|
||
`tf` of term `TERM` in field `FIELD`. Will be 0 if the term is not present | ||
in the current document. | ||
|
||
|
||
[float] | ||
=== Term positions, offsets and payloads: | ||
|
||
If you need information on the positions of terms in a field, call | ||
`_shard['FIELD'].get('TERM', flag)` where flag can be | ||
|
||
[horizontal] | ||
`_POSITIONS`:: if you need the positions of the term | ||
`_OFFSETS`:: if you need the offests of the term | ||
`_PAYLOADS`:: if you need the payloads of the term | ||
`_CACHE`:: if you need to iterate over all positions several times | ||
|
||
The iterator uses the underlying lucene classes to iterate over positions. For efficiency reasons, you can only iterate over positions once. If you need to iterate over the positions several times, set the `_CACHE` flag. | ||
|
||
You can combine the operators with a `|` if you need more than one info. For | ||
example, the following will return an object holding the positions and payloads, | ||
as well as all statistics: | ||
|
||
|
||
`_shard['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)` | ||
|
||
|
||
Positions can be accessed with an iterator that returns an object | ||
(`POS_OBJECT`) holding position, offsets and payload for each term position. | ||
|
||
`POS_OBJECT.position`:: | ||
|
||
The position of the term. | ||
|
||
`POS_OBJECT.startOffset`:: | ||
|
||
The start offset of the term. | ||
|
||
`POS_OBJECT.endOffset`:: | ||
|
||
The end offset of the term. | ||
|
||
`POS_OBJECT.payload`:: | ||
|
||
The payload of the term. | ||
|
||
`POS_OBJECT.payloadAsInt(missingValue)`:: | ||
|
||
The payload of the term converted to integer. If the current position has | ||
no payload, the `missingValue` will be returned. Call this only if you | ||
know that your payloads are integers. | ||
|
||
`POS_OBJECT.payloadAsFloat(missingValue)`:: | ||
|
||
The payload of the term converted to float. If the current position has no | ||
payload, the `missingValue` will be returned. Call this only if you know | ||
that your payloads are floats. | ||
|
||
`POS_OBJECT.payloadAsString()`:: | ||
|
||
The payload of the term converted to string. If the current position has | ||
no payload, `null` will be returned. Call this only if you know that your | ||
payloads are strings. | ||
|
||
|
||
Example: sums up all payloads for the term `foo`. | ||
|
||
[source,mvel] | ||
--------------------------------------------------------- | ||
termInfo = _shard['my_field'].get('foo',_PAYLOADS); | ||
score = 0; | ||
for (pos : termInfo) { | ||
score = score + pos.payloadAsInt(0); | ||
} | ||
return score; | ||
--------------------------------------------------------- | ||
|
||
|
||
[float] | ||
=== Term vectors: | ||
|
||
The `_shard` variable can only be used to gather statistics for single terms. If you want to use information on all terms in a field, you must store the term vectors (set `term_vector` in the mapping as described in the <<mapping-core-types,mapping documentation>>). To access them, call | ||
`_shard.getTermVectors()` to get a | ||
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[Fields] | ||
instance. This object can then be used as described in https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[lucene doc] to iterate over fields and then for each field iterate over each term in the field. | ||
The method will return null if the term vectors were not stored. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
71 changes: 71 additions & 0 deletions
71
src/main/java/org/elasticsearch/common/util/MinimalMap.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
/* | ||
* Licensed to ElasticSearch and Shay Banon under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. ElasticSearch licenses this | ||
* file to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.common.util; | ||
|
||
import java.util.Collection; | ||
import java.util.Map; | ||
import java.util.Set; | ||
|
||
public abstract class MinimalMap<K, V> implements Map<K, V> { | ||
|
||
public boolean isEmpty() { | ||
throw new UnsupportedOperationException("entrySet() not supported!"); | ||
} | ||
|
||
public V put(K key, V value) { | ||
throw new UnsupportedOperationException("put(Object, Object) not supported!"); | ||
} | ||
|
||
public void putAll(Map<? extends K, ? extends V> m) { | ||
throw new UnsupportedOperationException("putAll(Map<? extends K, ? extends V>) not supported!"); | ||
} | ||
|
||
public V remove(Object key) { | ||
throw new UnsupportedOperationException("remove(Object) not supported!"); | ||
} | ||
|
||
public void clear() { | ||
throw new UnsupportedOperationException("clear() not supported!"); | ||
} | ||
|
||
public Set<K> keySet() { | ||
throw new UnsupportedOperationException("keySet() not supported!"); | ||
} | ||
|
||
public Collection<V> values() { | ||
throw new UnsupportedOperationException("values() not supported!"); | ||
} | ||
|
||
public Set<Entry<K, V>> entrySet() { | ||
throw new UnsupportedOperationException("entrySet() not supported!"); | ||
} | ||
|
||
public boolean containsValue(Object value) { | ||
throw new UnsupportedOperationException("containsValue(Object) not supported!"); | ||
} | ||
|
||
public int size() { | ||
throw new UnsupportedOperationException("size() not supported!"); | ||
} | ||
|
||
public boolean containsKey(Object k) { | ||
throw new UnsupportedOperationException("containsKey(Object) not supported!"); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like!