Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a feature_vector field. #31102

Merged
merged 5 commits into from
Jun 7, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/reference/mapping/types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>

<<feature>>:: Record numeric features to boost hits at query time.

<<feature-vector>>:: Record numeric feature vectors to boost hits at query time.

[float]
=== Multi-fields

Expand Down Expand Up @@ -90,4 +92,4 @@ include::types/parent-join.asciidoc[]

include::types/feature.asciidoc[]


include::types/feature-vector.asciidoc[]
64 changes: 64 additions & 0 deletions docs/reference/mapping/types/feature-vector.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
[[feature-vector]]
=== Feature vector datatype

A `feature_vector` field can index numeric feature vectors, so that they can
later be used to boost documents in queries with a
<<query-dsl-feature-query,`feature`>> query.

It is analogous to the <<feature,`feature`>> datatype but is better suited
when the list of features is sparse so that it wouldn't be reasonable to add
one field to the mappings for each of them.

[source,js]
--------------------------------------------------
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"topics": {
"type": "feature_vector" <1>
}
}
}
}
}

PUT my_index/_doc/1
{
"topics": { <2>
"politics": 20,
"economics": 50.8
}
}

PUT my_index/_doc/2
{
"topics": {
"politics": 5.2,
"sports": 80.1
}
}

GET my_index/_search
{
"query": {
"feature": {
"field": "topics.politics"
}
}
}
--------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, @gibrown had a similar comment

// CONSOLE
<1> Feature vector fields must use the `feature_vector` field type
<2> Feature vector fields must be a hash with string keys and strictly positive numeric values

NOTE: `feature_vector` fields only support single-valued features and strictly
positive values. Multi-valued fields and zero or negative values will be rejected.

NOTE: `feature_vector` fields do not support sorting or aggregating and may
only be queried using <<query-dsl-feature-query,`feature`>> queries.

NOTE: `feature_vector` fields only preserve 9 significant bits for the
precision, which translates to a relative error of about 0.4%.

90 changes: 72 additions & 18 deletions docs/reference/query-dsl/feature-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@
=== Feature Query

The `feature` query is a specialized query that only works on
<<feature,`feature`>> fields. Its goal is to boost the score of documents based
on the values of numeric features. It is typically put in a `should` clause of
a <<query-dsl-bool-query,`bool`>> query so that its score is added to the score
<<feature,`feature`>> fields and <<feature-vector,`feature_vector`>> fields.
Its goal is to boost the score of documents based on the values of numeric
features. It is typically put in a `should` clause of a
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
of the query.

Compared to using <<query-dsl-function-score-query,`function_score`>> or other
Expand All @@ -13,7 +14,16 @@ efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
spectacular.

Here is an example:
Here is an example that indexes various features:
- https://en.wikipedia.org/wiki/PageRank[`pagerank`], a measure of the
importance of a website,
- `url_length`, the length of the url, which typically correlates negatively
with relevance,
- `topics`, which associates a list of topics with every document alongside a
measure of how well the document is connected to this topic.

Then the example includes an example query that searches for `"2016"` and boosts
based or `pagerank`, `url_length` and the `sports` topic.

[source,js]
--------------------------------------------------
Expand All @@ -28,6 +38,9 @@ PUT test
"url_length": {
"type": "feature",
"positive_score_impact": false
},
"topics": {
"type": "feature_vector"
}
}
}
Expand All @@ -36,32 +49,73 @@ PUT test

PUT test/_doc/1
{
"pagerank": 10,
"url_length": 50
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
"content": "Rio 2016",
"pagerank": 50.3,
"url_length": 42,
"topics": {
"sports": 50,
"brazil": 30
}
}

PUT test/_doc/2
{
"pagerank": 100,
"url_length": 20
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
"pagerank": 50.3,
"url_length": 47,
"topics": {
"sports": 35,
"formula one": 65,
"brazil": 20
}
}

POST test/_refresh

GET test/_search
PUT test/_doc/3
{
"query": {
"feature": {
"field": "pagerank"
}
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
"content": "Deadpool is a 2016 American superhero film",
"pagerank": 50.3,
"url_length": 37,
"topics": {
"movies": 60,
"super hero": 65
}
}

GET test/_search
POST test/_refresh

GET test/_search
{
"query": {
"feature": {
"field": "url_length"
"bool": {
"must": [
{
"match": {
"content": "2016"
}
}
],
"should": [
{
"feature": {
"field": "pagerank"
}
},
{
"feature": {
"field": "url_length",
"boost": 0.1
}
},
{
"feature": {
"field": "topics.sports",
"boost": 0.4
}
}
]
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,7 @@ public Query existsQuery(QueryShardContext context) {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) {
failIfNoDocValues();
return new DocValuesIndexFieldData.Builder();
throw new UnsupportedOperationException("[feature] fields do not support sorting, scripting or aggregating");
}

@Override
Expand Down Expand Up @@ -229,10 +228,6 @@ protected String contentType() {
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
super.doXContentBody(builder, includeDefaults, params);

if (includeDefaults || fieldType().nullValue() != null) {
builder.field("null_value", fieldType().nullValue());
}

if (includeDefaults || fieldType().positiveScoreImpact() == false) {
builder.field("positive_score_impact", fieldType().positiveScoreImpact());
}
Expand Down
Loading