New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a `feature_vector` field. #31102

Merged
merged 5 commits into from Jun 7, 2018

Conversation

Projects
None yet
5 participants
@jpountz
Contributor

jpountz commented Jun 5, 2018

This field is similar to the feature field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552

Add a `feature_vector` field.
This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552
@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Jun 5, 2018

@gibrown

This comment has been minimized.

Show comment
Hide comment
@gibrown

gibrown Jun 5, 2018

Contributor

This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers).

But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized?

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

Contributor

gibrown commented Jun 5, 2018

This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers).

But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized?

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 6, 2018

Contributor

But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there

What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.

supporting negative numbers would be pretty useful also

This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using function_score.

Not fully clear to me from the current docs if zero and floats are supported.

I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

This is good feedback, thanks!

Contributor

jpountz commented Jun 6, 2018

But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there

What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.

supporting negative numbers would be pretty useful also

This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using function_score.

Not fully clear to me from the current docs if zero and floats are supported.

I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

This is good feedback, thanks!

@mayya-sharipova

Thanks Adrien, very interesting and useful feature

throw new IllegalArgumentException("[feature_vector] fields do not support indexing multiple values for the same " +
"feature [" + key + "] in the same document");
}
context.doc().addWithKey(key, new FeatureField(name(), feature, value));

This comment has been minimized.

@mayya-sharipova

mayya-sharipova Jun 6, 2018

Contributor

Did I understand correctly - if we index in ES:

"topics": {
	"politics": 20,
	"economics": 50
}

in Lucene, it will translate to a field with a name topics, with values: "politics" with freq 20, and "economics" with freq 50?

@mayya-sharipova

mayya-sharipova Jun 6, 2018

Contributor

Did I understand correctly - if we index in ES:

"topics": {
	"politics": 20,
	"economics": 50
}

in Lucene, it will translate to a field with a name topics, with values: "politics" with freq 20, and "economics" with freq 50?

This comment has been minimized.

@jpountz

jpountz Jun 6, 2018

Contributor

The actual term frequency will actually be Float.floatToIntBits(20) >>> 15 rather than 20, but other than that this is correct. You can look at FeatureField.tokenStream for more details. So basically it is a float but with only 8 bits for the mantissa rather than 23.

@jpountz

jpountz Jun 6, 2018

Contributor

The actual term frequency will actually be Float.floatToIntBits(20) >>> 15 rather than 20, but other than that this is correct. You can look at FeatureField.tokenStream for more details. So basically it is a float but with only 8 bits for the mantissa rather than 23.

@colings86

@jpountz I left some comments on the documentation but this LGTM and seems like a great enhancement on top of the feature field 😄

Show outdated Hide outdated docs/reference/mapping/types/feature-vector.asciidoc
}
}
}
--------------------------------------------------

This comment has been minimized.

@colings86

colings86 Jun 6, 2018

Member

I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?

@colings86

colings86 Jun 6, 2018

Member

I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?

This comment has been minimized.

@jpountz

jpountz Jun 6, 2018

Contributor

Agreed, @gibrown had a similar comment

@jpountz

jpountz Jun 6, 2018

Contributor

Agreed, @gibrown had a similar comment

@gibrown

This comment has been minimized.

Show comment
Hide comment
@gibrown

gibrown Jun 6, 2018

Contributor

That all makes sense to me, thanks for the explanations. Very cool, and ya I think this would address a number of my use cases.

Contributor

gibrown commented Jun 6, 2018

That all makes sense to me, thanks for the explanations. Very cool, and ya I think this would address a number of my use cases.

@jpountz jpountz merged commit 458bca1 into elastic:master Jun 7, 2018

3 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@jpountz jpountz deleted the jpountz:feature_vector branch Jun 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment