Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use index-prefix fields for terms of length min_chars - 1 #36703

Merged
merged 8 commits into from Dec 19, 2018

Conversation

Projects
None yet
5 participants
@romseygeek
Copy link
Contributor

commented Dec 17, 2018

The default index_prefix settings will index prefixes of between 2 and 5 characters in length. Currently, if a prefix search falls outside of this range at either end we fall back to a standard prefix expansion, which is still very expensive for single character prefixes. However, we have an option here to use a wildcard expansion rather than a prefix expansion, so that a query of a* gets remapped to a? against the _index_prefix field - likely to be a very small set of terms, and certain to be much smaller than a* against the whole index.

This pull request adds this extra level of mapping for any prefix term whose length is one less than the min_chars parameter of the index_prefixes field.

A possible follow-up could be to disallow single-character wildcards against a field unless index_prefix is enabled with a min_char settings of 2.

@elasticmachine

This comment has been minimized.

Copy link

commented Dec 17, 2018

@romseygeek romseygeek requested review from jpountz and jtibshirani Dec 17, 2018

@romseygeek

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

@elasticmachine retest this please

@jpountz
Copy link
Contributor

left a comment

This is a great idea! I left some suggestions.

if (strValue.length() >= minChars) {
return super.termQuery(value, context);
}
WildcardQuery query = new WildcardQuery(new Term(name(), value + "?"));

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 17, 2018

Contributor

I think it'd be safer to create an automaton manually and then instantiate an AutomatonQuery, otherwise there could be surprises if value contains ? or *.

@@ -360,7 +361,7 @@ PrefixFieldType setAnalyzer(NamedAnalyzer delegate) {
}

boolean accept(int length) {
return length >= minChars && length <= maxChars;
return length >= minChars - 1 && length <= maxChars;

This comment has been minimized.

Copy link
@jpountz

jpountz Dec 17, 2018

Contributor

Let's go even further, change this to just length <= maxChars, and then append minChars - prefixTerm.length wildcards to the automaton that is used for querying?

romseygeek added some commits Dec 17, 2018

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2018

Now that I'm thinking about it again... I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2018

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

@romseygeek

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

It probably also means that we should not try to support prefixes whose length is less than minChars - 1 like I suggested above.

What about text fields that don't have index_prefixes set on them? Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

@jtibshirani
Copy link
Member

left a comment

I had a couple questions for my knowledge, to help understand the trade-offs we're making: in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

@@ -370,6 +375,23 @@ void doXContent(XContentBuilder builder) throws IOException {
builder.endObject();
}

public Query termQuery(Object value, MultiTermQuery.RewriteMethod method, QueryShardContext context) {

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 17, 2018

Member

Maybe we should rename this method, now that it is not a pure term query (for example prefixQuery could make more sense)?

This comment has been minimized.

Copy link
@romseygeek

romseygeek Dec 18, 2018

Author Contributor

I made this just override prefixQuery, it makes much more sense!

q = fieldType.prefixQuery("internationalisatio", CONSTANT_SCORE_REWRITE, queryShardContext);
assertEquals(new PrefixQuery(new Term("field", "internationalisatio")), q);

q = fieldType.prefixQuery("g", CONSTANT_SCORE_REWRITE, queryShardContext);

This comment has been minimized.

Copy link
@jtibshirani

jtibshirani Dec 17, 2018

Member

It seems like these detailed query construction tests would fit better in a unit test like TextFieldTypeTests.

This comment has been minimized.

Copy link
@romseygeek

romseygeek Dec 18, 2018

Author Contributor

++

@romseygeek

This comment has been minimized.

Copy link
Contributor Author

commented Dec 18, 2018

Thanks @jtibshirani, have pushed some changes to address your comments.

in what circumstances would users adjust the min_chars setting, and why does it default to 2 as opposed to 1?

The reasoning is that each extra ngram length adds to index size; so min_chars of 1 will end up with a very large index indeed, and 2 seems to be a reasonable default. But if you know that you will only ever do prefix searches of length 4 or more, for example, then you can up the min_chars setting to save on disk space.

@romseygeek romseygeek merged commit dd540ef into elastic:master Dec 19, 2018

7 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci-1 Build finished.
Details
elasticsearch-ci-2 Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@romseygeek romseygeek deleted the romseygeek:single-char-prefix branch Dec 19, 2018

@jtibshirani

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

@romseygeek I'm wondering if you addressed @jpountz's comment above?

I'm afraid that this refactoring will fail to match terms whose length is exactly minChars-1? my_text_field:a* should actually be parsed to something like my_text_field:a OR my_text_field.index_prefix:a??

@romseygeek

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2018

I'm wondering if you addressed @jpountz's comment above?

I had missed that, thank you! Will open a separate PR to deal with it.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2018

Should we disallow prefix queries below a certain length entirely there, and say that if you want prefix search for short prefixes you need to use index_prefixes?

I can't find the Github issue now, but it has been occasionally asked that we add a flag that allows to disable slow queries entirely, such as multi-term queries that match lots of terms. We could have a switch for all queries rather than only prefix queries, eg. by enforcing a rewrite method that fails if more than X terms match? And index_prefixes would be a way to avoid hitting this limit for prefix queries?

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.