Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix alias resolution runtime complexity. #40263

Merged
merged 10 commits into from
Apr 3, 2019

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Mar 20, 2019

A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the can_match phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the can_match phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the took went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248

A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes elastic#40248
@jpountz jpountz added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.2.0 labels Mar 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jasontedor jasontedor self-requested a review March 20, 2019 15:32
@jpountz jpountz requested review from s1monw and martijnvg and removed request for jasontedor March 20, 2019 15:35
@jpountz
Copy link
Contributor Author

jpountz commented Mar 20, 2019

@martijnvg @s1monw Git blame is pointing to you as being familiar with this part of the code base. Would you mind having a look?

@jpountz jpountz added v6.7.1 and removed v6.7.0 labels Mar 21, 2019
@@ -314,17 +317,17 @@ public String resolveDateMathExpression(String dateExpression) {
* <p>Only aliases with filters are returned. If the indices list contains a non-filtering reference to
* the index itself - null is returned. Returns {@code null} if no filtering is required.
*/
public String[] filteringAliases(ClusterState state, String index, String... expressions) {
return indexAliases(state, index, AliasMetaData::filteringRequired, false, expressions);
public Function<String, String[]> filteringAliases(ClusterState state, String... expressions) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: please pay special attention to changes in this file where I reorganized loops to make things perform better.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change and will improve alias resolution. Not that it has been a while since I worked on this part of the code base, but this does look good to me.

Looks like the pr build needs to be kicked off again?

if (aliasMetaData != null) {
for (ObjectCursor<AliasMetaData> aliasMetaDataCursor : indexMetaData.getAliases().values()) {
AliasMetaData aliasMetaData = aliasMetaDataCursor.value;
if (resolvedExpressions.contains(aliasMetaData.alias())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jpountz
Copy link
Contributor Author

jpountz commented Mar 21, 2019

@elasticmachine please test it

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me while I'm not very happy with the Function. I wonder if we should extract the expression resolving into something like Set<String> responveExpression(String expressionString) that we can then pass into the relevant functions? Might also be easier to unittest that way. If we need a dedicated class for type safety I am ok with that too.

@@ -116,9 +116,10 @@ public TransportSearchAction(ThreadPool threadPool, TransportService transportSe
private Map<String, AliasFilter> buildPerIndexAliasFilter(SearchRequest request, ClusterState clusterState,
Index[] concreteIndices, Map<String, AliasFilter> remoteAliasMap) {
final Map<String, AliasFilter> aliasFilterMap = new HashMap<>();
final Function<String, AliasFilter> indexToAliasFilter = searchService.buildAliasFilter(clusterState, request.indices());;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit extra ;

@jpountz jpountz requested a review from s1monw March 22, 2019 08:05
@jpountz
Copy link
Contributor Author

jpountz commented Mar 22, 2019

@s1monw updated

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Mar 22, 2019
`Index` interns its name and uuid. My guess is that the main goal is to avoid
having duplicate strings in the representation of the cluster state. However
I doubt it helps much given that we have many other objects in the cluster state
that we don't try to reuse, and interning has some cost. When looking into
 elastic#40263 my profiler pointed to string interning because of the `Index` object
that is created in `QueryShardContext` as one of the bottlenecks of the
`can_match` phase.
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false);
List<String> resolvedExpressions = Arrays.asList(expressions);
for (ExpressionResolver expressionResolver : expressionResolvers) {
resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also fix the ExpressionResolver to take and return a Set. I can see WildcardExpressionResolver is already converting from set to list internally. I am ok with doing this as a followup. Either way is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I added a TODO in WildcardExpressionResolver. I'd like to keep this change contained if possible since I intend to backport it to stable branches, so I'd rather like to do this as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

jpountz added a commit that referenced this pull request Mar 27, 2019
`Index` interns its name and uuid. My guess is that the main goal is to avoid
having duplicate strings in the representation of the cluster state. However
I doubt it helps much given that we have many other objects in the cluster state
that we don't try to reuse, and interning has some cost. When looking into
 #40263 my profiler pointed to string interning because of the `Index` object
that is created in `QueryShardContext` as one of the bottlenecks of the
`can_match` phase.
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Mar 27, 2019
`Index` interns its name and uuid. My guess is that the main goal is to avoid
having duplicate strings in the representation of the cluster state. However
I doubt it helps much given that we have many other objects in the cluster state
that we don't try to reuse, and interning has some cost. When looking into
 elastic#40263 my profiler pointed to string interning because of the `Index` object
that is created in `QueryShardContext` as one of the bottlenecks of the
`can_match` phase.
@jpountz
Copy link
Contributor Author

jpountz commented Mar 27, 2019

I was unhappy about the fact that this new way of resolving aliases might become slower in the case that an index has many aliases, so I refactored it in such a way that we decide how to resolve aliases based on the number of resolved expressions vs the number of aliases of the index that is being considered. Would you mind having another look?

jpountz added a commit that referenced this pull request Mar 28, 2019
`Index` interns its name and uuid. My guess is that the main goal is to avoid
having duplicate strings in the representation of the cluster state. However
I doubt it helps much given that we have many other objects in the cluster state
that we don't try to reuse, and interning has some cost. When looking into
 #40263 my profiler pointed to string interning because of the `Index` object
that is created in `QueryShardContext` as one of the bottlenecks of the
`can_match` phase.
@colings86 colings86 added v6.7.2 and removed v6.7.1 labels Mar 30, 2019
@jpountz jpountz removed the v6.6.3 label Apr 2, 2019
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I left a comment regarding to testing.

}

// pkg-private for testing
Boolean forceIterateIndexAliases;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rather have a method that takes the two sets and returns a boolean value that we can override in the tests? I don't like this test only variable.

Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false);
List<String> resolvedExpressions = Arrays.asList(expressions);
for (ExpressionResolver expressionResolver : expressionResolvers) {
resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@jpountz jpountz requested a review from s1monw April 3, 2019 12:58
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jpountz jpountz merged commit a78d79f into elastic:master Apr 3, 2019
@jpountz jpountz deleted the fix/alias_resolution_runtime branch April 3, 2019 14:54
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 3, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes elastic#40248
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 3, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes elastic#40248
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 3, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes elastic#40248
jpountz added a commit that referenced this pull request Apr 4, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248
jpountz added a commit that referenced this pull request Apr 4, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248
jpountz added a commit that referenced this pull request Apr 4, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes elastic#40248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v6.7.2 v7.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolving queried indices runs in quadratic time with the number of queried indices
6 participants