-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix alias resolution runtime complexity. #40263
Conversation
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248
Pinging @elastic/es-search |
@martijnvg @s1monw Git blame is pointing to you as being familiar with this part of the code base. Would you mind having a look? |
@@ -314,17 +317,17 @@ public String resolveDateMathExpression(String dateExpression) { | |||
* <p>Only aliases with filters are returned. If the indices list contains a non-filtering reference to | |||
* the index itself - null is returned. Returns {@code null} if no filtering is required. | |||
*/ | |||
public String[] filteringAliases(ClusterState state, String index, String... expressions) { | |||
return indexAliases(state, index, AliasMetaData::filteringRequired, false, expressions); | |||
public Function<String, String[]> filteringAliases(ClusterState state, String... expressions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: please pay special attention to changes in this file where I reorganized loops to make things perform better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change and will improve alias resolution. Not that it has been a while since I worked on this part of the code base, but this does look good to me.
Looks like the pr build needs to be kicked off again?
if (aliasMetaData != null) { | ||
for (ObjectCursor<AliasMetaData> aliasMetaDataCursor : indexMetaData.getAliases().values()) { | ||
AliasMetaData aliasMetaData = aliasMetaDataCursor.value; | ||
if (resolvedExpressions.contains(aliasMetaData.alias())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@elasticmachine please test it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me while I'm not very happy with the Function. I wonder if we should extract the expression resolving into something like Set<String> responveExpression(String expressionString)
that we can then pass into the relevant functions? Might also be easier to unittest that way. If we need a dedicated class for type safety I am ok with that too.
@@ -116,9 +116,10 @@ public TransportSearchAction(ThreadPool threadPool, TransportService transportSe | |||
private Map<String, AliasFilter> buildPerIndexAliasFilter(SearchRequest request, ClusterState clusterState, | |||
Index[] concreteIndices, Map<String, AliasFilter> remoteAliasMap) { | |||
final Map<String, AliasFilter> aliasFilterMap = new HashMap<>(); | |||
final Function<String, AliasFilter> indexToAliasFilter = searchService.buildAliasFilter(clusterState, request.indices());; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit extra ;
@s1monw updated |
`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into elastic#40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false); | ||
List<String> resolvedExpressions = Arrays.asList(expressions); | ||
for (ExpressionResolver expressionResolver : expressionResolvers) { | ||
resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also fix the ExpressionResolver to take and return a Set. I can see WildcardExpressionResolver
is already converting from set to list internally. I am ok with doing this as a followup. Either way is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I added a TODO in WildcardExpressionResolver. I'd like to keep this change contained if possible since I intend to backport it to stable branches, so I'd rather like to do this as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into #40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.
`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into elastic#40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.
I was unhappy about the fact that this new way of resolving aliases might become slower in the case that an index has many aliases, so I refactored it in such a way that we decide how to resolve aliases based on the number of resolved expressions vs the number of aliases of the index that is being considered. Would you mind having another look? |
`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into #40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I left a comment regarding to testing.
} | ||
|
||
// pkg-private for testing | ||
Boolean forceIterateIndexAliases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rather have a method that takes the two sets and returns a boolean value that we can override in the tests? I don't like this test only variable.
Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false); | ||
List<String> resolvedExpressions = Arrays.asList(expressions); | ||
for (ExpressionResolver expressionResolver : expressionResolvers) { | ||
resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes #40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes #40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes #40248
A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the
can_match
phase worksas expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.
This ended up not being related to the
can_match
phase but to the cost ofresolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.
I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the
took
went down from ~90ms to ~10ms.It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.
Closes #40248