Fix alias resolution runtime complexity. #40263

jpountz · 2019-03-20T15:18:46Z

A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the can_match phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the can_match phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the took went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248

A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248

elasticmachine · 2019-03-20T15:18:48Z

Pinging @elastic/es-search

jpountz · 2019-03-20T15:36:23Z

@martijnvg @s1monw Git blame is pointing to you as being familiar with this part of the code base. Would you mind having a look?

jpountz · 2019-03-21T08:09:19Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexNameExpressionResolver.java

@@ -314,17 +317,17 @@ public String resolveDateMathExpression(String dateExpression) {
     * <p>Only aliases with filters are returned. If the indices list contains a non-filtering reference to
     * the index itself - null is returned. Returns {@code null} if no filtering is required.
     */
-    public String[] filteringAliases(ClusterState state, String index, String... expressions) {
-        return indexAliases(state, index, AliasMetaData::filteringRequired, false, expressions);
+    public Function<String, String[]> filteringAliases(ClusterState state, String... expressions) {


Note to reviewers: please pay special attention to changes in this file where I reorganized loops to make things perform better.

martijnvg

This is a good change and will improve alias resolution. Not that it has been a while since I worked on this part of the code base, but this does look good to me.

Looks like the pr build needs to be kicked off again?

martijnvg · 2019-03-21T07:55:15Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexNameExpressionResolver.java

-            if (aliasMetaData != null) {
+        for (ObjectCursor<AliasMetaData> aliasMetaDataCursor : indexMetaData.getAliases().values()) {
+            AliasMetaData aliasMetaData = aliasMetaDataCursor.value;
+            if (resolvedExpressions.contains(aliasMetaData.alias())) {


jpountz · 2019-03-21T08:29:33Z

@elasticmachine please test it

s1monw

it looks good to me while I'm not very happy with the Function. I wonder if we should extract the expression resolving into something like Set<String> responveExpression(String expressionString) that we can then pass into the relevant functions? Might also be easier to unittest that way. If we need a dedicated class for type safety I am ok with that too.

s1monw · 2019-03-21T13:44:39Z

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java

@@ -116,9 +116,10 @@ public TransportSearchAction(ThreadPool threadPool, TransportService transportSe
    private Map<String, AliasFilter> buildPerIndexAliasFilter(SearchRequest request, ClusterState clusterState,
                                                              Index[] concreteIndices, Map<String, AliasFilter> remoteAliasMap) {
        final Map<String, AliasFilter> aliasFilterMap = new HashMap<>();
+        final Function<String, AliasFilter> indexToAliasFilter = searchService.buildAliasFilter(clusterState, request.indices());;


nit extra ;

jpountz · 2019-03-22T08:06:06Z

@s1monw updated

`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into elastic#40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.

s1monw

LGTM

s1monw · 2019-03-22T13:38:30Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexNameExpressionResolver.java

+        Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false);
+        List<String> resolvedExpressions = Arrays.asList(expressions);
+        for (ExpressionResolver expressionResolver : expressionResolvers) {
+            resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions);


I think we should also fix the ExpressionResolver to take and return a Set. I can see WildcardExpressionResolver is already converting from set to list internally. I am ok with doing this as a followup. Either way is fine

Agreed, I added a TODO in WildcardExpressionResolver. I'd like to keep this change contained if possible since I intend to backport it to stable branches, so I'd rather like to do this as a follow-up.

`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into #40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.

`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into elastic#40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.

jpountz · 2019-03-27T15:23:27Z

I was unhappy about the fact that this new way of resolving aliases might become slower in the case that an index has many aliases, so I refactored it in such a way that we decide how to resolve aliases based on the number of resolved expressions vs the number of aliases of the index that is being considered. Would you mind having another look?

`Index` interns its name and uuid. My guess is that the main goal is to avoid having duplicate strings in the representation of the cluster state. However I doubt it helps much given that we have many other objects in the cluster state that we don't try to reuse, and interning has some cost. When looking into #40263 my profiler pointed to string interning because of the `Index` object that is created in `QueryShardContext` as one of the bottlenecks of the `can_match` phase.

s1monw

LGTM I left a comment regarding to testing.

s1monw · 2019-04-03T12:08:04Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexNameExpressionResolver.java

    }

+    // pkg-private for testing
+    Boolean forceIterateIndexAliases;


can we rather have a method that takes the two sets and returns a boolean value that we can override in the tests? I don't like this test only variable.

s1monw · 2019-04-03T12:08:16Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexNameExpressionResolver.java

+        Context context = new Context(state, IndicesOptions.lenientExpandOpen(), true, false);
+        List<String> resolvedExpressions = Arrays.asList(expressions);
+        for (ExpressionResolver expressionResolver : expressionResolvers) {
+            resolvedExpressions = expressionResolver.resolve(context, resolvedExpressions);


jasontedor

LGTM.

A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248

A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes #40248

A user reported that the same query that takes ~900ms when querying an index pattern only takes ~50ms when only querying indices that have matches. The query is a date range query and we confirmed that the `can_match` phase works as expected. I was able to reproduce this issue locally with a single node: with 900 1-shard indices, a query to an index pattern that matches all indices runs in ~90ms while a query to the only index that has matches runs in 0-1ms. This ended up not being related to the `can_match` phase but to the cost of resolving aliases when querying an index pattern that matches lots of indices. In that case, we first resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. I reorganized alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead. When testing the scenario described above, the `took` went down from ~90ms to ~10ms. It is still more than the 0-1ms latency that one gets when only querying the single index that has data, but still much better than what we had before. Closes elastic#40248

jpountz added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.2.0 labels Mar 20, 2019

jpountz added v7.0.0 v6.7.0 v6.6.3 and removed v7.2.0 v8.0.0 labels Mar 20, 2019

jasontedor self-requested a review March 20, 2019 15:32

jpountz requested review from s1monw and martijnvg and removed request for jasontedor March 20, 2019 15:35

jpountz added v6.7.1 and removed v6.7.0 labels Mar 21, 2019

jpountz commented Mar 21, 2019

View reviewed changes

martijnvg approved these changes Mar 21, 2019

View reviewed changes

s1monw reviewed Mar 21, 2019

View reviewed changes

Refactor to avoid using a Function.

b31c693

jpountz requested a review from s1monw March 22, 2019 08:05

jpountz mentioned this pull request Mar 22, 2019

Remove String interning from o.e.index.Index. #40350

Merged

s1monw approved these changes Mar 22, 2019

View reviewed changes

jpountz added 2 commits March 22, 2019 16:26

iter

dcff4b2

iter

058c5a1

Merge branch 'master' into fix/alias_resolution_runtime

0c679d5

jpountz mentioned this pull request Mar 27, 2019

Remove String interning from o.e.index.Index. (#40350) #40517

Merged

Remove test dependency on order.

775a8e3

jpountz requested review from s1monw and jasontedor March 27, 2019 15:23

colings86 added v6.7.2 and removed v6.7.1 labels Mar 30, 2019

iter

3f8d3e4

jpountz removed the v6.6.3 label Apr 2, 2019

s1monw approved these changes Apr 3, 2019

View reviewed changes

jpountz added 2 commits April 3, 2019 14:52

Merge branch 'master' into fix/alias_resolution_runtime

530add8

Use pkg-private method instead of member.

0d83ee4

jpountz requested a review from s1monw April 3, 2019 12:58

jasontedor approved these changes Apr 3, 2019

View reviewed changes

jpountz merged commit a78d79f into elastic:master Apr 3, 2019

jpountz deleted the fix/alias_resolution_runtime branch April 3, 2019 14:54

javanna mentioned this pull request Apr 8, 2019

Lookup concrete indices and alias information via a single lookup #12173

Closed

jaymode mentioned this pull request Feb 25, 2020

ExpressionResolver uses sets instead of lists #52788

Closed

dnhatn mentioned this pull request Jun 26, 2022

Alias aggregation queries are used in large quantities without es query cache, and the performance is poor #86827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix alias resolution runtime complexity. #40263

Fix alias resolution runtime complexity. #40263

jpountz commented Mar 20, 2019

elasticmachine commented Mar 20, 2019

jpountz commented Mar 20, 2019

jpountz Mar 21, 2019

martijnvg left a comment

martijnvg Mar 21, 2019

jpountz commented Mar 21, 2019

s1monw left a comment

s1monw Mar 21, 2019

jpountz commented Mar 22, 2019

s1monw left a comment

s1monw Mar 22, 2019

jpountz Mar 27, 2019

s1monw Apr 3, 2019

jpountz commented Mar 27, 2019

s1monw left a comment

s1monw Apr 3, 2019

s1monw Apr 3, 2019

jasontedor left a comment

Fix alias resolution runtime complexity. #40263

Fix alias resolution runtime complexity. #40263

Conversation

jpountz commented Mar 20, 2019

elasticmachine commented Mar 20, 2019

jpountz commented Mar 20, 2019

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Mar 21, 2019

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Mar 22, 2019

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Mar 27, 2019

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment