Speed up Building Indices Lookup in Metadata #83241

original-brownbear · 2022-01-28T09:55:19Z

We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.

relates #77466

We can speed this up by almost a third in case of a large index count where most indices are part of a datastream by avoiding the double lookup for the ds abstraction in the way done here. Also, simplified the loop iteration a little to use the slightly faster cursor and removed some needless conditional in the loop.

elasticmachine · 2022-01-28T09:55:22Z

Pinging @elastic/es-data-management (Team:Data Management)

original-brownbear · 2022-01-28T09:55:56Z

server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java

+                final String name = entry.key;
+                final IndexMetadata indexMetadata = entry.value;
+                final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name);
+                assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))


assertion change is unrelated, just felt the need to clean this up and make it a little clearer what we're asserting :)

totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName), i.e.:

Suggested change

assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))

assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

henningandersen

LGTM.

henningandersen · 2022-01-28T11:18:10Z

server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java

+                final String name = entry.key;
+                final IndexMetadata indexMetadata = entry.value;
+                final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name);
+                assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))


totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName), i.e.:

Suggested change

assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))

assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

henningandersen · 2022-01-28T11:23:48Z

server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java

                    IndexAbstraction existing = indicesLookup.put(
                        dataStream.getName(),
                        new IndexAbstraction.DataStream(dataStream, aliases)
                    );
                    assert existing == null : "duplicate data stream for " + dataStream.getName();

                    for (Index i : dataStream.getIndices()) {
-                        indexToDataStreamLookup.put(i.getName(), dataStream);
+                        indexToDataStreamLookup.put(i.getName(), dsAbstraction);


I wonder if we need to build indextToDataStreamLookup at all? Could we not simply populate indicesLookup directly here instead and ignore those below (asserting that they are data stream owned instead of the duplicate check)?

Right we don't ... now comes the interesting question of whether this would actually be faster.

Looking at the profiling for the full method (after my change here):

for a completely degenerate case where we have about as many indices as datastreams (one index per DS) ...
the hash map building barely shows up relative to the tree-map.

So the question becomes, is it cheaper to do 2 puts (or I guess one put and one computeIfAbsent) to the tree-map (the second put in the indices loop would be to a larger tree map than the one in the DS loop) or do one put into an on average smaller tree map and instead build the technically redundant hash-map.

Profiling suggests that this is faster I'd say. And in the real world, where you'd have multiple indices per DS the decision would be even more in favour of having the hash-map I think.
The problem we're running into is that the tree-map is super lob sided because we have the shared .ds prefix for all indices pretty much and then some more sharing for the specific DS, leading to loads of comparisons.
We should've used a different naming schema for the internal DS indices I guess ... (in fact I think we still could).

original-brownbear · 2022-01-28T12:28:25Z

Thanks Henning! I went with my solution for now because of the way our tree-map is painful. But I think we might want to improve that in a follow-up.

* upstream/master: (100 commits) Avoid duplicate _type fields in v7 compat layer (elastic#83239) Bump bundled JDK to 17.0.2+8 (elastic#83243) [DOCS] Correct header syntax (elastic#83275) Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261) [DOCS] Add note that write indices are not replicated (elastic#82997) Add notes on indexing to kNN search guide (elastic#83188) Fix get-snapshot-api :docs:integTest (elastic#83273) FilterPathBasedFilter support match fieldname with dot (elastic#83178) Fix compilation issues in example-plugins (elastic#83258) fix ClusterStateListener javadoc (elastic#83246) Speed up Building Indices Lookup in Metadata (elastic#83241) Mute whole suite for elastic#82502 (elastic#83252) Make PeerFinder log messages happier (elastic#83222) [Docs] Add supported _terms_enum field types (elastic#83244) Add an aggregator for IPv4 and IPv6 subnets (elastic#82410) [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217) Update test-failure Issue Template to include "needs:triage" label elastic#83226 Add an index->step cache to the PolicyStepsRegistry (elastic#82316) Improve support for joda datetime to java datetime transition in Painless (elastic#83099) Fix joda migration for week based methods in Painless (elastic#83232) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java

We can speed this up by almost a third in case of a large index count where most indices are part of a datastream by avoiding the double lookup for the ds abstraction in the way done here. Also, simplified the loop iteration a little to use the slightly faster cursor and removed some needless conditional in the loop.

original-brownbear added >non-issue :Data Management/Indices APIs APIs to create and manage indices and templates v8.1.0 labels Jan 28, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Jan 28, 2022

original-brownbear commented Jan 28, 2022

View reviewed changes

original-brownbear requested review from martijnvg and henningandersen January 28, 2022 09:56

henningandersen approved these changes Jan 28, 2022

View reviewed changes

original-brownbear merged commit 52d4c89 into elastic:master Jan 28, 2022

original-brownbear deleted the faster-building-index-lookup branch January 28, 2022 12:28

original-brownbear mentioned this pull request Jan 28, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Building Indices Lookup in Metadata #83241

Speed up Building Indices Lookup in Metadata #83241

original-brownbear commented Jan 28, 2022 •

edited

elasticmachine commented Jan 28, 2022

original-brownbear Jan 28, 2022

henningandersen Jan 28, 2022

henningandersen left a comment

henningandersen Jan 28, 2022

henningandersen Jan 28, 2022

original-brownbear Jan 28, 2022 •

edited

original-brownbear commented Jan 28, 2022

	assert parent == null \|\| parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
	assert parent == null \|\| parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

Speed up Building Indices Lookup in Metadata #83241

Speed up Building Indices Lookup in Metadata #83241

Conversation

original-brownbear commented Jan 28, 2022 • edited

elasticmachine commented Jan 28, 2022

original-brownbear Jan 28, 2022

Choose a reason for hiding this comment

henningandersen Jan 28, 2022

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Jan 28, 2022

Choose a reason for hiding this comment

henningandersen Jan 28, 2022

Choose a reason for hiding this comment

original-brownbear Jan 28, 2022 • edited

Choose a reason for hiding this comment

original-brownbear commented Jan 28, 2022

original-brownbear commented Jan 28, 2022 •

edited

original-brownbear Jan 28, 2022 •

edited