Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Building Indices Lookup in Metadata #83241

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jan 28, 2022

We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.

relates #77466

We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.
@original-brownbear original-brownbear added >non-issue :Data Management/Indices APIs APIs to create and manage indices and templates v8.1.0 labels Jan 28, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jan 28, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

final String name = entry.key;
final IndexMetadata indexMetadata = entry.value;
final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name);
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertion change is unrelated, just felt the need to clean this up and make it a little clearer what we're asserting :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName), i.e.:

Suggested change
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

final String name = entry.key;
final IndexMetadata indexMetadata = entry.value;
final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name);
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName), i.e.:

Suggested change
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

IndexAbstraction existing = indicesLookup.put(
dataStream.getName(),
new IndexAbstraction.DataStream(dataStream, aliases)
);
assert existing == null : "duplicate data stream for " + dataStream.getName();

for (Index i : dataStream.getIndices()) {
indexToDataStreamLookup.put(i.getName(), dataStream);
indexToDataStreamLookup.put(i.getName(), dsAbstraction);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to build indextToDataStreamLookup at all? Could we not simply populate indicesLookup directly here instead and ignore those below (asserting that they are data stream owned instead of the duplicate check)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right we don't ... now comes the interesting question of whether this would actually be faster.

Looking at the profiling for the full method (after my change here):

image

for a completely degenerate case where we have about as many indices as datastreams (one index per DS) ...
the hash map building barely shows up relative to the tree-map.

So the question becomes, is it cheaper to do 2 puts (or I guess one put and one computeIfAbsent) to the tree-map (the second put in the indices loop would be to a larger tree map than the one in the DS loop) or do one put into an on average smaller tree map and instead build the technically redundant hash-map.

Profiling suggests that this is faster I'd say. And in the real world, where you'd have multiple indices per DS the decision would be even more in favour of having the hash-map I think.
The problem we're running into is that the tree-map is super lob sided because we have the shared .ds prefix for all indices pretty much and then some more sharing for the specific DS, leading to loads of comparisons.
We should've used a different naming schema for the internal DS indices I guess ... (in fact I think we still could).

@original-brownbear
Copy link
Member Author

Thanks Henning! I went with my solution for now because of the way our tree-map is painful. But I think we might want to improve that in a follow-up.

@original-brownbear original-brownbear merged commit 52d4c89 into elastic:master Jan 28, 2022
@original-brownbear original-brownbear deleted the faster-building-index-lookup branch January 28, 2022 12:28
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jan 31, 2022
* upstream/master: (100 commits)
  Avoid duplicate _type fields in v7 compat layer (elastic#83239)
  Bump bundled JDK to 17.0.2+8 (elastic#83243)
  [DOCS] Correct header syntax (elastic#83275)
  Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261)
  [DOCS] Add note that write indices are not replicated (elastic#82997)
  Add notes on indexing to kNN search guide (elastic#83188)
  Fix get-snapshot-api :docs:integTest (elastic#83273)
  FilterPathBasedFilter support match fieldname with dot (elastic#83178)
  Fix compilation issues in example-plugins (elastic#83258)
  fix ClusterStateListener javadoc (elastic#83246)
  Speed up Building Indices Lookup in Metadata (elastic#83241)
  Mute whole suite for elastic#82502 (elastic#83252)
  Make PeerFinder log messages happier (elastic#83222)
  [Docs] Add supported _terms_enum field types (elastic#83244)
  Add an aggregator for IPv4 and IPv6 subnets (elastic#82410)
  [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217)
  Update test-failure Issue Template to include "needs:triage" label elastic#83226
  Add an index->step cache to the PolicyStepsRegistry (elastic#82316)
  Improve support for joda datetime to java datetime transition in Painless (elastic#83099)
  Fix joda migration for week based methods in Painless (elastic#83232)
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java
idegtiarenko pushed a commit to idegtiarenko/elasticsearch that referenced this pull request Feb 1, 2022
We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Mar 3, 2022
We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >non-issue Team:Data Management Meta label for data/management team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants