Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index commit id to searcher #63963

Merged
merged 18 commits into from
Dec 12, 2020
Merged

Add index commit id to searcher #63963

merged 18 commits into from
Dec 12, 2020

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 21, 2020

This commit adds an id, which is composed of the ids of segment files, to ElasticsearchDirectoryReader. With this id, we can retry search requests on another shard copy if its latest "snapshot" has the same segment files as the failing shard.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 21, 2020

I am labeling this WIP, although it's ready as I am not sure the approach in the PR. The way we compose the searcher id might be too aggressive. If we need to improve search resiliency for searchable snapshots only, then we can use the commit id instead.

@dnhatn dnhatn added WIP :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Oct 21, 2020
@dnhatn dnhatn marked this pull request as ready for review October 21, 2020 17:31
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Oct 21, 2020
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are two goals we could have with this id:

  1. The ability to repeat the fetch phase if it fails. Here we need identical doc-ids on the replica.
  2. The ability to continue to use a PIT despite failures in the cluster. Here we do not need identical doc-ids.

I wonder if the second part is better served by maxSeqNo instead whenever gcp==maxSeqNo? Plus primary term for correctness.

It might be two completely separate ids and thus handled in another PR. I only wanted to raise this here to gather consensus that we cannot come up with a single id or failover scheme that can serve both purposes.

// are always different although they can have the identical segment files.
final MessageDigest md = MessageDigests.sha256();
for (SegmentCommitInfo sci : segmentInfos) {
final byte[] segmentId = sci.getId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me that this is valid across different shard copies. The id generation starts somewhere random and then increments. I acknowledge the risk is small and I did not dig deeply into whether this increases the risk of collissions over using more standard uuid generation.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 26, 2020

@jimczi If we implement #56828 using a composition of indexUUID, shardId, and sequence number (instead of internal document ID), then PIT can go with Henning's option-2. WDYT?

@dnhatn dnhatn removed the request for review from jpountz November 17, 2020 18:36
@dnhatn
Copy link
Member Author

dnhatn commented Nov 18, 2020

@henningandersen @jimczi

We (Jim, Francisco, and I) discussed this PR and agreed to add two IDs (i.e., commitID and seqID) to searchers. I implemented the commitID using an external UUID that is generated when an engine flushes (see 2f410eb). However, I found that using an external UUID isn't better than the id of an index commit. Hence, I re-implemented it using the id of an index commit. Would you please take a look? Thank you!

@dnhatn dnhatn changed the title Add id to searcher Add index commit id to searcher Nov 18, 2020
@dnhatn
Copy link
Member Author

dnhatn commented Dec 2, 2020

Could we also set replicas=1 and then check that the recovered copy has the same commit id?

We flush after copying segment files in peer recovery to associate a new translog UUID, and this flush changes the commit id.

We can avoid this limitation by having an external commit id, and change it whenever the InternalEngine performs a flush. I implemented this, but I reverted it as it helps only this situation. I am happy to bring it back if you prefer.

@henningandersen
Copy link
Contributor

but I reverted it as it helps only this situation

Is the purpose of the change not that we can fail over a search to another replica copy containing the same data? I suppose the approach taken here could work for searchable snapshots, but it would be nice to have an approach that also works for shards that already went through file based peer recovery and did not see any changes since (like most warm indices today). Or is there a good reason that we cannot/should not handle that scenario?

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Dec 10, 2020

Is the purpose of the change not that we can fail over a search to another replica copy containing the same data?

Yes, that's the original purpose of this PR.

I suppose the approach taken here could work for searchable snapshots.

Yes, it works with searchable snapshots.

It would be nice to have an approach that also works for shards that already went through file based peer recovery and did not see any changes since (like most warm indices today). Or is there a good reason that we cannot/should not handle that scenario?

The problem is that we flush to create a new index commit to associate with a new local translog on replicas after performing a file-based recovery. This generates a new commit_id, although the content of the index commit doesn't change. To circumvent this, we can have an external commit id that doesn't change in file-based recoveries (please see 5ec5781).

@henningandersen
Copy link
Contributor

henningandersen commented Dec 10, 2020

Thanks @dnhatn , I think we should go for the external commit id like you did in 5ec5781.

@henningandersen
Copy link
Contributor

Chatted with @dnhatn about the main issue of the external commit id: relying on lucene not merging when associating a new translog with the commit. In order to continue this path of development, let us stick to the approach you have here (using the commit id of the segment infos), or revisit your original proposal of relying on individual segment commit-ids.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 12, 2020

Thanks everyone for reviewing. I am merging this PR as is and will revisit the previous approach later.

@dnhatn dnhatn merged commit 0e8e02f into elastic:master Dec 12, 2020
@dnhatn dnhatn deleted the add-searcher-id branch December 12, 2020 15:30
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 12, 2020
This change assigns the id of an index commit to a searcher, so we can
retry search requests on another shard copy if they have the same index
commit.
dnhatn added a commit that referenced this pull request Dec 12, 2020
This change assigns the id of an index commit to a searcher, so we can
retry search requests on another shard copy if they have the same index
commit.
dnhatn added a commit that referenced this pull request Dec 21, 2020
The commit id introduced in #63963 does not work well with searchable 
snapshots as we create a new index commit when restoring from snapshots.

This change revises an approach that generates an id using the ids of the
 segments of an index commit.

Relates #63963
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 21, 2020
The commit id introduced in elastic#63963 does not work well with searchable
snapshots as we create a new index commit when restoring from snapshots.

This change revises an approach that generates an id using the ids of the
 segments of an index commit.

Relates elastic#63963
dnhatn added a commit that referenced this pull request Dec 21, 2020
The commit id introduced in #63963 does not work well with searchable
snapshots as we create a new index commit when restoring from snapshots.

This change revises an approach that generates an id using the ids of the
segments of an index commit.

Relates #63963
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement Team:Distributed Meta label for distributed team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants