Use local checkpoint to calculate min translog gen for recovery #51905

dnhatn · 2020-02-05T03:23:18Z

Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush.

This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead.

Closes #49970

elasticmachine · 2020-02-05T03:23:21Z

Pinging @elastic/es-distributed (:Distributed/Engine)

ywelsch

I think that the uncommittedSizeInBytes and uncommittedOperations metrics are pretty useless today, as they are not a measure of how much data needs to be recovered/replayed after a crash. I would rather base these metrics on the local checkpoint of the safe commit in all cases, and completely disregard the last commit (which is irrelevant for recovery).

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

dnhatn · 2020-02-07T14:24:42Z

I think that the uncommittedSizeInBytes and uncommittedOperations metrics are pretty useless today, as they are not a measure of how much data needs to be recovered/replayed after a crash. I would rather base these metrics on the local checkpoint of the safe commit in all cases, and completely disregard the last commit (which is irrelevant for recovery).

++. I pushed c4bccea.

ywelsch · 2020-02-07T15:39:02Z

There are relevant test failures here

dnhatn · 2020-02-07T16:08:59Z

@ywelsch Thanks for looking at the test failures. I've addressed them in 5ea2cae.

henningandersen

This LGTM. I am not familiar enough with all 6.x development that I can confidently say that this will work in both rolling and full restart upgrades, but my search for an issue did not reveal anything. So better wait for Yannick's approval too.

server/src/main/java/org/elasticsearch/index/engine/NoOpEngine.java

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

server/src/test/java/org/elasticsearch/index/shard/IndexShardIT.java

server/src/test/java/org/elasticsearch/index/engine/NoOpEngineTests.java

ywelsch

LGTM

dnhatn · 2020-02-10T13:25:31Z

@ywelsch @henningandersen Thanks for reviewing.

We roll a new translog generation and trim operations that are above the global checkpoint during primary-replica resync. If the initOperations is empty, then the stale operation on the replica2 will be discarded as it is the only operation in a translog file (since #51905 where we started using the local checkpoint to calculate the minimum required translog generation for recovery). Otherwise, the stale op will be retained along with initOperations but will be skipped in snapshots. Relates #51905 Closes #52148

Since #51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates #51905 Closes #52223

Since #51905, we skip translog recovery if the local checkpoint of the safe commit equals to the global checkpoint. This change adjusts the test not to create a new snapshot in that case. Closes #52221 Relates #51905

We need to reduce the translog sync interval for indices with translog async setting so that we can have the safe commit in the assertBusy interval. This is needed since #51905, where we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. Closes #52251 Relates #51905

Asserts that no new operations are made into the translog since we re-opened the engine. Relates #51905 Closes #52410

Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410

…tic#51905) Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes elastic#49970

Since elastic#51905, we skip translog recovery if the local checkpoint of the safe commit equals to the global checkpoint. This change adjusts the test not to create a new snapshot in that case. Closes elastic#52221 Relates elastic#51905

Since elastic#51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates elastic#51905 Closes elastic#52223

Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410

Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes #49970

Since #51905, we skip translog recovery if the local checkpoint of the safe commit equals to the global checkpoint. This change adjusts the test not to create a new snapshot in that case. Closes #52221 Relates #51905

Since #51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates #51905 Closes #52223

Asserts that no new operations are made into the translog since we re-opened the engine. Relates #51905 Closes #52410

Use local checkpoint to calculate min translog gen for recovery

fe67337

dnhatn added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.7.0 labels Feb 5, 2020

dnhatn requested review from ywelsch and henningandersen February 5, 2020 03:23

dnhatn added 2 commits February 5, 2020 09:44

Merge branch 'master' into seqno-tlog-policy

403a8bd

force flush in translog yaml test

1eb0d53

ywelsch reviewed Feb 7, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java Show resolved Hide resolved

dnhatn added 3 commits February 7, 2020 08:38

Merge branch 'master' into seqno-tlog-policy

6572ed4

Use local checkpoint of safe commit to calculate committed stats

c4bccea

restore log

2c0b257

dnhatn requested a review from ywelsch February 7, 2020 14:24

fix test

5ea2cae

henningandersen approved these changes Feb 7, 2020

View reviewed changes

dnhatn added 5 commits February 7, 2020 15:04

Merge branch 'master' into seqno-tlog-policy

6baaf90

remove uncommitted translog

63c25e2

revert checkpoint sync

630da78

assert translog ops before trimming

02103b7

comment

8c91dcf

ywelsch approved these changes Feb 10, 2020

View reviewed changes

dnhatn merged commit ebc4681 into elastic:master Feb 10, 2020

dnhatn deleted the seqno-tlog-policy branch February 10, 2020 13:26

dnhatn added the backport pending label Feb 10, 2020

This was referenced Feb 10, 2020

[CI] Failure in IndexLevelReplicationTests.testSeqNoCollision #52148

Closed

Fix testSeqNoCollision #52154

Merged

This was referenced Feb 11, 2020

Fix testPrepareIndexForPeerRecovery #52245

Merged

[CI] IndexShardIT » testMaybeFlush #52223

Closed

Fix IndexShardIT#testMaybeFlush #52247

Merged

dnhatn mentioned this pull request Feb 12, 2020

Fix testFlushOnInactive #52275

Merged

dnhatn mentioned this pull request Feb 17, 2020

Fix testRestoreLocalHistoryFromTranslog #52441

Merged

dnhatn added a commit that referenced this pull request Feb 18, 2020

Fix testRestoreLocalHistoryFromTranslog (#52441)

8ec43df

Asserts that no new operations are made into the translog since we re-opened the engine. Relates #51905 Closes #52410

sbourke pushed a commit to sbourke/elasticsearch that referenced this pull request Feb 19, 2020

Fix testRestoreLocalHistoryFromTranslog (elastic#52441)

2b3e715

Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 26, 2020

Fix testRestoreLocalHistoryFromTranslog (elastic#52441)

3c42d64

Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410

dnhatn mentioned this pull request Feb 26, 2020

Use local checkpoint to calculate min translog gen for recovery #52841

Closed

dnhatn added a commit that referenced this pull request Feb 26, 2020

Fix testRestoreLocalHistoryFromTranslog (#52441)

5aa612c

Asserts that no new operations are made into the translog since we re-opened the engine. Relates #51905 Closes #52410

dnhatn removed the backport pending label Feb 26, 2020

tlrx mentioned this pull request Mar 9, 2020

NoOpEngineTests failure #51303

Closed

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

dliappis mentioned this pull request Apr 10, 2020

org.elasticsearch.index.translog.TranslogTests#testStats failing on 6.8 #55064

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use local checkpoint to calculate min translog gen for recovery #51905

Use local checkpoint to calculate min translog gen for recovery #51905

dnhatn commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

ywelsch left a comment

dnhatn commented Feb 7, 2020

ywelsch commented Feb 7, 2020

dnhatn commented Feb 7, 2020

henningandersen left a comment

ywelsch left a comment

dnhatn commented Feb 10, 2020

Use local checkpoint to calculate min translog gen for recovery #51905

Use local checkpoint to calculate min translog gen for recovery #51905

Conversation

dnhatn commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Feb 7, 2020

ywelsch commented Feb 7, 2020

dnhatn commented Feb 7, 2020

henningandersen left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Feb 10, 2020