Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable a long translog retention policy by default #25294

Merged
merged 29 commits into from Jun 22, 2017

Conversation

Projects
None yet
4 participants
@bleskes
Copy link
Member

commented Jun 19, 2017

#25147 added the translog deletion policy but didn't enable it by default. This PR enables a default retention of 512MB (same maximum size of the current translog) and an age of 12 hours (i.e., after 12 hours all translog files will be deleted). This increases to chance to have an ops based recovery, even if the primary flushed or the replica was offline for a few hours.

In order to see which parts of the translog are committed into lucene the translog stats are extended to include information about uncommitted operations.

Views now include all translog ops and guarantee, as before, that those will not go away. Snapshotting a view allows to filter out generations that are not relevant based on a specific sequence number.

I still have to write some docs and add a migration note, but I think we can start reviewing.

Relates to #10708

bleskes added some commits Jun 18, 2017

bleskes added some commits Jun 19, 2017

Merge remote-tracking branch 'upstream/master' into translog_retentio…
…n_part2

# Conflicts:
#	core/src/main/java/org/elasticsearch/index/shard/IndexShard.java
@s1monw
Copy link
Contributor

left a comment

left a bunch of comments

@@ -606,6 +608,9 @@ public synchronized void updateMetaData(final IndexMetaData metadata) {
if (refreshTask.getInterval().equals(indexSettings.getRefreshInterval()) == false) {
rescheduleRefreshTasks();
}
if (trimTranslogTask.getInterval().equals(indexSettings.getTranslogRetentionCheckInterval()) == false) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation() and maybe also do it in IndexShard#flush? I'd love to rather remove tasks than adding them?! it's just way easier to reason about things if they are happening due to a user interaction form the outside

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation()

This is not instead but on top of it. The problem is that if people have a 12h retention policy and we need to start cleaning stuff 12hs later (assuming they index full speed and roll over to a new index).

I'd love to rather remove tasks than adding them?!

I'm totally with you. I spent some time thinking about ways to avoid the regular check but everything I came up with ended up more complex. I considered stuff scheduling things in in activity etc.. always come back to this being the clearest/easiest. I'll be more than happy to hear about alternatives.

engine.getTranslog().rollGeneration();
final Translog translog = engine.getTranslog();
translog.rollGeneration();
translog.trimUnreferencedReaders();

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

hmm you run translog.trimUnreferencedReaders(); under the engine lock here but you run it without a lock here so I wonder when we have to run under lock and if we can assert on that?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

That's a great observation. Strictly speaking we don't need to coordinate through the write lock here. In the place you linked to I was mostly interested in the error handling and that we fail the engine if anything goes wrong (as we close the translog on error as well). I was worried for poisonous situations where we keep trying to get stuff from a closed translog. I will work on removing the read lock in the engine and fold this functionality (trim and roll) into the engine so we can be consistent. agreed?

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 21, 2017

Contributor

I think the readlock is healthy here, I wonder if we would rather fold this part into the engine and don't call it from the outside and add a parameter boolean rollGeneration to the trim method?

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.

@@ -560,14 +594,33 @@ public Snapshot newSnapshot(long minGeneration) {
}
}

private Stream<? extends BaseTranslogReader> readersAboveMinSeqNo(long minSeqNo) {
assert readLock.isHeldByCurrentThread() || writeLock.isHeldByCurrentThread();

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

please have one assert per check it's easier to see what failed

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

I'm confused - this is an or check. I will add a message.

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 21, 2017

Contributor

nevermind message is good

@@ -1522,6 +1576,10 @@ public void trimUnreferencedReaders() throws IOException {
// we're shutdown potentially on some tragic event, don't delete anything
return;
}
if (readers.isEmpty()) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

maybe instead of adding this intermediate return we can invert and only do the work if it's not empty?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

yeah, I can do that and thought about it. The reason I went with this is that to do something like shouldTrimTranslog where I can add this shortcut, I will also have to go and ask the retention policy for the current trimming situation and add tests etc. Since this method is not called on every write, but I rather on every roll / flush/ background check every 10m.. I felt this was simpler. Let me know what you think.

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 21, 2017

Contributor

methods should have a single return statement IMO, everything else is more complex in 99% of the cases.

@@ -154,4 +153,8 @@ private long getMinTranslogGenRequiredByViews() {
public synchronized long getMinTranslogGenerationForRecovery() {
return minTranslogGenerationForRecovery;
}

synchronized long getViewCount(long viewGen) {
return translogRefCounts.getOrDefault(viewGen, Counter.newCounter(false)).get();

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

you are creating a new counter anyway even if it's there I think we can safe that object and use computeIfAbsent?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

++. will change

boolean isSequenceNumberBasedRecoveryPossible = request.startingSeqNo() != SequenceNumbersService.UNASSIGNED_SEQ_NO &&
isTranslogReadyForSequenceNumberBasedRecovery(translogView);

if (isSequenceNumberBasedRecoveryPossible) {
logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo());
startingSeqNo = request.startingSeqNo();

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 20, 2017

Contributor

the changes in here are unrelated to the trimming code, I wonder if we can do it in 2 steps and add the integration in a followup, that way reviews are much simpler and more effective?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 20, 2017

Author Member

yes, this became less important than what it used to be after some iterations. That said the change is mostly related to reporting and that acquiring a snapshot now has a seq# in it's constructor (to avoid scanning a large translog). I can try to roll that part back but I was afraid it will change the system too much (we will stream more ops from the translog). I'm not sure it's worth it tbh.

This comment has been minimized.

Copy link
@s1monw

s1monw Jun 21, 2017

Contributor

+727 −336 I think we should try harder to make them more contained. really!

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jun 20, 2017

@s1monw thanks. I responded to some of you comment. Will apply code changes tomorrow.

bleskes added some commits Jun 20, 2017

Revert "add time based trimming"
This reverts commit 286d4a3.
@jasontedor
Copy link
Member

left a comment

I left a few comments.

@@ -1317,6 +1317,24 @@ public CommitId flush(boolean force, boolean waitIfOngoing) throws EngineExcepti
return new CommitId(newCommitId);
}

@Override
public void trimTranslog() throws EngineException {
try (ReleasableLock lock = readLock.acquire()) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

Would you mind renaming this to ignored? It keeps the IDE from complaining, and it makes it clear that the resource is intentionally unused through the scope of the try-with-resources block.

engine.getTranslog().rollGeneration();
final Translog translog = engine.getTranslog();
translog.rollGeneration();
translog.trimUnreferencedReaders();

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.

.filter(reader -> {
final long maxSeqNo = reader.getCheckpoint().maxSeqNo;
return maxSeqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO ||
maxSeqNo >= minSeqNo;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

Nit: I think this fits on the preceding line without wrapping.

@@ -30,20 +31,29 @@

private long translogSizeInBytes;
private int numberOfOperations;
private long uncommittedSizeInBytes;
private int uncommittedOperations;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

I wonder if since these are summed in node stats if this should be a long (to reduce the chance of overflow)?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 21, 2017

Author Member

I agree, but I think we need to do this consistently and I didn't want to touch the other numberOfOperations field (and the bwc code that entails) in the same PR. That's why I did it like this.

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jun 21, 2017

Member

To be clear, you're saying that you'll address this in a follow-up?

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 21, 2017

Author Member

can will do.

bleskes added some commits Jun 21, 2017

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2017

@s1monw I took out all the scheduled trimming (we can decide what to do about it later) and did the engine changes.

@jasontedor I addressed your feedback.

Can you please look again?

@jasontedor
Copy link
Member

left a comment

LGTM.

@s1monw

s1monw approved these changes Jun 22, 2017

Copy link
Contributor

left a comment

LGTM 2

@bleskes bleskes merged commit d963882 into elastic:master Jun 22, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
@bleskes

This comment has been minimized.

Copy link
Member Author

commented Jun 22, 2017

Thx @jasontedor @s1monw for the iterations.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 22, 2017

Merge branch 'master' into primary-context
* master: (56 commits)
  Initialize max unsafe auto ID timestamp on shrink
  Enable a long translog retention policy by default (elastic#25294)
  Remove `index.mapping.single_type=false` from core/tests (elastic#25331)
  test: single type defaults to true since alpha1 and not alpha3
  Get short path name for native controllers
  Live primary-replica resync (no rollback) (elastic#24841)
  Upgrade to lucene-7.0.0-snapshot-ad2cb77. (elastic#25349)
  percolator: Deprecate `document_type` parameter.
  [DOCS] Fixed typo.
  [rest-api-spec/indices.refresh] Remove old params
  Remove redundant and broken MD5 checksum from repository-s3 (elastic#25270)
  Initialize sequence numbers on a shrunken index
  Port most snapshot/restore static bwc tests to qa:full-cluster-restart (elastic#25296)
  Javadoc: ThreadPool doesn't reject while shutdown (elastic#23678)
  test: verify `size_to_upgrade_in_bytes` in assertBusy(...)
  Docs: Removed duplicated line in mapping docs
  Add backward compatibility indices for 5.4.2
  Update MockTransportService to the age of Transport.Connection (elastic#25320)
  Add version v5.4.2 after release
  IndexMetaData: Add internal format index setting (elastic#25292)
  ...

bleskes added a commit that referenced this pull request Jul 10, 2017

Add a scheduled translog retention check (#25622)
We currently check whether translog files can be trimmed whenever we create a new translog generation or close a view. However #25294 added a long translog retention period (12h, max 512MB by default), which means translog files should potentially be cleaned up long after there isn't any indexing activity to trigger flushes/the creation of new translog files. We therefore need a scheduled background check to clean up those files once they are no longer needed.

Relates to #10708

@clintongormley clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.