Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translog Operations with no sequence numbers fail recoveries after a rolling upgrading to 6.0 #27536

Closed
bleskes opened this issue Nov 27, 2017 · 4 comments

Comments

@bleskes
Copy link
Member

commented Nov 27, 2017

During a rolling upgrade shards on 6.0 nodes should be able to work with operations with no sequence numbers when the primary is still on a 5.6 node. After the primary is on a 6.0 node, it will start generating sequence numbers for all shard copies. To simplify the number of edge cases we have to deal with we have designed the code to only make the transition in one direction - i.e., once a shard moves to the sequence numbers universe, it will never go back to operated in a non sequence numbers universe. We also added protections in place to verify it doesn't happen.

Sadly, a rolling upgrade can lead to a situation that triggers those protections and fail the recoveries. Here is a typical exception line:

Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HS9TBPU9Vhf7shs_U', type='px-web-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned];

This can happen in the following scenario:

  1. A replica is on a 6.0 node while the primary is on 5.6 while indexing is going on.
  2. The replica inserts operations with no seq# to it's translog
  3. The primary goes down (planned or not) and the replica becomes a new primary.
  4. The primary comes back (or another shard) and starts recovery. While recovery is ongoing, indexing operations come in.
  5. The recovering replica opens up its engine and processes the new indexing operations, switching to seq# mode.
  6. The old operations from the translogs are replayed and since they have no seq #s in them the assumptions are violated and the recovery failed.

This problem is made worse by the fact that we now ship all the translogs to create a replica that adheres to the translog retention policy.

To work around it, people can reduce their translog retention policy to 0 (index.translog.retention.size) and then flush the primary shard. This should clean up the old ops from the translog. After that (and setting index.translog.retention.size back to null) you can run a reroute command with retry_failed flag : POST /_cluster/reroute?retry_failed=true.

I'm still evaluating possible solutions. Based on the complexity we can decide which version the solution should go to. It's also an open question why our tests didn't catch this.

@archanid

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2017

@bleskes If index.translog.retention.size is set to null, what's the significance, how is it different from 0? Defaults to 512mb?

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2017

@archanid setting to null unsets it and restores the default, which is indeed in this case 512mb. Some parts of the system sometimes treat unset as different than a fixed value but not in this case.

bleskes added a commit that referenced this issue Nov 30, 2017
During a recovery the target shard may process both new indexing operations and old ones concurrently. When the primary is on a 6.0 node, the new indexing operations are guaranteed to have sequence numbers but we don't have that guarantee for the old operations as they may come from a period when the primary was on a pre 6.0 node. Have this mixture of old and new is something we do not support and it triggers exceptions.

This PR adds a flush on primary promotion and primary relocations to make sure that any recoveries from a primary on a 6.0 will be guaranteed to only need operations with sequence numbers. A recovery from store already flushes when we start the engine if there were any ops in the translog.

With this extra flushes in place we can now actively filter out operations that have no sequence numbers during recovery. Since filtering out operations is risky, I have opted to harden the logic in the recovery source handler to verify that all operations in the required sequence number range (from the local checkpoint in the commit onwards) are not missed. This comes at an extra complexity for this PR but I think it's worth it.

Finally I added two tests that reproduce the problems.

Closes #27536
bleskes added a commit that referenced this issue Nov 30, 2017
During a recovery the target shard may process both new indexing operations and old ones concurrently. When the primary is on a 6.0 node, the new indexing operations are guaranteed to have sequence numbers but we don't have that guarantee for the old operations as they may come from a period when the primary was on a pre 6.0 node. Have this mixture of old and new is something we do not support and it triggers exceptions.

This PR adds a flush on primary promotion and primary relocations to make sure that any recoveries from a primary on a 6.0 will be guaranteed to only need operations with sequence numbers. A recovery from store already flushes when we start the engine if there were any ops in the translog.

With this extra flushes in place we can now actively filter out operations that have no sequence numbers during recovery. Since filtering out operations is risky, I have opted to harden the logic in the recovery source handler to verify that all operations in the required sequence number range (from the local checkpoint in the commit onwards) are not missed. This comes at an extra complexity for this PR but I think it's worth it.

Finally I added two tests that reproduce the problems.

Closes #27536
@bleskes

This comment has been minimized.

Copy link
Member Author

commented Dec 5, 2017

This has been future ported and is all fixed,

@bleskes bleskes closed this Dec 5, 2017
@woodlee

This comment has been minimized.

Copy link

commented Dec 5, 2017

For anyone landing here from a search who's seeking concrete workaround steps for indexes affected by this issue, check out the thread at https://discuss.elastic.co/t/replica-shard-is-in-unallocated-state-after-upgrade-to-6-0-from-5-6-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.