Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Translog inner closes should happen after tragedy exception is set #32674

Merged
merged 13 commits into from Aug 20, 2018

Conversation

@andrershov
Copy link
Contributor

commented Aug 7, 2018

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriter has closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This pull request fixes these 2 methods. To fix it, I've decided to pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 7, 2018

@andrershov andrershov added the >bug label Aug 7, 2018

@andrershov andrershov changed the title All Translog inner closes should be because tragedy exception has happened All Translog inner closes should happen after tragedy exception is set Aug 7, 2018

@andrershov andrershov requested review from bleskes and ywelsch Aug 7, 2018

@andrershov andrershov added the v7.0.0 label Aug 7, 2018

@ywelsch
Copy link
Contributor

left a comment

Thanks for the PR. I've left some comments. I think we should also beef up the testing here to ensure that we indeed always have a tragic exception at hand in case a tragic event happened. A good starting point for this is TranslogTests#testWithRandomException, where we can sprinkle assertNotNull(failableTLog.getTragicException()); over the place (this would have actually found the rollGeneration one), and call some more of the methods on failableTLog (e.g. trimOperations) so that we get good coverage.

private synchronized void closeWithTragicEvent(final Exception ex) {
assert ex != null;
if (tragedy == null) {
tragedy = ex;
} else if (tragedy != ex) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 9, 2018

Contributor

this case got lost

@@ -310,8 +313,21 @@ public boolean isOpen() {
return closed.get() == false;
}

private boolean calledFromOutsideOrViaTragedyClose(){

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 9, 2018

Contributor

make this method static?

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 9, 2018

Contributor

Also, can we test this method somehow?

@bleskes

This comment has been minimized.

Copy link
Member

commented Aug 10, 2018

@ywelsch please also discuss how far this can be backported (I only see labels for 7.0.0)

@andrershov

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2018

@ywelsch

  • I've wrapped AtomicReference with the tragic exception with TragedyExceptionHolder to make sure we do not lose, but instead suppress subsequent exceptions
  • Made stack checking method static
  • Added a test for stack checking method - the best thing we can do here is subclassing and check for subclasses in stack checking method
  • Added non-null tragic exception assertions to tests that might fail during rollGeneration and trimOperations.

Could you please review my changes?

andrershov added 2 commits Aug 17, 2018
@ywelsch
Copy link
Contributor

left a comment

Thanks for the extra test. I've left some more comments.

@@ -51,7 +51,7 @@
/* the number of translog operations written to this file */
private volatile int operationCounter;
/* if we hit an exception that we can't recover from we assign it to this var and ship it with every AlreadyClosedException we throw */
private volatile Exception tragedy;
private TragicExceptionHolder tragedy;

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

make this final

@@ -2508,8 +2509,10 @@ public void testWithRandomException() throws IOException {
// fair enough
} catch (IOException ex) {
assertEquals(ex.getMessage(), "__FAKE__ no space left on device");
assertEquals(failableTLog.getTragicException(), ex);

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

what about the case where we catch TranslogException | MockDirectoryWrapper.FakeIOException?

@@ -2931,6 +2934,46 @@ public void testCloseSnapshotTwice() throws Exception {
}
}

public void testTranslogCloseInvariant() throws IOException {
//close method should never be called directly from Translog (the only exception is closeOnTragicEvent)

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

please add a space behind //

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

also move this comment outside of the method

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

finally, add an assumeTrue("test only works with assertions enabled", Assertions.ENABLED); to the beginning of the method

close();
} catch (final IOException | RuntimeException e) {
ex.addSuppressed(e);
if (tragedy.setTragicException(ex)) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

why change the semantics here to only call close when setting the tragedy the first time? Let's keep the existing semantics and make setTragicException return void

This comment has been minimized.

Copy link
@andrershov

andrershov Aug 20, 2018

Author Contributor

@ywelsch what is the reason to call close twice if every time we set tragic exception close is called?

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

simpler as it decouples these two things. And no need for having this method return a boolean.

This comment has been minimized.

Copy link
@andrershov

andrershov Aug 20, 2018

Author Contributor

@ywelsch ok, it makes sense. There is already "closed" atomic that will prevent multiple closes.

tragedy.set(ex);
return true;
} else {
if (tragedy.get() != ex) { //to ensure there is no self-suppression

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

add space after //

public boolean setTragicException(Exception ex){
assert ex != null;
if (tragedy.compareAndSet(null, ex)) {
tragedy.set(ex);

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

why double-set here? Isn't the previous line already doing this?

@andrershov

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

@ywelsch thank you for the review!
I've fixed everything except close once. I've commented on it.

@andrershov

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

@ywelsch done with changes

@ywelsch
Copy link
Contributor

left a comment

LGTM. Let's backport this to 6.4.1 and 6.5.0 as well.

* Sets the tragic exception or if the tragic exception is already set adds passed exception as suppressed exception
* @param ex tragic exception to set
*/
public void setTragicException(Exception ex){

This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

space missing between ) and {

try {
close();
} catch (final IOException | RuntimeException e) {
ex.addSuppressed(e);
}
}


This comment has been minimized.

Copy link
@ywelsch

ywelsch Aug 20, 2018

Contributor

remove extra newline

@andrershov andrershov merged commit 0749b18 into elastic:master Aug 20, 2018

0 of 2 checks passed

elasticsearch-ci Build started for merge commit.
Details
elasticsearch-ci/packaging-sample Build triggered for merge commit.
Details
@andrershov

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

Merged to master, pending backport to 6.5.0 and 6.4.1

jasontedor added a commit that referenced this pull request Aug 20, 2018
Merge branch 'master' into ccr
* master:
  Generalize remote license checker (#32971)
  Trim translog when safe commit advanced (#32967)
  Fix an inaccuracy in the dynamic templates documentation. (#32890)
  Logging: Use settings when building daemon threads (#32751)
  All Translog inner closes should happen after tragedy exception is set (#32674)
  HLREST: AwaitsFix ML Test
  Pass DiscoveryNode to initiateChannel (#32958)
  Add mzn and dz to unsupported locales (#32957)
  Use settings from the context in BootstrapChecks (#32908)
  Update docs for node specifications (#30468)
  HLRC: Forbid all Elasticsearch logging infra (#32784)
  Only configure publishing if it's applied externally (#32351)
  Fixes libs:dissect when in eclipse
  Protect ScriptedMetricIT test cases against failures on 0-doc shards (#32959) (#32968)
  [Kerberos] Add documentation for Kerberos realm (#32662)
  Watcher: Properly find next valid date in cron expressions (#32734)
  Fix some small issues in the getting started docs (#30346)
  Set forbidden APIs target compatibility to compiler java version   (#32935)
  Move connection listener to ConnectionManager (#32956)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Aug 21, 2018
Merge remote-tracking branch 'elastic/master' into allowed-index-stor…
…e-types

* elastic/master: (89 commits)
  Fix assertion in AbstractSimpleTransportTestCase (elastic#32991)
  [DOC] Splits role mapping APIs into separate pages (elastic#32797)
  HLRC: ML Close Job (elastic#32943)
  Generalize remote license checker (elastic#32971)
  Trim translog when safe commit advanced (elastic#32967)
  Fix an inaccuracy in the dynamic templates documentation. (elastic#32890)
  Logging: Use settings when building daemon threads (elastic#32751)
  All Translog inner closes should happen after tragedy exception is set (elastic#32674)
  HLREST: AwaitsFix ML Test
  Pass DiscoveryNode to initiateChannel (elastic#32958)
  Add mzn and dz to unsupported locales (elastic#32957)
  Use settings from the context in BootstrapChecks (elastic#32908)
  Update docs for node specifications (elastic#30468)
  HLRC: Forbid all Elasticsearch logging infra (elastic#32784)
  Only configure publishing if it's applied externally (elastic#32351)
  Fixes libs:dissect when in eclipse
  Protect ScriptedMetricIT test cases against failures on 0-doc shards (elastic#32959) (elastic#32968)
  [Kerberos] Add documentation for Kerberos realm (elastic#32662)
  Watcher: Properly find next valid date in cron expressions (elastic#32734)
  Fix some small issues in the getting started docs (elastic#30346)
  ...
andrershov added a commit that referenced this pull request Aug 21, 2018
All Translog inner closes should happen after tragedy exception is set (
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
andrershov added a commit that referenced this pull request Aug 21, 2018
All Translog inner closes should happen after tragedy exception is set (
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
@andrershov

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2018

Backported to 6.x and 6.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.