Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrimaryNode: add configurable timeout to waitForAllRemotesToClose #11822

Merged
merged 1 commit into from Oct 19, 2022

Conversation

stevenschlansker
Copy link
Contributor

Adds a configurable timeout for PrimaryNode waiting for remotes to close.
The default matches existing behavior. In the long run, maybe this wait loop goes away entirely, but I elected to make the most compatible change first.

Fixes #11674

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change, overall LGTM, just have some small comments.

Also please don't forget to add an entry in CHANGES.txt

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes total sense -- tying Primary to the behavior of Replicas is dangerous coupling. It would be great to fix/improve the design so that Primary never has to wait for Replicas to do anything, but let's start with this great baby step.

@stevenschlansker stevenschlansker force-pushed the primarynode-close-timeout branch 2 times, most recently from a4ae16a to 4280bb7 Compare October 18, 2022 22:29
@stevenschlansker
Copy link
Contributor Author

stevenschlansker commented Oct 18, 2022

I updated this PR to rename the field to include Ms, and throw an exception on < -1.
I added a test case for both no timeout (0), and 1000ms. I verified the test fails (doesn't terminate) without the new configuration option sent.

I don't think this is a great candidate for a random test - randomization is wonderful for perturbing data and finding edge cases in algorithms. In this case, it is just an int we compare to a clock, so randomizing doesn't seem likely to uncover any helpful edge cases. Please let me know if this is still desired.

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenschlansker LGTM, I think this change contains no backward breaking API change so we should make it in 9.5?

lucene/CHANGES.txt Outdated Show resolved Hide resolved
@zhaih zhaih merged commit f3d85be into apache:main Oct 19, 2022
@stevenschlansker stevenschlansker deleted the primarynode-close-timeout branch October 19, 2022 00:38
@zhaih
Copy link
Contributor

zhaih commented Oct 19, 2022

I merged it but seems there're test failure

org.apache.lucene.index.TestIndexFileDeleter > test suite's output saved to /home/runner/work/lucene/lucene/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexFileDeleter.txt, copied below:
   >     org.apache.lucene.store.AlreadyClosedException: ReaderPool is already closed
   >         at __randomizedtesting.SeedInfo.seed([FF62209E9305A732:16FF57ACE5CC40CF]:0)
   >         at app//org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400)
   >         at app//org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3922)
   >         at app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:592)
   >         at app//org.apache.lucene.index.IndexWriter$4.getReader(IndexWriter.java:6479)
   >         at app//org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:488)
   >         at app//org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:420)
   >         at app//org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:485)
   >         at java.base@17.0.4.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base@17.0.4.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   >         at java.base@17.0.4.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base@17.0.4.1/java.lang.reflect.Method.invoke(Method.java:568)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base@17.0.4.1/java.lang.Thread.run(Thread.java:833)
  2> NOTE: reproduce with: gradlew test --tests TestIndexFileDeleter.testExcInDecRef -Dtests.seed=FF62209E9305A732 -Dtests.locale=is-IS -Dtests.timezone=Africa/Maseru -Dtests.asserts=true -Dtests.file.encoding=UTF-8

But seems not related to this PR, I'll create another issue if that failure is reproducible.
Thanks @stevenschlansker !

@stevenschlansker
Copy link
Contributor Author

OK - I did run ./gradlew check so I don't think I broke anything, but please let me know if it does end up being related!

@zhaih
Copy link
Contributor

zhaih commented Oct 20, 2022 via email

@rmuir rmuir added this to the 9.5.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PrimaryNode close waits for replicas to close, but there is no guarantee they ever will [LUCENE-10638]
4 participants