-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-4990: Cluster Config StartUp Race Condition #1730
Conversation
108629a
to
b6e2f62
Compare
|
||
logger.info("Received request for configuration : {}", groups); | ||
// Return exception to the caller so startup fails fast. | ||
if (!internalLocator.isSharedConfigurationEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check for isSharedConfigurationRunning() here also? If it's enabled but not in a running state yet, we should return this error too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @jinmeiliao,
I don't think so, but I might be wrong here... the fact that the function is executed on a locator where the cluster configuration service is disabled (internalLocator.isSharedConfigurationEnabled() == false
) can be considered as unexpected and that's the reason why the IllegalStateException
is returned to the caller (ClusterConfigurationLoader
).
It's possible (and expected if members are started up concurrently), however, for the cluster configuration service to be enabled but not yet running because it's actually starting up (internalLocator.isSharedConfigurationEnabled() == true && internalLocator.isSharedConfigurationRunning() == false
), in which case we shouldn't throw an exception but return null
from the function. This allows the caller (ClusterConfigurationLoader
) to decide whether to fail fast or keep retrying for some time, as it's currently doing here.
Please let me know your thoughts.
Cheers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether it return null or throw an exception to the caller, the caller ClusterConfigurationLoader.requestConfigurationFromOneLocator will already return null, so I don't think it will make a difference to the caller what's the cause of it (whether it's not enabled or not started yet). The caller will just need to move on to the next locator or try again.
Also, when we get the the list of the locators this function will be executed on, we made sure the locators are just the ones with cluster configuration enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jinmeiliao,
Thanks for the heads up, I haven't realized that we were already requesting the cluster configuration service only from locators where the service is actually enabled.
Either way, I have to disagree again... returning null
when the cluster configuration service is not enabled can slow down the startup process altogether: the server will retry six times with a delay of 10 seconds after each retry, meaning that the startup failure will happen only after one minute. This shouldn't happen as we only request the cluster configuration from locators that have it enabled, but it's better to be on the safer side and fail fast on startup:
int attempts = 6;
OUTER: while (attempts > 0) {
for (InternalDistributedMember locator : locatorList) {
logger.info("Attempting to retrieve cluster configuration from {} - {} attempts remaining",
locator.getName(), attempts);
response = requestConfigurationFromOneLocator(locator, groups);
if (response != null) {
break OUTER;
}
}
try {
Thread.sleep(10000);
} catch (InterruptedException ex) {
break;
}
attempts--;
}
Moreover, and thinking more about the GetClusterConfigurationFunction
API, is more clear to return different things for different situations: if we return null
on both scenarios (cluster configuration not enabled or cluster configuration enabled but not fully loaded yet) the user of the API (there's none at the moment other than ClusterConfigurationLoader
, but it could be in the future...) doesn't get a clear perspective of what's happening... throwing an IllegalStateException
when the cluster configuration service is not enabled and returning null
when the service is enabled but not yet fully loaded is far more clear.
We could even refactor the GetClusterConfigurationFunction
as a whole to return a typed response containing more accurate information about the status of the cluster configuration service without any exceptions at all, but that's another improvement, for the future maybe.
Long story short: it feels ambiguous to return the same null
value for two totally different scenarios: cluster configuration not enabled and cluster configuration enabled but not yet fully loaded. Please let me know what you think, I can make the changes if you still believe they are needed after reading the above.
if (!internalLocator.isSharedConfigurationEnabled()) { | ||
String errorMessage = "The cluster configuration service is not enabled on this member."; | ||
logger.warn(errorMessage); | ||
context.getResultSender().lastResult(new IllegalStateException(errorMessage)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matter of personal preference, but I think this would read better if this block explicitly return
ed at the end instead of having a long else
block following it. The try
block ends up being five tabs in and that just feels a little heavily nested for my taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you :-), the change is done.
import org.apache.geode.test.junit.categories.UnitTest; | ||
|
||
@Category(UnitTest.class) | ||
@PowerMockIgnore("*.UnitTest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really familiar with PowerMock
, but we don't have any classes that end with .UnitTest
. Or is that ignoring third-party classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about the answer here, all tests that use PowerMock
have this annotation enabled, and I've decided to use it as well after reading an email thread between Kirk and Jens here.
If you remove the annotation, on the other hand, the task checkMissedTests
from gradle
fails when building the project with the following exception:
The test org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunctionTest.executeShouldReturnTheRequestConfigurationWhenClusterConfigurationServiceIsEnabled does not include a junit category.
org.gradle.api.GradleException: The test org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunctionTest.executeShouldReturnTheRequestConfigurationWhenClusterConfigurationServiceIsEnabled does not include a junit category.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247)
at test_eszlyywcom1yg233zvaftl84l$_run_closure3$_closure10$_closure40.doCall(/Users/jramos/git/geode/gradle/test.gradle:100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
at groovy.lang.Closure.call(Closure.java:414)
at org.gradle.listener.ClosureBackedMethodInvocationDispatch.dispatch(ClosureBackedMethodInvocationDispatch.java:40)
at org.gradle.listener.ClosureBackedMethodInvocationDispatch.dispatch(ClosureBackedMethodInvocationDispatch.java:25)
at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:42)
at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:221)
at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:145)
at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:58)
at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:315)
at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:225)
at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:138)
at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:35)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy73.beforeTest(Unknown Source)
at org.gradle.api.internal.tasks.testing.results.TestListenerAdapter.started(TestListenerAdapter.java:41)
at sun.reflect.GeneratedMethodAccessor467.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:42)
at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:221)
at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:145)
at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:58)
at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:315)
at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:225)
at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:138)
at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:35)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy75.started(Unknown Source)
at org.gradle.api.internal.tasks.testing.results.StateTrackingTestResultProcessor.started(StateTrackingTestResultProcessor.java:47)
at org.gradle.api.internal.tasks.testing.results.AttachParentTestResultProcessor.started(AttachParentTestResultProcessor.java:38)
at sun.reflect.GeneratedMethodAccessor469.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.dispatch.FailureHandlingDispatch.dispatch(FailureHandlingDispatch.java:29)
at org.gradle.internal.dispatch.AsyncDispatch.dispatchMessages(AsyncDispatch.java:132)
at org.gradle.internal.dispatch.AsyncDispatch.access$000(AsyncDispatch.java:33)
at org.gradle.internal.dispatch.AsyncDispatch$1.run(AsyncDispatch.java:72)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
at org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Good to know!
public void requestForClusterConfiguration() throws Exception { | ||
public void requestConfigurationFromLocatorsShouldCorrectlyLoadTheClusterConfigurationEvenAfterSeveralRetries() | ||
throws Exception { | ||
int mockLimit = new Random().nextInt(6) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like seeing non-deterministic behavior in a test, beyond that which is inherent to the system. Is there a reason this can't just be explicitly chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason at all, I've changed it to use the maximum amount of attempts.
(I meant to make this a parent-comment to the review, but mis-clicked...) The only thing I'm really concerned about is using |
@@ -308,7 +308,7 @@ public ConfigurationResponse requestConfigurationFromLocators(String groupList, | |||
return response; | |||
} | |||
|
|||
private ConfigurationResponse requestConfigurationFromOneLocator( | |||
protected ConfigurationResponse requestConfigurationFromOneLocator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 324 in this method, any exception we've got from the function execution, we log it and then a null ConfigResponse is returned. So even if you throw an exception when CC is not enabled, it's still going to try again. And like I mentioned before, on the locator where this function is enabled, the CC is enabled. I understand that you want to make the function work in both cases for future usage, but that may not be needed. Would be better to put a comment in the function saying that "this function is only executed on locators with CC enabled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @jinmeiliao,
Sorry to disagree again, but this is not accurate...
The code Object result = ((ArrayList) resultCollector.getResult()).get(0);
(line 318) will actually throw a FunctionException
and, thus, line 324 doesn't come into play at all, the next execution step will be line 329. Altogether it implies that the ClusterConfigurationLoader
doesn't return null
and throws an exception instead, skipping the retry process and failing fast. This scenario is covered by test method requestConfigurationFromLocatorsShouldThrowExceptionWhenClusterConfigurationServiceIsNotEnabled
within class ClusterConfigurationLoaderIntegrationTest
, by the way.
Cheers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. I thought the exception was wrapped in a CliFunctionResult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jinmeiliao!! 👍
@jujoramos the precheckin has some failures which I am not sure if it's environment related. Just to be safe, can you rebase your changes to the current develop and then do a force commit to trigger the precheckin to run again with all the latest fixes? Thanks! |
719f261
to
896b1bd
Compare
Done @jinmeiliao, I've rebased against develop and executed a |
- Added unit test for `GetClusterConfigurationFunction`. - Added integration tests for `ClusterConfigurationLoader`. - Set 'InternalLocator.isSharedConfigurationEnabled()' visibility as `public`. - Set `ClusterConfigurationLoader.requestConfigurationFromOneLocator()` visibility as `protected`. - `GetClusterConfigurationFunction` now throws an exception if the cluster configuraiton service is not enabled on the locator. - `GetClusterConfigurationFunction` now returns null if the cluster configuraiton service is enabled but not yet started. The caller can decide whether to retry or fail fast.
- Changed the log-level from ERROR to WARN within the GetClusterConfigurationFunction. LockServiceDestroyedException was being thrown before but it was logged using log-level=DEBUG, that's why test didn't fail before commit e98b4bf.
896b1bd
to
40c81d5
Compare
Hello @jinmeiliao I found where the problem is...
This is the reason why the test wasn't failing before: the |
GEODE-4990: Cluster Config StartUp Race Condition
GetClusterConfigurationFunction
.ClusterConfigurationLoader
.as
public
.ClusterConfigurationLoader.requestConfigurationFromOneLocator()
visibility as
protected
.GetClusterConfigurationFunction
now throws an exception if thecluster configuraiton service is not enabled on the locator.
GetClusterConfigurationFunction
now returns null if the clusterconfiguraiton service is enabled but not yet started. The caller can
decide whether to retry or fail fast.
Thank you for submitting a contribution to Apache Geode.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Has your PR been rebased against the latest commit within the target branch (typically
develop
)?Is your initial contribution a single, squashed commit?
Does
gradlew build
run cleanly?Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.