Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEODE-4990: Cluster Config StartUp Race Condition #1730

Merged
merged 3 commits into from Apr 13, 2018

Conversation

jujoramos
Copy link
Contributor

GEODE-4990: Cluster Config StartUp Race Condition

  • Added unit test for GetClusterConfigurationFunction.
  • Added integration tests for ClusterConfigurationLoader.
  • Set 'InternalLocator.isSharedConfigurationEnabled()' visibility
    as public.
  • Set ClusterConfigurationLoader.requestConfigurationFromOneLocator()
    visibility as protected.
  • GetClusterConfigurationFunction now throws an exception if the
    cluster configuraiton service is not enabled on the locator.
  • GetClusterConfigurationFunction now returns null if the cluster
    configuraiton service is enabled but not yet started. The caller can
    decide whether to retry or fail fast.

Thank you for submitting a contribution to Apache Geode.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • Has your PR been rebased against the latest commit within the target branch (typically develop)?

  • Is your initial contribution a single, squashed commit?

  • Does gradlew build run cleanly?

  • Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.


logger.info("Received request for configuration : {}", groups);
// Return exception to the caller so startup fails fast.
if (!internalLocator.isSharedConfigurationEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for isSharedConfigurationRunning() here also? If it's enabled but not in a running state yet, we should return this error too, right?

Copy link
Contributor Author

@jujoramos jujoramos Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jinmeiliao,

I don't think so, but I might be wrong here... the fact that the function is executed on a locator where the cluster configuration service is disabled (internalLocator.isSharedConfigurationEnabled() == false) can be considered as unexpected and that's the reason why the IllegalStateException is returned to the caller (ClusterConfigurationLoader).
It's possible (and expected if members are started up concurrently), however, for the cluster configuration service to be enabled but not yet running because it's actually starting up (internalLocator.isSharedConfigurationEnabled() == true && internalLocator.isSharedConfigurationRunning() == false), in which case we shouldn't throw an exception but return null from the function. This allows the caller (ClusterConfigurationLoader) to decide whether to fail fast or keep retrying for some time, as it's currently doing here.
Please let me know your thoughts.
Cheers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether it return null or throw an exception to the caller, the caller ClusterConfigurationLoader.requestConfigurationFromOneLocator will already return null, so I don't think it will make a difference to the caller what's the cause of it (whether it's not enabled or not started yet). The caller will just need to move on to the next locator or try again.

Also, when we get the the list of the locators this function will be executed on, we made sure the locators are just the ones with cluster configuration enabled.

Copy link
Contributor Author

@jujoramos jujoramos Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jinmeiliao,

Thanks for the heads up, I haven't realized that we were already requesting the cluster configuration service only from locators where the service is actually enabled.
Either way, I have to disagree again... returning null when the cluster configuration service is not enabled can slow down the startup process altogether: the server will retry six times with a delay of 10 seconds after each retry, meaning that the startup failure will happen only after one minute. This shouldn't happen as we only request the cluster configuration from locators that have it enabled, but it's better to be on the safer side and fail fast on startup:

    int attempts = 6;
    OUTER: while (attempts > 0) {
      for (InternalDistributedMember locator : locatorList) {
        logger.info("Attempting to retrieve cluster configuration from {} - {} attempts remaining",
            locator.getName(), attempts);
        response = requestConfigurationFromOneLocator(locator, groups);
        if (response != null) {
          break OUTER;
        }
      }

      try {
        Thread.sleep(10000);
      } catch (InterruptedException ex) {
        break;
      }

      attempts--;
    }

Moreover, and thinking more about the GetClusterConfigurationFunction API, is more clear to return different things for different situations: if we return null on both scenarios (cluster configuration not enabled or cluster configuration enabled but not fully loaded yet) the user of the API (there's none at the moment other than ClusterConfigurationLoader, but it could be in the future...) doesn't get a clear perspective of what's happening... throwing an IllegalStateException when the cluster configuration service is not enabled and returning null when the service is enabled but not yet fully loaded is far more clear.
We could even refactor the GetClusterConfigurationFunction as a whole to return a typed response containing more accurate information about the status of the cluster configuration service without any exceptions at all, but that's another improvement, for the future maybe.
Long story short: it feels ambiguous to return the same null value for two totally different scenarios: cluster configuration not enabled and cluster configuration enabled but not yet fully loaded. Please let me know what you think, I can make the changes if you still believe they are needed after reading the above.

if (!internalLocator.isSharedConfigurationEnabled()) {
String errorMessage = "The cluster configuration service is not enabled on this member.";
logger.warn(errorMessage);
context.getResultSender().lastResult(new IllegalStateException(errorMessage));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A matter of personal preference, but I think this would read better if this block explicitly returned at the end instead of having a long else block following it. The try block ends up being five tabs in and that just feels a little heavily nested for my taste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you :-), the change is done.

import org.apache.geode.test.junit.categories.UnitTest;

@Category(UnitTest.class)
@PowerMockIgnore("*.UnitTest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with PowerMock, but we don't have any classes that end with .UnitTest. Or is that ignoring third-party classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about the answer here, all tests that use PowerMock have this annotation enabled, and I've decided to use it as well after reading an email thread between Kirk and Jens here.
If you remove the annotation, on the other hand, the task checkMissedTests from gradle fails when building the project with the following exception:

The test org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunctionTest.executeShouldReturnTheRequestConfigurationWhenClusterConfigurationServiceIsEnabled does not include a junit category.
org.gradle.api.GradleException: The test org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunctionTest.executeShouldReturnTheRequestConfigurationWhenClusterConfigurationServiceIsEnabled does not include a junit category.
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
        at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247)
        at test_eszlyywcom1yg233zvaftl84l$_run_closure3$_closure10$_closure40.doCall(/Users/jramos/git/geode/gradle/test.gradle:100)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
        at groovy.lang.Closure.call(Closure.java:414)
        at org.gradle.listener.ClosureBackedMethodInvocationDispatch.dispatch(ClosureBackedMethodInvocationDispatch.java:40)
        at org.gradle.listener.ClosureBackedMethodInvocationDispatch.dispatch(ClosureBackedMethodInvocationDispatch.java:25)
        at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:42)
        at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:221)
        at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:145)
        at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:58)
        at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:315)
        at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:225)
        at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:138)
        at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:35)
        at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
        at com.sun.proxy.$Proxy73.beforeTest(Unknown Source)
        at org.gradle.api.internal.tasks.testing.results.TestListenerAdapter.started(TestListenerAdapter.java:41)
        at sun.reflect.GeneratedMethodAccessor467.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
        at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:42)
        at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:221)
        at org.gradle.internal.event.BroadcastDispatch$SingletonDispatch.dispatch(BroadcastDispatch.java:145)
        at org.gradle.internal.event.AbstractBroadcastDispatch.dispatch(AbstractBroadcastDispatch.java:58)
        at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:315)
        at org.gradle.internal.event.BroadcastDispatch$CompositeDispatch.dispatch(BroadcastDispatch.java:225)
        at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:138)
        at org.gradle.internal.event.ListenerBroadcast.dispatch(ListenerBroadcast.java:35)
        at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
        at com.sun.proxy.$Proxy75.started(Unknown Source)
        at org.gradle.api.internal.tasks.testing.results.StateTrackingTestResultProcessor.started(StateTrackingTestResultProcessor.java:47)
        at org.gradle.api.internal.tasks.testing.results.AttachParentTestResultProcessor.started(AttachParentTestResultProcessor.java:38)
        at sun.reflect.GeneratedMethodAccessor469.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
        at org.gradle.internal.dispatch.FailureHandlingDispatch.dispatch(FailureHandlingDispatch.java:29)
        at org.gradle.internal.dispatch.AsyncDispatch.dispatchMessages(AsyncDispatch.java:132)
        at org.gradle.internal.dispatch.AsyncDispatch.access$000(AsyncDispatch.java:33)
        at org.gradle.internal.dispatch.AsyncDispatch$1.run(AsyncDispatch.java:72)
        at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
        at org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:46)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Good to know!

public void requestForClusterConfiguration() throws Exception {
public void requestConfigurationFromLocatorsShouldCorrectlyLoadTheClusterConfigurationEvenAfterSeveralRetries()
throws Exception {
int mockLimit = new Random().nextInt(6) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like seeing non-deterministic behavior in a test, beyond that which is inherent to the system. Is there a reason this can't just be explicitly chosen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason at all, I've changed it to use the maximum amount of attempts.

@PurelyApplied
Copy link
Member

(I meant to make this a parent-comment to the review, but mis-clicked...)

The only thing I'm really concerned about is using Random in our tests. The other two are meant to be comments more than Requesting Changes items.

@@ -308,7 +308,7 @@ public ConfigurationResponse requestConfigurationFromLocators(String groupList,
return response;
}

private ConfigurationResponse requestConfigurationFromOneLocator(
protected ConfigurationResponse requestConfigurationFromOneLocator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 324 in this method, any exception we've got from the function execution, we log it and then a null ConfigResponse is returned. So even if you throw an exception when CC is not enabled, it's still going to try again. And like I mentioned before, on the locator where this function is enabled, the CC is enabled. I understand that you want to make the function work in both cases for future usage, but that may not be needed. Would be better to put a comment in the function saying that "this function is only executed on locators with CC enabled".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jinmeiliao,

Sorry to disagree again, but this is not accurate...
The code Object result = ((ArrayList) resultCollector.getResult()).get(0); (line 318) will actually throw a FunctionException and, thus, line 324 doesn't come into play at all, the next execution step will be line 329. Altogether it implies that the ClusterConfigurationLoader doesn't return null and throws an exception instead, skipping the retry process and failing fast. This scenario is covered by test method requestConfigurationFromLocatorsShouldThrowExceptionWhenClusterConfigurationServiceIsNotEnabled within class ClusterConfigurationLoaderIntegrationTest, by the way.
Cheers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. I thought the exception was wrapped in a CliFunctionResult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinmeiliao!! 👍

@jinmeiliao
Copy link
Member

@jujoramos the precheckin has some failures which I am not sure if it's environment related. Just to be safe, can you rebase your changes to the current develop and then do a force commit to trigger the precheckin to run again with all the latest fixes? Thanks!

@jujoramos
Copy link
Contributor Author

Done @jinmeiliao, I've rebased against develop and executed a push --force.
Hope everything's fine now, please let me know if something's missing!.
Cheers.

- Added unit test for `GetClusterConfigurationFunction`.
- Added integration tests for `ClusterConfigurationLoader`.
- Set 'InternalLocator.isSharedConfigurationEnabled()' visibility
  as `public`.
- Set `ClusterConfigurationLoader.requestConfigurationFromOneLocator()`
  visibility as `protected`.
- `GetClusterConfigurationFunction` now throws an exception if the
  cluster configuraiton service is not enabled on the locator.
- `GetClusterConfigurationFunction` now returns null if the cluster
  configuraiton service is enabled but not yet started. The caller can
  decide whether to retry or fail fast.
- Changed the log-level from ERROR to WARN within the
  GetClusterConfigurationFunction. LockServiceDestroyedException was
  being thrown before but it was logged using log-level=DEBUG, that's
  why test didn't fail before commit e98b4bf.
@jujoramos
Copy link
Contributor Author

Hello @jinmeiliao

I found where the problem is...
In the previous version of the GetClusterConfigurationFunction we were catching and logging only instances of the IOException class when calling the method clusterConfigurationService.createConfigurationResponse(Set<String> groups). That particular method, however, never throws that Exception anyway, so I directly removed the throws clause from the declaration and modified the GetClusterConfigurationFunction as below:

try {
  ConfigurationResponse response =
    clusterConfigurationService.createConfigurationResponse(groups);
  context.getResultSender().lastResult(response);
} catch (Exception exception) {
  logger.warn("Unable to retrieve the cluster configuration", exception);
  context.getResultSender().lastResult(exception);
}


This means that we are catching and logging (using log-level=ERROR) every single exception that might occur while creating the ConfigurationResponse. That said, the class MemberFunctionStreamingMessage already catches all exceptions thrown by the function execution, but it logs them using log-level=DEBUG:

...
  try {
...
      this.functionObject.execute(context);
      if (!this.replyLastMsg && this.functionObject.hasResult()) {
        throw new FunctionException(
            LocalizedStrings.ExecuteFunction_THE_FUNCTION_0_DID_NOT_SENT_LAST_RESULT
                .toString(functionObject.getId()));
      }
      stats.endFunctionExecution(start, this.functionObject.hasResult());
...
    } catch (Exception exception) {
      if (logger.isDebugEnabled()) {
        logger.debug("Exception occurred on remote member while executing Function: {}",
            this.functionObject.getId(), exception);
      }
      stats.endFunctionExecutionWithException(this.functionObject.hasResult());
      rex = new ReplyException(exception);
      replyWithException(dm, rex);
      // thr = e.getCause();
...

This is the reason why the test wasn't failing before: the LockServiceDestroyedException was being thrown way before my changes were applied, but it was logged using log-level=DEBUG (it seems to be expected at this point of the execution), because of that the LogConsumer class wasn't even receiving the event and, thus, no suspicious strings were logged at all so the test didn't fail. Considering that the exception is expected at this point, I've changed logger.error() to logger.warn() within the function implementation and everything works fine.
The changes have been rebased and committed already, please let me know if you want me to change something else (didn't squash everything into one single commit to keep the history, but I can do it if you want).
Cheers.

@jinmeiliao jinmeiliao merged commit d4a3689 into apache:develop Apr 13, 2018
@jujoramos jujoramos deleted the feature/GEODE-4990 branch May 24, 2019 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants