Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/GEODE-10228 DurableClientTestCase.testDurableHAFailover is failing #7608

Merged
merged 12 commits into from Apr 28, 2022

Conversation

mhansonp
Copy link
Contributor

@mhansonp mhansonp commented Apr 19, 2022

There were a couple of basic timing issues with the code.

  • Starting the cache server is not a synchronous event.
  • We were not waiting for the queues to clear before restarting, this allowed messages to "persist"
    across a restart of the durable client.

Basically all of the problems here are timing issues in the tests.

See my PR comments for key changes

- The test was failing because it didn't wait for the
HARegionQueue to clear before shutting down the durable
client for test. Thus when it came back up, there was
an extra message in the queue.

- Reverse the order of readyforevents and registerinterest
@@ -429,6 +431,10 @@ public void run2() throws CacheException {
@Test
public void testDurableNonHAFailover() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two tests here previously testing two things each.
Broke them out into four tests doing one thing.

}
});

registerInterest(durableClientVM, regionName, true, InterestResultPolicy.NONE);
// Verify the durable client is connected to server1
verifyDurableClientPresent(VERY_LONG_DURABLE_TIMEOUT_SECONDS, durableClientId, server1VM);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify the servers are up and ready before proceeding.

verifyDurableClientPresent(VERY_LONG_DURABLE_TIMEOUT_SECONDS, durableClientId, server1VM);
// Wait to until the HARegionQueue has emptied before disconnecting the Durable client
// to avoid having left over messages that processed, in particular, the key 0 message
waitUntilHARegionQueueSizeIsZero(server1VM);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for the HA region queues to clear before disconnecting the durable client we want to process all the messages first, then disconnect, otherwise we will see them again when the client restarts.


// Verify the durable client received the updates before failover
if (redundancyLevel == 1) {
checkListenerEvents(4, 1, -1, durableClientVM);
checkListenerEvents(3, 1, -1, durableClientVM);
Copy link
Contributor Author

@mhansonp mhansonp Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers were contradictory to the required record count because key 0 should not have been there to be counted.

Copy link
Contributor

@jake-at-work jake-at-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a wack at cleanup. I would ask for consistency in the usage of AssertJ over junit assertions. Also, I feel like fail is smell and there is usually an appropriate assertion to replace it. On await please use the untilAsserted and use AssertJ assertions in the block to give meaningful failures.

Copy link
Contributor

@kirklund kirklund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of Jake's comments. Fix those and add in my approval.

Would need to do a lot more changes for this to
work which is just not a priority at this moment.
@jake-at-work
Copy link
Contributor

@mhansonp As you noticed, you can't convert DUnit based tests to JUni 5. You can rename them to the current naming convention though.

@jake-at-work
Copy link
Contributor

Can you also please rename the updated tests to their current naming conventions.

Copy link
Contributor

@jake-at-work jake-at-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor things to consider but nothing blocking from me now. Thanks!

await()
.untilAsserted(() -> assertThat(stats.get_queueDroppedCount()).isEqualTo(queueDropCount));
await().untilAsserted(
() -> assertThat(stats.get_eventEnqueuedWhileClientAwayCount()).isEqualTo(enqueueCount));
} catch (Exception e) {
fail("Exception thrown while executing checkStatisticsWithExpectedValues()", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little hesitant on this fail. I assume it was here to bubble up some proper task failure handling but the use of fail still smells funny. First off this catch block is going to catch that assertion failures and wrap them, which makes me curious about behavior on failure. We don't consistently catch exceptions and convert to failure, so maybe it isn't necessary. If it is I wonder if using the assertThatCode(...).doesNotThrowAnyException() is more appropriate replacement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError which is thrown by the assertions in this try block is an Error rather than an Exception so it won't be caught here and any assertions that fail will bubble up. I personally dislike the use of try/catch coupled with fail() since it's not really adding anything here that wouldn't be clear from the stack trace of the exception, and it introduces inconsistent behaviour depending on the type of error/exception we might encounter.

// Verify the durable client still exists on the server
verifyDurableClientPresent(VERY_LONG_DURABLE_TIMEOUT_SECONDS, durableClientId,
server1VM);
Thread.sleep(10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still no keen on sleeps and would like to see this await on something.

Copy link
Contributor

@DonalEvans DonalEvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean-up! There's a lot of work still to do in these old tests, but some is always better than none, so none of my comments should be considered blocking.

Comment on lines +179 to 185
protected <V> V put() {
return (V) gemfireServerVm.invoke(() -> {
Cache cache = CacheFactory.getAnyInstance();
cache.getRegion("/Example").put("2", "TWO");
return cache.getRegion("/Example").get("2");
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings here can be fixed by using:

  protected String put() {
    return gemfireServerVm.invoke(() -> {
      Cache cache = CacheFactory.getAnyInstance();
      cache.getRegion("/Example").put("2", "TWO");
      return cache.<String, String>getRegion("/Example").get("2");
    });
  }

Since we explicitly create the Region as <String, String>, there's no need to use generics here.

Comment on lines +188 to 197
final long timeout = (System.currentTimeMillis()
+ ClientServerRegisterInterestsDUnitTest.WAIT_TIME_MILLISECONDS);

while (entryEvents.empty() && (System.currentTimeMillis() < timeout)) {
synchronized (this) {
try {
TimeUnit.MILLISECONDS.timedWait(this, Math.min(500, waitTimeMilliseconds));
TimeUnit.MILLISECONDS.timedWait(this, Math.min(500,
ClientServerRegisterInterestsDUnitTest.WAIT_TIME_MILLISECONDS));
} catch (InterruptedException ignore) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole method should probably just be replaced with:

      await().untilAsserted(() -> assertThat(entryEvents).isNotEmpty());

final int port = AvailablePortHelper.getRandomAvailableTCPPort();
vm0.invoke("create cache", () -> {
getSystem(getClientProperties());
PoolImpl p = (PoolImpl) PoolManager.createFactory().addServer(hostName, port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be the interface Pool rather than the concrete PoolImpl.

import org.apache.geode.test.junit.categories.ClientSubscriptionTest;

@Category({ClientSubscriptionTest.class})
public class DurableClientNoServerAvailabileDistributedTest extends JUnit4CacheTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here, should be "NoServerAvailable"

startAndCloseNonDurableClientCache(durableClientTimeout);
startAndCloseNonDurableClientCache(1); //////// -> Reconnection1
startAndCloseNonDurableClientCache();
startAndCloseNonDurableClientCache(); //////// -> Reconnection1
Wait.pause(1400); //////// -> Queue Dropped1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These waits should ideally be replaced with await() calls on some reasonable condition. For this one and the one below it the following seems right:

    await().untilAsserted(() -> assertThat(durableClientVM.invoke(() -> getClientCache().isClosed())).isTrue());

await()
.untilAsserted(() -> assertThat(stats.get_queueDroppedCount()).isEqualTo(queueDropCount));
await().untilAsserted(
() -> assertThat(stats.get_eventEnqueuedWhileClientAwayCount()).isEqualTo(enqueueCount));
} catch (Exception e) {
fail("Exception thrown while executing checkStatisticsWithExpectedValues()", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError which is thrown by the assertions in this try block is an Error rather than an Exception so it won't be caught here and any assertions that fail will bubble up. I personally dislike the use of try/catch coupled with fail() since it's not really adding anything here that wouldn't be clear from the stack trace of the exception, and it introduces inconsistent behaviour depending on the type of error/exception we might encounter.

Comment on lines +149 to 152
Enumeration<?> e = javaSystemProperties.propertyNames();

while (e.hasMoreElements()) {
String key = (String) e.nextElement();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block can be simplified to:

    if (javaSystemProperties != null && javaSystemProperties.size() > 0) {
      for (String key : javaSystemProperties.stringPropertyNames()) {
        System.setProperty(key, javaSystemProperties.getProperty(key));
      }
    }

Comment on lines 172 to 176
if (javaSystemProperties != null && javaSystemProperties.size() > 0) {
Enumeration e = javaSystemProperties.propertyNames();
Enumeration<?> e = javaSystemProperties.propertyNames();

while (e.hasMoreElements()) {
String key = (String) e.nextElement();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block can be simplified to:

    if (javaSystemProperties != null && javaSystemProperties.size() > 0) {
      for (String key : javaSystemProperties.stringPropertyNames()) {
        System.clearProperty(key);
      }
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in checking the size, the for range loop won't do anything if it is empty.

factory.setScope(Scope.LOCAL);
factory.setPoolName(p.getName());
RegionAttributes attrs = factory.create();
RegionAttributes<Object, Object> attrs = factory.create();
cache.createRegion(regionName1, attrs);
cache.createRegion(regionName2, attrs);
pool = p;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addControlListener arguments to createCacheClientFromXmlN() on line 197 and createCacheClientFromXml() on line 219 are not used and can be removed.

Comment on lines +498 to +501
/**
* This method is not implemented to test event count matches the eventsToCheck.size() which is
* confusing. It will wait and return if it got something in the eventsToCheck or not.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either use correct javadoc semantics (add a @param tag for each argument and an @return tag for the returned value) or it should be a block comment rather than a javadoc comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree, I would rather have the hover over documentation with some information than none. Though completing the information should not be that hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a bee in my bonnet about javadocs after the big clean-up I did lately, I think. I saw so many places where there was stuff like

/**
* Constructs the object
* /
public SomeClass(int param1, int param2) {
...

that provided no value as a javadoc and caused the linter to throw warnings. I think sometimes people just use javadoc comments because they're green in IntelliJ and so stand out more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for sure, that comment is pointless. I delete those when I find them. I should have been more precise, I would rather have some useful comments missing params and return tags than no comments at all.

@@ -576,6 +577,7 @@ public void run2() throws CacheException {

@Test
public void testReadyForEventsNotCalledImplicitlyForRegisterInterestWithCacheXML() {
int[] ports = getRandomAvailableTCPPorts(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used.

@mhansonp
Copy link
Contributor Author

So there is one remaining problem in the stress test and that is caused by the fact that the test DurableClientSimpleDUnitTest.testReadyForEventsNotCalledImplicitlyForRegisterInterestWithCacheXML uses a server with a defined port number of 10188. So when the test starts up it references an cache.xml that has the port defined in it. This means that the test cannot be run in parallel with itself as is done through the stress test and be ok. Depending on timing it will have resource contention. This has been in there for quite some time and is not really a problem except in stress test. If we do want to fix it, we will need to find a way to override it by generating it on the fly or modifying the XML on the fly. Generating it on the fly would be the easier path.

@jake-at-work
Copy link
Contributor

@mhansonp I think we can reasonably ignore that repeated test failure.

@mhansonp
Copy link
Contributor Author

I filed https://issues.apache.org/jira/browse/GEODE-10265 for the stress test failure.

@mhansonp mhansonp merged commit 5f1d358 into apache:develop Apr 28, 2022
@mhansonp mhansonp deleted the bugfix/GEODE-10228 branch April 28, 2022 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants