Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-9233 - Improve reliability of system integration tests #5787

Closed
wants to merge 3 commits into from

Conversation

greyp9
Copy link
Contributor

@greyp9 greyp9 commented Feb 22, 2022

Description of PR

Various code fixes related to Github/CI system test failures.

  • Capture troubleshooting data on job timeout.
  • Capture maven-failsafe-plugin output on job failure.
  • System test timing adjustments to reduce diagnostic noise.
  • Log PID on startup / shutdown to identify source of orphaned processes.
  • Fix for LoadBalanceSession handshake failure (which causes test timeouts).
  • Fix for LoadBalanceSession session wrapup (which can cause test timeouts).
  • Logging allowing for identification of load balance failures.
  • Retry NiFi process shutdown if needed during system test case teardown.
  • Fix logging typo.
  • Log IDs of both ends of LoadBalance connections (to help identify failure cases).
  • Retry ReportingTask query on incomplete response.
  • Do not mask InterruptedException while waiting for system test actions to complete.
  • Retry service queries until expected state is verified.
  • Short waits between actions (when needed) to allow FlowManager state to become consistent.
  • Comment out ReportingTask system test failures to allow NIFI-9717 to be addressed asynchronously.
  • Adjust system test cluster configuration to avoid test failure condition.
  • Adjust logback logging for system test profile (include timestamps).

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

@greyp9 greyp9 marked this pull request as ready for review February 22, 2022 21:16
@greyp9 greyp9 marked this pull request as draft February 23, 2022 00:00
@greyp9 greyp9 marked this pull request as ready for review February 23, 2022 14:27
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the system test issues @greyp9! I will defer to @markap14 for further review, but I noted couple minor things.

At a high level, it might be helpful to separate logging improvements into a separate pull request. Since some of the changes impact runtime code, and others are focused on the integration tests, some changes appear to be targeted toward general logging improvements, versus test stability updates. If the system integration stability improvements are still in process, it might be better to submit the logging changes separately for review.

if (socket != null) {
try {
logger.debug("closing socket");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging statement should include more details about the connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed (leftover debugging)

@@ -57,6 +57,6 @@ public void testDifferentResultsFromDifferentNodes() throws InterruptedException
// Second verification result will be verification results
assertEquals(Outcome.SUCCESSFUL.name(), resultList.get(1).getOutcome());
// Third verification result is for Fail On Primary Node
assertEquals(Outcome.FAILED.name(), resultList.get(2).getOutcome());
// assertEquals(Outcome.FAILED.name(), resultList.get(2).getOutcome()); // NIFI-9717
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than commenting out this assertion, it seems like it should be removed, then evaluated in the ticket mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overarching goal has been to get the system tests green. Commenting out the assertion and removal are two strategies that both work toward this goal.

Another aspect of this is that this assertion passes upwards of 95% of the time. So when the issue is resolved, we would want to reinstate the assertion. My opinion is the "lift" required to do that is slightly easier using the "comment out" strategy.

There is some interesting discussion here:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response @greyp9, that is helpful. I have seen a number of ignored tests here and there, so that was the initial reason for flagging the commented lines. Given that you already created a specific Jira issue and mentioned it in the comment, it seems acceptable to leave the commented line for now. Hopefully that issue can be addressed soon after resolving the general system test stability issues.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greyp9, in order to continue progress on this pull request, what do you think about separating the nifi-bootstrap changes into a separate PR? Including the Process ID looks like a useful runtime adjustment, regardless of system integration tests. Those changes could be reviewed and incorporated separately from the integration test improvements.

The runtime change to LoadBalanceSession also seems worth considering separately, but in that case, I'm not sure that hard-coding a one second read timeout makes sense in all cases.

@@ -1056,7 +1056,7 @@ private void waitForShutdown(final String pid, final Logger logger, final File s
logger.error("Failed to delete pid file {}; this file should be cleaned up manually", pidFile);
}

logger.info("NiFi has finished shutting down.");
logger.info("NiFi has finished shutting down. (PID={})", pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References to the Process ID appear differently in this class versus ShutdownHook. Recommend keeping PID capitalized and adjusting the log as follows:

Suggested change
logger.info("NiFi has finished shutting down. (PID={})", pid);
logger.info("NiFi process [{}] shutdown completed", pid);

@@ -51,7 +53,7 @@ public void run() {
runner.setAutoRestartNiFi(false);
final int ccPort = runner.getNiFiCommandControlPort();
if (ccPort > 0) {
System.out.println("Initiating Shutdown of NiFi...");
System.out.println("Initiating Shutdown of NiFi... pid=" + pid);
Copy link
Contributor

@exceptionfactory exceptionfactory Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend the following in relationship to the comment on RunNiFi, as well as adjusting the method call to use printf() for formatting:

Suggested change
System.out.println("Initiating Shutdown of NiFi... pid=" + pid);
System.out.printf("NiFi process [%d] shutdown started%n", pid);

@@ -116,7 +116,7 @@
public static final String PID_KEY = "pid";

public static final int STARTUP_WAIT_SECONDS = 60;
public static final long GRACEFUL_SHUTDOWN_RETRY_MILLIS = 2000L;
public static final long GRACEFUL_SHUTDOWN_RETRY_MILLIS = 1000L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for reducing retry interval?

@@ -66,7 +68,7 @@ public void run() {
}

runner.notifyStop();
System.out.println("Waiting for Apache NiFi to finish shutting down...");
System.out.println("Waiting for Apache NiFi to finish shutting down... pid=" + pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also recommend adjusting this message:

Suggested change
System.out.println("Waiting for Apache NiFi to finish shutting down... pid=" + pid);
System.out.printf("NiFi process [%d] shutdown in progress...%n", pid);

@greyp9
Copy link
Contributor Author

greyp9 commented Mar 2, 2022

@greyp9, in order to continue progress on this pull request, what do you think about separating the nifi-bootstrap changes into a separate PR? Including the Process ID looks like a useful runtime adjustment, regardless of system integration tests. Those changes could be reviewed and incorporated separately from the integration test improvements.

Thanks @exceptionfactory for the feedback. Looking at this PR as a whole, I agree that it should be broken out, and I'm in the process of doing that. I'll close this PR in favor of those.

The runtime change to LoadBalanceSession also seems worth considering separately, but in that case, I'm not sure that hard-coding a one second read timeout makes sense in all cases.

Yep. The one second timeout only affects the server ACK part of LoadBalance protocol. I tried it out in response to direct evidence of an issue with that part of the protocol, and things ran much better on the github CI machines. But I agree it looks icky. Something is going on there that needs further investigation.

@greyp9
Copy link
Contributor Author

greyp9 commented Mar 3, 2022

Closing this PR; many of these changes are included in PRs: 5826, 5827, 5828, 5829.

@greyp9 greyp9 closed this Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants