-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-9233 - Improve reliability of system integration tests #5787
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the system test issues @greyp9! I will defer to @markap14 for further review, but I noted couple minor things.
At a high level, it might be helpful to separate logging improvements into a separate pull request. Since some of the changes impact runtime code, and others are focused on the integration tests, some changes appear to be targeted toward general logging improvements, versus test stability updates. If the system integration stability improvements are still in process, it might be better to submit the logging changes separately for review.
if (socket != null) { | ||
try { | ||
logger.debug("closing socket"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging statement should include more details about the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed (leftover debugging)
@@ -57,6 +57,6 @@ public void testDifferentResultsFromDifferentNodes() throws InterruptedException | |||
// Second verification result will be verification results | |||
assertEquals(Outcome.SUCCESSFUL.name(), resultList.get(1).getOutcome()); | |||
// Third verification result is for Fail On Primary Node | |||
assertEquals(Outcome.FAILED.name(), resultList.get(2).getOutcome()); | |||
// assertEquals(Outcome.FAILED.name(), resultList.get(2).getOutcome()); // NIFI-9717 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than commenting out this assertion, it seems like it should be removed, then evaluated in the ticket mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overarching goal has been to get the system tests green. Commenting out the assertion and removal are two strategies that both work toward this goal.
Another aspect of this is that this assertion passes upwards of 95% of the time. So when the issue is resolved, we would want to reinstate the assertion. My opinion is the "lift" required to do that is slightly easier using the "comment out" strategy.
There is some interesting discussion here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response @greyp9, that is helpful. I have seen a number of ignored tests here and there, so that was the initial reason for flagging the commented lines. Given that you already created a specific Jira issue and mentioned it in the comment, it seems acceptable to leave the commented line for now. Hopefully that issue can be addressed soon after resolving the general system test stability issues.
…eftover logging statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greyp9, in order to continue progress on this pull request, what do you think about separating the nifi-bootstrap
changes into a separate PR? Including the Process ID looks like a useful runtime adjustment, regardless of system integration tests. Those changes could be reviewed and incorporated separately from the integration test improvements.
The runtime change to LoadBalanceSession
also seems worth considering separately, but in that case, I'm not sure that hard-coding a one second read timeout makes sense in all cases.
@@ -1056,7 +1056,7 @@ private void waitForShutdown(final String pid, final Logger logger, final File s | |||
logger.error("Failed to delete pid file {}; this file should be cleaned up manually", pidFile); | |||
} | |||
|
|||
logger.info("NiFi has finished shutting down."); | |||
logger.info("NiFi has finished shutting down. (PID={})", pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References to the Process ID appear differently in this class versus ShutdownHook
. Recommend keeping PID
capitalized and adjusting the log as follows:
logger.info("NiFi has finished shutting down. (PID={})", pid); | |
logger.info("NiFi process [{}] shutdown completed", pid); |
@@ -51,7 +53,7 @@ public void run() { | |||
runner.setAutoRestartNiFi(false); | |||
final int ccPort = runner.getNiFiCommandControlPort(); | |||
if (ccPort > 0) { | |||
System.out.println("Initiating Shutdown of NiFi..."); | |||
System.out.println("Initiating Shutdown of NiFi... pid=" + pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend the following in relationship to the comment on RunNiFi
, as well as adjusting the method call to use printf()
for formatting:
System.out.println("Initiating Shutdown of NiFi... pid=" + pid); | |
System.out.printf("NiFi process [%d] shutdown started%n", pid); |
@@ -116,7 +116,7 @@ | |||
public static final String PID_KEY = "pid"; | |||
|
|||
public static final int STARTUP_WAIT_SECONDS = 60; | |||
public static final long GRACEFUL_SHUTDOWN_RETRY_MILLIS = 2000L; | |||
public static final long GRACEFUL_SHUTDOWN_RETRY_MILLIS = 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for reducing retry interval?
@@ -66,7 +68,7 @@ public void run() { | |||
} | |||
|
|||
runner.notifyStop(); | |||
System.out.println("Waiting for Apache NiFi to finish shutting down..."); | |||
System.out.println("Waiting for Apache NiFi to finish shutting down... pid=" + pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also recommend adjusting this message:
System.out.println("Waiting for Apache NiFi to finish shutting down... pid=" + pid); | |
System.out.printf("NiFi process [%d] shutdown in progress...%n", pid); |
Thanks @exceptionfactory for the feedback. Looking at this PR as a whole, I agree that it should be broken out, and I'm in the process of doing that. I'll close this PR in favor of those.
Yep. The one second timeout only affects the server ACK part of LoadBalance protocol. I tried it out in response to direct evidence of an issue with that part of the protocol, and things ran much better on the github CI machines. But I agree it looks icky. Something is going on there that needs further investigation. |
Closing this PR; many of these changes are included in PRs: 5826, 5827, 5828, 5829. |
Description of PR
Various code fixes related to Github/CI system test failures.
maven-failsafe-plugin
output on job failure.LoadBalanceSession
handshake failure (which causes test timeouts).LoadBalanceSession
session wrapup (which can cause test timeouts).InterruptedException
while waiting for system test actions to complete.logback
logging for system test profile (include timestamps).In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically
main
)?Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not
squash
or use--force
when pushing to allow for clean monitoring of changes.For code changes:
mvn -Pcontrib-check clean install
at the rootnifi
folder?LICENSE
file, including the mainLICENSE
file undernifi-assembly
?NOTICE
file, including the mainNOTICE
file found undernifi-assembly
?.displayName
in addition to .name (programmatic access) for each of the new properties?For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.