Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3309. Add TimedOutTestsListener to surefire and add timeout to integration tests #813

Merged
merged 15 commits into from
May 28, 2020

Conversation

smengcl
Copy link
Contributor

@smengcl smengcl commented Apr 13, 2020

What changes were proposed in this pull request?

Add TimedOutTestsListener as a listener to maven-surefire-plugin like Hadoop does. And add timeout to integration tests that are not marked with @Ignore at the time of the patch.

Also previously discussed under #750.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3309

How was this patch tested?

The patch should make the timed out tests print out the stack trace (if any). Checks itself.

Bonus

To test if the TimedOutTestsListener is working, you can change the timeout of a class to an impossibly small value like 5 sec (for an integration test to initialize). Then run that integration test. Make sure the test isn't ignored or it won't run with mvn test.

Note that the timeout must be smaller than surefire.fork.timeout shown below otherwise the fork will still be killed with no stack trace or thread dump.

https://github.com/apache/hadoop-ozone/blob/2e9bae2f8636974c5c0e11a4a3e04d9cbfbb2288/pom.xml#L228

For example, in TestOzoneManagerRestart I removed the @Ignore annotation and set the timeout to 5000 ms. Then ran:

mvn test -e -Dtest=TestOzoneManagerRestart

Maven output:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.ozone.om.TestOzoneManagerRestart
[ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 15.223 s <<< FAILURE! - in org.apache.hadoop.ozone.om.TestOzoneManagerRestart
[ERROR] testRestartOMWithVolumeOperation(org.apache.hadoop.ozone.om.TestOzoneManagerRestart)  Time elapsed: 5.019 s  <<< ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:218)
...
[ERROR] testRestartOMWithBucketOperation(org.apache.hadoop.ozone.om.TestOzoneManagerRestart)  Time elapsed: 5.004 s  <<< ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:218)
...
[ERROR] testRestartOMWithKeyOperation(org.apache.hadoop.ozone.om.TestOzoneManagerRestart)  Time elapsed: 5.002 s  <<< ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:218)
...
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   TestOzoneManagerRestart.init:90 »  test timed out after 5000 milliseconds
[ERROR]   TestOzoneManagerRestart.init:90 »  test timed out after 5000 milliseconds
[ERROR]   TestOzoneManagerRestart.init:90 »  test timed out after 5000 milliseconds
[INFO]
[ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0

More importantly, the surefire-reports folder has the detailed logs with thread dump:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/smeng/repo/ozone/hadoop-ozone/integration-test/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There are test failures.

Please refer to /Users/smeng/repo/ozone/hadoop-ozone/integration-test/target/surefire-reports for the individual test results.
Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)

Example thread dump in surefire-reports/org.apache.hadoop.ozone.om.TestOzoneManagerRestart-output.txt:

====> TEST TIMED OUT. PRINTING THREAD DUMP. <====

Timestamp: 2020-05-22 06:37:20,788

"qtp494169951-212" daemon prio=5 tid=212 timed_waiting
java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:875)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:925)
        at java.lang.Thread.run(Thread.java:748)
"pool-56-thread-1"  prio=5 tid=192 timed_waiting
java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

@smengcl smengcl requested review from elek and adoroszlai April 13, 2020 17:30
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for working on this. I tested the change by configuring very short timeout. Here are my observations:

  1. The parallel-tests profile is not enabled by default. The listener should be configured globally, outside of profiles to take effect. I think it should be added to the root pom.xml here:

https://github.com/apache/hadoop-ozone/blob/c9a1925537312eb950d596dbae2c7e4d91f9fd59/pom.xml#L1686-L1724

  1. TimedOutTestsListener is not notified of "fork timeouts", only of individual test timeout, ie. the kind which you wanted to add in HDDS-3309. Add timeout to all integration tests #750. So I think we need both changes.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 15, 2020

Thanks @smengcl for working on this. I tested the change by configuring very short timeout. Here are my observations:

  1. The parallel-tests profile is not enabled by default. The listener should be configured globally, outside of profiles to take effect. I think it should be added to the root pom.xml here:

https://github.com/apache/hadoop-ozone/blob/c9a1925537312eb950d596dbae2c7e4d91f9fd59/pom.xml#L1686-L1724

  1. TimedOutTestsListener is not notified of "fork timeouts", only of individual test timeout, ie. the kind which you wanted to add in HDDS-3309. Add timeout to all integration tests #750. So I think we need both changes.

Thanks for the comment @adoroszlai . I can confirm parallel-tests isn't enabled and we should add the listener to root. Will post update.

I also played with the maven surefire listener a bit. Some observations:

  1. testStarted() will always be executed at test start, as in its name. But the current TimedOutTestsListener doesn't implement it. We can flush the output to make the message appear in e.g. org.apache.hadoop.ozone.om.TestOzoneManagerRestart-output.txt even if the test is eventually shutdown by fork timeout (forkedProcessTimeoutInSeconds). This way we can print the test name at least. Even if the test fork is terminated by timeout (prints Killing self fork JVM. Received SHUTDOWN command from Maven shutdown hook. in <DATETIME>-jvmRunX.dump), we can know which test was it running in the output text file.

  2. testFailure will not be executed when fork JVM for testing timed out. There appears to be one forked JVM per test class (See help for . CMIIW).

  3. I'm wondering if we could utilize shutdownHook to capture the failed test name and other info. Probably not. But I'll dig a bit.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 15, 2020

Looks like hadoop-hdds can't locate the class in hadoop-ozone

@adoroszlai
Copy link
Contributor

Looks like hadoop-hdds can't locate the class in hadoop-ozone

I started working on a similar listener (for debugging disk space usage in tests) and ran into this classpath issue. Solved it by moving these test utilities to a new module. I'll create a PR with the base change to unblock you. With that change you'll be able to add the TimedOutTestsListener globally.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 15, 2020

Looks like hadoop-hdds can't locate the class in hadoop-ozone

I started working on a similar listener (for debugging disk space usage in tests) and ran into this classpath issue. Solved it by moving these test utilities to a new module. I'll create a PR with the base change to unblock you. With that change you'll be able to add the TimedOutTestsListener globally.

Thanks. btw It looks to me the root cause of the issue is that Apache Hadoop HDDS Config(hadoop-hdds-config) is compiled before Apache Hadoop Distributed Data Store Common(hadoop-hdds-common). The latter has the TimedOutTestsListener. One easy fix might just be to move the TimedOutTestsListener class to a module that is compiled early enough.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @smengcl for continuing with this. Having moved TimedOutTestsListener to a separate module, it is now possible to add it globally. Can you please update this PR by removing the unnecessary change in TimedOutTestsListener? Your change in pom.xml should be fine for a new CI run.

@smengcl
Copy link
Contributor Author

smengcl commented May 18, 2020

@adoroszlai I have rebased the jira and added commits similar to #750 . I have also left all ignored integration test classes untouched from the timeout change.

Please take a look. :)

@smengcl
Copy link
Contributor Author

smengcl commented May 18, 2020

Rebased again since HDDS-3602 is merged.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for continuing work on this. There are few files with whitespace-only change, and several with ignored test class but timeout being added. Please let me know if you would like a list of these. Otherwise it looks good to me.

@smengcl
Copy link
Contributor Author

smengcl commented May 22, 2020

Thanks @smengcl for continuing work on this. There are few files with whitespace-only change, and several with ignored test class but timeout being added. Please let me know if you would like a list of these. Otherwise it looks good to me.

Thanks @adoroszlai for another review. I have addressed the whitespace and removed timeout from more ignored tests.

@smengcl smengcl changed the title HDDS-3309. Add TimedOutTestsListener to surefire HDDS-3309. Add TimedOutTestsListener to surefire and add timeouts to integration tests May 22, 2020
@smengcl smengcl changed the title HDDS-3309. Add TimedOutTestsListener to surefire and add timeouts to integration tests HDDS-3309. Add TimedOutTestsListener to surefire and add timeout to integration tests May 22, 2020
@adoroszlai
Copy link
Contributor

I have addressed the whitespace and removed timeout from more ignored tests.

Thanks, the update looks good.

@smengcl
Copy link
Contributor Author

smengcl commented May 28, 2020

Thanks @elek for triggering the recheck. The latest it-client failure is caused by flakiness of TestKeyInputStream#testSeekRandomly and TestOzoneRpcClient.

But I see a green run on my dev branch: smengcl@843f1d8

Will merge shortly.

@smengcl smengcl merged commit b66869d into apache:master May 28, 2020
@elek
Copy link
Member

elek commented Jun 2, 2020

In case of flaky tests I would suggest disabling them (with opening a Jira + saving log + commit @ignore without review) and repeat test. I think it provides better safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants