[FLINK-16480] Improve error reporting on AZP #11372

rmetzger · 2020-03-10T21:12:07Z

What is the purpose of the change

While debugging unstable e2e tests and crashing JVMs remotely, I came up with a bunch of improvements to our testing scripts, which I would like to add to the main repo.

Brief change log

The following changes are contained here:

[FLINK-16480][e2e] Always checks logs when test has failed: The e2e tests now check the logs for exceptions / errors etc when a test has failed (before, logs where not checked in error cases for some tests)
[FLINK-16480][AZP] fix log upload condition the log upload failed for e2e tests, when a bash e2e test failed.
[FLINK-16480] Collect debug files when JVM crashes (for IT cases): we now collect coredumps, jvm stacktraces, and other debugging files
[FLINK-16480][e2e] Collect java crashreports for e2e tests as well: same for e2e tests. This includes a refactoring of the log collection tools into a separate file.

Verifying this change

This run properly collects coredumps etc of a test that used to fail on master a few days ago: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6137&view=results

Does this pull request potentially affect one of the following parts:

affects only tests

flinkbot · 2020-03-10T21:15:33Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 08aa4c1 (Tue Mar 10 21:15:33 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-03-10T21:25:39Z

CI report:

b3ce8a2 Travis: SUCCESS Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

zentol · 2020-03-13T16:29:55Z

flink-end-to-end-tests/test-scripts/test-runner-common.sh

@@ -46,7 +46,7 @@ function run_test {

    function test_error() {
      echo "[FAIL] Test script contains errors."
-      post_test_validation 1 "$description" "$skip_check_exceptions"
+      post_test_validation 1 "$description"


do you know whether the bash tests upload the logs for failed tests?

My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.

do you know whether the bash tests upload the logs for failed tests?

No. All directories are cleaned up after a test (in any case).

My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.

That is definitely a risk, I agree. But I would prefer not to rework this change for a problem we have not confirmed to exist.
It is quite tedious to improve these things because of all the testing cycles in between.
I would prefer to add this as is and re-adjust later if we have the feeling that people are not checking the logs carefully enough.

But I would prefer not to rework this change for a problem we have not confirmed to exist.

We have repeatedly seen people not properly analyzing the CI output when the logs are being dumped, this is not a theory.

It is quite tedious to improve these things because of all the testing cycles in between.

That comes with the territory unfortunately. You will just have to find a way to reduce these cycles, e.g., only compile and test flink-core or refactor the scripts such that you can run things offline.

Okay. I will look into a different approach for making the logs available.

zentol · 2020-03-13T16:31:55Z

tools/travis/common-logging.sh

+		fi
+	fi
+
+	echo "PRODUCED build artifacts."


as usual this would be easier to review if the moving of code were put into a separate commit.

That's what I tried to do. There's a commit introducing core dump collection for regular tests, and another commit for introducing it for end to end tests.

I could not find any guidelines on how to exactly split the commits. How do you want it?

From what I can tell you first moved a bunch of stuff from travis_watchdog.sh to common-logging.sh (commit#1, pure refactoring, no behavioral changes), and then modified run-nightly-tests.sh to make use of it (commit#2, behavioral changes).

zentol · 2020-03-13T17:21:59Z

tools/travis_watchdog.sh

@@ -177,6 +177,14 @@ print_stacktraces () {
 	done
 }

+collect_coredumps() {
+	echo "Searching for .dump, .dumpstream and related files in $($HERE/../)"
+	for file in `find . -type f -regextype posix-extended -iregex '.*\.dump|.*\.dumpstream|.*hs.*\.log|.*/core(.[0-9]+)?$'`; do


a simpler solution may be to setup the JVM to directly write the dumps into ARTIFACTS_DIR.
https://stackoverflow.com/questions/1880166/is-it-possible-to-specify-where-jvms-crash-dumps-go

With this implementation, we can use this function also for collecting any JVM debugging files from the end to end tests.
Yes, we could also inject JVM properties through environment variables into most of the end to end tests. But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.

With this approach, I'm following a blackbox approach where I don't trust the system I'm testing.

Also, I don't know if I would spend my time well for the project if I would refactor something that works. Coredumps (luckily) happen quite rarely. I just wanted to make life for others easier when they have to debug a crashing JVM on a CI system.

But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.

There's no guarantee that these are run under FLINK_ROOT though; the java e2e tests use a junit temporary folder to store and run these services, as well as the flink distribution.
At least you'd also have to search the entire tmp directory of the system, or modify java.io.tmpdir to point somewhere under FLINK_ROOT.

Have you tried the JAVA_TOOL_OPTIONS environment variable that the JVM itself evaluates? https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/envvars002.html

I don't know if I would spend my time well for the project if I would refactor something that works.

This can be read in rather frightening ways: "Lol let's use the first version that works now."

tools/azure_controller.sh

zentol · 2020-03-13T17:29:44Z

tools/travis/common-logging.sh

+
+SCRIPT_DIR="`dirname \"$0\"`"
+SCRIPT_DIR="`( cd \"${SCRIPT_DIR}\" && pwd -P)`"
+export FLINK_ROOT="`( cd \"${SCRIPT_DIR}/..\" && pwd -P)`"


this seems quite brittle

I agree. Do you have a good idea how to make it more stable?

AFAIK you can pass parameters when calling source, so you could explicitly pass in the desired directory.
Alternatively, set a property before sourcing it.

In my tests, the regex matches the following files: ./flink-state-processing-api/target/surefire-reports/2020-03-05T15-44-03_851-jvmRun1.dump ./flink-state-processing-api/target/surefire-reports/hs_err_pid3432.log ./flink-state-processing-api/target/surefire-reports/2020-03-05T15-11-41_750-jvmRun2.dump ./flink-state-processing-api/target/surefire-reports/2020-03-05T16-44-23_231.dumpstream ./flink-state-processing-api/target/surefire-reports/hs_err_pid_3432.log ./flink-state-processing-api/target/surefire-reports/hs_err_pid.log ./flink-state-processing-api/target/core.11410

zentol · 2020-03-17T11:16:58Z

tools/azure_controller.sh

@@ -54,7 +54,7 @@ print_system_info() {
 print_system_info

 # enable core dumps
-sudo ulimit -c unlimited
+ulimit -c unlimited


I suppose the sudo removal belongs into another commit.

rmetzger · 2020-03-23T10:42:08Z

I'm closing this PR, re-opening parts of this PR as separate PRs.

rmetzger added 2 commits March 7, 2020 09:34

[FLINK-16480][e2e] Always checks logs when test has failed

70e42a8

[FLINK-16480][AZP] fix log upload condition

6ce2589

rmetzger added the review=description? label Mar 10, 2020

rmetzger added component=BuildSystem/AzurePipelines component=TestInfrastructure labels Mar 10, 2020

zentol reviewed Mar 13, 2020

View reviewed changes

tools/azure_controller.sh Outdated Show resolved Hide resolved

zentol reviewed Mar 13, 2020

View reviewed changes

tools/azure_controller.sh Outdated Show resolved Hide resolved

zentol reviewed Mar 13, 2020

View reviewed changes

zentol self-assigned this Mar 16, 2020

rmetzger added 2 commits March 17, 2020 11:23

[FLINK-16480][e2e] Collect java crashreports for e2e tests as well

b3ce8a2

rmetzger force-pushed the FLINK-16480-improve-error-reporting-pr branch from 08aa4c1 to b3ce8a2 Compare March 17, 2020 10:24

zentol reviewed Mar 17, 2020

View reviewed changes

rmetzger closed this Mar 23, 2020

rmetzger mentioned this pull request Apr 5, 2020

[FLINK-16973][tests] Add tooling for collecting jvm crash files #11638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-16480] Improve error reporting on AZP #11372

[FLINK-16480] Improve error reporting on AZP #11372

rmetzger commented Mar 10, 2020

flinkbot commented Mar 10, 2020

flinkbot commented Mar 10, 2020 •

edited

zentol Mar 13, 2020

zentol Mar 13, 2020

rmetzger Mar 17, 2020

zentol Mar 17, 2020 •

edited

rmetzger Mar 17, 2020

zentol Mar 13, 2020

rmetzger Mar 17, 2020

zentol Mar 17, 2020

zentol Mar 13, 2020

rmetzger Mar 17, 2020

zentol Mar 17, 2020

zentol Mar 13, 2020

rmetzger Mar 17, 2020

zentol Mar 17, 2020 •

edited

zentol Mar 17, 2020

rmetzger commented Mar 23, 2020

[FLINK-16480] Improve error reporting on AZP #11372

[FLINK-16480] Improve error reporting on AZP #11372

Conversation

rmetzger commented Mar 10, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

flinkbot commented Mar 10, 2020

Automated Checks

Review Progress

flinkbot commented Mar 10, 2020 • edited

CI report:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zentol Mar 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zentol Mar 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmetzger commented Mar 23, 2020

flinkbot commented Mar 10, 2020 •

edited

zentol Mar 17, 2020 •

edited

zentol Mar 17, 2020 •

edited