Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-16480] Improve error reporting on AZP #11372

Closed

Conversation

rmetzger
Copy link
Contributor

What is the purpose of the change

While debugging unstable e2e tests and crashing JVMs remotely, I came up with a bunch of improvements to our testing scripts, which I would like to add to the main repo.

Brief change log

The following changes are contained here:

  • [FLINK-16480][e2e] Always checks logs when test has failed: The e2e tests now check the logs for exceptions / errors etc when a test has failed (before, logs where not checked in error cases for some tests)
  • [FLINK-16480][AZP] fix log upload condition the log upload failed for e2e tests, when a bash e2e test failed.
  • [FLINK-16480] Collect debug files when JVM crashes (for IT cases): we now collect coredumps, jvm stacktraces, and other debugging files
  • [FLINK-16480][e2e] Collect java crashreports for e2e tests as well: same for e2e tests. This includes a refactoring of the log collection tools into a separate file.

Verifying this change

This run properly collects coredumps etc of a test that used to fail on master a few days ago: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6137&view=results

Does this pull request potentially affect one of the following parts:

affects only tests

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 08aa4c1 (Tue Mar 10 21:15:33 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 10, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@@ -46,7 +46,7 @@ function run_test {

function test_error() {
echo "[FAIL] Test script contains errors."
post_test_validation 1 "$description" "$skip_check_exceptions"
post_test_validation 1 "$description"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know whether the bash tests upload the logs for failed tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know whether the bash tests upload the logs for failed tests?

No. All directories are cleaned up after a test (in any case).

My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.

That is definitely a risk, I agree. But I would prefer not to rework this change for a problem we have not confirmed to exist.
It is quite tedious to improve these things because of all the testing cycles in between.
I would prefer to add this as is and re-adjust later if we have the feeling that people are not checking the logs carefully enough.

Copy link
Contributor

@zentol zentol Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I would prefer not to rework this change for a problem we have not confirmed to exist.

We have repeatedly seen people not properly analyzing the CI output when the logs are being dumped, this is not a theory.

It is quite tedious to improve these things because of all the testing cycles in between.

That comes with the territory unfortunately. You will just have to find a way to reduce these cycles, e.g., only compile and test flink-core or refactor the scripts such that you can run things offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will look into a different approach for making the logs available.

fi
fi

echo "PRODUCED build artifacts."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as usual this would be easier to review if the moving of code were put into a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried to do. There's a commit introducing core dump collection for regular tests, and another commit for introducing it for end to end tests.

I could not find any guidelines on how to exactly split the commits. How do you want it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell you first moved a bunch of stuff from travis_watchdog.sh to common-logging.sh (commit#1, pure refactoring, no behavioral changes), and then modified run-nightly-tests.sh to make use of it (commit#2, behavioral changes).

@@ -177,6 +177,14 @@ print_stacktraces () {
done
}

collect_coredumps() {
echo "Searching for .dump, .dumpstream and related files in $($HERE/../)"
for file in `find . -type f -regextype posix-extended -iregex '.*\.dump|.*\.dumpstream|.*hs.*\.log|.*/core(.[0-9]+)?$'`; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simpler solution may be to setup the JVM to directly write the dumps into ARTIFACTS_DIR.
https://stackoverflow.com/questions/1880166/is-it-possible-to-specify-where-jvms-crash-dumps-go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation, we can use this function also for collecting any JVM debugging files from the end to end tests.
Yes, we could also inject JVM properties through environment variables into most of the end to end tests. But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.

With this approach, I'm following a blackbox approach where I don't trust the system I'm testing.

Also, I don't know if I would spend my time well for the project if I would refactor something that works. Coredumps (luckily) happen quite rarely. I just wanted to make life for others easier when they have to debug a crashing JVM on a CI system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.

There's no guarantee that these are run under FLINK_ROOT though; the java e2e tests use a junit temporary folder to store and run these services, as well as the flink distribution.
At least you'd also have to search the entire tmp directory of the system, or modify java.io.tmpdir to point somewhere under FLINK_ROOT.

Have you tried the JAVA_TOOL_OPTIONS environment variable that the JVM itself evaluates? https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/envvars002.html

I don't know if I would spend my time well for the project if I would refactor something that works.

This can be read in rather frightening ways: "Lol let's use the first version that works now."

tools/azure_controller.sh Outdated Show resolved Hide resolved
tools/azure_controller.sh Outdated Show resolved Hide resolved

SCRIPT_DIR="`dirname \"$0\"`"
SCRIPT_DIR="`( cd \"${SCRIPT_DIR}\" && pwd -P)`"
export FLINK_ROOT="`( cd \"${SCRIPT_DIR}/..\" && pwd -P)`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems quite brittle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Do you have a good idea how to make it more stable?

Copy link
Contributor

@zentol zentol Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK you can pass parameters when calling source, so you could explicitly pass in the desired directory.
Alternatively, set a property before sourcing it.

@zentol zentol self-assigned this Mar 16, 2020
In my tests, the regex matches the following files:

./flink-state-processing-api/target/surefire-reports/2020-03-05T15-44-03_851-jvmRun1.dump
./flink-state-processing-api/target/surefire-reports/hs_err_pid3432.log
./flink-state-processing-api/target/surefire-reports/2020-03-05T15-11-41_750-jvmRun2.dump
./flink-state-processing-api/target/surefire-reports/2020-03-05T16-44-23_231.dumpstream
./flink-state-processing-api/target/surefire-reports/hs_err_pid_3432.log
./flink-state-processing-api/target/surefire-reports/hs_err_pid.log
./flink-state-processing-api/target/core.11410
@rmetzger rmetzger force-pushed the FLINK-16480-improve-error-reporting-pr branch from 08aa4c1 to b3ce8a2 Compare March 17, 2020 10:24
@@ -54,7 +54,7 @@ print_system_info() {
print_system_info

# enable core dumps
sudo ulimit -c unlimited
ulimit -c unlimited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the sudo removal belongs into another commit.

@rmetzger
Copy link
Contributor Author

I'm closing this PR, re-opening parts of this PR as separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants