New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-16480] Improve error reporting on AZP #11372
[FLINK-16480] Improve error reporting on AZP #11372
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 08aa4c1 (Tue Mar 10 21:15:33 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@@ -46,7 +46,7 @@ function run_test { | |||
|
|||
function test_error() { | |||
echo "[FAIL] Test script contains errors." | |||
post_test_validation 1 "$description" "$skip_check_exceptions" | |||
post_test_validation 1 "$description" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know whether the bash tests upload the logs for failed tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know whether the bash tests upload the logs for failed tests?
No. All directories are cleaned up after a test (in any case).
My impressions is that this will actually obfuscate errors. People pretty much only look at the last ~20 lines of the CI output, and don't properly investigate the case, resulting in several superfluous JIRAs.
That is definitely a risk, I agree. But I would prefer not to rework this change for a problem we have not confirmed to exist.
It is quite tedious to improve these things because of all the testing cycles in between.
I would prefer to add this as is and re-adjust later if we have the feeling that people are not checking the logs carefully enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I would prefer not to rework this change for a problem we have not confirmed to exist.
We have repeatedly seen people not properly analyzing the CI output when the logs are being dumped, this is not a theory.
It is quite tedious to improve these things because of all the testing cycles in between.
That comes with the territory unfortunately. You will just have to find a way to reduce these cycles, e.g., only compile and test flink-core or refactor the scripts such that you can run things offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I will look into a different approach for making the logs available.
fi | ||
fi | ||
|
||
echo "PRODUCED build artifacts." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as usual this would be easier to review if the moving of code were put into a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I tried to do. There's a commit introducing core dump collection for regular tests, and another commit for introducing it for end to end tests.
I could not find any guidelines on how to exactly split the commits. How do you want it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell you first moved a bunch of stuff from travis_watchdog.sh
to common-logging.sh
(commit#1, pure refactoring, no behavioral changes), and then modified run-nightly-tests.sh
to make use of it (commit#2, behavioral changes).
tools/travis_watchdog.sh
Outdated
@@ -177,6 +177,14 @@ print_stacktraces () { | |||
done | |||
} | |||
|
|||
collect_coredumps() { | |||
echo "Searching for .dump, .dumpstream and related files in $($HERE/../)" | |||
for file in `find . -type f -regextype posix-extended -iregex '.*\.dump|.*\.dumpstream|.*hs.*\.log|.*/core(.[0-9]+)?$'`; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a simpler solution may be to setup the JVM to directly write the dumps into ARTIFACTS_DIR
.
https://stackoverflow.com/questions/1880166/is-it-possible-to-specify-where-jvms-crash-dumps-go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this implementation, we can use this function also for collecting any JVM debugging files from the end to end tests.
Yes, we could also inject JVM properties through environment variables into most of the end to end tests. But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.
With this approach, I'm following a blackbox approach where I don't trust the system I'm testing.
Also, I don't know if I would spend my time well for the project if I would refactor something that works. Coredumps (luckily) happen quite rarely. I just wanted to make life for others easier when they have to debug a crashing JVM on a CI system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But who guarantees that Kafka, ElasticSearch, Zookeeper all properly respect the environment variables.
There's no guarantee that these are run under FLINK_ROOT though; the java e2e tests use a junit temporary folder to store and run these services, as well as the flink distribution.
At least you'd also have to search the entire tmp directory of the system, or modify java.io.tmpdir
to point somewhere under FLINK_ROOT.
Have you tried the JAVA_TOOL_OPTIONS
environment variable that the JVM itself evaluates? https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/envvars002.html
I don't know if I would spend my time well for the project if I would refactor something that works.
This can be read in rather frightening ways: "Lol let's use the first version that works now."
|
||
SCRIPT_DIR="`dirname \"$0\"`" | ||
SCRIPT_DIR="`( cd \"${SCRIPT_DIR}\" && pwd -P)`" | ||
export FLINK_ROOT="`( cd \"${SCRIPT_DIR}/..\" && pwd -P)`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems quite brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Do you have a good idea how to make it more stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK you can pass parameters when calling source
, so you could explicitly pass in the desired directory.
Alternatively, set a property before sourcing it.
In my tests, the regex matches the following files: ./flink-state-processing-api/target/surefire-reports/2020-03-05T15-44-03_851-jvmRun1.dump ./flink-state-processing-api/target/surefire-reports/hs_err_pid3432.log ./flink-state-processing-api/target/surefire-reports/2020-03-05T15-11-41_750-jvmRun2.dump ./flink-state-processing-api/target/surefire-reports/2020-03-05T16-44-23_231.dumpstream ./flink-state-processing-api/target/surefire-reports/hs_err_pid_3432.log ./flink-state-processing-api/target/surefire-reports/hs_err_pid.log ./flink-state-processing-api/target/core.11410
08aa4c1
to
b3ce8a2
Compare
@@ -54,7 +54,7 @@ print_system_info() { | |||
print_system_info | |||
|
|||
# enable core dumps | |||
sudo ulimit -c unlimited | |||
ulimit -c unlimited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the sudo removal belongs into another commit.
I'm closing this PR, re-opening parts of this PR as separate PRs. |
What is the purpose of the change
While debugging unstable e2e tests and crashing JVMs remotely, I came up with a bunch of improvements to our testing scripts, which I would like to add to the main repo.
Brief change log
The following changes are contained here:
Verifying this change
This run properly collects coredumps etc of a test that used to fail on master a few days ago: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6137&view=results
Does this pull request potentially affect one of the following parts:
affects only tests