New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-19664][e2e] Upload logs before tests time out #13655
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 117ce2e (Thu Oct 15 18:09:52 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
The e2e test failed by timeout with 280 mins, maybe improving from 260 to 280 is still not enough |
echo "Running command '$COMMAND' with a timeout of $DEFINED_TIMEOUT_MINUTES minutes." | ||
|
||
function warning_watchdog { | ||
SLEEP_TIME=$(echo "scale=0; $DEFINED_TIMEOUT_SECONDS*0.8/1" | bc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain the various bits here? Why is the /1
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bash doesn't support floating point math, that's why I use bc.
doing just "$number/0.8" will return a floating point. The division and scale is for rounding to an integer: https://stackoverflow.com/a/20562313/568695
SLEEP_TIME=$(($DEFINED_TIMEOUT_SECONDS-$START_LOG_UPLOAD_SECONDS_FROM_END)) | ||
sleep $SLEEP_TIME | ||
echo "=======================================================================================================" | ||
echo "=== WARNING: This E2E Run will be killed in the next few minutes. Starting to upload the log output ===" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with "killed" you mean azure timing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I'll update the wording.
log_upload_watchdog & | ||
|
||
# ts from moreutils prepends the time to each line | ||
( $COMMAND & PID=$! ; wait $PID ) | ts | tee $OUTPUT_FILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...could we not actually kill PID after the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel uncomfortable killing here, because my whole accounting for when to start uploading the log / showing the warning is somewhat fragile: We start the timeout in the test stage, not in the compile stage, however, the overall timeout is for compile+test.
Putting a kill here could severely affect the user experience and requires constant adjustment.
What I'm proposing in this PR is a best effort logging solution that we can potentially even disable again once Azure is reliably presenting logs again even in failure cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense 👍
Can we not define per-stage timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually can. Let me investigate. I still won't introduce a hard kill, but it would make the behavior of the upload more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
What is the purpose of the change
Due to a bug in azure pipelines, we can not see the e2e output when a run times out.
This pull is to add some tooling for rescuing the logs before it's too late
Verifying this change
I've tested it here: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8473&view=artifacts&type=publishedArtifacts