[FLINK-19664][e2e] Upload logs before tests time out #13655

rmetzger · 2020-10-15T18:07:53Z

What is the purpose of the change

Due to a bug in azure pipelines, we can not see the e2e output when a run times out.
This pull is to add some tooling for rescuing the logs before it's too late

Verifying this change

I've tested it here: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8473&view=artifacts&type=publishedArtifacts

flinkbot · 2020-10-15T18:09:52Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 117ce2e (Thu Oct 15 18:09:52 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-10-15T18:22:26Z

CI report:

117ce2e Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

leonardBang · 2020-10-16T02:25:26Z

The e2e test failed by timeout with 280 mins, maybe improving from 260 to 280 is still not enough

zentol · 2020-10-16T08:07:45Z

tools/azure-pipelines/e2e_uploading_watchdog.sh

+echo "Running command '$COMMAND' with a timeout of $DEFINED_TIMEOUT_MINUTES minutes."
+
+function warning_watchdog {
+	SLEEP_TIME=$(echo "scale=0; $DEFINED_TIMEOUT_SECONDS*0.8/1" | bc)


could you explain the various bits here? Why is the /1 necessary?

Bash doesn't support floating point math, that's why I use bc.
doing just "$number/0.8" will return a floating point. The division and scale is for rounding to an integer: https://stackoverflow.com/a/20562313/568695

zentol · 2020-10-16T08:09:11Z

tools/azure-pipelines/e2e_uploading_watchdog.sh

+	SLEEP_TIME=$(($DEFINED_TIMEOUT_SECONDS-$START_LOG_UPLOAD_SECONDS_FROM_END))
+	sleep $SLEEP_TIME
+	echo "======================================================================================================="
+	echo "=== WARNING: This E2E Run will be killed in the next few minutes. Starting to upload the log output ==="


with "killed" you mean azure timing out?

You are right. I'll update the wording.

zentol · 2020-10-16T08:13:39Z

tools/azure-pipelines/e2e_uploading_watchdog.sh

+log_upload_watchdog &
+
+# ts from moreutils prepends the time to each line
+( $COMMAND & PID=$! ; wait $PID ) | ts | tee $OUTPUT_FILE


hmm...could we not actually kill PID after the timeout?

I feel uncomfortable killing here, because my whole accounting for when to start uploading the log / showing the warning is somewhat fragile: We start the timeout in the test stage, not in the compile stage, however, the overall timeout is for compile+test.
Putting a kill here could severely affect the user experience and requires constant adjustment.

What I'm proposing in this PR is a best effort logging solution that we can potentially even disable again once Azure is reliably presenting logs again even in failure cases.

I see, makes sense 👍

Can we not define per-stage timeouts?

We actually can. Let me investigate. I still won't introduce a hard kill, but it would make the behavior of the upload more accurate.

zentol

+1

[FLINK-19664][e2e] Upload logs before tests time out

117ce2e

rmetzger added the review=description? label Oct 15, 2020

rmetzger added the component=BuildSystem/AzurePipelines label Oct 15, 2020

zentol self-assigned this Oct 16, 2020

zentol reviewed Oct 16, 2020

View reviewed changes

zentol approved these changes Oct 19, 2020

View reviewed changes

rmetzger closed this in 4ca00ed Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-19664][e2e] Upload logs before tests time out #13655

[FLINK-19664][e2e] Upload logs before tests time out #13655

rmetzger commented Oct 15, 2020

flinkbot commented Oct 15, 2020

flinkbot commented Oct 15, 2020 •

edited

leonardBang commented Oct 16, 2020 •

edited

zentol Oct 16, 2020

rmetzger Oct 16, 2020

zentol Oct 16, 2020

rmetzger Oct 16, 2020

zentol Oct 16, 2020

rmetzger Oct 16, 2020

zentol Oct 19, 2020

rmetzger Oct 19, 2020

zentol left a comment

[FLINK-19664][e2e] Upload logs before tests time out #13655

[FLINK-19664][e2e] Upload logs before tests time out #13655

Conversation

rmetzger commented Oct 15, 2020

What is the purpose of the change

Verifying this change

flinkbot commented Oct 15, 2020

Automated Checks

Review Progress

flinkbot commented Oct 15, 2020 • edited

CI report:

leonardBang commented Oct 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zentol left a comment

Choose a reason for hiding this comment

flinkbot commented Oct 15, 2020 •

edited

leonardBang commented Oct 16, 2020 •

edited