Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-19664][e2e] Upload logs before tests time out #13655

Closed
wants to merge 1 commit into from

Conversation

rmetzger
Copy link
Contributor

What is the purpose of the change

Due to a bug in azure pipelines, we can not see the e2e output when a run times out.
This pull is to add some tooling for rescuing the logs before it's too late

Verifying this change

I've tested it here: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8473&view=artifacts&type=publishedArtifacts

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 117ce2e (Thu Oct 15 18:09:52 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 15, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@leonardBang
Copy link
Contributor

leonardBang commented Oct 16, 2020

The e2e test failed by timeout with 280 mins, maybe improving from 260 to 280 is still not enough

@zentol zentol self-assigned this Oct 16, 2020
echo "Running command '$COMMAND' with a timeout of $DEFINED_TIMEOUT_MINUTES minutes."

function warning_watchdog {
SLEEP_TIME=$(echo "scale=0; $DEFINED_TIMEOUT_SECONDS*0.8/1" | bc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain the various bits here? Why is the /1 necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash doesn't support floating point math, that's why I use bc.
doing just "$number/0.8" will return a floating point. The division and scale is for rounding to an integer: https://stackoverflow.com/a/20562313/568695

SLEEP_TIME=$(($DEFINED_TIMEOUT_SECONDS-$START_LOG_UPLOAD_SECONDS_FROM_END))
sleep $SLEEP_TIME
echo "======================================================================================================="
echo "=== WARNING: This E2E Run will be killed in the next few minutes. Starting to upload the log output ==="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with "killed" you mean azure timing out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I'll update the wording.

log_upload_watchdog &

# ts from moreutils prepends the time to each line
( $COMMAND & PID=$! ; wait $PID ) | ts | tee $OUTPUT_FILE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...could we not actually kill PID after the timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uncomfortable killing here, because my whole accounting for when to start uploading the log / showing the warning is somewhat fragile: We start the timeout in the test stage, not in the compile stage, however, the overall timeout is for compile+test.
Putting a kill here could severely affect the user experience and requires constant adjustment.

What I'm proposing in this PR is a best effort logging solution that we can potentially even disable again once Azure is reliably presenting logs again even in failure cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense 👍

Can we not define per-stage timeouts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually can. Let me investigate. I still won't introduce a hard kill, but it would make the behavior of the upload more accurate.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rmetzger rmetzger closed this in 4ca00ed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants