Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

Merged
merged 2 commits into from Nov 16, 2020

Conversation

rmetzger
Copy link
Contributor

What is the purpose of the change

The purpose of this change is to increase the robustness of the e2e tests overall.

Brief change log

  • Check of leftover JVMs after a e2e test, and fails if there's one
  • Check for "more ports open than before", and fail if there are more open
  • increase robustness of process stopping: wait until a process has finished, potentially shooting it with a sigkill
  • increase robustness of run_with_timeout. It used to leave a dangling sleep in the system

Verifying this change

I ran this change through 3 CI runs, to make sure all the tests adhere to the stricter requirements of "no leftover stuff".

I will also manually test the watchdog again (proper killing and reporting).

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 43dee71 (Wed Nov 11 11:46:02 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 11, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@rmetzger
Copy link
Contributor Author

I will fix the

2020-11-11T15:54:49.5642835Z Nov 11 15:54:49 cat: /proc/119970/cmdline: No such file or directory
2020-11-11T15:54:49.5647189Z Nov 11 15:54:49 Waiting till process is stopped: pid = 119970 pattern = 'TaskManagerRunner|TaskManager' cmdline = ''
2020-11-11T15:54:49.5797225Z Nov 11 15:54:49 cat: /proc/120435/cmdline: No such file or directory
2020-11-11T15:54:49.5800596Z Nov 11 15:54:49 Waiting till process is stopped: pid = 120435 pattern = 'TaskManagerRunner|TaskManager' cmdline = ''

output with the first round of reviews.

@rmetzger
Copy link
Contributor Author

Otherwise, the e2e tests have passed in the first run

…ver processes

This commit also contains a number of robustness improvements.
@rmetzger
Copy link
Contributor Author

Rebased to latest master to make sure e2e tests are still compliant.

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code. There are some discussion items - not really big code changes. But I'd like to get those clarified before finalizing the PR.

Comment on lines +596 to +608
echo "Waiting till process is stopped: pid = $pid pattern = '${1}'"
kill ${pid} 2> /dev/null || true
if [[ "$OS_TYPE" == "mac" ]]; then
# works on mac, but does seem to return before the process has finished on Linux
wait ${pid} 2> /dev/null || true
else
# use tail to wait for a process to finish: https://stackoverflow.com/questions/1058047/wait-for-a-process-to-finish/11719943
timeout 60 tail --pid=${pid} -f /dev/null
if [ "$?" -eq 124 ]; then
echo "Process (pid = $pid) didn't stop within 60 seconds. Killing it:"
kill -9 $pid
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we move that out into its own common utility function considering that we used almost the exact same code in PR #14062 (FLINK-17470)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the scripts in #14062 is shipped to the users as part of the Flink distribution for starting Flink.

The code here is solely for the testing infrastructure. We could in theory source a script from the distribution here, but that would be a weird dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense.

echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds."
eval "${on_failure}"
kill "$command_pid"
pkill -P "$command_pid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't verify it that's why I'm asking: Are you sure that this is what you want to do? AFAIK, pkill would need some pattern and -P just restricts the parent PID?! 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this command kills all children of $command_pid, and if there's no pattern, it'll kill all of them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did some more research on it. Looks like it works like that.

# "ps --ppid 2 -p 2 --deselect" shows all non-kernel processes
# "ps --ppid $$" shows all children of this bash process
# "ps -o pid= -o comm=" removes the header line
echo $(sudo netstat -tulpn | wc -l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to run sudo here? Isn't the test executed by root? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik the tests are run by vsts user on Azure. I just want to be sure that we are catching all ports

@rmetzger
Copy link
Contributor Author

Thanks for the review.
This CI run contained an open port by "dotnet": https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9552&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529
I pushed a version with more debugging info

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rmetzger
Copy link
Contributor Author

Thanks a lot. Merging ...

@rmetzger rmetzger merged commit 4cc0e72 into apache:master Nov 16, 2020
rmetzger added a commit that referenced this pull request Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants