New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 43dee71 (Wed Nov 11 11:46:02 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
I will fix the
output with the first round of reviews. |
Otherwise, the e2e tests have passed in the first run |
…ver processes This commit also contains a number of robustness improvements.
43dee71
to
9e5179d
Compare
Rebased to latest master to make sure e2e tests are still compliant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code. There are some discussion items - not really big code changes. But I'd like to get those clarified before finalizing the PR.
echo "Waiting till process is stopped: pid = $pid pattern = '${1}'" | ||
kill ${pid} 2> /dev/null || true | ||
if [[ "$OS_TYPE" == "mac" ]]; then | ||
# works on mac, but does seem to return before the process has finished on Linux | ||
wait ${pid} 2> /dev/null || true | ||
else | ||
# use tail to wait for a process to finish: https://stackoverflow.com/questions/1058047/wait-for-a-process-to-finish/11719943 | ||
timeout 60 tail --pid=${pid} -f /dev/null | ||
if [ "$?" -eq 124 ]; then | ||
echo "Process (pid = $pid) didn't stop within 60 seconds. Killing it:" | ||
kill -9 $pid | ||
fi | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we move that out into its own common utility function considering that we used almost the exact same code in PR #14062 (FLINK-17470)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the scripts in #14062 is shipped to the users as part of the Flink distribution for starting Flink.
The code here is solely for the testing infrastructure. We could in theory source a script from the distribution here, but that would be a weird dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Makes sense.
echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds." | ||
eval "${on_failure}" | ||
kill "$command_pid" | ||
pkill -P "$command_pid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't verify it that's why I'm asking: Are you sure that this is what you want to do? AFAIK, pkill
would need some pattern
and -P
just restricts the parent PID?! 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this command kills all children of $command_pid
, and if there's no pattern, it'll kill all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did some more research on it. Looks like it works like that.
# "ps --ppid 2 -p 2 --deselect" shows all non-kernel processes | ||
# "ps --ppid $$" shows all children of this bash process | ||
# "ps -o pid= -o comm=" removes the header line | ||
echo $(sudo netstat -tulpn | wc -l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to run sudo here? Isn't the test executed by root
? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik the tests are run by vsts
user on Azure. I just want to be sure that we are catching all ports
Thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks a lot. Merging ... |
What is the purpose of the change
The purpose of this change is to increase the robustness of the e2e tests overall.
Brief change log
run_with_timeout
. It used to leave a danglingsleep
in the systemVerifying this change
I ran this change through 3 CI runs, to make sure all the tests adhere to the stricter requirements of "no leftover stuff".
I will also manually test the watchdog again (proper killing and reporting).