[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

rmetzger · 2020-11-11T11:43:44Z

What is the purpose of the change

The purpose of this change is to increase the robustness of the e2e tests overall.

Brief change log

Check of leftover JVMs after a e2e test, and fails if there's one
Check for "more ports open than before", and fail if there are more open
increase robustness of process stopping: wait until a process has finished, potentially shooting it with a sigkill
increase robustness of run_with_timeout. It used to leave a dangling sleep in the system

Verifying this change

I ran this change through 3 CI runs, to make sure all the tests adhere to the stricter requirements of "no leftover stuff".

I will also manually test the watchdog again (proper killing and reporting).

flinkbot · 2020-11-11T11:46:02Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 43dee71 (Wed Nov 11 11:46:02 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-11-11T12:10:02Z

CI report:

fc9f926 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

rmetzger · 2020-11-11T18:38:04Z

I will fix the

2020-11-11T15:54:49.5642835Z Nov 11 15:54:49 cat: /proc/119970/cmdline: No such file or directory
2020-11-11T15:54:49.5647189Z Nov 11 15:54:49 Waiting till process is stopped: pid = 119970 pattern = 'TaskManagerRunner|TaskManager' cmdline = ''
2020-11-11T15:54:49.5797225Z Nov 11 15:54:49 cat: /proc/120435/cmdline: No such file or directory
2020-11-11T15:54:49.5800596Z Nov 11 15:54:49 Waiting till process is stopped: pid = 120435 pattern = 'TaskManagerRunner|TaskManager' cmdline = ''

output with the first round of reviews.

rmetzger · 2020-11-11T18:38:27Z

Otherwise, the e2e tests have passed in the first run

…ver processes This commit also contains a number of robustness improvements.

rmetzger · 2020-11-13T07:52:36Z

Rebased to latest master to make sure e2e tests are still compliant.

XComp

I went through the code. There are some discussion items - not really big code changes. But I'd like to get those clarified before finalizing the PR.

XComp · 2020-11-13T17:48:15Z

flink-end-to-end-tests/test-scripts/common.sh

+      echo "Waiting till process is stopped: pid = $pid pattern = '${1}'"
+      kill ${pid} 2> /dev/null || true
+      if [[ "$OS_TYPE" == "mac" ]]; then
+          # works on mac, but does seem to return before the process has finished on Linux
+          wait ${pid} 2> /dev/null || true
+      else
+          # use tail to wait for a process to finish: https://stackoverflow.com/questions/1058047/wait-for-a-process-to-finish/11719943
+          timeout 60 tail --pid=${pid} -f /dev/null
+          if [ "$?" -eq 124 ]; then
+            echo "Process (pid = $pid) didn't stop within 60 seconds. Killing it:"
+            kill -9 $pid
+          fi
+      fi


Shouldn't we move that out into its own common utility function considering that we used almost the exact same code in PR #14062 (FLINK-17470)?

The problem is that the scripts in #14062 is shipped to the users as part of the Flink distribution for starting Flink.

The code here is solely for the testing infrastructure. We could in theory source a script from the distribution here, but that would be a weird dependency.

I see. Makes sense.

XComp · 2020-11-13T17:56:50Z

flink-end-to-end-tests/test-scripts/common.sh

+            echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds."
+            eval "${on_failure}"
+            kill "$command_pid"
+            pkill -P "$command_pid"


I couldn't verify it that's why I'm asking: Are you sure that this is what you want to do? AFAIK, pkill would need some pattern and -P just restricts the parent PID?! 🤔

My understanding is that this command kills all children of $command_pid, and if there's no pattern, it'll kill all of them

Ok, I did some more research on it. Looks like it works like that.

XComp · 2020-11-13T18:04:12Z

flink-end-to-end-tests/test-scripts/test-runner-common.sh

+    # "ps --ppid 2 -p 2 --deselect" shows all non-kernel processes
+    # "ps --ppid $$" shows all children of this bash process
+    # "ps -o pid= -o comm=" removes the header line
+    echo $(sudo netstat -tulpn | wc -l)


why do we have to run sudo here? Isn't the test executed by root? 🤔

afaik the tests are run by vsts user on Azure. I just want to be sure that we are catching all ports

rmetzger · 2020-11-13T18:43:54Z

Thanks for the review.
This CI run contained an open port by "dotnet": https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9552&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529
I pushed a version with more debugging info

XComp

LGTM

rmetzger · 2020-11-16T20:08:02Z

Thanks a lot. Merging ...

…no leftovers (#14033)" This reverts commit 4cc0e72.

rmetzger added the review=description? label Nov 11, 2020

rmetzger added the component=TestInfrastructure label Nov 11, 2020

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no lefto…

9e5179d

…ver processes This commit also contains a number of robustness improvements.

rmetzger force-pushed the FLINK-19979-e2e-sanity branch from 43dee71 to 9e5179d Compare November 13, 2020 07:52

XComp requested changes Nov 13, 2020

View reviewed changes

print pstree for debugging dotnet

fc9f926

XComp approved these changes Nov 15, 2020

View reviewed changes

rmetzger merged commit 4cc0e72 into apache:master Nov 16, 2020

rmetzger added a commit that referenced this pull request Nov 17, 2020

Revert "[FLINK-19979][e2e] Add sanity check after bash e2e tests for …

0fa50a3

…no leftovers (#14033)" This reverts commit 4cc0e72.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

rmetzger commented Nov 11, 2020

flinkbot commented Nov 11, 2020

flinkbot commented Nov 11, 2020 •

edited

rmetzger commented Nov 11, 2020

rmetzger commented Nov 11, 2020

rmetzger commented Nov 13, 2020

XComp left a comment

XComp Nov 13, 2020

rmetzger Nov 13, 2020

XComp Nov 15, 2020

XComp Nov 13, 2020

rmetzger Nov 13, 2020

XComp Nov 15, 2020

XComp Nov 13, 2020

rmetzger Nov 13, 2020

rmetzger commented Nov 13, 2020

XComp left a comment

rmetzger commented Nov 16, 2020

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

[FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers #14033

Conversation

rmetzger commented Nov 11, 2020

What is the purpose of the change

Brief change log

Verifying this change

flinkbot commented Nov 11, 2020

Automated Checks

Review Progress

flinkbot commented Nov 11, 2020 • edited

CI report:

rmetzger commented Nov 11, 2020

rmetzger commented Nov 11, 2020

rmetzger commented Nov 13, 2020

XComp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmetzger commented Nov 13, 2020

XComp left a comment

Choose a reason for hiding this comment

rmetzger commented Nov 16, 2020

flinkbot commented Nov 11, 2020 •

edited