Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-16423][e2e] Introduce timeouts for HA tests #11831

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 4 additions & 3 deletions flink-end-to-end-tests/test-scripts/common_ha.sh
Expand Up @@ -23,10 +23,9 @@ source "${END_TO_END_DIR}"/test-scripts/common.sh
# flag indicating if we have already cleared up things after a test
CLEARED=0

JM_WATCHDOG_PID=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a curiosity, what's the reason of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JM_WATCHDOG_PID is used in the stop_watchdogs function, which is called from different bash-processes.
Because of that, if the variable is set in one process, it's not available in the other one.

That's why I decided to store the PID in a file.

TM_WATCHDOG_PID=0

function stop_watchdogs() {
JM_WATCHDOG_PID=`cat $TEST_DATA_DIR/jm_watchdog.pid`
TM_WATCHDOG_PID=`cat $TEST_DATA_DIR/tm_watchdog.pid`
if [ ${CLEARED} -eq 0 ]; then

if ! [ ${JM_WATCHDOG_PID} -eq 0 ]; then
Expand Down Expand Up @@ -115,6 +114,7 @@ function start_jm_cmd {
function start_ha_jm_watchdog() {
jm_watchdog $1 $2 ${@:3} &
JM_WATCHDOG_PID=$!
echo $JM_WATCHDOG_PID > $TEST_DATA_DIR/jm_watchdog.pid
echo "Running JM watchdog @ ${JM_WATCHDOG_PID}"
}

Expand Down Expand Up @@ -179,6 +179,7 @@ function ha_tm_watchdog() {
function start_ha_tm_watchdog() {
ha_tm_watchdog $1 $2 &
TM_WATCHDOG_PID=$!
echo $TM_WATCHDOG_PID > $TEST_DATA_DIR/tm_watchdog.pid
echo "Running TM watchdog @ ${TM_WATCHDOG_PID}"
}

25 changes: 24 additions & 1 deletion flink-end-to-end-tests/test-scripts/test_ha_datastream.sh
Expand Up @@ -20,6 +20,12 @@
source "$(dirname "$0")"/common.sh
source "$(dirname "$0")"/common_ha.sh

#
# NOTE: This script requires at least Bash version >= 4. Mac OS in 2020 still ships 3.x
#

TEST_TIMEOUT_SECONDS=540

TEST_PROGRAM_JAR=${END_TO_END_DIR}/flink-datastream-allround-test/target/DataStreamAllroundTestProgram.jar

function ha_cleanup() {
Expand Down Expand Up @@ -100,4 +106,21 @@ STATE_BACKEND_FILE_ASYNC=${2:-true}
STATE_BACKEND_ROCKS_INCREMENTAL=${3:-false}
ZOOKEEPER_VERSION=${4:-3.4}

run_ha_test 4 ${STATE_BACKEND_TYPE} ${STATE_BACKEND_FILE_ASYNC} ${STATE_BACKEND_ROCKS_INCREMENTAL} ${ZOOKEEPER_VERSION}
function kill_test_watchdog() {
local watchdog_pid=`cat $TEST_DATA_DIR/job_watchdog.pid`
echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
kill $watchdog_pid
}
on_exit kill_test_watchdog

(
cmdpid=$BASHPID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BASHPID does not work on my Mac. It's introduced in Bash 4, however the version of Bash on my Mac is 3.2. Maybe we could use $$ when BASHPID is not supported? $$ is more general, however sometimes it does not work well as expected, like executing in background.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these changes require bash 4. Bash on mac is pretty outdated sadly.

I think $$ won't work, as it will return the PID of the bash-process, not of the subshell I'm launching there:

When you want to find out the PID of your current shell session, even from subshells, that’s when you use $$. Any subshell has its own PID as well, and you can access it via the $BASHPID variable.

https://bashwizard.com/shells-subshells-pids/

But I will add a warning to the script, that is requires at least bash 4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right. It's in another sub-process.

I come up with a work-around way. It probably works that executing the watchdog in original main process and executing the testing scripts in sub-process. We can get the sub-process pid in main process.

But I guess it's not a big deal, as MacOS is not a main scenario. I hope this PR could be merged asap. I would help a lot.

(sleep $TEST_TIMEOUT_SECONDS; # set a timeout of 10 minutes for this test
echo "Test (pid: $cmdpid) did not finish after $TEST_TIMEOUT_SECONDS seconds."
echo "Printing Flink logs and killing it:"
cat ${FLINK_DIR}/log/*
kill "$cmdpid") & watchdog_pid=$!
echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid
run_ha_test 4 ${STATE_BACKEND_TYPE} ${STATE_BACKEND_FILE_ASYNC} ${STATE_BACKEND_ROCKS_INCREMENTAL} ${ZOOKEEPER_VERSION}
)

Expand Up @@ -20,11 +20,16 @@
source "$(dirname "$0")"/common.sh
source "$(dirname "$0")"/common_ha.sh

TEST_TIMEOUT_SECONDS=540
TEST_PROGRAM_JAR_NAME=DataStreamAllroundTestProgram.jar
TEST_PROGRAM_JAR=${END_TO_END_DIR}/flink-datastream-allround-test/target/${TEST_PROGRAM_JAR_NAME}
FLINK_LIB_DIR=${FLINK_DIR}/lib
JOB_ID="00000000000000000000000000000000"

#
# NOTE: This script requires at least Bash version >= 4. Mac OS in 2020 still ships 3.x
#

function ha_cleanup() {
stop_watchdogs
kill_all 'StandaloneJobClusterEntryPoint'
Expand Down Expand Up @@ -148,4 +153,23 @@ STATE_BACKEND_TYPE=${1:-file}
STATE_BACKEND_FILE_ASYNC=${2:-true}
STATE_BACKEND_ROCKS_INCREMENTAL=${3:-false}

run_ha_test 4 ${STATE_BACKEND_TYPE} ${STATE_BACKEND_FILE_ASYNC} ${STATE_BACKEND_ROCKS_INCREMENTAL}
function kill_test_watchdog() {
local watchdog_pid=`cat $TEST_DATA_DIR/job_watchdog.pid`
echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
kill $watchdog_pid
}
on_exit kill_test_watchdog

(
cmdpid=$BASHPID;
(sleep $TEST_TIMEOUT_SECONDS; # set a timeout of 10 minutes for this test
echo "Test (pid: $cmdpid) did not finish after $TEST_TIMEOUT_SECONDS seconds."
echo "Printing Flink logs and killing it:"
cat ${FLINK_DIR}/log/*
kill "$cmdpid") & watchdog_pid=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another curiosity, is there any magic of inline watchdog_pid=$!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I don't know if this has to be in the same line or not. Probably not.
It is okay for you, I would not change this, because it would mean that I have to test the scripts again (I manually tested that they are behaving as expected in failure / timeout / success cases.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK for me :)

echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid

run_ha_test 4 ${STATE_BACKEND_TYPE} ${STATE_BACKEND_FILE_ASYNC} ${STATE_BACKEND_ROCKS_INCREMENTAL}
)