-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-10632][e2e] Running general purpose testing job with failure in per-job mode #6943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @dawidwys. All in all it looks good. However, in its current form the test is not executable on a Mac. Moreover, we should add the test to run-nightly-tests.sh
. After addressing these problems, I think we can merge this E2E test.
|
||
# kill the cluster and zookeeper | ||
stop_watchdogs | ||
shutdown_all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think shutdown_all
will be called by the test runner in test-runner-common.sh
|
||
# kill the cluster and zookeeper | ||
stop_watchdogs | ||
shutdown_all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
|
||
stop_watchdogs | ||
kill_all 'StandaloneJobClusterEntryPoint' | ||
shutdown_all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call this, since the test runner will call it.
# submit a job in detached mode and let it run | ||
run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM} | ||
|
||
start_taskmanagers 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works because in ha mode we start a TM with 4 slots. But what happens if PARALLELISM
is something different? Either we fix this or we don't allow the user to pass in a different PARALLELISM
than 4
.
|
||
${FLINK_DIR}/bin/standalone-job.sh start \ | ||
--job-classname org.apache.flink.streaming.tests.DataStreamAllroundTestProgram \ | ||
-p ${PARALLELISM} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? I think that setting environment.parallelism
should already be enough.
# create a new one which will take over | ||
kill_single 'StandaloneJobClusterEntryPoint' | ||
# let the job start and take some checkpoints | ||
sleep 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would wait until the next checkpoint has been successfully taken.
start_ha_tm_watchdog ${JOB_ID} 1 | ||
|
||
# let the job run for a while to take some checkpoints | ||
sleep 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would wait until the first checkpoint has been taken.
# start the watchdog that keeps the number of JMs stable | ||
start_ha_jm_watchdog 1 "StandaloneJobClusterEntryPoint" run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM} | ||
|
||
sleep 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's leftover from the datastream test ha. But as I don't see a reason for it in this test, I have removed it.
echo "One or more tests FAILED." | ||
exit $EXIT_CODE | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a duplicate of common_ha.sh#verify_logs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is similar, but it performs different checks. Looks for different texts and number of them. I extracted the comparing number of occurrence logic but don't think there is much more that can be unified.
EXIT_CODE=1 | ||
fi | ||
|
||
if ! [ `grep -r --include '*standalonejob*.log' -P 'Found \d+ checkpoints in ZooKeeper' "${FLINK_DIR}/log/" | cut -d ":" -f 1 | uniq | wc -l` -eq $((JM_FAILURES + 1)) ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard grep
application on Mac OS does not support -P
because it's FreeBSD's grep version. On Mac OS you can pass the -E
option to enable extended regex support which would support \d+
. I think we should make it Unix compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to the basic regular expressions.
I've tried to address your comments @tillrohrmann. I would appreciate if you have had another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the change
Add test that runs general purpose job with failure in per-job cluster mode.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation