Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-10632][e2e] Running general purpose testing job with failure in per-job mode #6943

Merged
merged 1 commit into from
Nov 2, 2018

Conversation

dawidwys
Copy link
Contributor

What is the purpose of the change

Add test that runs general purpose job with failure in per-job cluster mode.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@tillrohrmann tillrohrmann self-assigned this Oct 30, 2018
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @dawidwys. All in all it looks good. However, in its current form the test is not executable on a Mac. Moreover, we should add the test to run-nightly-tests.sh. After addressing these problems, I think we can merge this E2E test.


# kill the cluster and zookeeper
stop_watchdogs
shutdown_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shutdown_all will be called by the test runner in test-runner-common.sh


# kill the cluster and zookeeper
stop_watchdogs
shutdown_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


stop_watchdogs
kill_all 'StandaloneJobClusterEntryPoint'
shutdown_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call this, since the test runner will call it.

# submit a job in detached mode and let it run
run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM}

start_taskmanagers 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works because in ha mode we start a TM with 4 slots. But what happens if PARALLELISM is something different? Either we fix this or we don't allow the user to pass in a different PARALLELISM than 4.


${FLINK_DIR}/bin/standalone-job.sh start \
--job-classname org.apache.flink.streaming.tests.DataStreamAllroundTestProgram \
-p ${PARALLELISM} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? I think that setting environment.parallelism should already be enough.

# create a new one which will take over
kill_single 'StandaloneJobClusterEntryPoint'
# let the job start and take some checkpoints
sleep 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would wait until the next checkpoint has been successfully taken.

start_ha_tm_watchdog ${JOB_ID} 1

# let the job run for a while to take some checkpoints
sleep 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would wait until the first checkpoint has been taken.

# start the watchdog that keeps the number of JMs stable
start_ha_jm_watchdog 1 "StandaloneJobClusterEntryPoint" run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM}

sleep 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's leftover from the datastream test ha. But as I don't see a reason for it in this test, I have removed it.

echo "One or more tests FAILED."
exit $EXIT_CODE
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate of common_ha.sh#verify_logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is similar, but it performs different checks. Looks for different texts and number of them. I extracted the comparing number of occurrence logic but don't think there is much more that can be unified.

EXIT_CODE=1
fi

if ! [ `grep -r --include '*standalonejob*.log' -P 'Found \d+ checkpoints in ZooKeeper' "${FLINK_DIR}/log/" | cut -d ":" -f 1 | uniq | wc -l` -eq $((JM_FAILURES + 1)) ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard grep application on Mac OS does not support -P because it's FreeBSD's grep version. On Mac OS you can pass the -E option to enable extended regex support which would support \d+. I think we should make it Unix compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to the basic regular expressions.

@dawidwys
Copy link
Contributor Author

I've tried to address your comments @tillrohrmann. I would appreciate if you have had another look.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @dawidwys. I've tested the e2e test and it passes now on Mac OS. +1 for merging from my side after we have merged #6959 because without disabling the log verification the test fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants