New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FLINK-10632][e2e] Running general purpose testing job with failure in per-job mode #6943

Merged

dawidwys merged 1 commit into apache:master from dawidwys:flink10632

Nov 2, 2018

Contributor

dawidwys commented Oct 26, 2018

What is the purpose of the change

Add test that runs general purpose job with failure in per-job cluster mode.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

tillrohrmann self-assigned this

tillrohrmann requested changes

View reviewed changes

Contributor

tillrohrmann left a comment

Thanks for your contribution @dawidwys. All in all it looks good. However, in its current form the test is not executable on a Mac. Moreover, we should add the test to run-nightly-tests.sh. After addressing these problems, I think we can merge this E2E test.

flink-end-to-end-tests/test-scripts/test_ha_dataset.sh Outdated

+                # kill the cluster and zookeeper
+                stop_watchdogs
+                shutdown_all

Contributor

tillrohrmann Oct 30, 2018

I think shutdown_all will be called by the test runner in test-runner-common.sh

flink-end-to-end-tests/test-scripts/test_ha_datastream.sh Outdated

+                # kill the cluster and zookeeper
+                stop_watchdogs
+                shutdown_all

Contributor

tillrohrmann Oct 30, 2018

Same here.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                stop_watchdogs
+                kill_all 'StandaloneJobClusterEntryPoint'
+                shutdown_all

Contributor

tillrohrmann Oct 30, 2018

No need to call this, since the test runner will call it.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                  # submit a job in detached mode and let it run
+                  run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM}
+                  start_taskmanagers 1

Contributor

tillrohrmann Oct 30, 2018

This only works because in ha mode we start a TM with 4 slots. But what happens if PARALLELISM is something different? Either we fix this or we don't allow the user to pass in a different PARALLELISM than 4.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                  ${FLINK_DIR}/bin/standalone-job.sh start \
+                      --job-classname org.apache.flink.streaming.tests.DataStreamAllroundTestProgram \
+                      -p ${PARALLELISM} \

Contributor

tillrohrmann Oct 30, 2018

Is this really needed? I think that setting environment.parallelism should already be enough.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                      # create a new one which will take over
+                      kill_single 'StandaloneJobClusterEntryPoint'
+                      # let the job start and take some checkpoints
+                      sleep 60

Contributor

tillrohrmann Oct 30, 2018

Ideally we would wait until the next checkpoint has been successfully taken.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                  start_ha_tm_watchdog ${JOB_ID} 1
+                  # let the job run for a while to take some checkpoints
+                  sleep 20

Contributor

tillrohrmann Oct 30, 2018

Ideally, we would wait until the first checkpoint has been taken.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                  # start the watchdog that keeps the number of JMs stable
+                  start_ha_jm_watchdog 1 "StandaloneJobClusterEntryPoint" run_job ${PARALLELISM} ${BACKEND} ${ASYNC} ${INCREM}
+                  sleep 5

Contributor

tillrohrmann Oct 30, 2018

Why this sleep?

Contributor Author

dawidwys Oct 31, 2018

It's leftover from the datastream test ha. But as I don't see a reason for it in this test, I have removed it.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh

+                      echo "One or more tests FAILED."
+                      exit $EXIT_CODE
+                  fi
+              }

Contributor

tillrohrmann Oct 30, 2018

This looks like a duplicate of common_ha.sh#verify_logs.

Contributor Author

dawidwys Oct 31, 2018

It is similar, but it performs different checks. Looks for different texts and number of them. I extracted the comparing number of occurrence logic but don't think there is much more that can be unified.

flink-end-to-end-tests/test-scripts/test_ha_per_job_cluster_datastream.sh Outdated

+                      EXIT_CODE=1
+                  fi
+                  if ! [ `grep -r --include '*standalonejob*.log' -P 'Found \d+ checkpoints in ZooKeeper' "${FLINK_DIR}/log/" | cut -d ":" -f 1 | uniq | wc -l` -eq $((JM_FAILURES + 1)) ]; then

Contributor

tillrohrmann Oct 30, 2018

The standard grep application on Mac OS does not support -P because it's FreeBSD's grep version. On Mac OS you can pass the -E option to enable extended regex support which would support \d+. I think we should make it Unix compatible.

Contributor Author

dawidwys Oct 31, 2018

I've switched to the basic regular expressions.

dawidwys force-pushed the flink10632 branch from f817e4c to 4ece6b2 Compare

October 31, 2018 10:19

Contributor Author

dawidwys commented Oct 31, 2018

I've tried to address your comments @tillrohrmann. I would appreciate if you have had another look.

tillrohrmann approved these changes

View reviewed changes

Contributor

tillrohrmann left a comment

Thanks for addressing my comments @dawidwys. I've tested the e2e test and it passes now on Mac OS. +1 for merging from my side after we have merged #6959 because without disabling the log verification the test fails.


          [FLINK-10632][e2e] Running general purpose testing job with failure in

7528d8f

per-job mode

dawidwys force-pushed the flink10632 branch from 6c5b1ab to 7528d8f Compare

November 1, 2018 13:01

dawidwys merged commit ab4bc47 into apache:master

dawidwys deleted the flink10632 branch

November 2, 2018 08:32

rmetzger added the component=TestInfrastructure label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component=TestInfrastructure