Skip to content

Conversation

florianschmidt1994
Copy link
Contributor

What is the purpose of the change

Fix a wrongly printed "All tests PASS" message when they actually don't"

Previously the behaviour was like this:

During the cleanup hook (trap cleanup EXIT in common.sh) it will be checked whether there are non-empty out files or log files with certain exceptions. If a tests fails with non-zero exit code, but does not have any exceptions or .out files, this will still print "All tests PASS" to stdout, even though they don't

With this PR the whole test-runner is restructured so that

  1. The check for non-empty .out files, errors and exceptions in logs is triggered from the run_test method
  2. The error message after each test is dependant on both the exit code of the test script as well as the result from checking the log files
  3. cleanup is now triggered by the test runner, not by the individual tests anymore
  4. tests that signaled their failure by modifying PASS now do so by exiting with non-zero exit code
  5. check_result_hash exits with 1 instead of modifying PASS

Additionally this PR

  1. Reformats the output a little compared to previous tests
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/temp-test-directory-00N
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/temp-test-directory-00N
flink-end-to-end-test directory: /Users/florianschmidt/dev/flink/flink-end-to-end-tests
Flink distribution directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT

==============================================================================
Running 'Streaming Python Wordcount end-to-end test'
==============================================================================
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-00N
Starting cluster.
Starting standalonesession daemon on host Florians-MBP.fritz.box.
Starting taskexecutor daemon on host Florians-MBP.fritz.box.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 436dfd1f2a81ab4f818fc7fb9c395f0c has finished.
Job Runtime: 7512 ms
pass StreamingPythonWordCount
Stopping taskexecutor daemon (pid: 9877) on host Florians-MBP.fritz.box.
Stopping standalonesession daemon (pid: 9585) on host Florians-MBP.fritz.box.
No zookeeper daemon to stop on host Florians-MBP.fritz.box.

[PASS] 'Streaming Python Wordcount end-to-end test' passed after 0 minutes and 22 seconds! Test exited with exit code 0.


==============================================================================
Running 'Wordcount end-to-end test'
==============================================================================
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-24N

Verifying this change

  • I ran the test scripts manually and checked that they still behave as expected
  • I used the following script as a sample e2e-test to trigger different failure / success behaviours
#!/usr/bin/env bash
source "$(dirname "$0")"/common.sh

# each of those can be used to cause a test to fail

# echo "This should cause the test to fail" > $FLINK_DIR/log/test.out
# check_result_hash "asf" "$FLINK_DIR/log/"
# exit 1

function test_cleanup {
    echo "Something"

    # Uncomment to see test fail in cleanup
    # exit 2
}

trap test_cleanup EXIT

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@florianschmidt1994 florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from 8f404f9 to 5e2b8f5 Compare May 22, 2018 12:51
@florianschmidt1994
Copy link
Contributor Author

@zentol Maybe you could have a look at this?

@zentol
Copy link
Contributor

zentol commented May 22, 2018

I'll take a look tomorrow.

Copy link
Contributor

@tzulitai tzulitai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @florianschmidt1994.
I've left some comments. Would be great if @zentol also takes a look.

exit 1
fi

source "$(dirname "$0")"/test-scripts/common.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can now be removed?


check_logs_for_errors
check_logs_for_exceptions
check_logs_for_non_empty_out_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move all these methods:

start_timer
end_timer
check_logs_for_errors
check_logs_for_exceptions
check_logs_for_non_empty_out_files

to test-runner-common.sh since that's the only place they are used anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been discussing about changing the semantics at some point to leave it up to each individual test case to check the logs for errors and drop it from the test runner, maybe even with a whitelist / blacklist approach of expected exceptions. If we want to go that way I'd say leave it in common.sh
We could also say we're probably gonna stick with the current approach for a while, then I'd say let's move them to test-runner-common.sh

kill ${watchdog_pid} 2> /dev/null
wait ${watchdog_pid} 2> /dev/null
#
cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_local_recovery_and_scheduling test currently bundles several executions of the test (e.g. with different state backend configurations) in a single run of the test script. That's why it required this cleanup within the test itself.

How would the change of this PR affect this?
In general, should we also restructure e2e tests so that each execution configuration variant should be executed with the test-runner-cleanup#run_test method (instead of cleaning up itself in-between executions)?

AFAIK, only the test_local_recovery_and_scheduling does this at the moment.

Copy link
Contributor Author

@florianschmidt1994 florianschmidt1994 Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a concern anymore with the new changed where each configuration of test_local_recovery_and_scheduling is its own test-case?

run_test "Streaming SQL end-to-end test" "$END_TO_END_DIR/test-scripts/test_streaming_sql.sh"
run_test "Streaming bucketing end-to-end test" "$END_TO_END_DIR/test-scripts/test_streaming_bucketing.sh"
run_test "Stateful stream job upgrade end-to-end test" "$END_TO_END_DIR/test-scripts/test_stateful_stream_job_upgrade.sh 2 4"
run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently performs log verifications and cleanups within a single execution of the test script, since it specifies multiple executions with different state backend configurations.

Should we break this up, so that each configuration variant is explicitly executed by the run_test method (like what we currently do with the savepoint / externalized checkpoint tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having each execution as a separate test

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces a new way to track failure states (test_has_errors variable initialized as 1, check for empty string), I'd prefer sticking to the existing pattern (EXIT_CODE variable initialized as 0, check for inequality to 0) for consistency.

@florianschmidt1994
Copy link
Contributor Author

Thanks @zentol and @tzulitai for the review. I addressed your concerns in the lastest couple of commits

@florianschmidt1994 florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from f7f6471 to 4b19431 Compare June 13, 2018 12:33
Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 small thing. Could you rebase the PR (feel free to squash everything beforehand)?

fi

source "$(dirname "$0")"/test-scripts/common.sh
source "$(dirname "$0")"/test-scripts/test-runner-common.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use END_TO_END_DIR instead

@zentol
Copy link
Contributor

zentol commented Jun 13, 2018

Oh you already rebased it in the mean-time. neat.

run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh 4 10 rocks false true"
run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh 4 10 rocks true true"

run_test "Quickstarts nightly end-to-end test" "$END_TO_END_DIR/test-scripts/test_quickstarts.sh"
Copy link
Contributor

@zentol zentol Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be here, subsumed by the java/scala quickstart calls

incremental checkpoints: ${incremental}
kill JVM: ${kill_jvm}"

TEST_PROGRAM_JAR=$TEST_INFRA_DIR/../../flink-end-to-end-tests/flink-local-recovery-and-allocation-test/target/StickyAllocationAndLocalRecoveryTestJob.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use END_TO_END_DIR instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only works if we export END_TO_END_DIR in run_nightly/precommit_tests.sh, but I'll just add that there as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well we already did export it there, but you removed it :P

cleanup

if [[ ${exit_code} == 0 ]]; then
if [[ ! "$PASS" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we aren't using pass anymore?


cleanup

if [[ ${exit_code} == 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I mistaken or didn't you already change this to EXIT_CODE? Did maybe something go wrong during the rebase?

Copy link
Contributor Author

@florianschmidt1994 florianschmidt1994 Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn. Something might have gone wrong here...
I'll look into it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is still in there b.c. I left the check_logs_for_non_empty_out_files etc. untouched, which again use the PASS thing to signal whether or not a test case should fail.
I'll go ahead and change this to our new convention as well.
At least it's nothing went wrong during the rebase 😅

@florianschmidt1994 florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from b62e14c to a114667 Compare June 13, 2018 14:04
@zentol
Copy link
Contributor

zentol commented Jun 13, 2018

Looks good to me, let's see what travis says.

@florianschmidt1994
Copy link
Contributor Author

@zentol Looks like travis likes it 🙂

@tzulitai
Copy link
Contributor

==============================================================================
Running 'Streaming Python Wordcount end-to-end test'
==============================================================================
Flink dist directory: /home/travis/build/apache/flink/build-target
TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-10412258255
Starting cluster.
Starting standalonesession daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Starting taskexecutor daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 06184a085272dd12b3573b1bcb96badc has finished.
Job Runtime: 6103 ms
pass StreamingPythonWordCount
Stopping taskexecutor daemon (pid: 31303) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Stopping standalonesession daemon (pid: 30988) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.

[PASS] 'Streaming Python Wordcount end-to-end test' passed after 0 minutes and 24 seconds! Test exited with exit code 0.


==============================================================================
Running 'Wordcount end-to-end test'
==============================================================================
Flink dist directory: /home/travis/build/apache/flink/build-target
TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-36174383269
Starting cluster.
Starting standalonesession daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Starting taskexecutor daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 30256ad7ff23ea8543ddca76bacaaee5 has finished.
Job Runtime: 1352 ms
pass WordCount
Stopping taskexecutor daemon (pid: 835) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Stopping standalonesession daemon (pid: 517) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.

[PASS] 'Wordcount end-to-end test' passed after 0 minutes and 11 seconds! Test exited with exit code 0.

Travis logs excerpt looks good. Follow up commits looks good.
+1, LGTM on my side.

@zentol
Copy link
Contributor

zentol commented Jun 14, 2018

merging.

zentol pushed a commit to zentol/flink that referenced this pull request Jun 14, 2018
@asfgit asfgit closed this in 45ac85e Jun 14, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 15, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 15, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018
zentol added a commit to zentol/flink that referenced this pull request Jun 20, 2018
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants