[FLINK-9257][E2E Tests] Fix wrong "All tests pass" message #6053

florianschmidt1994 · 2018-05-22T12:28:39Z

What is the purpose of the change

Fix a wrongly printed "All tests PASS" message when they actually don't"

Previously the behaviour was like this:

During the cleanup hook (trap cleanup EXIT in common.sh) it will be checked whether there are non-empty out files or log files with certain exceptions. If a tests fails with non-zero exit code, but does not have any exceptions or .out files, this will still print "All tests PASS" to stdout, even though they don't

With this PR the whole test-runner is restructured so that

The check for non-empty .out files, errors and exceptions in logs is triggered from the run_test method
The error message after each test is dependant on both the exit code of the test script as well as the result from checking the log files
cleanup is now triggered by the test runner, not by the individual tests anymore
tests that signaled their failure by modifying PASS now do so by exiting with non-zero exit code
check_result_hash exits with 1 instead of modifying PASS

Additionally this PR

Reformats the output a little compared to previous tests

Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/temp-test-directory-00N
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/temp-test-directory-00N
flink-end-to-end-test directory: /Users/florianschmidt/dev/flink/flink-end-to-end-tests
Flink distribution directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT

==============================================================================
Running 'Streaming Python Wordcount end-to-end test'
==============================================================================
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-00N
Starting cluster.
Starting standalonesession daemon on host Florians-MBP.fritz.box.
Starting taskexecutor daemon on host Florians-MBP.fritz.box.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 436dfd1f2a81ab4f818fc7fb9c395f0c has finished.
Job Runtime: 7512 ms
pass StreamingPythonWordCount
Stopping taskexecutor daemon (pid: 9877) on host Florians-MBP.fritz.box.
Stopping standalonesession daemon (pid: 9585) on host Florians-MBP.fritz.box.
No zookeeper daemon to stop on host Florians-MBP.fritz.box.

[PASS] 'Streaming Python Wordcount end-to-end test' passed after 0 minutes and 22 seconds! Test exited with exit code 0.


==============================================================================
Running 'Wordcount end-to-end test'
==============================================================================
Flink dist directory: /Users/florianschmidt/dev/flink/flink-dist/target/flink-1.6-SNAPSHOT-bin/flink-1.6-SNAPSHOT
TEST_DATA_DIR: /Users/florianschmidt/dev/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-24N

Verifying this change

I ran the test scripts manually and checked that they still behave as expected
I used the following script as a sample e2e-test to trigger different failure / success behaviours

#!/usr/bin/env bash
source "$(dirname "$0")"/common.sh

# each of those can be used to cause a test to fail

# echo "This should cause the test to fail" > $FLINK_DIR/log/test.out
# check_result_hash "asf" "$FLINK_DIR/log/"
# exit 1

function test_cleanup {
    echo "Something"

    # Uncomment to see test fail in cleanup
    # exit 2
}

trap test_cleanup EXIT

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

florianschmidt1994 · 2018-05-22T12:52:02Z

@zentol Maybe you could have a look at this?

zentol · 2018-05-22T13:03:00Z

I'll take a look tomorrow.

tzulitai

Thanks for working on this @florianschmidt1994.
I've left some comments. Would be great if @zentol also takes a look.

tzulitai · 2018-05-23T05:15:25Z

flink-end-to-end-tests/run-pre-commit-tests.sh

    exit 1
 fi

 source "$(dirname "$0")"/test-scripts/common.sh


I think this can now be removed?

tzulitai · 2018-05-23T05:19:34Z

flink-end-to-end-tests/test-scripts/test-runner-common.sh

+
+    check_logs_for_errors
+    check_logs_for_exceptions
+    check_logs_for_non_empty_out_files


Maybe we should move all these methods:

start_timer end_timer check_logs_for_errors check_logs_for_exceptions check_logs_for_non_empty_out_files

to test-runner-common.sh since that's the only place they are used anyways

We have been discussing about changing the semantics at some point to leave it up to each individual test case to check the logs for errors and drop it from the test runner, maybe even with a whitelist / blacklist approach of expected exceptions. If we want to go that way I'd say leave it in common.sh
We could also say we're probably gonna stick with the current approach for a while, then I'd say let's move them to test-runner-common.sh

tzulitai · 2018-05-23T05:26:11Z

flink-end-to-end-tests/test-scripts/test_local_recovery_and_scheduling.sh

    kill ${watchdog_pid} 2> /dev/null
    wait ${watchdog_pid} 2> /dev/null
-    #
-    cleanup


The test_local_recovery_and_scheduling test currently bundles several executions of the test (e.g. with different state backend configurations) in a single run of the test script. That's why it required this cleanup within the test itself.

How would the change of this PR affect this?
In general, should we also restructure e2e tests so that each execution configuration variant should be executed with the test-runner-cleanup#run_test method (instead of cleaning up itself in-between executions)?

AFAIK, only the test_local_recovery_and_scheduling does this at the moment.

This should not be a concern anymore with the new changed where each configuration of test_local_recovery_and_scheduling is its own test-case?

tzulitai · 2018-05-23T05:27:55Z

flink-end-to-end-tests/run-nightly-tests.sh

+run_test "Streaming SQL end-to-end test" "$END_TO_END_DIR/test-scripts/test_streaming_sql.sh"
+run_test "Streaming bucketing end-to-end test" "$END_TO_END_DIR/test-scripts/test_streaming_bucketing.sh"
+run_test "Stateful stream job upgrade end-to-end test" "$END_TO_END_DIR/test-scripts/test_stateful_stream_job_upgrade.sh 2 4"
+run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh"


This test currently performs log verifications and cleanups within a single execution of the test script, since it specifies multiple executions with different state backend configurations.

Should we break this up, so that each configuration variant is explicitly executed by the run_test method (like what we currently do with the savepoint / externalized checkpoint tests)

+1 to having each execution as a separate test

zentol

The PR introduces a new way to track failure states (test_has_errors variable initialized as 1, check for empty string), I'd prefer sticking to the existing pattern (EXIT_CODE variable initialized as 0, check for inequality to 0) for consistency.

florianschmidt1994 · 2018-06-13T12:09:15Z

Thanks @zentol and @tzulitai for the review. I addressed your concerns in the lastest couple of commits

zentol

Found 1 small thing. Could you rebase the PR (feel free to squash everything beforehand)?

zentol · 2018-06-13T12:25:47Z

flink-end-to-end-tests/run-nightly-tests.sh

 fi

-source "$(dirname "$0")"/test-scripts/common.sh
+source "$(dirname "$0")"/test-scripts/test-runner-common.sh


use END_TO_END_DIR instead

zentol · 2018-06-13T13:00:10Z

Oh you already rebased it in the mean-time. neat.

zentol · 2018-06-13T13:00:51Z

flink-end-to-end-tests/run-nightly-tests.sh

+run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh 4 10 rocks false true"
+run_test "Local recovery and sticky scheduling end-to-end test" "$END_TO_END_DIR/test-scripts/test_local_recovery_and_scheduling.sh 4 10 rocks true true"
+
+run_test "Quickstarts nightly end-to-end test" "$END_TO_END_DIR/test-scripts/test_quickstarts.sh"


this shouldn't be here, subsumed by the java/scala quickstart calls

zentol · 2018-06-13T13:02:19Z

flink-end-to-end-tests/test-scripts/test_local_recovery_and_scheduling.sh

+        incremental checkpoints: ${incremental}
+        kill JVM: ${kill_jvm}"
+
+    TEST_PROGRAM_JAR=$TEST_INFRA_DIR/../../flink-end-to-end-tests/flink-local-recovery-and-allocation-test/target/StickyAllocationAndLocalRecoveryTestJob.jar


use END_TO_END_DIR instead

Only works if we export END_TO_END_DIR in run_nightly/precommit_tests.sh, but I'll just add that there as well

well we already did export it there, but you removed it :P

zentol · 2018-06-13T13:26:02Z

flink-end-to-end-tests/test-scripts/test-runner-common.sh

+    cleanup
+
+    if [[ ${exit_code} == 0 ]]; then
+        if [[ ! "$PASS" ]]; then


I thought we aren't using pass anymore?

zentol · 2018-06-13T13:26:37Z

flink-end-to-end-tests/test-scripts/test-runner-common.sh

+
+    cleanup
+
+    if [[ ${exit_code} == 0 ]]; then


Am I mistaken or didn't you already change this to EXIT_CODE? Did maybe something go wrong during the rebase?

Oh damn. Something might have gone wrong here...
I'll look into it

Oh that is still in there b.c. I left the check_logs_for_non_empty_out_files etc. untouched, which again use the PASS thing to signal whether or not a test case should fail.
I'll go ahead and change this to our new convention as well.
At least it's nothing went wrong during the rebase 😅

zentol · 2018-06-13T15:45:58Z

Looks good to me, let's see what travis says.

florianschmidt1994 · 2018-06-14T11:25:07Z

@zentol Looks like travis likes it 🙂

tzulitai · 2018-06-14T11:35:53Z

==============================================================================
Running 'Streaming Python Wordcount end-to-end test'
==============================================================================
Flink dist directory: /home/travis/build/apache/flink/build-target
TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-10412258255
Starting cluster.
Starting standalonesession daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Starting taskexecutor daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 06184a085272dd12b3573b1bcb96badc has finished.
Job Runtime: 6103 ms
pass StreamingPythonWordCount
Stopping taskexecutor daemon (pid: 31303) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Stopping standalonesession daemon (pid: 30988) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.

[PASS] 'Streaming Python Wordcount end-to-end test' passed after 0 minutes and 24 seconds! Test exited with exit code 0.


==============================================================================
Running 'Wordcount end-to-end test'
==============================================================================
Flink dist directory: /home/travis/build/apache/flink/build-target
TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-36174383269
Starting cluster.
Starting standalonesession daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Starting taskexecutor daemon on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program
Program execution finished
Job with JobID 30256ad7ff23ea8543ddca76bacaaee5 has finished.
Job Runtime: 1352 ms
pass WordCount
Stopping taskexecutor daemon (pid: 835) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.
Stopping standalonesession daemon (pid: 517) on host travis-job-363a754a-fe5f-4873-bbe8-fe4064b95bc8.

[PASS] 'Wordcount end-to-end test' passed after 0 minutes and 11 seconds! Test exited with exit code 0.

Travis logs excerpt looks good. Follow up commits looks good.
+1, LGTM on my side.

zentol · 2018-06-14T17:30:32Z

merging.

This closes apache#6053.

florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from 8f404f9 to 5e2b8f5 Compare May 22, 2018 12:51

tzulitai reviewed May 23, 2018

View reviewed changes

zentol requested changes May 23, 2018

View reviewed changes

florianschmidt1994 added 15 commits June 13, 2018 14:10

Move error/exception/out checking to functions

982ae34

Use changed test runner for nightly tests

8590f43

Use new test runner for pre-commit tests

b331925

Use script exit codes where PASS="" has been used

c5dee05

Exit immediately on check_result_hash failure

0c47694

Add missing initialization for test_has_errors

515eaa0

Restructure test runner for nicer output

95ce89a

Remove sample test and unnecessary EXIT_CODE

6578f19

Add all tests passed message

6c46080

Make on test upper-case again

28708f0

Split up 'Local recovery and sticky scheduling' test cases

b74d3f9

Remove unnecessary source of common.sh

ec8e728

Fix wrongly printed [PASS] message

c9badc5

Use EXIT_CODE variable instead of test_has_errors

043e0d8

Use absolute path for sourcing common.sh

4b19431

florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from f7f6471 to 4b19431 Compare June 13, 2018 12:33

zentol reviewed Jun 13, 2018

View reviewed changes

Address PR comments

ace36aa

zentol requested changes Jun 13, 2018

View reviewed changes

Move more methods to new EXIT_CODE convention

a114667

florianschmidt1994 force-pushed the flink-9257-fix-all-tests-pass-message branch from b62e14c to a114667 Compare June 13, 2018 14:04

zentol pushed a commit to zentol/flink that referenced this pull request Jun 14, 2018

[FLINK-9257][tests] Fix wrong "All tests pass" message

0796ac7

This closes apache#6053.

asfgit closed this in 45ac85e Jun 14, 2018

zentol added a commit to zentol/flink that referenced this pull request Jun 15, 2018

[hotfix] fix python failing on Windows due to long path

606046c

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 15, 2018

[hotfix] fix python failing on Windows due to long path

1544a50

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018

[hotfix] fix python failing on Windows due to long path

7857fa5

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018

[hotfix][py] Fix python failing due to long paths

b05fb42

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018

[hotfix][py] Fix python failing due to long paths

031fb4e

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018

[hotfix][py] Fix python failing due to long paths

fefb95d

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 19, 2018

[hotfix][py] Fix python failing due to long paths

347ad4c

This closes apache#6053.

zentol added a commit to zentol/flink that referenced this pull request Jun 20, 2018

[hotfix][py] Fix python failing due to long paths

1273700

This closes apache#6053.

sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018

[FLINK-9257][tests] Fix wrong "All tests pass" message

8c35961

This closes apache#6053.

rmetzger added the component=Tests label Mar 18, 2019

[FLINK-9257][E2E Tests] Fix wrong "All tests pass" message #6053

[FLINK-9257][E2E Tests] Fix wrong "All tests pass" message #6053

Uh oh!

Conversation

florianschmidt1994 commented May 22, 2018

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

florianschmidt1994 commented May 22, 2018

Uh oh!

zentol commented May 22, 2018

Uh oh!

tzulitai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianschmidt1994 Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zentol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianschmidt1994 commented Jun 13, 2018

Uh oh!

zentol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zentol commented Jun 13, 2018

Uh oh!

zentol Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianschmidt1994 Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zentol commented Jun 13, 2018

Uh oh!

florianschmidt1994 commented Jun 14, 2018

Uh oh!

tzulitai commented Jun 14, 2018

Uh oh!

zentol commented Jun 14, 2018

Uh oh!

Uh oh!

florianschmidt1994 Jun 13, 2018 •

edited

Loading

zentol left a comment •

edited

Loading

zentol Jun 13, 2018 •

edited

Loading

florianschmidt1994 Jun 13, 2018 •

edited

Loading