Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flink 34569][e2e] fail fast if AWS cli container fails to start #24491

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

robobario
Copy link

What is the purpose of the change

This pull request aims to make end-to-end test scripts that source common_s3_operations.sh fail fast if the aws cli container fails to start. It also adds a single naive retry aiming to recover from a transient network failure.

FLINK-34569 describes an issue where an end-to-end test run took 15 minutes to fail after the aws cli container failed to start. From the test logs:

2024-03-02T04:10:55.5496990Z Unable to find image 'banst/awscli:latest' locally 2024-03-02T04:10:56.3857380Z docker: Error response from daemon: Head "https://registry-1.docker.io/v2/banst/awscli/manifests/latest": read tcp 10.1.0.97:33016->54.236.113.205:443: read: connection reset by peer. 2024-03-02T04:10:56.3857877Z See 'docker run --help'. 2024-03-02T04:10:56.4586492Z Error: No such object:

This failure isn't handled and so later we were stuck in a loop trying to docker exec commands like docker exec -t "" command.

To test it locally I've been provoking docker run failures by changing the image name to something non-existent.

Brief change log

  • Fail fast if aws cli container fails to run
  • Add naive retry when creating aws cli container
  • Add --rm to jq docker run commands to remove them on exit

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

I verified that it fails fast by modifying the awscli image to have a non-existant name, to provoke a docker run failure, causing it to fail like:

==============================================================================
Running 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions'
==============================================================================
TEST_DATA_DIR: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53909550201
Flink dist directory: /home/roby/development/redhat-managed-kafka/upstream/flink/flink-dist/target/flink-1.20-SNAPSHOT-bin/flink-1.20-SNAPSHOT
Found AWS bucket robeyoun-testing-flink-13-03-2024, running the e2e test.
Found AWS access key, running the e2e test.
Found AWS secret key, running the e2e test.
Unable to find image 'banstz/awscli:latest' locally
docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
running aws cli container failed
Unable to find image 'banstz/awscli:latest' locally
docker: Error response from daemon: pull access denied for banstz/awscli, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
running aws cli container failed
running the aws cli container failed
[FAIL] Test script contains errors.
Checking for errors...
No errors in log files.
Checking for exceptions...
No exceptions in log files.
Checking for non-empty .out files...
grep: /home/roby/development/redhat-managed-kafka/upstream/flink/build-target/log/*.out: No such file or directory
No non-empty .out files.

[FAIL] 'test-scripts/test_file_sink.sh s3 StreamingFileSink skip_check_exceptions' failed after 0 minutes and 6 seconds! Test exited with exit code 1

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers:no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@@ -29,12 +29,18 @@
# AWSCLI_CONTAINER_ID
###################################
function aws_cli_start() {
export AWSCLI_CONTAINER_ID=$(docker run -d \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 13, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! I just left one comment. PTAL

@@ -58,7 +64,11 @@ function aws_cli_stop() {
if [[ $AWSCLI_CONTAINER_ID ]]; then
aws_cli_stop
fi
aws_cli_start
aws_cli_start || aws_cli_start
Copy link
Contributor

@JingGe JingGe Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will simple retry like this let the script be executed multiple times without proper cleanup of previous container, given the aws_cli_start is not Idempotent, afaiu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that docker run -d ... with detach did not create a container if the exit code is non-zero. So failed aws_cli_start shouldn't leave something to be cleaned. But I can't find docs stating that.

Happy to remove the retry, getting the test to fail faster with a more obvious cause is an improvement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added in a failsafe to kill/remove if CONTAINER_ID is non-empty after the docker run fails.

Why:
An end-to-end test run failed and in the test logs you could see that the
AWS cli container failed to start. Because of the way it's organised the
failure in the subshell did not cause a failure and AWSCLI_CONTAINER_ID was
empty. This lead to a loop trying to docker exec a command in a container
named "" and the test taking 15 minutes to time out. This change speeds up
the failure.

Note that we use 'return' to prevent an immediate failure of the script so
that we have the potential to implement a simple retry.

Signed-off-by: Robert Young <robeyoun@redhat.com>
Why:
An end-to-end test run failed with what looked like a transient network
exception when pulling the aws cli image. This retries once.

Signed-off-by: Robert Young <robeyoun@redhat.com>
Why:
A large pile of exited jq containers were left in docker after
an operation was retried repeatedly.

Signed-off-by: Robert Young <robeyoun@redhat.com>
Why:
If for some reason the command can return a non-zero exit code and also
create a container, this will remove it so we don't have an orphan sitting
stranded.

Signed-off-by: Robert Young <robeyoun@redhat.com>
@robobario robobario force-pushed the FLINK-34569-end-to-end-test-timeout branch from 119cd86 to 4b233a7 Compare March 17, 2024 20:13
@robobario robobario requested a review from JingGe March 19, 2024 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants