Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8962. Ensure docker env is stopped #5011

Merged
merged 7 commits into from
Jul 2, 2023

Conversation

adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai commented Jun 30, 2023

What changes were proposed in this pull request?

Currently, if acceptance test script exits abruptly due to failed command (e.g. timeout in wait_for_port), the docker environment is left running, logs are not saved, subsequent test scripts may fail due to port conflict, etc. This is happening recently in HA-secure:

Timed out waiting on datanode4 9856 to become available
ERROR: Test execution of ozonesecure-ha/test.sh is FAILED!!!!

This change adds a trap to ensure stop_docker_env is executed.

https://issues.apache.org/jira/browse/HDDS-8962

How was this patch tested?

$ OZONE_ACCEPTANCE_SUITE=failing ./hadoop-ozone/dev-support/checks/acceptance.sh
...
Executing test ozone/test-failures1.sh
...
Safe mode is off
No OM HA service, no need to wait
Port 1234 is not available on scm yet
Port 1234 is not available on scm yet
Port 1234 is not available on scm yet
Port 1234 is not available on scm yet
Timed out waiting on scm 1234 to become available
...
Stopping ozone_s3g_1      ... done
Stopping ozone_httpfs_1   ... done
Stopping ozone_recon_1    ... done
Stopping ozone_datanode_3 ... done
Stopping ozone_datanode_2 ... done
Stopping ozone_datanode_1 ... done
Stopping ozone_om_1       ... done
Stopping ozone_scm_1      ... done
...

Verified that logs are saved for failures1, too:

$ find target/acceptance -type f | grep -Fv '.stack' | sort
target/acceptance/log.html
target/acceptance/output.log
target/acceptance/ozone/failures1/docker-ozone.log
target/acceptance/ozone/failures1/scm-audit-28ee34c8bf11.log
target/acceptance/ozone/failures2/docker-ozone-ozone-test1-scm.log
target/acceptance/ozone/failures2/scm-audit-9076f47d8095.log
target/acceptance/ozone-failures2.xml
target/acceptance/report.html
target/acceptance/summary.html

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5424893835

@adoroszlai adoroszlai self-assigned this Jun 30, 2023
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adoroszlai for the improvement! Change mostly lgtm. I have one question inline.

Comment on lines 160 to +164
docker-compose --ansi never down
if ! { docker-compose --ansi never up -d --scale datanode="${datanode_count}" \
&& wait_for_safemode_exit \
&& wait_for_om_leader ; }; then
[[ -n "$OUTPUT_NAME" ]] || OUTPUT_NAME="$COMPOSE_ENV_NAME"
stop_docker_env
return 1
fi

trap stop_docker_env EXIT HUP INT TERM

docker-compose --ansi never up -d --scale datanode="${datanode_count}"
Copy link
Contributor

@smengcl smengcl Jul 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, with the changes in this PR:

  1. copy_daemon_logs and stop_docker_env calls are removed from execute_robot_test()
  2. copy_daemon_logs is now called by stop_docker_env()
  3. stop_docker_env is now called from start_docker_env()

Thus whoever calls start_docker_env last would have to call stop_docker_env in order to have the logs collected, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for the review.

whoever calls start_docker_env last would have to call stop_docker_env in order to have the logs collected

It's like a ShutdownHook in Java. The shell calls stop_docker_env when the test script exits. start_docker_env just sets it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense now. That is exactly what trap is for. Thanks @adoroszlai .

@adoroszlai adoroszlai merged commit dd25740 into apache:master Jul 2, 2023
@adoroszlai adoroszlai deleted the HDDS-8962 branch July 2, 2023 15:55
@adoroszlai
Copy link
Contributor Author

Thanks @smengcl for the review.

vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jul 10, 2023
* master: (36 commits)
  HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039)
  Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)"
  HDDS-7750. Incorrect WRITE ACL check. (apache#4992)
  HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033)
  HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030)
  HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028)
  HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029)
  HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025)
  HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode
  HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961)
  HDDS-8979. error validating kustomization.yaml (apache#5024)
  HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018)
  HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015)
  HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021)
  HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943)
  HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825)
  HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022)
  HDDS-8962. Ensure docker env is stopped (apache#5011)
  HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007)
  HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants