New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-16921][e2e] Describe all resources and show pods logs before cleanup when failed #11630
Conversation
…leanup when failed The pods may be pending because of not enough resources, disk pressure, or other problems. Then wait_rest_endpoint_up will timeout. Describing all resources will help to debug these problems.
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 9a9073f (Sat Apr 04 04:38:33 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for debugging this issue.
I'll merge this PR (Azure has passed)
echo "Debugging failed Kubernetes test:" | ||
echo "Currently existing Kubernetes resources" | ||
kubectl get all | ||
kubectl describe all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this also doesn't help, I used these commands for debugging the k8s the last time it was unstable:
kubectl get pods -o json -n kube-system
kubectl get pods -o json
kubectl get events -o json
kubectl get deployments -o json
kubectl describe pods
kubectl describe nodes
kubectl get nodes -o json
if [ $? != 0 ];then | ||
debug_copy_and_show_logs | ||
fi | ||
SUCCEEDED=$? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently got to know bash traps, which are basically signal handlers.
You can do trap debug_and_show_logs EXIT
, which will run debug_and_show_logs
whenever the script exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already use the on_exit cleanup
in common_kubernetes.sh
. So i put the debug_and_show_logs
in cleanup
, it will always be called before the script exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. You could register on_exit debug_and_show_logs
before sourcing common_kubernetes.sh
.
But I'm fine with your solution. I just wanted to mention it so that you know it exists :)
What is the purpose of the change
The pods may be pending because of not enough resources, disk pressure, or other problems. Then wait_rest_endpoint_up will timeout. Describing all resources will help to debug these problems.
We still have some failed instances and can not reproduce in the local environment(Mac/Linux). Open this PR to run e2e tests more times to find the root cause.
Brief change log
debug_and_show_logs
beforecleanup
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation