Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backoffLimit for the test job set to 0. #434

Merged
merged 1 commit into from
Sep 25, 2019

Conversation

gusmith
Copy link
Contributor

@gusmith gusmith commented Sep 24, 2019

Otherwise, thed efault value is 6, which means that the job restarts up to 6 time a pod if the previous one failed.

Otherwise, thed efault value is 6, which means that the job restarts up to 6 time a pod if the previous one failed.
@gusmith gusmith self-assigned this Sep 24, 2019
@gusmith gusmith marked this pull request as ready for review September 24, 2019 07:14
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might see more failures by removing this retry mechanism... I'd have like to have a rational provided but I'm approving assuming you have one.

@gusmith
Copy link
Contributor Author

gusmith commented Sep 25, 2019

Currently, when the python tests fail (e.g. a timeout), the pod returns an error, and the job restart another pod. Which creates a number of new deployments problems. In fact, the pod is mounting a volume which cannot be attached to multiple pods, and at the same time the job trying to publish the results is trying to access the same volume.
The problem was observed in the build https://dev.azure.com/data61/Anonlink/_build/results?buildId=553 (attempt 4 of the tests on kubernetes): the pod trying to publish the results cannot access the volume where the tests results are because another pod is trying to mount the same volume.

In all cases, I do not think it makes sense to re-run the same test if a failure occurs as we may want to see all the failing tests.

@gusmith gusmith merged commit c68dcea into develop Sep 25, 2019
@gusmith gusmith deleted the fix-k8s-test-job-no-restart branch September 25, 2019 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants