Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate dead solver pods during self-check #1388

Merged
merged 8 commits into from
Feb 20, 2019

Conversation

DanielMorsing
Copy link
Contributor

What this PR does / why we need it:
While solver pods are fairly simple and shouldn't die, they can be terminated for a variety of reasons, like draining nodes or overzealous sysadmins. Recreate the solver pod if it goes away while we're waiting for the self-check to pass.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged):
fixes #1311

Release note:

More robust handling of solver pods

Daniel Morsing added 3 commits February 19, 2019 15:22
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Feb 20, 2019
@jetstack-bot jetstack-bot added area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/http01 Indicates a PR modifies ACME HTTP01 provider code area/testing Issues relating to testing approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2019
err := s.cleanupPods(ch)
if err != nil {
return nil, err
}
return nil, fmt.Errorf(errMsg)
return s.createPod(ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just change this to call createPod without also making the pod's name determinate else there's a risk we could enter a loop of:

  • create a pod/present successfully
  • resync occurs and we've not observed the new pod in the lister
  • a new pod will be created - because we use GenerateName, the call to the apiservers Create endpoint will succeed, next time we loop around here, we'll delete both pods because multiple exist, and the loop re-runs.

We can avoid this by not using GenerateName in buildPod, and instead using Name (where 'name' is predictable). That way, we'll get an AlreadyExists error if we hit this race, causing the controller to resync after a back-off (which allows for the cache to become consistent)

@@ -222,4 +226,49 @@ var _ = framework.CertManagerDescribe("ACME Certificate (HTTP01)", func() {
err = h.WaitCertificateIssuedValid(f.Namespace.Name, certificateName, time.Minute*5)
Expect(err).NotTo(HaveOccurred())
})

It("should obtain a signed certificate with a single CN from the ACME server after solver pod killed", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change to "should automatically recreate challenge pod and still obtain a certificate if it is manually deleted"? (just thinking about how we'll read these test failure messages in a few months time when we start breaking things 😅)

for _, p := range podlist.Items {
log.Logf("solver pod %s", p.Name)
// TODO(dmo): make this cleaner instead of just going by name
if strings.Contains(p.Name, "http-solver") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 not important for this PR really, but we can use a label selector in the List call above to ensure we get the correct pods 😄 (similar to how we do it in ensurePod)

By("killing the solver pod")
podClient := f.KubeClientSet.CoreV1().Pods(f.Namespace.Name)
var pod corev1.Pod
err = wait.PollImmediate(500*time.Millisecond, time.Minute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe change Poll to 1s? 500ms might put systems under strain if we're running 20 tests concurrently

Daniel Morsing added 4 commits February 20, 2019 14:27
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
This reverts commit 6b81093.

Signed-off-by: Daniel Morsing <dmo@jetstack.io>
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
Signed-off-by: Daniel Morsing <dmo@jetstack.io>
@DanielMorsing
Copy link
Contributor Author

/retest

Signed-off-by: Daniel Morsing <dmo@jetstack.io>
@munnerz
Copy link
Member

munnerz commented Feb 20, 2019

😄

/lgtm
/approve

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2019
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DanielMorsing, munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [DanielMorsing,munnerz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme/http01 Indicates a PR modifies ACME HTTP01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code area/testing Issues relating to testing dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP01 Pods not recreated after kubectl delete po
3 participants