Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Venafi issuer 10s timeout on certificate issuance. #5108

Closed
hawksight opened this issue May 9, 2022 · 3 comments · Fixed by #5247
Closed

Venafi issuer 10s timeout on certificate issuance. #5108

hawksight opened this issue May 9, 2022 · 3 comments · Fixed by #5247
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@hawksight
Copy link
Member

Is your feature request related to a problem? Please describe.

The problem is sporadic Timeout Errors with a Venafi TPP issuer. There is a hard coded timeout of 10s in the Venafi TPP isser when retrieving a certificate from TPP.

https://github.com/cert-manager/cert-manager/blob/master/pkg/issuer/venafi/client/request.go#L55-L62

Often certifcates can be ready only shortly after the 10s timeout.

A similar issue has been seen there with the Issuer setup:

Describe the solution you'd like

Allow a configurable timeout value for certificate issuance, particularly for the Venafi Issuer. As TPP it not a CA itself, there is an additional call to the CA to account for.

There can also be multiple custom workflows or adapatable scripts which esentially allow you to do anything in powershell.
These for example could include calls out to other services before or during certificate generation. A common example might be to approve or deny a certificate based on some metadata.

The more complex the adapatable workflow or slower the CA is to respond, the more likely it is that a given CertificateRequest will hit a timeout. The result it waiting until the next retry. The problem is worse when scaled as there will be many logs and redundant certificate requests.

Often teams using cert-manager have little or no control over the TPP service or the CAs used, so improving response times is not something they can achieve. This solution would give them a way to tweak cert-manager to their environment.

Describe alternatives you've considered

Perhaps up the default hard coded timeout from 10s to 20 or 30s as a simple solution.
Could be some consequences to doing this though?

Additional context

Example of Timeout errors seen:

E0409 13:32:12.027971       1 controller.go:158] cert-manager/controller/certificaterequests-issuer-venafi "msg"="re-queuing item due to error processing" "error"="Post \"https://<REDACTED>/vedsdk/certificates/request\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "key"="<NAMESPACE>/example-xxxxx" 
E0409 13:32:11.983977       1 sync.go:132] cert-manager/controller/certificaterequests-issuer-venafi "msg"="error issuing certificate request" "error"="Post \"https://<REDACTED>/vedsdk/certificates/request\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "related_resource_kind"="Issuer" "related_resource_name"="<TPP-ISSUER>" "related_resource_namespace"="<NAMEPACE>" "related_resource_version"="v1" "resource_kind"="CertificateRequest" "resource_name"="example-xxxxx" "resource_namespace"="<NAMESPACE>" "resource_version"="v1" 
E0409 13:32:11.983932       1 venafi.go:130] cert-manager/controller/certificaterequests-issuer-venafi/sign "msg"="Failed to request venafi certificate" "error"="Post \"https://<REDACTED>t/vedsdk/certificates/request\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "related_resource_kind"="Issuer" "related_resource_name"="<TPP_ISSUER>" "related_resource_namespace"="<NAMESPACE>" "related_resource_version"="v1" "resource_kind"="CertificateRequest" "resource_name"="example-xxxxx" "resource_namespace"="<NAMEAPACE>" "resource_version"="v1" 

There is also potential that the actual Timeout error often seen could be a misrepresentation and obstructing a useful error message. Referencing this issue on the vcert docs:

I've seen similar timeout errors on a variety of API endpoints as listed:

  • /vedsdk/metadata/get
  • /vedsdk/certificates/checkpolicy
  • /vedsdk/authorize
  • /vedsdk/certificates/request
  • /vedsdk/certificates/retrieve

But I can't find any hard coded timeouts specifically that would cause these, other than maybe the defaults for an api request to TPP via vcert:
https://github.com/Venafi/vcert/blob/master/pkg/venafi/tpp/tpp.go#L441-L455

So for this issue, mainly just to tackle the know 10s timeout.

/kind feature

@hawksight
Copy link
Member Author

From the cert-manager bi-weekly community meeting, @SgtCoDFish discussed his proposal around timeouts, which can be read here or in #5214

The summary / scope that I read, is that there is some impact / crossover around a certificate issuing timeout, but the Venafi timeout in particular will not be part of the scope of that change, due to back porting complexity. @SgtCoDFish is that summary correct?

If so I could make a separate PR / change around Venafi specifically.

In more detail, the general theme is to up any existing timeout to a suitably large value based on settings from other k8s controller research. This currently is preferably to adding new configuration options for timeout settings.

@SgtCoDFish
Copy link
Member

That's a fair summary @hawksight yeah - I'd appreciate + review a PR doing this for the Venafi issuer. I think we should be able to backport that change as well.

I'd rather keep both ACME and Venafi in separate PRs so that if there's a problem with backporting one it won't affect the other.

@hawksight
Copy link
Member Author

For anyone else coming across this, the tl;dr is:

Fixed by upping from 10s to 60s timeout on retrieving certificates.
Having a ClusterIssuer / Issuer config options adds unnecessary complexity.
This should be released in version 1.9.0 of cert-manager with approx date of July 6th (1 week today).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants