Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENHANCE_YOUR_CALM and too_many_pings (again) #326

Closed
jcogilvie opened this issue Aug 9, 2023 · 20 comments
Closed

ENHANCE_YOUR_CALM and too_many_pings (again) #326

jcogilvie opened this issue Aug 9, 2023 · 20 comments
Labels
bug Something isn't working Stale upstream-dependency Issue depends on changes being made to upstream dependencies (e.g. `argoproj/argo-cd`)

Comments

@jcogilvie
Copy link

jcogilvie commented Aug 9, 2023

Terraform Version, ArgoCD Provider Version and ArgoCD Version

Terraform version: 1.4.6
ArgoCD provider version: 5.6.0
ArgoCD version: 2.6.7

Affected Resource(s)

  • argocd_application
  • probably others too

Terraform Configuration Files

A generic multi-source application w/6 sources; all of them helm; all of them with values inline on their source object

Output

module.this_app[0].argocd_application.this: Modifying... [id=crm-pushback:argocd]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 10s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 20s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 30s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 40s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 50s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 1m0s elapsed]

│ Error: failed to update application crm-pushback
│ 
│   with module.this_app[0].argocd_application.this,
│   on .terraform/modules/this_app/main.tf line 117, in resource "argocd_application" "this":
│  117: resource "argocd_application" "this" {
│ 
│ rpc error: code = Unavailable desc = closing transport due to: connection
│ error: desc = "error reading from server: EOF", received prior goaway:
│ code: ENHANCE_YOUR_CALM, debug data: "too_many_pings"

Steps to Reproduce

  1. terraform apply

Expected Behavior

Update is applied

Actual Behavior

Failed with above error

Important Factoids

public argo endpoint to an EKS cluster

References

When a client receives a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings", it should log the occurrence at a log level that is enabled by default and double the configure KEEPALIVE_TIME used for new connections on that channel.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jcogilvie jcogilvie added the bug Something isn't working label Aug 9, 2023
@jcogilvie
Copy link
Author

For a sample size of one, I had some better luck with this after setting grpc_web = true on the provider config. I'll see if it recurs, but further validation would be helpful.

@onematchfox
Copy link
Collaborator

@jcogilvie you mentioned here that the issue is repeatable. Any chance you can share that config?

@jcogilvie
Copy link
Author

Well, there's a lot of terraform machinery around how it's actually configured, but I can give you a generalized lay of the terraform land, plus the app manifest that ends up being applied. I hope that's close enough.

Here's the (minimized) manifest:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: crm-pushback
  namespace: argocd
spec:
  destination:
    namespace: crm-pushback
    server: https://kubernetes.default.svc
  project: crm-pushback
  revisionHistoryLimit: 10
  sources:
    - chart: mycompany-api-service
      helm:
        releaseName: api
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: first-query-complete-receiver
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: first-status-poller
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: second-query-complete-receiver
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: second-status-poller
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-cronjob
      helm:
        releaseName: syncqueries
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
  syncPolicy:
    automated: {}
    retry:
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 2m
      limit: 5

It's built through this tf module:

resource "argocd_repository" "this" {
  repo    = data.github_repository.this.http_clone_url
  project = argocd_project.this.metadata[0].name

  lifecycle {
    # these get populated upstream by argo
    ignore_changes = [githubapp_id, githubapp_installation_id]
  }
}

locals {
  helm_repo_url = "https://mycompany.helm.repo/artifactory/default-helm/"

  multiple_sources = [for source in var.services : {
    repo_url        = local.helm_repo_url
    chart           = source.source_chart
    path            = source.local_chart_path != null ? source.local_chart_path : ""
    target_revision = source.local_chart_path != null ? var.target_infra_revision : source.source_chart_version
    helm = {
      release_name = source.name
      values       = source.helm_values
    }
  }]

  sources         = local.multiple_sources
  sources_map     = { for source in local.sources : source.helm.release_name => source }
}

resource "argocd_project" "this" {
  metadata {
    name        = var.service_name
    namespace   = "argocd"
    labels      = {}
    annotations = {}
  }

  spec {
    description = var.description

    source_namespaces = [var.namespace]
    source_repos      = [data.github_repository.this.html_url, local.helm_repo_url]

    destination {
      server    = var.destination_cluster
      namespace = var.namespace
    }

    role {
      name        = "owner"
      description = "Owner access to ${var.service_name}.  Note most operations should be done through terraform."
      policies = [
	     ...
      ]
      groups = [
        ...
      ]
    }

  }
}

locals {
  sync_policy = var.automatic_sync_enabled ? {
    automated = {
      allowEmpty = false
      prune      = var.sync_policy_enable_prune
      selfHeal   = var.sync_policy_enable_self_heal
    }
  } : {}
}

resource "argocd_application" "this" {
  count = var.use_raw_manifest ? 0 : 1

  wait = var.wait_for_sync

  metadata {
    name      = var.service_name
    namespace = "argocd"
    labels    = {} # var.tags -- tags fail validation because they contain '/'
  }

  spec {
    project = argocd_project.this.metadata[0].name

    destination {
      server    = var.destination_cluster
      namespace = var.namespace
    }

    dynamic "source" {
      for_each = local.sources_map
      content {
        repo_url        = source.value.repo_url
        path            = source.value.path
        chart           = source.value.chart
        target_revision = source.value.target_revision
        helm {
          release_name = source.value.helm.release_name
          values       = source.value.helm.values
        }
      }
    }

    sync_policy {

      dynamic "automated" {
        for_each = var.automatic_sync_enabled ? {
          automated_sync_enabled = true
        } : {}

        content {
          allow_empty = false
          prune       = var.sync_policy_enable_prune
          self_heal   = var.sync_policy_enable_self_heal
        }
      }

      retry {
        limit = var.sync_retry_limit
        backoff {
          duration     = var.sync_retry_backoff_base_duration
          max_duration = var.sync_retry_backoff_max_duration
          factor       = var.sync_retry_backoff_factor
        }
      }
    }
  }
}

@jcogilvie
Copy link
Author

jcogilvie commented Aug 15, 2023

Note that for this specific case, the creation doesn't get too_many_pings; but any kind of an update does (so, e.g., update the image in the sources).

Making it sufficiently bigger can cause too_many_pings on create as well. One of my apps tries to have like 30 sources, which was just too much for the provider (maybe for the CLI?) so I had to skip the provider and go right to a kubernetes_manifest which was somewhat disappointing (though quick).

@amedinagar
Copy link

Bump this, im experiencing the same problem when deploying, some times is ramdomly.
@jcogilvie how do you skip the provider using kubernetes_manifest?
Thanks!

@peturgq
Copy link

peturgq commented Sep 11, 2023

Bump, I'm also experiencing this on provider version 5.6.0.
This happens for me on creation of argocd_cluster.

edit: Upgrading the provider to 6.0.3 does not seem to resolve the issue.

@jcogilvie
Copy link
Author

@amedinagar I used a kubernetes_manifest resource with argo's declarative configuration.

There are a few gotchas:

  1. make sure you add finalizers
  2. you'll probably want a wait statement similar to this:
  wait {
      fields = {
        "status.sync.status" = "Synced"
      }
    }
  1. the kubernetes_manifest provider has issues with the argo CRDs as of argo 2.8, when the schema changed to introduce a field with x-kubernetes-preserve-unknown-fields on it. So, my CRDs are presently stuck on argo 2.6.7.

@onematchfox onematchfox added the upstream-dependency Issue depends on changes being made to upstream dependencies (e.g. `argoproj/argo-cd`) label Sep 28, 2023
@onematchfox
Copy link
Collaborator

Will revisit once GRPCKeepAliveEnforcementMinimum is made configurable in the underlying argocd module. Related to argoproj/argo-cd#15656

@renperez-cpi
Copy link

This is also happening to me when I'm using the ArgoCD cli to do the app sync.

@jcogilvie
Copy link
Author

@onematchfox looks like the upstream PR has been merged making the keepalive time configurable.

@onematchfox
Copy link
Collaborator

@onematchfox looks like the upstream PR has been merged making the keepalive time configurable.

Yeah, I see that. Although, we will need to wait for this to actually be released (at a glance PR was merged into main so it will only be in the 2.9 release - feel free to correct me if I'm wrong) and then, it will take some consideration as to how we implement it here given that we need to support older versions as well.

@jcogilvie
Copy link
Author

Looks like 2.9 is released. What kind of consideration are we talking about here? How tightly is the client library coupled to the api?

Perusing the upstream PR, it looks like the server and the api client both expect an environment variable to be set (via common).
So, if I'm understanding correctly, the new env var is something we can set in the client process and it'll simply be ignored in the event we need to use an older client lib version.

Given the implementation I actually wonder if setting it here in a new client would also fix the issue when running against an older server version as well.

@danielkza
Copy link

I'm still seeing this frequently when connection to argo 2.9.2 all the time. What's the status on moving to the new library?

@jcogilvie
Copy link
Author

Any chance this gets looked at soon, @onematchfox? Is there anything we can do to help?

@donovanrost
Copy link

donovanrost commented Feb 5, 2024

I am also experiencing this issue. But when adding a new cluster. I'm happy to provide any additional information to help resolve this.
Some additional details:
I'm using provider version 6.0.3
ArgoCD information {
"Version": "v2.9.6+ba62a0a",
"BuildDate": "2024-02-05T11:24:01Z",
"GitCommit": "ba62a0a86d19f71a65ec2b510a39ea55497e1580",
"GitTreeState": "clean",
"GoVersion": "go1.20.13",
"Compiler": "gc",
"Platform": "linux/amd64",
"KustomizeVersion": "(devel) unknown",
"HelmVersion": "v3.14.0+g3fc9f4b",
"KubectlVersion": "v0.24.17",
"JsonnetVersion": "v0.20.0"
}

@donovanrost
Copy link

I am also experiencing this issue. But when adding a new cluster. I'm happy to provide any additional information to help resolve this. Some additional details: I'm using provider version 6.0.3 ArgoCD information { "Version": "v2.9.6+ba62a0a", "BuildDate": "2024-02-05T11:24:01Z", "GitCommit": "ba62a0a86d19f71a65ec2b510a39ea55497e1580", "GitTreeState": "clean", "GoVersion": "go1.20.13", "Compiler": "gc", "Platform": "linux/amd64", "KustomizeVersion": "(devel) unknown", "HelmVersion": "v3.14.0+g3fc9f4b", "KubectlVersion": "v0.24.17", "JsonnetVersion": "v0.20.0" }

After switching to the official ArgoCD helm chart from the Bitnami chart and updating to version 2.10, this issue has gone away for me

@blakepettersson
Copy link
Collaborator

Actually it seems like this only got released with 2.10. I wonder if it is just enough to run a 2.10 server, as @donovanrost seems to have done.

@jcogilvie
Copy link
Author

I have a strong suspicion that my case was somehow related to me having an entirely-too-large helm repo index file (~80 megs).

@onematchfox
Copy link
Collaborator

Hey folks, Sorry for the lack of response here. As of v6.1.0 the provider is now import v.2.9.9 of argoprg/argocd. I do suspect that this issue is mostly server side so you may need to update you Argo instance to v2.10 as @donovanrost suggested. But, if that doesn't work then we're certainly open to PRs to upgrading the deps in this provider to v2.10 since the changes to the client side code didn't land in 2.9.

Copy link

github-actions bot commented Oct 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Oct 4, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale upstream-dependency Issue depends on changes being made to upstream dependencies (e.g. `argoproj/argo-cd`)
Projects
None yet
Development

No branches or pull requests

8 participants