Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App synch fails with ComparisonError rpc error: code = DeadlineExceeded desc = context deadline exceeded #3864

Open
3 tasks done
papanito opened this issue Jun 27, 2020 · 36 comments
Labels

Comments

@papanito
Copy link

papanito commented Jun 27, 2020

Checklist:

  • I've searched in the docs and FAQ for my answer: http://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

I've added different apps to my argo-cd but all of the fail to synch with the same error

ComparisonError  rpc error: code = DeadlineExceeded desc = context deadline exceeded

To Reproduce

Here is the manifest from one of the demo projects which fails

project: default
source:
  repoURL: 'https://github.com/argoproj/argocd-example-apps.git'
  path: kustomize-guestbook
  targetRevision: HEAD
destination:
  server: 'https://kubernetes.default.svc'
  namespace: guestbook

Remarks

Not sure if this matters however...

  • Cluster is setup with rke and contains 3 nodes.
  • Ingress is an argo-tunnel
  • Access to cluster seems ok
     argocd cluster list
     SERVER                          NAME  VERSION  STATUS      MESSAGE
     https://kubernetes.default.svc        1.18     Successful 

Expected behavior

Synchronization works fine and without errors

Version

argocd: v1.6.1+159674e
  BuildDate: 2020-06-19T00:39:46Z
  GitCommit: 159674ee844a378fb98fe297006bf7b83a6e32d2
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
argocd-server: v1.6.1+159674e
  BuildDate: 2020-06-19T00:41:05Z
  GitCommit: 159674ee844a378fb98fe297006bf7b83a6e32d2
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: {Version:kustomize/v3.6.1 GitCommit:c97fa946d576eb6ed559f17f2ac43b3b5a8d5dbd BuildDate:2020-05-27T20:47:35Z GoOs:linux GoArch:amd64}
  Helm Version: version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}
  Kubectl Version: v1.14.0

Logs

argocd app get demo
Name:               demo
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          guestbook
URL:                https://argocd.mycompany.com/applications/demo
Repo:               https://github.com/argoproj/argocd-example-apps.git
Target:             HEAD
Path:               kustomize-guestbook
SyncWindow:         Sync Allowed
Sync Policy:        <none>
Sync Status:        Unknown
Health Status:      Healthy

CONDITION        MESSAGE                                                              LAST TRANSITION
ComparisonError  rpc error: code = DeadlineExceeded desc = context deadline exceeded  2020-06-27 19:06:51 +0200 CEST

argocd-server

time="2020-07-01T09:18:45Z" level=info msg="Requested app 'demo' refresh"
time="2020-07-01T09:18:45Z" level=warning msg="finished unary call with code DeadlineExceeded" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=DeadlineExceeded grpc.method=RevisionMetadata grpc.service=application.ApplicationService grpc.start_time="2020-07-01T09:15:41Z" grpc.time_ms=183821.61 span.kind=server system=grpc

argocd-application-controller*

I suspect the issue is cause if the i/o timeout

2020/07/01 09:18:05 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56566->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:05Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56566->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=1 git_ms=460 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:05Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:05Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=1 fields.level=2 git_ms=460 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12570
time="2020-07-01T09:18:07Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:07Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:08Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=542 unmarshal_ms=542 version_ms=0
2020/07/01 09:18:20 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56652->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:20Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56652->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=543 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:20Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:20Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=2 fields.level=2 git_ms=543 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12636
time="2020-07-01T09:18:22Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:22Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:22Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=633 unmarshal_ms=633 version_ms=0
2020/07/01 09:18:34 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56756->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:34Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56756->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=633 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:35Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:35Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=2 fields.level=2 git_ms=633 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12756
time="2020-07-01T09:18:38Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:38Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:38Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=384 unmarshal_ms=384 version_ms=0
2020/07/01 09:18:50 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56842->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:50Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56842->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=385 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0

For convenience, I've attached the logs from argocd-server and argocd-application-controller

@papanito papanito added the bug Something isn't working label Jun 27, 2020
@WaldoFR
Copy link

WaldoFR commented Jun 28, 2020

Have you tried to increase status-processors , operation-processors and repo-server-timeout-seconds in your application-controller deployment as it was done in this issue #1641 ?

@papanito
Copy link
Author

papanito commented Jun 29, 2020

@WaldoFR thanks for the hint. I've added the following

- --status-processors
- "20"
- --operation-processors
- "10"
- --repo-server-timeout-seconds
- "180"

I am surprised as I only have 3 apps so far and I only sync them manually for now, so I would expect syncing one app should work and no timeout occurs

@WaldoFR
Copy link

WaldoFR commented Jun 29, 2020

Do you still face the synchronization problems ?

@papanito
Copy link
Author

No seems better now, apps show OutOfSynch which is fine. What helped was the increase of the timeout to 180

@papanito
Copy link
Author

papanito commented Jun 29, 2020

So it only temporarily worked - I got at least a status OutOfSync which was correct and after a sync I got the message Synch OK. But eventually I get again the same problem. Status shows Unknown and I can see the error again. Also when I try to check the app-diff I get Unable to load data: ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Screenshot-20200629151348-1920x1080

@WaldoFR
Copy link

WaldoFR commented Jun 29, 2020

Ok, we face the same problem (in v1.5.3), and increasing number of argocd-repo-server replicas and changing --repo-server-timeout-seconds does not help neither.

time="2020-06-29T14:03:41Z" level=error msg="finished unary call with code Unknown" error="Get \"https://private-repo.com/projet.deploy/project/project.git/info/refs?service=git-upload-pack\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2020-06-29T14:04:26Z" grpc.service=repository.RepoServerService grpc.start_time="2020-06-29T14:03:26Z" grpc.time_ms=15002.009 span.kind=server system=grpc

Seems that some other people are facing issue ( i.e. #3547 (comment) )

Related to this issue (#2957) may I ask you if you are currently deploying helm3 charts on your cluster ? It seems that we face this problems only on clusters with helm3 charts but we can't ensure it's related.

Regards,

EDIT : Additionnal informations, sometimes repository that are not syncing are synchronized for a few minutes.

@papanito
Copy link
Author

I have a single cluster and yes I have some stuff like cert-manager deployed with helm 3

@papanito
Copy link
Author

papanito commented Jun 30, 2020

If I switch back to v.1.5.0 I don't have the issue and sync work fine - updated the description with the new info.

@papanito
Copy link
Author

papanito commented Jul 1, 2020

It's does not seem it's the version, I have now the same issue with v.1.5.0

@papanito
Copy link
Author

papanito commented Jul 2, 2020

argocd-redis.log

@papanito
Copy link
Author

papanito commented Jul 2, 2020

So in my case is that I get a 10.42.67.206:56842->10.43.73.252:6379: i/o timeout whereas the ip is the cluster ip of redis-server

kubectl -n argocd describe service argocd-redis
Name:              argocd-redis
Namespace:         argocd
Labels:            app.kubernetes.io/component=redis
                   app.kubernetes.io/instance=argocd
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=argocd-redis
                   app.kubernetes.io/part-of=argocd
                   helm.sh/chart=argo-cd-2.5.0
Annotations:       Selector:  app.kubernetes.io/name=argocd-redis
Type:              ClusterIP
IP:                10.43.73.252
Port:              <unset>  5379/TCP
TargetPort:        5379/TCP
Endpoints:         10.42.112.147:5379
Session Affinity:  None
Events:            <none>

@papanito
Copy link
Author

papanito commented Jul 3, 2020

I completely removed argocd and installed it in a new namespace. So far it seems good and I don't have the issue mentioned. Still not clear why I had it in the first place.

@WaldoFR
Copy link

WaldoFR commented Jul 8, 2020

Have you proceed to auto-sync same deployments of your previous ArgoCD setup ?

@papanito
Copy link
Author

papanito commented Jul 8, 2020

@WaldoFR no at the moment I don't use auto-sync

@WaldoFR
Copy link

WaldoFR commented Jul 20, 2020

In my case I found that all commands were stopped at 15s exactly, browsing the code I found this part here that would explain we stop synchronization after 15 seconds :

// Returns a HTTP client object suitable for go-git to use using the following
// pattern:
// - If insecure is true, always returns a client with certificate verification
//   turned off.
// - If one or more custom certificates are stored for the repository, returns
//   a client with those certificates in the list of root CAs used to verify
//   the server's certificate.
// - Otherwise (and on non-fatal errors), a default HTTP client is returned.
func GetRepoHTTPClient(repoURL string, insecure bool, creds Creds) *http.Client {
    // Default HTTP client
    var customHTTPClient = &http.Client{
        // 15 second timeout
        Timeout: 15 * time.Second,
        // don't follow redirect
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }

Git client seems to not handle timeout specified in --repo-server-timeout-seconds so, even if the GRPC request is extended the git fetch is not. Should this option also be applied to git client ?

Regards,

References :

Timeout: 15 * time.Second,

@dmitrievav
Copy link

I have the same issue with helm3, argocd 1.5.3 and cert-manager 0.16.1

error="application refresh deadline exceeded"

@dmitrievav
Copy link

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-application-controller
spec:
  template:
    spec:
      containers:
      - name: argocd-application-controller
        command:
        - argocd-application-controller
        - --status-processors
        - "20"
        - --operation-processors
        - "50"
        - --repo-server-timeout-seconds
        - "300"
argocd/argocd-server-5b9466d7f-x62db[argocd-server]: time="2020-08-25T16:04:33Z" level=error msg="finished unary call with code Unknown" error="application refresh deadline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-08-25T16:03:33Z" grpc.time_ms=60003.48 span.kind=server system=grpc

and I have only dozen applications in control repo ...

@cemsusal
Copy link

cemsusal commented Aug 27, 2020

I have the same issue while adding a new https repository from bitbucket and gitlab:

Having following error message:

Unable to connect HTTPS repository: Get "https://bitbucket.org/xxx/yyy/info/refs?service=git-upload-pack": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I enabled --insecure flag for argocd-server

These are the settings applied to the argocd-application-controller:

- command:
        - argocd-application-controller
        - --status-processors
        - "20"
        - --operation-processors
        - "50"
        - --repo-server-timeout-seconds
        - "420"
        - --app-resync
        - "600"

I also tried to scale up argocd-repo-server (5 replicas) and it did not work.

git clone inside argocd-repo-server is working fine.

Installed following ArgoCD versions and had the same issue for all of them:

  • 1.5.0
  • 1.6.1
  • 1.7.1

My cluster has 3 nodes and 1 master.

This is happening for manual and auto sync policies while creating a new app.

@cemsusal
Copy link

My argocd-repo-server logs:

time="2020-08-27T09:25:05Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"
time="2020-08-27T09:25:05Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --generate-key /tmp/gpg-key-recipe563578480" dir= execID=mX2zE
time="2020-08-27T09:25:06Z" level=info msg="Populating GnuPG keyring with keys from /app/config/gpg/source"
time="2020-08-27T09:25:06Z" level=info msg="gpg --no-permission-warning --list-public-keys" dir= execID=iYECJ
time="2020-08-27T09:25:06Z" level=info msg="gpg --no-permission-warning -a --export 22088F9D0E59A460" dir= execID=k0Ftj
time="2020-08-27T09:25:06Z" level=info msg="gpg-wrapper.sh --no-permission-warning --list-secret-keys 22088F9D0E59A460" dir= execID=grdLd
time="2020-08-27T09:25:06Z" level=info msg="Loaded 0 (and removed 0) keys from keyring"
time="2020-08-27T09:25:06Z" level=info msg="argocd-repo-server v1.7.1+da5fa74 serving on [::]:8081"

I have got a bunch of no permission warning.

@jannfis
Copy link
Member

jannfis commented Aug 27, 2020

I have got a bunch of no permission warning.

These are no warnings. It's just a log of a command that got executed, which happen to be executed using a parameter --no-permission-warning

@jannfis
Copy link
Member

jannfis commented Aug 27, 2020

This is happening for manual and auto sync policies while creating a new app.

Hi @cemsusal, is it happening when you create an app or when you add the repository? For the latter, an initial connection attempt is made by argocd-server (and not the repo server). Is the argocd-server pod allowed to speak to Internet hosts or are you behind a HTTP proxy maybe?

@cemsusal
Copy link

This is happening while creating and app or adding a repository from bitbucket or gitlab. (Did not try github)

ARGOCD_GPG_ENABLED -> Is now set the false and issue still persists.

We are not behind a proxy. argocd-server can access to the internet.

@cemsusal
Copy link

This is a log from our repo-server:

time="2020-08-27T12:10:54Z" level=error msg="finished unary call with code Unknown" error="Get \"https://zzz:***@bitbucket.org/xxx/yyy.git/info/r efs?service=git-upload-pack\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" grpc.code=Unknown grpc.method=GenerateManifest grpc.req uest.deadline="2020-08-27T12:17:19Z" grpc.service=repository.RepoServerService grpc.start_time="2020-08-27T12:10:39Z" grpc.time_ms=15001.042 span.kind=server syst em=grpc

This is confirming that repo server is failing to fetch.

@cemsusal
Copy link

cemsusal commented Aug 27, 2020

@jannfis this is the output of git commands:

argocd@argocd-repo-server-6ddb994d9-s49gf:/tmp/https:__xxx:yyy@bitbucket.org_project$ git remote -v
origin  https://ttt:zzz@bitbucket.org/xxx/yyy.git (fetch)
origin  https://ttt:zzz@bitbucket.org/xxx/yyy.git (push)
argocd@argocd-repo-server-6ddb994d9-s49gf:/tmp/https:__xxx:yyy@bitbucket.org_project$ git pull origin master
From https://bitbucket.org/xxx/yyy
 * branch            master     -> FETCH_HEAD
Already up to date.

@cemsusal
Copy link

I have only one application on top of a fresh install of v.1.7.1. The issue still persists.

@cemsusal
Copy link

Installing Argo CD in another namespace did not resolve the issue for me.

@dmitrievav
Copy link

it may depends on this #3998
As new version of cert-manager has a huge manifest file (>1MB)

@onelapahead
Copy link

onelapahead commented Dec 3, 2020

We're experiencing something similar in v1.7.6.

Our application controller started spewing this log this morning:

patch: {"status":{"conditions":[{"lastTransitionTime":"2020-12-03T12:33:40Z","message":"rpc error: code = Unknown desc = Get \"https://github.com/Bandwidth/some-repo/info/refs?service=git-upload-pack\": dial tcp 140.82.113.3:443: i/o timeout (Client.Timeout exceeded while awaiting headers)","type":"ComparisonError"}]}}
--

The reason being our network is saturated and so packets are being dropped, connections timing out, etc.

So while this kind of behavior is expected during a network issue, I'm concerned as to the impact its having on application syncs.

What bothers me, is that the repo server hasn't thrown a single error and everything it needs is still in its cache. So why is the app controller timing out on syncs and putting a bunch of apps in the Unknown state, when it has everything it needs to do its job?

What is the point of the app controller hitting the Git repo with git-upload-pack?

@kintarowins
Copy link

same problem with v1.8.3+0f9c684

@WaldoFR
Copy link

WaldoFR commented Feb 16, 2021

Again, git timeout might be caused by slowness not related to ArgoCD. For instance the version control server you are using itself or tiers services related for authentication on it.

To troubleshoot that, ensure that realizing a git clone from your node with sames credentials as you configured won't last over 15 seconds.

@willemm
Copy link

willemm commented Nov 24, 2021

What if the git timeout is definitely caused by slowness not (directly) related to ArgoCD ?

To be more precise, what if the slowness is caused by rate limiting? Then argocd will start retrying all the calls that are rate-limited for more than 15 seconds, thus hammering the git server with more and more requests, thereby indirectly causing the rate limiting to increase.

It would certainly be very useful if it were possible to change that timeout setting, in that situation.

(PS: Sorry if this comes across emotionally, I just had a bad day trying to figure out why our production applications suddenly weren't updating any more, and why azuregit was suddenly refusing to talk to argocd.)

@mnasruul
Copy link

thanks a lot man, i follow this and i fix my issue. before i googling 3 days to fixed it.

@christophercutajar
Copy link
Contributor

christophercutajar commented Jan 25, 2022

We're experiencing the same timeout issues after deploying an ApplicationSet which creates 9 applications.

level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2022-01-24T19:20:45Z\",\"message\":\"rpc error: code = Unknown desc = `helm dependency build` failed timeout after 1m30s\",\"type\":\"ComparisonError\"}]}}" application=app

ArgoCD version deployed: v2.2.2+03b17e0
ArgoCD ApplicationSet version: v0.3.0

Controller settings:

controller:
    args:
      # -- define the application controller `--status-processors`
      statusProcessors: "60"
      # -- define the application controller `--operation-processors`
      operationProcessors: "40"
      # -- define the application controller `--repo-server-timeout-seconds`
      repoServerTimeoutSeconds: "360"

GitHub API rate limits are 5000 API calls per hour, or ~83 calls per minute. I'm not knowledgeable enough about ArgoCD engine to say that we're hitting such but limits but I think it could be very plausible that such limits are being reached as I'm able to create the Applications one by one!

UPDATE: I've also added appResyncPeriod together with the above configs and re-deployed. Being patient was key now! 🔑 After about 10minutes, none of my 9 applications where in an Unkown state anymore

#  - define the application controller `--app-resync`
      appResyncPeriod: "300"

UPDATE 2: At the seconds refresh, the repo-server started throwing timeouts again!

level=error msg="finished unary call with code Unknown" error="Manifest generation error (cached): `helm dependency build` failed timeout after 1m30s" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2022-01-25T14:18:48Z" grpc.service=repository.RepoServerService grpc.start_time="2022-01-25T14:12:48Z" grpc.time_ms=330.102 span.kind=server system=grpc

@tooptoop4
Copy link

@mnasruul what was the fix?

@mnasruul
Copy link

@mnasruul what was the fix?

follow this #3864 (comment)

@pp185214
Copy link

pp185214 commented May 2, 2022

Maybe it could help to somebody as well. We have ArgoCD on AKS cluster and we faced similar issue. According https://argo-cd.readthedocs.io/en/release-1.8/operator-manual/high_availability/ we have tried to set workaround:
--repo-server-timeout-seconds for argocd-application-controller and issue disapiered for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

14 participants