Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Login attempt fails sometimes after 1.5.3 upgrade #3547

Closed
eroji opened this issue May 5, 2020 · 24 comments · Fixed by #3575
Closed

Login attempt fails sometimes after 1.5.3 upgrade #3547

eroji opened this issue May 5, 2020 · 24 comments · Fixed by #3575
Assignees
Labels
bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:api API bugs and enhancements
Milestone

Comments

@eroji
Copy link

eroji commented May 5, 2020

After upgrading to 1.5.3 I'm getting these errors on the first attempt to authenticate via /api/v1/session route after some idle period. It would eventually work if I retry a second or few more times. I did not configure any of the rate limiting and the credentials used to hit the route is the admin user. Not quite sure why this is happening.

time="2020-05-05T16:58:46Z" level=error msg="finished unary call with code Unknown" error="failed to enforce max concurrent logins limit: EOF" grpc.code=Unknown grpc.method=Create grpc.service=session.SessionService grpc.start_time="2020-05-05T16:58:46Z" grpc.time_ms=0.664 span.kind=server system=grpc
@eroji eroji added the bug Something isn't working label May 5, 2020
@jannfis jannfis added the bug/in-triage This issue needs further triage to be correctly classified label May 5, 2020
@jannfis
Copy link
Member

jannfis commented May 5, 2020

Hi @eroji, can you share a little more details about your environment please? This error suggest that the Redis cache is not available (although seems to be intermittent according to your error description).

Interesting to know would be:

  • Have you installed ArgoCD in HA setup?
  • How did you upgrade, and from where (which version)?

@eroji
Copy link
Author

eroji commented May 5, 2020

My apologies. I'm using the HA install. Only modification I added was --insecure flag for argocd-server. I upgraded from 1.5.1.

@alexmt
Copy link
Collaborator

alexmt commented May 5, 2020

I've encountered the same issue during upgrade. The solution was to "restart" both Redis statefulset and redis HA proxy. @eroji , can you give it a try please?

I've seen the same issue with 1.5.1.

@eroji
Copy link
Author

eroji commented May 5, 2020

Trying it now.

@eroji
Copy link
Author

eroji commented May 5, 2020

It seems to be working? I'll check throughout the day to see if I hit this error again and report back.

@alexmt
Copy link
Collaborator

alexmt commented May 5, 2020

We should look for/file upstream bug in Redis HA helm chart. Looks like it is not happening often. In my case, it happened for 1 out of ~40 argocd instances.

@eroji
Copy link
Author

eroji commented May 6, 2020

Looks like it's still happening. I see that 1.5.4 has been released. I will try upgrading to that to see if it helps.

time="2020-05-06T07:54:14Z" level=error msg="finished unary call with code Unknown" error="failed to enforce max concurrent logins limit: EOF" grpc.code=Unknown grpc.method=Create grpc.service=session.SessionService grpc.start_time="2020-05-06T07:54:14Z" grpc.time_ms=0.886 span.kind=server system=grpc

@alexmt
Copy link
Collaborator

alexmt commented May 6, 2020

Hello @eroji , 1.5.4 does not include redis related changes. I don't think it will help.

As a quick workaround you might disable concurrent login limit feature: set env ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT=0 in argocd-server deployment.

Going to enable retries in redis client and test it on local deployments.

@eroji
Copy link
Author

eroji commented May 7, 2020

@alexmt not sure why but it seems like upgrading to 1.5.4 resolved the issue. I didn't have to add the env var at all...

@alexmt
Copy link
Collaborator

alexmt commented May 12, 2020

Created ticket in redis-ha chart repository: DandyDeveloper/charts#26

@alexmt alexmt reopened this May 12, 2020
@alexmt alexmt added bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug/priority:high Should be fixed in the next patch release and removed bug/in-triage This issue needs further triage to be correctly classified labels May 13, 2020
@alexmt
Copy link
Collaborator

alexmt commented May 13, 2020

PR that introduces redis retries during login flow is merged: #3575

@alexmt
Copy link
Collaborator

alexmt commented May 13, 2020

Adding big WARNING to 1.4 -> 1.5 upgrade instructions about possible redis issue as well: #3584. Probably this is as much as we can do:

Once all three are done I think ticket can be closed. Does it look reasonable to you @jannfis , @jessesuen ?

@jannfis jannfis added the component:api API bugs and enhancements label May 14, 2020
@alexmt
Copy link
Collaborator

alexmt commented May 18, 2020

v1.5.5 with the redis retries had been released. Please give it try. Closing ticket until we hear again about redis issues.

@alexmt alexmt closed this as completed May 18, 2020
@asvasyanin
Copy link

v1.5.5 with the redis retries had been released. Please give it try. Closing ticket until we hear again about redis issues.

still have this issue in 1.5.5, like @eroji only modification I have is --insecure flag

@samhuss
Copy link

samhuss commented May 29, 2020

Same issue with v1.5.5, works only when setting the env variable to argocd-server:
ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT=0

In logs I'm getting this after many time outs, thought this might help

5/29/2020 7:13:54 PM 2020/05/29 17:13:54 cache: Get key="session|login.attempts|1.0.0" failed: dial tcp: i/o timeout
5/29/2020 7:13:54 PM time="2020-05-29T17:13:54Z" level=error msg="Could not retrieve login attempts: dial tcp: i/o timeout"
5/29/2020 7:14:14 PM 2020/05/29 17:14:14 cache: Get key="session|login.attempts|1.0.0" failed: dial tcp: i/o timeout
5/29/2020 7:14:14 PM time="2020-05-29T17:14:14Z" level=error msg="Could not retrieve login attempts: dial tcp: i/o timeout"
5/29/2020 7:14:34 PM 2020/05/29 17:14:34 cache: Set key="session|login.attempts|1.0.0" failed: dial tcp: i/o timeout
5/29/2020 7:14:34 PM time="2020-05-29T17:14:34Z" level=error msg="Could not update login attempts: dial tcp: i/o timeout"
5/29/2020 7:14:34 PM time="2020-05-29T17:14:34Z" level=info msg="Issuing claims: { 0 1590772474 argocd 1590772474 admin}"
5/29/2020 7:14:34 PM time="2020-05-29T17:14:34Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Create grpc.service=session.SessionService grpc.start_time="2020-05-29T17:13:34Z" grpc.time_ms=60206.44 span.kind=server system=grpc
5/29/2020 7:14:35 PM time="2020-05-29T17:14:35Z" level=info msg="received unary call /session.SessionService/GetUserInfo" grpc.method=GetUserInfo grpc.request.claims="{\"iat\":1590772474,\"iss\":\"argocd\",\"nbf\":1590772474,\"sub\":\"admin\"}" grpc.request.content= grpc.service=session.SessionService grpc.start_time="2020-05-29T17:14:35Z" span.kind=server system=grpc
5/29/2020 7:14:35 PM time="2020-05-29T17:14:35Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetUserInfo grpc.service=session.SessionService grpc.start_time="2020-05-29T17:14:35Z" grpc.time_ms=0.456 span.kind=server system=grpc
5/29/2020 7:14:35 PM time="2020-05-29T17:14:35Z" level=info msg="received unary call /cluster.ClusterService/List" grpc.method=List grpc.request.claims="{\"iat\":1590772474,\"iss\":\"argocd\",\"nbf\":1590772474,\"sub\":\"admin\"}" grpc.request.content= grpc.service=cluster.ClusterService grpc.start_time="2020-05-29T17:14:35Z" span.kind=server system=grpc 

@alexmt alexmt reopened this May 29, 2020
@alexmt alexmt added this to the v1.6 GitOps Engine milestone May 29, 2020
@onelapahead
Copy link

Does this affect users signing in via an IDP such as Okta?

@ajayr5
Copy link

ajayr5 commented Jun 22, 2020

adding this to my argocd-server deployment resolved the issue

env:
  - name: ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT
    value: "0"

@jannfis
Copy link
Member

jannfis commented Jun 24, 2020

Often, when log entries like these

5/29/2020 7:14:34 PM 2020/05/29 17:14:34 cache: Set key="session|login.attempts|1.0.0" failed: dial tcp: i/o timeout

can be observed, there is a problem with either in-cluster DNS resolution or otherwise interconnectivity issues within the cluster or the redis pod is not running at all.

@ajayr5
Copy link

ajayr5 commented Jun 25, 2020

Often, when log entries like these

5/29/2020 7:14:34 PM 2020/05/29 17:14:34 cache: Set key="session|login.attempts|1.0.0" failed: dial tcp: i/o timeout

can be observed, there is a problem with either in-cluster DNS resolution or otherwise interconnectivity issues within the cluster or the redis pod is not running at all.

I get this issue only when creating cluster on bare-metal azure vm. Works perfectly fine with cluster on ec2 instance.
Now I'm getting error while adding git repo
rpc error: code = Unknown desc = Get "https://gitlab.com/xxxxx/xxxxxxx.git/info/refs?service=git-upload-pack": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@creet0007
Copy link

creet0007 commented Jun 30, 2020

env:

  • name: ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT
    value: "0

where exactly to add these values? can u show me the screenshot for this?

@ajayr5
Copy link

ajayr5 commented Jul 1, 2020

env:

  • name: ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT
    value: "0

where exactly to add these values? can u show me the screenshot for this?

Add this in argocd-server Deployment in the install.yaml. You can try adding it at https://github.com/argoproj/argo-cd/blob/master/manifests/install.yaml#L2646
for_git

@creet0007
Copy link

env:

  • name: ARGOCD_MAX_CONCURRENT_LOGIN_REQUESTS_COUNT
    value: "0

where exactly to add these values? can u show me the screenshot for this?

Add this in argocd-server Deployment in the install.yaml. You can try adding it at https://github.com/argoproj/argo-cd/blob/master/manifests/install.yaml#L2646
for_git

I got this:

error: error validating "install.yaml": error validating data: ValidationError(Deployment.spec.template.spec.containers[0]): unknown field "-env" in io.k8s.api.core.v1.Container; if you choose to ignore these errors, turn validation off with --validate=false

@creet0007
Copy link

Now worked. Thanks a lot :)

@rachelwang20 rachelwang20 self-assigned this Aug 3, 2020
@rachelwang20
Copy link
Contributor

rachelwang20 commented Aug 7, 2020

Made the change in - #4049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/priority:high Should be fixed in the next patch release bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:api API bugs and enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants