Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar connection issues with placement service with on-disk raft logs #7749

Open
sicoyle opened this issue May 21, 2024 · 5 comments
Open
Assignees
Labels
kind/bug Something isn't working

Comments

@sicoyle
Copy link
Contributor

sicoyle commented May 21, 2024

In what area(s)?

/area placement

What version of Dapr?

latest

Relates to: #4881 and #4882 and #5583

Expected Behavior

Sidecar can connect to placement service without issue, and have built in fault tolerance with connection issues as they arise with Placement. A stable connection between sidecar and placement service is necessary for workflows to function properly.

Actual Behavior

Sidecar disconnects, has issues reconnecting, and then is able to connect with placement service. This is a single instance of placement with on-disk logs for raft.

abbreviated logs from sidecar:

{..."msg":"Disconnected from placement: rpc error: code = FailedPrecondition desc = only leader can serve the request",...}
{..."msg":"try to connect to placement service: ..."...}
{..."msg":"Error connecting to placement service (will retry to connect in background): rpc error: code = Unavailable desc = last resolver error: produced zero addresses",...}

I also saw logs from the sidecar showing only leader can serve the request when I'm running a single instance, so maybe somehow placement lost it's leadership lease even though there is only one instance, and there is no other instance to pick up the lease so the sidecar gets stuck.

However, the interesting thing here that to fix the problem, I had to restart the sidecar and recycle the TCP connection to placement. Then the sidecar could connect properly.

Another interesting one here is that I do see this repeating on another sidecar every 2 hours where something sends a termination signal to placement, then it shuts down, and it is then that the sidecar has issues connecting. This makes sense as to the sidecar having connection problems as the placement service is down and spinning back up, but it is unclear to me why or what sends termination signals to placement. This also occurs after I see in placement:

... Start disseminating tables. memberUpdateCount: 1,...
... Completed dissemination. memberUpdateCount: 1,...
...Received signal 'terminated'; beginning shutdown\...
...Waiting until all connections are drained...
...Raft server is shutting down ...
...Healthz server is shutting down
...Raft sever shutdown
... Received signal 'terminated'; beginning shutdown...

Steps to Reproduce the Problem

I'm not even sure why it is occuring tbh, so this is going to be challenging to reproduce. More exploration is needed to answer here better.

Release Note

RELEASE NOTE:
FIX Bug in placement with in-mem Raft logs preventing proper connection with sidecar.

@sicoyle sicoyle added the kind/bug Something isn't working label May 21, 2024
@elena-kolevska
Copy link
Contributor

/assign

@Cherrs
Copy link

Cherrs commented May 22, 2024

I am preparing to use dapr_placement.cluster.forceInMemoryLog=true in a production environment. I haven't enabled high availability (HA) before. This makes me very concerned.

@sicoyle sicoyle changed the title Sidecar connection issues with placement service with in-memory raft logs Sidecar connection issues with placement service with on-disk raft logs May 28, 2024
@sicoyle
Copy link
Contributor Author

sicoyle commented May 28, 2024

I updated the title and description to reflect that we actually use on-disk raft logs. I saw the dapr flag default being in memory so assumed that's what we were using (see here); however, we actually use the helm value default which sets placement to on-disk for raft logs. So, the issue in my case is specific to on-disk raft logs.

ping @Cherrs - just wanted to let you know the update :)

@jjcollinge
Copy link
Contributor

jjcollinge commented May 29, 2024

I updated the title and description to reflect that we actually use on-disk raft logs. I saw the dapr flag default being in memory so assumed that's what we were using (see here); however, we actually use the helm value default which sets placement to on-disk for raft logs. So, the issue in my case is specific to on-disk raft logs.

ping @Cherrs - just wanted to let you know the update :)

Isn't this Helm default (forceInMemoryLog=false) only actually used when the HA value is also set to true? Check the statefulset manifest

@elena-kolevska
Copy link
Contributor

Yeah, that's correct, we can only have on disk log when we have a placement cluster:

{{- if or (eq .Values.global.ha.enabled true) (eq .Values.ha true) }}
        - "--id"
        - "$(PLACEMENT_ID)"
        - "--initial-cluster"
        - {{ template "dapr_placement.initialcluster" . }}
  {{- if eq .Values.cluster.forceInMemoryLog false }}
        - "--raft-logstore-path"
    {{- if eq .Values.global.daprControlPlaneOs "windows" }}
        - "{{ .Values.cluster.logStoreWinPath }}\\cluster-v2-$(PLACEMENT_ID)"
    {{- else }}
        - "{{ .Values.cluster.logStorePath }}/cluster-v2-$(PLACEMENT_ID)"
    {{- end }}
  {{- end }}
{{- end }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants