Sidecar connection issues with placement service with on-disk raft logs #7749

sicoyle · 2024-05-21T13:43:05Z

In what area(s)?

/area placement

What version of Dapr?

latest

Expected Behavior

Sidecar can connect to placement service without issue, and have built in fault tolerance with connection issues as they arise with Placement. A stable connection between sidecar and placement service is necessary for workflows to function properly.

Actual Behavior

Sidecar disconnects, has issues reconnecting, and then is able to connect with placement service. This is a single instance of placement with on-disk logs for raft.

abbreviated logs from sidecar:

{..."msg":"Disconnected from placement: rpc error: code = FailedPrecondition desc = only leader can serve the request",...}
{..."msg":"try to connect to placement service: ..."...}
{..."msg":"Error connecting to placement service (will retry to connect in background): rpc error: code = Unavailable desc = last resolver error: produced zero addresses",...}

I also saw logs from the sidecar showing only leader can serve the request when I'm running a single instance, so maybe somehow placement lost it's leadership lease even though there is only one instance, and there is no other instance to pick up the lease so the sidecar gets stuck.

However, the interesting thing here that to fix the problem, I had to restart the sidecar and recycle the TCP connection to placement. Then the sidecar could connect properly.

Another interesting one here is that I do see this repeating on another sidecar every 2 hours where something sends a termination signal to placement, then it shuts down, and it is then that the sidecar has issues connecting. This makes sense as to the sidecar having connection problems as the placement service is down and spinning back up, but it is unclear to me why or what sends termination signals to placement. This also occurs after I see in placement:

... Start disseminating tables. memberUpdateCount: 1,...
... Completed dissemination. memberUpdateCount: 1,...
...Received signal 'terminated'; beginning shutdown\...
...Waiting until all connections are drained...
...Raft server is shutting down ...
...Healthz server is shutting down
...Raft sever shutdown
... Received signal 'terminated'; beginning shutdown...

Steps to Reproduce the Problem

I'm not even sure why it is occuring tbh, so this is going to be challenging to reproduce. More exploration is needed to answer here better.

Release Note

RELEASE NOTE:
FIX Bug in placement with in-mem Raft logs preventing proper connection with sidecar.

The text was updated successfully, but these errors were encountered:

elena-kolevska · 2024-05-22T00:15:07Z

/assign

Cherrs · 2024-05-22T08:10:44Z

I am preparing to use dapr_placement.cluster.forceInMemoryLog=true in a production environment. I haven't enabled high availability (HA) before. This makes me very concerned.

sicoyle · 2024-05-28T22:18:17Z

I updated the title and description to reflect that we actually use on-disk raft logs. I saw the dapr flag default being in memory so assumed that's what we were using (see here); however, we actually use the helm value default which sets placement to on-disk for raft logs. So, the issue in my case is specific to on-disk raft logs.

ping @Cherrs - just wanted to let you know the update :)

jjcollinge · 2024-05-29T06:42:15Z

I updated the title and description to reflect that we actually use on-disk raft logs. I saw the dapr flag default being in memory so assumed that's what we were using (see here); however, we actually use the helm value default which sets placement to on-disk for raft logs. So, the issue in my case is specific to on-disk raft logs.

ping @Cherrs - just wanted to let you know the update :)

Isn't this Helm default (forceInMemoryLog=false) only actually used when the HA value is also set to true? Check the statefulset manifest

elena-kolevska · 2024-05-31T09:55:50Z

Yeah, that's correct, we can only have on disk log when we have a placement cluster:

{{- if or (eq .Values.global.ha.enabled true) (eq .Values.ha true) }}
        - "--id"
        - "$(PLACEMENT_ID)"
        - "--initial-cluster"
        - {{ template "dapr_placement.initialcluster" . }}
  {{- if eq .Values.cluster.forceInMemoryLog false }}
        - "--raft-logstore-path"
    {{- if eq .Values.global.daprControlPlaneOs "windows" }}
        - "{{ .Values.cluster.logStoreWinPath }}\\cluster-v2-$(PLACEMENT_ID)"
    {{- else }}
        - "{{ .Values.cluster.logStorePath }}/cluster-v2-$(PLACEMENT_ID)"
    {{- end }}
  {{- end }}
{{- end }}

sicoyle added the kind/bug Something isn't working label May 21, 2024

dapr-bot assigned elena-kolevska May 22, 2024

sicoyle changed the title ~~Sidecar connection issues with placement service with in-memory raft logs~~ Sidecar connection issues with placement service with on-disk raft logs May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar connection issues with placement service with on-disk raft logs #7749

Sidecar connection issues with placement service with on-disk raft logs #7749

sicoyle commented May 21, 2024 •

edited

Loading

elena-kolevska commented May 22, 2024

Cherrs commented May 22, 2024

sicoyle commented May 28, 2024

jjcollinge commented May 29, 2024 •

edited

Loading

elena-kolevska commented May 31, 2024

Sidecar connection issues with placement service with on-disk raft logs #7749

Sidecar connection issues with placement service with on-disk raft logs #7749

Comments

sicoyle commented May 21, 2024 • edited Loading

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Release Note

elena-kolevska commented May 22, 2024

Cherrs commented May 22, 2024

sicoyle commented May 28, 2024

jjcollinge commented May 29, 2024 • edited Loading

elena-kolevska commented May 31, 2024

sicoyle commented May 21, 2024 •

edited

Loading

jjcollinge commented May 29, 2024 •

edited

Loading