Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

zhp007 · 2024-05-20T01:19:22Z

One follow up with #251

We are using Olric embedded mode to build a cache service with the config:
Config env: "wan"
PartitionCount: 271
ReplicaCount: 2
ReplicationMode: AsyncReplicationMode
We create only one DMap.

We are testing with 3 pods. When the 1000 QPS traffic was ongoing, we killed one pod to test one failure scenario.

The new pod fail to create DMap with NewDMap with the following error:

operation timeout

If DMap creation fails, then we fail the pod creation. Thus new pod creation kept failing with above error.

And other nodes see different kinds of errors. Here are some samples:

[ERROR] Failed to delete replica key/value on dmap.test_table: dial tcp 172.19.46.26:3320: connect: no route to host => delete.go:82"}

[INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 270 to 172.19.21.125:3320 => balancer.go:86"}
[ERROR] Failed to move DMap fragment: test_table on PartID: 270 to 172.19.21.125:3320: dial tcp 172.19.21.125:3320: connect: connection refused => balancer.go:91"}

Then entire cluster of 3 nodes failed to serve the incoming requests.

Besides, we sometimes observed operation timeout on DMap creation with the same setup even when there is no traffic.

The text was updated successfully, but these errors were encountered:

buraksezer · 2024-05-20T14:20:04Z

That's interesting. I had never seen such a problem before. I'll try to reproduce it but it is possibly related to your network environment. I predict that you are trying to deploy an Olric cluster on Kubernetes. How do the nodes discover each other? There is a plug-in to discover nodes in Kubernetes env but it is not properly maintained.

Config env: "wan"

People generally use "lan" as the network environment for memberlist configuration. memberlist configuration can be tricky.

Are you using the cluster client to connect to the cluster? I guess there is a subtle issue in your network setup.

zhp007 · 2024-05-20T21:29:48Z

We are deploying Olric with embedded mode and use the approach in #195 for service discovery.

We also tried https://github.com/buraksezer/olric-cloud-plugin as well as setting static member list. For all of these approaches, we see DMap creation operation timeout:

During cluster setup, happens sometimes but later pod can come up and form the cluster.
After killing one pod (in 3-pod cluster) when there is traffic, then new pod can never come up with DMap operation timeout.

If there is network problem, I assume there will also be failure for the 1st case, but it always succeeds.

For the 2nd case, traffic to other running nodes all failed with either errors like tcp 172.19.22.75:3320: connect: connection refused or deadline exceeded/canceled, and the entire cluster cannot serve traffic anymore.

But as soon we stop the traffic, the restart pods can succeed and receiving routing table and DMap, :

[INFO] Routing table has been pushed by 172.18.140.44:3320 => operations.go:92"}
[INFO] Received DMap (kind: Primary): realtime-leaf on PartID: 7 => balance.go:128"}
[INFO] Received DMap (kind: Backup): realtime-leaf on PartID: 270 => balance.go:128"}

But when there is traffic, we cannot even see the 1st line of logs, DMap creation just fail with operation timeout.

Also markBootstrap seems to be the prerequisite for DMap CheckBootstrap. And we didn't see https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L92 in logs, it means startup cannot reach this place.

We are using EmbeddedClient on each of the node to connect to Olric.

buraksezer · 2024-05-25T10:21:26Z

This is too difficult for me to analyze because I cannot reproduce the problem here. Possibly, the peer discovery code fails to propagate or remove dead nodes from the system.

olric-cloud-plugin is an abandonware. I last tested it on Kubernetes a long time ago.
Static peer list is just for playing with Olric on localhost.
Error logs are normal. Olric can be too chatty about network problems. You can try to decrease the verbosity level. Checkout this

olric/pkg/flog/flog.go

Line 27 in 81e1254

Derived from kubernetes/klog:

derekperkins · 2024-06-07T17:54:35Z

@zhp007 We are still using the same code from the gist in #195, and I haven't ever seen indications that there have been problems with it, with some very aggressive autoscaling set up. Here's the config we use:

// create a new Olric configuration
cfg := config.New("lan") // default configuration
cfg.ServiceDiscovery = map[string]any{
"plugin": k8sDisc,
}
cfg.ReplicationMode = config.AsyncReplicationMode
cfg.LogLevel = "WARN"
cfg.LogVerbosity = 1

zhp007 changed the title ~~Entire cluster fail to serve requests when one pod restarts with DMap creation operation timeout~~ Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout May 20, 2024

buraksezer self-assigned this May 20, 2024

zhp007 mentioned this issue May 21, 2024

Potential lock contention: Coordinator is not able to publish routing table to new node which causes it fails to be bootstrapped #254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

zhp007 commented May 20, 2024 •

edited

Loading

buraksezer commented May 20, 2024 •

edited

Loading

zhp007 commented May 20, 2024 •

edited

Loading

buraksezer commented May 25, 2024

derekperkins commented Jun 7, 2024

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

Comments

zhp007 commented May 20, 2024 • edited Loading

buraksezer commented May 20, 2024 • edited Loading

zhp007 commented May 20, 2024 • edited Loading

buraksezer commented May 25, 2024

derekperkins commented Jun 7, 2024

zhp007 commented May 20, 2024 •

edited

Loading

buraksezer commented May 20, 2024 •

edited

Loading

zhp007 commented May 20, 2024 •

edited

Loading