Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253

Open
zhp007 opened this issue May 20, 2024 · 4 comments
Assignees

Comments

@zhp007
Copy link

zhp007 commented May 20, 2024

One follow up with #251

We are using Olric embedded mode to build a cache service with the config:
Config env: "wan"
PartitionCount: 271
ReplicaCount: 2
ReplicationMode: AsyncReplicationMode
We create only one DMap.

We are testing with 3 pods. When the 1000 QPS traffic was ongoing, we killed one pod to test one failure scenario.

The new pod fail to create DMap with NewDMap with the following error:

operation timeout 

If DMap creation fails, then we fail the pod creation. Thus new pod creation kept failing with above error.

And other nodes see different kinds of errors. Here are some samples:

[ERROR] Failed to delete replica key/value on dmap.test_table: dial tcp 172.19.46.26:3320: connect: no route to host => delete.go:82"}
[INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 270 to 172.19.21.125:3320 => balancer.go:86"}
[ERROR] Failed to move DMap fragment: test_table on PartID: 270 to 172.19.21.125:3320: dial tcp 172.19.21.125:3320: connect: connection refused => balancer.go:91"}

Then entire cluster of 3 nodes failed to serve the incoming requests.

Besides, we sometimes observed operation timeout on DMap creation with the same setup even when there is no traffic.

@zhp007 zhp007 changed the title Entire cluster fail to serve requests when one pod restarts with DMap creation operation timeout Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout May 20, 2024
@buraksezer buraksezer self-assigned this May 20, 2024
@buraksezer
Copy link
Owner

buraksezer commented May 20, 2024

That's interesting. I had never seen such a problem before. I'll try to reproduce it but it is possibly related to your network environment. I predict that you are trying to deploy an Olric cluster on Kubernetes. How do the nodes discover each other? There is a plug-in to discover nodes in Kubernetes env but it is not properly maintained.

Config env: "wan"

People generally use "lan" as the network environment for memberlist configuration. memberlist configuration can be tricky.

Are you using the cluster client to connect to the cluster? I guess there is a subtle issue in your network setup.

@zhp007
Copy link
Author

zhp007 commented May 20, 2024

We are deploying Olric with embedded mode and use the approach in #195 for service discovery.

We also tried https://github.com/buraksezer/olric-cloud-plugin as well as setting static member list. For all of these approaches, we see DMap creation operation timeout:

  • During cluster setup, happens sometimes but later pod can come up and form the cluster.
  • After killing one pod (in 3-pod cluster) when there is traffic, then new pod can never come up with DMap operation timeout.

If there is network problem, I assume there will also be failure for the 1st case, but it always succeeds.

For the 2nd case, traffic to other running nodes all failed with either errors like tcp 172.19.22.75:3320: connect: connection refused or deadline exceeded/canceled, and the entire cluster cannot serve traffic anymore.

But as soon we stop the traffic, the restart pods can succeed and receiving routing table and DMap, :

[INFO] Routing table has been pushed by 172.18.140.44:3320 => operations.go:92"}
[INFO] Received DMap (kind: Primary): realtime-leaf on PartID: 7 => balance.go:128"}
[INFO] Received DMap (kind: Backup): realtime-leaf on PartID: 270 => balance.go:128"}

But when there is traffic, we cannot even see the 1st line of logs, DMap creation just fail with operation timeout.

Also markBootstrap seems to be the prerequisite for DMap CheckBootstrap. And we didn't see https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L92 in logs, it means startup cannot reach this place.

We are using EmbeddedClient on each of the node to connect to Olric.

@buraksezer
Copy link
Owner

This is too difficult for me to analyze because I cannot reproduce the problem here. Possibly, the peer discovery code fails to propagate or remove dead nodes from the system.

  • olric-cloud-plugin is an abandonware. I last tested it on Kubernetes a long time ago.
  • Static peer list is just for playing with Olric on localhost.
  • Error logs are normal. Olric can be too chatty about network problems. You can try to decrease the verbosity level. Checkout this
    Derived from kubernetes/klog:

@derekperkins
Copy link
Contributor

@zhp007 We are still using the same code from the gist in #195, and I haven't ever seen indications that there have been problems with it, with some very aggressive autoscaling set up. Here's the config we use:

// create a new Olric configuration
cfg := config.New("lan") // default configuration
cfg.ServiceDiscovery = map[string]any{
"plugin": k8sDisc,
}
cfg.ReplicationMode = config.AsyncReplicationMode
cfg.LogLevel = "WARN"
cfg.LogVerbosity = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants