-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253
Comments
That's interesting. I had never seen such a problem before. I'll try to reproduce it but it is possibly related to your network environment. I predict that you are trying to deploy an Olric cluster on Kubernetes. How do the nodes discover each other? There is a plug-in to discover nodes in Kubernetes env but it is not properly maintained.
People generally use "lan" as the network environment for memberlist configuration. memberlist configuration can be tricky. Are you using the cluster client to connect to the cluster? I guess there is a subtle issue in your network setup. |
We are deploying Olric with embedded mode and use the approach in #195 for service discovery. We also tried https://github.com/buraksezer/olric-cloud-plugin as well as setting static member list. For all of these approaches, we see DMap creation operation timeout:
If there is network problem, I assume there will also be failure for the 1st case, but it always succeeds. For the 2nd case, traffic to other running nodes all failed with either errors like But as soon we stop the traffic, the restart pods can succeed and receiving routing table and DMap, :
But when there is traffic, we cannot even see the 1st line of logs, DMap creation just fail with Also We are using EmbeddedClient on each of the node to connect to Olric. |
This is too difficult for me to analyze because I cannot reproduce the problem here. Possibly, the peer discovery code fails to propagate or remove dead nodes from the system.
|
@zhp007 We are still using the same code from the gist in #195, and I haven't ever seen indications that there have been problems with it, with some very aggressive autoscaling set up. Here's the config we use: // create a new Olric configuration
cfg := config.New("lan") // default configuration
cfg.ServiceDiscovery = map[string]any{
"plugin": k8sDisc,
}
cfg.ReplicationMode = config.AsyncReplicationMode
cfg.LogLevel = "WARN"
cfg.LogVerbosity = 1 |
One follow up with #251
We are using Olric embedded mode to build a cache service with the config:
Config env: "wan"
PartitionCount: 271
ReplicaCount: 2
ReplicationMode: AsyncReplicationMode
We create only one DMap.
We are testing with 3 pods. When the 1000 QPS traffic was ongoing, we killed one pod to test one failure scenario.
The new pod fail to create DMap with
NewDMap
with the following error:If DMap creation fails, then we fail the pod creation. Thus new pod creation kept failing with above error.
And other nodes see different kinds of errors. Here are some samples:
Then entire cluster of 3 nodes failed to serve the incoming requests.
Besides, we sometimes observed operation timeout on DMap creation with the same setup even when there is no traffic.
The text was updated successfully, but these errors were encountered: