Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Unable to reconnect to apisix, when all ep are deleted under svc of apisix #769

Closed
han6565 opened this issue Nov 24, 2021 · 8 comments · Fixed by #774
Closed

bug: Unable to reconnect to apisix, when all ep are deleted under svc of apisix #769

han6565 opened this issue Nov 24, 2021 · 8 comments · Fixed by #774
Assignees
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@han6565
Copy link

han6565 commented Nov 24, 2021

Issue description

我在k8s上部署了一个apisix,一个apisix-controller,两台配置好后controller健康检查无问题,
当apisix pod被kill后就算pod重新拉起,controller仍然无法重新连接到apisix
但是我后续手动admin访问pod相同地址没有问题
如果有两个或多个apisix pod时只要存在一个活的pod就不会有问题,但是当短时间内没有pod存活就会无法再连接上
是我配置问题,还是本来没有pod能访问时就应该再也无法连接呢

在controller启动同步后,删除service再恢复可以百分之百复现

021-11-23T20:51:39+08:00	warn	apisix/cluster.go:452	failed to check health for cluster default: dial tcp 172.24.150.14:9180: connect: connection refused, will retry
2021-11-23T20:51:39+08:00	warn	ingress/controller.go:660	failed to check health for default cluster: timed out waiting for the condition, give up leader
2021-11-23T20:51:39+08:00	info	ingress/endpoint.go:83	endpoints controller exited
2021-11-23T20:51:39+08:00	info	ingress/apisix_tls.go:71	ApisixTls controller exited
2021-11-23T20:51:39+08:00	error	ingress/ingress.go:63	cache sync failed
2021-11-23T20:51:39+08:00	info	ingress/ingress.go:64	ingress controller exited
2021-11-23T20:51:39+08:00	info	ingress/apisix_upstream.go:71	ApisixUpstream controller exited
2021-11-23T20:51:39+08:00	info	ingress/secret.go:76	secret controller exited
2021-11-23T20:51:39+08:00	info	ingress/service.go:61	svc controller exited
2021-11-23T20:51:39+08:00	info	ingress/namespace.go:82	namespace controller exited
2021-11-23T20:51:39+08:00	info	ingress/pod.go:56	pod controller exited
2021-11-23T20:51:39+08:00	info	ingress/apisix_route.go:71	ApisixRoute controller exited
2021-11-23T20:51:39+08:00	info	ingress/apisix_consumer.go:67	ApisixConsumer controller exited
2021-11-23T20:51:39+08:00	info	ingress/apisix_cluster_config.go:69	ApisixClusterConfig controller exited
2021-11-23T20:51:39+08:00	info	ingress/controller.go:354	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
I1123 20:51:39.105127       1 leaderelection.go:243] attempting to acquire leader lease apisix/ingress-apisix-leader...
2021-11-23T20:51:39+08:00	info	ingress/controller.go:307	LeaderElection	{"message": "apisix-ingress-controller-7994d7bb49-z5hms became leader", "event_type": "Normal"}
I1123 20:51:39.111962       1 leaderelection.go:253] successfully acquired lease apisix/ingress-apisix-leader
2021-11-23T20:51:39+08:00	info	ingress/controller.go:387	controller tries to leading ...	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
2021-11-23T20:51:39+08:00	error	ingress/controller.go:414	failed to wait the default cluster to be ready: dial tcp 172.24.150.14:9180: connect: connection refused
E1123 20:51:39.112199       1 leaderelection.go:325] error retrieving resource lock apisix/ingress-apisix-leader: Get https://172.24.144.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-leader: context canceled
2021-11-23T20:51:39+08:00	info	ingress/controller.go:307	LeaderElection	{"message": "apisix-ingress-controller-7994d7bb49-z5hms stopped leading", "event_type": "Normal"}
I1123 20:51:39.112223       1 leaderelection.go:278] failed to renew lease apisix/ingress-apisix-leader: timed out waiting for the condition
2021-11-23T20:51:39+08:00	info	ingress/controller.go:354	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
I1123 20:51:39.112241       1 leaderelection.go:243] attempting to acquire leader lease apisix/ingress-apisix-leader...
2021-11-23T20:51:39+08:00	info	apisix/cluster.go:156	syncing cache	{"cluster": "default"}
2021-11-23T20:51:39+08:00	info	apisix/cluster.go:347	syncing schema	{"cluster": "default"}
2021-11-23T20:51:39+08:00	error	apisix/plugin.go:46	failed to list plugins' names: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:39+08:00	error	apisix/cluster.go:367	failed to list plugin names in APISIX: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:39+08:00	warn	apisix/cluster.go:330	failed to sync schema: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:39+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:39+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:39+08:00	info	ingress/controller.go:307	LeaderElection	{"message": "apisix-ingress-controller-7994d7bb49-z5hms became leader", "event_type": "Normal"}
I1123 20:51:39.118355       1 leaderelection.go:253] successfully acquired lease apisix/ingress-apisix-leader
2021-11-23T20:51:39+08:00	info	ingress/controller.go:387	controller tries to leading ...	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
2021-11-23T20:51:39+08:00	warn	apisix/cluster.go:307	waiting cluster default to ready, it may takes a while
2021-11-23T20:51:41+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:41+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
[GIN] 2021/11/23 - 20:51:42 | 200 |      42.863µs |    172.24.248.6 | GET      "/healthz"
2021-11-23T20:51:43+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:43+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:45+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:45+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
[GIN] 2021/11/23 - 20:51:46 | 200 |      41.607µs |    172.24.248.6 | GET      "/healthz"
2021-11-23T20:51:47+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:47+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:47+08:00	error	apisix/cluster.go:166	failed to sync cache	{"cost_time": "8.001080415s", "cluster": "default"}
2021-11-23T20:51:47+08:00	error	ingress/controller.go:414	failed to wait the default cluster to be ready: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:47+08:00	info	ingress/controller.go:354	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
I1123 20:51:47.113474       1 leaderelection.go:243] attempting to acquire leader lease apisix/ingress-apisix-leader...
2021-11-23T20:51:47+08:00	info	apisix/cluster.go:347	syncing schema	{"cluster": "default"}
2021-11-23T20:51:47+08:00	error	apisix/plugin.go:46	failed to list plugins' names: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:47+08:00	error	apisix/cluster.go:367	failed to list plugin names in APISIX: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:47+08:00	info	apisix/cluster.go:156	syncing cache	{"cluster": "default"}
2021-11-23T20:51:47+08:00	error	apisix/route.go:119	failed to list routes: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:47+08:00	error	apisix/cluster.go:200	failed to list route in APISIX: Get http://172.24.150.14:9180/apisix/admin/routes: context canceled
2021-11-23T20:51:47+08:00	warn	apisix/cluster.go:330	failed to sync schema: Get http://172.24.150.14:9180/apisix/admin/plugins/list: context canceled
2021-11-23T20:51:47+08:00	info	ingress/controller.go:307	LeaderElection	{"message": "apisix-ingress-controller-7994d7bb49-z5hms became leader", "event_type": "Normal"}
I1123 20:51:47.119539       1 leaderelection.go:253] successfully acquired lease apisix/ingress-apisix-leader
2021-11-23T20:51:47+08:00	info	ingress/controller.go:387	controller tries to leading ...	{"namespace": "apisix", "pod": "apisix-ingress-controller-7994d7bb49-z5hms"}
2021-11-23T20:51:47+08:00	warn	apisix/cluster.go:307	waiting cluster default to ready, it may takes a while

config-map

data:
  config.yaml: |
    # log options
    log_level: "info"
    log_output: "stderr"
    http_listen: ":8080"
    enable_profiling: true
    kubernetes:
      kubeconfig: ""
      resync_interval: "6h"
      app_namespaces:
      - "*"
      ingress_class: "apisix"
      ingress_version: "networking/v1"
      apisix_route_version: "apisix.apache.org/v2beta1"
    apisix:
      default_cluster_base_url: "http://172.24.150.14:9180/apisix/admin"
      default_cluster_admin_key: "edd1c9f034335f136f87ad84b625c8f1"
      default_cluster_name: ""

Environment

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long);
  • 1.3.0
  • your Kubernetes cluster version (output of kubectl version);
  • Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"89182bdd065fbcaffefec691908a739d161efc03", GitTreeState:"clean", BuildDate:"2020-12-18T12:02:35Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • if you run apisix-ingress-controller in Bare-metal environment, also show your OS version (uname -a).
@tao12345666333
Copy link
Member

I will try to reproduce.

@tao12345666333
Copy link
Member

Thanks!

@tao12345666333 tao12345666333 added bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Nov 24, 2021
@tao12345666333 tao12345666333 changed the title request help: controller启动后,单点apisix pod被删除后就算恢复,controller也无法再连接上 bug: Unable to reconnect to apisix, when all ep are deleted under svc of apisix Nov 24, 2021
@tao12345666333
Copy link
Member

cc @gxthrj PTAL

@gxthrj gxthrj self-assigned this Nov 24, 2021
@tokers
Copy link
Contributor

tokers commented Nov 25, 2021

I can give you some clues about this issue.

When the controller gets the opportunity to be the new leader, it tries to add the cluster (name is default) again, but since the cluster was already there (added in its last term), the new one won't be added, and the controller will still use the old cluster, and in the old cluster, the cacheSyncErr was cached and will be used directly when calling HasSynced. So the controller won't enter the state for watching Kubernetes resources.

A simple solution for this is destroyed the old cluster when it gives up the leader role.

@chzhuo
Copy link
Contributor

chzhuo commented Nov 30, 2021

I encountered the same situation, but I also encountered the problem that the leader would not switch.
Because the failed leader node give up and quickly restarts acquiring the lock again.
The leader not switch over last two days.

@Zhang21
Copy link

Zhang21 commented Dec 9, 2021

Same error about apisix ingress accesses to apisix admin api.

2021-12-09T16:37:00+08:00	info	ingress/controller.go:290	LeaderElection	{"message": "apisix-ingress-controller-769ddc5457-68d2z became leader", "event_type": "Normal"}
I1209 16:37:00.635044       1 leaderelection.go:253] successfully acquired lease ingress-apisix/ingress-apisix-leader
2021-12-09T16:37:00+08:00	info	ingress/controller.go:370	controller tries to leading ...	{"namespace": "ingress-apisix", "pod": "apisix-ingress-controller-769ddc5457-68d2z"}
2021-12-09T16:37:00+08:00	warn	apisix/cluster.go:304	waiting cluster default to ready, it may takes a while
2021-12-09T16:37:02+08:00	error	apisix/route.go:117	failed to list routes: Get http://apisix-admin.ingress-apisix.svc.cluster.local:9180/apisix/admin/routes: context canceled
2021-12-09T16:37:02+08:00	error	apisix/cluster.go:197	failed to list route in APISIX: Get http://apisix-admin.ingress-apisix.svc.cluster.local:9180/apisix/admin/routes: context canceled
[GIN] 2021/12/09 - 16:37:03 | 200 |     162.264µs |    172.16.2.248 | GET      "/healthz"

I can access apisix admin api by curl:

curl http://apisix-admin.ingress-apisix.svc.cluster.local:9180/apisix/admin/routes -H 'X-API-KEY: xxxxx'
{ result xxxxxx }

@stone2world
Copy link

image
image
image
i get it

@tao12345666333
Copy link
Member

#774 has been merged. It will be released in v1.4 (next week)

If you want to try it now, you can also build docker image to use it.

I will close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

7 participants