Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: e2e autoscaler error #1065

Open
cloudziu opened this issue Oct 17, 2023 · 4 comments
Open

Bug: e2e autoscaler error #1065

cloudziu opened this issue Oct 17, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@cloudziu
Copy link
Contributor

While running multiple times the e2e tests, I've encountered error in autoscaler cluster scenario. The didn't want to trigger the scale down.

"error while performing additional test for manifest 1.yaml from autoscaling-1 : some cluster/s in config claudie-c6a1529-2396-autoscaling-1 have not been scaled down, when they should have"
@cloudziu cloudziu added the bug Something isn't working label Oct 17, 2023
@Despire
Copy link
Contributor

Despire commented Dec 6, 2023

I'm not sure how exactly to reproduce this, I've seen this error a few time in the e2e pipeline but a rerun would fix it. Now I'm not sure if the timeout for the scaledown is to low for the testing framework or it takes k8s longer to emit the scale-down event.

@cloudziu
Copy link
Contributor Author

cloudziu commented Dec 6, 2023

I haven't encountered this since that time. Maybe we should close this, and reopen when it happen again?

@Despire
Copy link
Contributor

Despire commented Dec 6, 2023

I haven't encountered this since that time. Maybe we should close this, and reopen when it happen again?

We can leave this open I know this happens from time to time.

@JKBGIT1
Copy link
Contributor

JKBGIT1 commented Apr 17, 2024

I encountered this while running testing-framework locally on the kind cluster. In a run, where this error occurred, I experienced an unexpected disconnect from the internet for ~3-5 minutes. At first, I thought that was the reason behind this error, but I couldn't replicate it when I disconnected and reconnected to the Internet on purpose...

Anyway, there is some info and the logs from the cluster-autoscaler container. Hope it helps to move further.
The cluster-autoscaler container restarted 8 times with the exit status code 255. Below are the logs from the container. The reason behind restarts was lost master.

I0416 06:53:01.638646   	1 leaderelection.go:248] attempting to acquire leader lease kube-system/cluster-autoscaler...
I0416 06:53:02.101492   	1 leaderelection.go:258] successfully acquired lease kube-system/cluster-autoscaler
I0416 06:53:03.666013   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:53:03.668204   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 2.170829ms
W0416 06:53:13.670587   	1 clusterstate.go:428] AcceptableRanges have not been populated yet. Skip checking
I0416 06:55:03.668908   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:55:03.669375   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 449.865µs
I0416 06:57:03.670147   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:57:03.670562   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 398.98µs
I0416 06:59:03.671334   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:59:03.671891   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 536.648µs
E0416 07:00:54.703366   	1 leaderelection.go:330] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://34.22.242.66:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0416 07:00:54.703394   	1 leaderelection.go:283] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0416 07:01:01.800514   	1 main.go:578] lost master

The lost master problem is discussed at kubernetes/autoscaler#1653. Some people fix it by increasing the limits, and some use --leader-elect=false...

Next time, when this error occurs in the e2e pipeline, let's check if the autoscaler pod got restarted a couple of times and what was the reason behind these restarts.

EDIT: I just realized that we can't get any info from the autoscaler pod, because the infra (with autoscaler pod) is deleted. In my case, it was there, because I forgot to specify AUTO_CLEAN_UP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants