Bug: e2e autoscaler error #1065

cloudziu · 2023-10-17T12:26:14Z

While running multiple times the e2e tests, I've encountered error in autoscaler cluster scenario. The didn't want to trigger the scale down.

"error while performing additional test for manifest 1.yaml from autoscaling-1 : some cluster/s in config claudie-c6a1529-2396-autoscaling-1 have not been scaled down, when they should have"

The text was updated successfully, but these errors were encountered:

Despire · 2023-12-06T07:56:10Z

I'm not sure how exactly to reproduce this, I've seen this error a few time in the e2e pipeline but a rerun would fix it. Now I'm not sure if the timeout for the scaledown is to low for the testing framework or it takes k8s longer to emit the scale-down event.

cloudziu · 2023-12-06T08:09:18Z

I haven't encountered this since that time. Maybe we should close this, and reopen when it happen again?

Despire · 2023-12-06T08:10:18Z

I haven't encountered this since that time. Maybe we should close this, and reopen when it happen again?

We can leave this open I know this happens from time to time.

JKBGIT1 · 2024-04-17T08:25:10Z

I encountered this while running testing-framework locally on the kind cluster. In a run, where this error occurred, I experienced an unexpected disconnect from the internet for ~3-5 minutes. At first, I thought that was the reason behind this error, but I couldn't replicate it when I disconnected and reconnected to the Internet on purpose...

Anyway, there is some info and the logs from the cluster-autoscaler container. Hope it helps to move further.
The cluster-autoscaler container restarted 8 times with the exit status code 255. Below are the logs from the container. The reason behind restarts was lost master.

I0416 06:53:01.638646   	1 leaderelection.go:248] attempting to acquire leader lease kube-system/cluster-autoscaler...
I0416 06:53:02.101492   	1 leaderelection.go:258] successfully acquired lease kube-system/cluster-autoscaler
I0416 06:53:03.666013   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:53:03.668204   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 2.170829ms
W0416 06:53:13.670587   	1 clusterstate.go:428] AcceptableRanges have not been populated yet. Skip checking
I0416 06:55:03.668908   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:55:03.669375   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 449.865µs
I0416 06:57:03.670147   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:57:03.670562   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 398.98µs
I0416 06:59:03.671334   	1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0416 06:59:03.671891   	1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 536.648µs
E0416 07:00:54.703366   	1 leaderelection.go:330] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://34.22.242.66:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0416 07:00:54.703394   	1 leaderelection.go:283] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0416 07:01:01.800514   	1 main.go:578] lost master

The lost master problem is discussed at kubernetes/autoscaler#1653. Some people fix it by increasing the limits, and some use --leader-elect=false...

Next time, when this error occurs in the e2e pipeline, let's check if the autoscaler pod got restarted a couple of times and what was the reason behind these restarts.

EDIT: I just realized that we can't get any info from the autoscaler pod, because the infra (with autoscaler pod) is deleted. In my case, it was there, because I forgot to specify AUTO_CLEAN_UP

cloudziu added the bug Something isn't working label Oct 17, 2023

JKBGIT1 self-assigned this Feb 1, 2024

samuelstolicny assigned bernardhalas Feb 2, 2024

JKBGIT1 mentioned this issue Apr 18, 2024

Bug: The dsChecksum wasn't updated on the autoscaling request #1340

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: e2e autoscaler error #1065

Bug: e2e autoscaler error #1065

cloudziu commented Oct 17, 2023

Despire commented Dec 6, 2023 •

edited

cloudziu commented Dec 6, 2023

Despire commented Dec 6, 2023

JKBGIT1 commented Apr 17, 2024 •

edited

Bug: e2e autoscaler error #1065

Bug: e2e autoscaler error #1065

Comments

cloudziu commented Oct 17, 2023

Despire commented Dec 6, 2023 • edited

cloudziu commented Dec 6, 2023

Despire commented Dec 6, 2023

JKBGIT1 commented Apr 17, 2024 • edited

Despire commented Dec 6, 2023 •

edited

JKBGIT1 commented Apr 17, 2024 •

edited