Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation of node-exporter often fails #483

Closed
torumakabe opened this issue Apr 28, 2023 · 15 comments
Closed

Installation of node-exporter often fails #483

torumakabe opened this issue Apr 28, 2023 · 15 comments

Comments

@torumakabe
Copy link

torumakabe commented Apr 28, 2023

Installing node-exporter under the following conditions often results in failure.

  • enable with Terraform
    • azurerm provider 3.53.0
    • use argument monitor_metrics of azurerm_kubernetes_cluster resource
  • Kubernetes 1.26.3
  • Azure CNI Overlay network plugin

The success and failure rates are equal, with a 50-50 split. As a result, node-exporter is not installed on the nodes as follows.

root [ / ]# ps -ef | grep exp
root        8433    5186  0 05:55 ?        00:00:00 grep --color=auto exp

root [ / ]# ls -asl /usr/local/bin/
total 344336
     4 drwxr-xr-x 2 root root        4096 Apr 28 05:48 .
     4 drwxr-xr-x 7 root root        4096 Mar 21 10:44 ..
 34556 -rwxr-x--- 1 root root    35384960 Apr 10 16:04 bpftrace
     4 -rwxr-xr-x 1 root root         705 Apr 10 16:02 ci-syslog-watcher.sh
 51008 -rw-r----- 1 root root    52232184 Apr 10 16:03 containerd-shim-slight-v1
 44276 -rw-r----- 1 root root    45334640 Apr 10 16:03 containerd-shim-spin-v1
 49136 -rwxr-xr-x 1 1001 docker  50311268 Aug 26  2022 crictl
     4 -r-xr--r-- 1 root root        2462 Apr 10 16:02 health-monitor.sh
 46912 -rwxr-xr-x 1 root root    48037888 Mar 21 00:57 kubectl
118432 -rwxr-xr-x 1 root root   121272408 Mar 21 00:57 kubelet

Additionally, in cases where Cilium was enabled along with it, all installations failed. I informed you about the network plugin for your reference, although it is unclear whether it has any impact.

It should be noted that all ama-metrics-node-* DaemonSets are running, and metrics can be collected from kubelet and cAdvisor.

What probable causes can you think of? Any advice would be appreciated.

@vishiy
Copy link
Contributor

vishiy commented Apr 28, 2023

@torumakabe - thanks for filing the issue . node exporter is installed thru vm images on aks nodes (not thru prometheus collector), and collector just scrapes it. can u pls tell us the vm image version ? i am also assuming these are aks nodes ?

@torumakabe
Copy link
Author

@vishiy Thank you for your comment. All nodes are in AKS. The node image is "AKSCBLMariner-V2gen2-202304.10.0".

@github-actions
Copy link

github-actions bot commented May 5, 2023

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@vishiy
Copy link
Contributor

vishiy commented May 5, 2023

@torumakabe apologies for the delay. i am following up with AKS folks on this.

@vishiy
Copy link
Contributor

vishiy commented May 5, 2023

@torumakabe - can u please confirm if the the corresponding node pools in the cluster is/are not in failed provisioning state ? Also is this the case in all anodes or just a few nodes ? Thanks for your help.

@torumakabe
Copy link
Author

@vishiy I have tried with the latest image (AKSCBLMariner-V2gen2-202304.20.0) several times. I failed three times out of five attempts. Still flaky.

Despite the failure to deploy Node Exporter, provisioning for all node pools has been successful.

az aks show -g my-group -n my-cluster -o json | grep provisioningState
WARNING: The behavior of this command has been altered by the following extension: aks-preview
      "provisioningState": "Succeeded",
      "provisioningState": "Succeeded",
      "provisioningState": "Succeeded",
      "provisioningState": "Succeeded",
  "provisioningState": "Succeeded",

When the deployment of Node Exporter fails, it fails on all nodes. There is no partial success.

The following are the /usr/local/bin directories for the nodes where installation was successful and the nodes where it failed.

successful

root [ / ]# ls -asl /usr/local/bin/
total 508780
     4 drwxr-xr-x 2 root root        4096 May  8 04:59 .
     4 drwxr-xr-x 7 root root        4096 Apr  7 16:30 ..
 34556 -rwxr-x--- 1 root root    35384960 Apr 20 15:21 bpftrace
     4 -rwxr-xr-x 1 root root         705 Apr 20 15:19 ci-syslog-watcher.sh
 46508 -rwxr-xr-x 1 root root    47622592 Apr 20 15:20 containerd-shim-slight-v0-3-0-v1
 51008 -rwxr-xr-x 1 root root    52232184 Apr 20 15:20 containerd-shim-slight-v0-5-1-v1
 35172 -rwxr-xr-x 1 root root    36014944 Apr 20 15:20 containerd-shim-spin-v0-3-0-v1
 44276 -rwxr-xr-x 1 root root    45334640 Apr 20 15:20 containerd-shim-spin-v0-5-1-v1
 49136 -rwxr-xr-x 1 1001 docker  50311268 Aug 26  2022 crictl
     4 -r-xr--r-- 1 root root        2462 Apr 20 15:19 health-monitor.sh
 46912 -rwxr-xr-x 1 root root    48037888 Mar 21 00:57 kubectl
118432 -rwxr-xr-x 1 root root   121272408 Mar 21 00:57 kubelet
 64948 -rwxr-xr-x 1 root root    66504080 Nov  9 11:05 local-gadget
     0 lrwxrwxrwx 1 root root          20 Mar 29 23:32 log-counter -> /usr/bin/log-counter
 17804 -rwxr-xr-x 1 root root    18231039 Feb  8  2022 node-exporter
     4 -rwxr-xr-x 1 root root         834 Mar 29 23:31 node-exporter-startup.sh
     0 lrwxrwxrwx 1 root root          30 Mar 29 23:32 node-problem-detector -> /usr/bin/node-problem-detector
     8 -rwxr-xr-x 1 root root        4601 Mar 29 23:32 node-problem-detector-startup.sh

failed

root [ / ]# ls -asl /usr/local/bin/
total 426016
     4 drwxr-xr-x 2 root root        4096 May  8 06:06 .
     4 drwxr-xr-x 7 root root        4096 Apr  7 16:30 ..
 34556 -rwxr-x--- 1 root root    35384960 Apr 20 15:21 bpftrace
     4 -rwxr-xr-x 1 root root         705 Apr 20 15:19 ci-syslog-watcher.sh
 46508 -rwxr-xr-x 1 root root    47622592 Apr 20 15:20 containerd-shim-slight-v0-3-0-v1
 51008 -rwxr-xr-x 1 root root    52232184 Apr 20 15:20 containerd-shim-slight-v0-5-1-v1
 35172 -rwxr-xr-x 1 root root    36014944 Apr 20 15:20 containerd-shim-spin-v0-3-0-v1
 44276 -rwxr-xr-x 1 root root    45334640 Apr 20 15:20 containerd-shim-spin-v0-5-1-v1
 49136 -rwxr-xr-x 1 1001 docker  50311268 Aug 26  2022 crictl
     4 -r-xr--r-- 1 root root        2462 Apr 20 15:19 health-monitor.sh
 46912 -rwxr-xr-x 1 root root    48037888 Mar 21 00:57 kubectl
118432 -rwxr-xr-x 1 root root   121272408 Mar 21 00:57 kubelet

Is there any possible cause that you can think of? Thanks.

@github-actions
Copy link

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@vishiy
Copy link
Contributor

vishiy commented May 15, 2023

@torumakabe - Could you please send the below log files and also share AKS cluster id ?

@github-actions
Copy link

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link

This issue was closed because it has been stalled for 12 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023
@vishiy vishiy reopened this May 29, 2023
@github-actions
Copy link

github-actions bot commented Jun 7, 2023

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link

This issue was closed because it has been stalled for 12 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2023
@vishiy vishiy reopened this Sep 21, 2023
@vishiy
Copy link
Contributor

vishiy commented Sep 21, 2023

@torumakabe - is this issue resolved now ? I remember you were following up with AKS .

@torumakabe
Copy link
Author

torumakabe commented Sep 24, 2023

@vishiy Thanks for your concern

I talked to the AKS team and found out why node exporter is not installed. node exporter is installed by AKS-Operator, which tries to install node exporter several times during cluster creation. But the priority is low. Therefore, if a high priority task is taking a long time, it will try to install it again after enough waiting time.

The wait time can be up to 24 hours, but I have confirmed that, indeed, if I wait, it will install. I would like the wait time to be shorter, in other words, the retry interval to be shorter, but I am satisfied at this point that I have found the cause of the problem.

@vishiy
Copy link
Contributor

vishiy commented Sep 25, 2023

ok thank you. i will close this issue.

@vishiy vishiy closed this as completed Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants