Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

Open
andy108369 opened this issue Jun 27, 2024 · 1 comment
Labels
awaiting-triage repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

hurricane provider reports excessively large amount of available CPUs

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 405.635368
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709504/-18446744073709404"  "1/1/0"       "196.45/57.48/138.97"  "1808.76/1443.1/365.67"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh

Versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                          IMAGE
akash-node-1-0                                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-hmmxd                            ghcr.io/akash-network/provider:0.6.2
operator-inventory-6fdf575d44-rnfj4                           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-control-01.hurricane2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-worker-01.hurricane2    ghcr.io/akash-network/provider:0.6.2
operator-ip-d9d6df8cd-t9zw9                                   ghcr.io/akash-network/provider:0.6.2

Logs

I've tried restarting the operator-inventory which previously used to "fix" this issue, but to no avail this time.

kubectl -n akash-services rollout restart deployment/operator-inventory
$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-06-27T15:25:29.979755238Z I[2024-06-27|15:25:29.979] using in cluster kube config                 cmp=provider
2024-06-27T15:25:30.993714193Z INFO	rook-ceph	   ADDED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:31.022718552Z INFO	rest listening on ":8080"
2024-06-27T15:25:31.022730122Z INFO	nodes.nodes	waiting for nodes to finish
2024-06-27T15:25:31.022777911Z INFO	grpc listening on ":8081"
2024-06-27T15:25:31.022824901Z INFO	watcher.storageclasses	started
2024-06-27T15:25:31.022976338Z INFO	watcher.config	started
2024-06-27T15:25:31.027880682Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-06-27T15:25:31.029378292Z INFO	nodes.node.monitor	starting	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.029383612Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029386222Z INFO	nodes.node.monitor	starting	{"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029390481Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.063512161Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-06-27T15:25:31.066705538Z W0627 15:25:31.066598       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:31.066875795Z W0627 15:25:31.066601       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087372741Z W0627 15:25:32.087218       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087522389Z W0627 15:25:32.087456       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.093759624Z W0627 15:25:33.093649       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.096327860Z W0627 15:25:33.096250       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:35.614448848Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:35.664476772Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "control-01.hurricane2"}
2024-06-27T15:25:35.780999348Z INFO	nodes.node.monitor	started	{"node": "control-01.hurricane2"}
2024-06-27T15:25:36.239976215Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "worker-01.hurricane2"}
2024-06-27T15:25:36.454713184Z INFO	nodes.node.monitor	started	{"node": "worker-01.hurricane2"}
2024-06-27T15:26:36.900875467Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:27:38.206330676Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:28:39.486188220Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:29:40.787165193Z INFO	rook-ceph	MODIFIED monitoring CephCluster	{"ns": "rook-ceph", "name": "rook-ceph"}
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Jun 27, 2024
@andy108369
Copy link
Contributor Author

The cpu value returned back to normal even without the need to bump the opreator-inventory, after deleting containers stuck in "ContainerStatusUnknown" state:

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
^R
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709490/-18446744073709384"  "1/1/0"       "196.45/49.67/146.78"  "1808.76/1435.28/373.48"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-4cq86   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-5ddrg   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-7nl6p   0/1     ContainerStatusUnknown   1          5d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-9jsn7   0/1     ContainerStatusUnknown   1          19d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-bnjfh   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-d2nfr   0/1     ContainerStatusUnknown   1          7d12h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-dk95v   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-fgfl4   0/1     ContainerStatusUnknown   1          7d19h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gh9bb   0/1     ContainerStatusUnknown   1          16d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gltgh   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-j9tnr   0/1     ContainerStatusUnknown   1          15d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-mmqfk   0/1     ContainerStatusUnknown   1          6d5h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-ph89h   0/1     ContainerStatusUnknown   1          11d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pjrg4   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pwbzv   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-rd7z5   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-t6vt9   0/1     ContainerStatusUnknown   1          6d15h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-vht5l   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-wd8w4   0/1     ContainerStatusUnknown   1          7d23h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-xnsvt   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zmzbf   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zw2st   0/1     ContainerStatusUnknown   1          10d

arno@x1:~$ ns=2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu
arno@x1:~$ kubectl -n $ns get deployment
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
web    1/1     1            1           116d
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-57478ff56c   0         0         0       4d17h
web-5df9f7c798   1         1         1       4d17h
web-7f5fdfd87c   0         0         0       53d
web-85fc6b7694   0         0         0       54d
web-85ff75fdc5   0         0         0       70d
arno@x1:~$ kubectl -n $ns delete rs web-85ff75fdc5
replicaset.apps "web-85ff75fdc5" deleted
arno@x1:~$ kubectl -n $ns delete rs web-85fc6b7694
replicaset.apps "web-85fc6b7694" deleted
arno@x1:~$ kubectl -n $ns delete rs web-7f5fdfd87c
replicaset.apps "web-7f5fdfd87c" deleted
arno@x1:~$ kubectl -n $ns delete rs web-57478ff56c
replicaset.apps "web-57478ff56c" deleted
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-5df9f7c798   1         1         1       4d17h
arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-5df9f7c798-bbgqz" deleted
pod "web-5df9f7c798-f2fpj" deleted
pod "web-5df9f7c798-g7xbd" deleted
pod "web-5df9f7c798-hv4qs" deleted
pod "web-5df9f7c798-p4h7j" deleted
pod "web-5df9f7c798-rcr45" deleted
$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"          "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/47.995/54.005"  "1/1/0"       "196.45/104.36/92.09"  "1808.76/1489.97/318.79"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-triage repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

1 participant