A pure GitOps observability platform for NVIDIA GPU workloads on MicroK8s.
- Deploys and configures GPU telemetry (DCGM exporter).
- Wires metrics into Prometheus and dashboards into Grafana.
- Adds log aggregation via Loki/Promtail for correlated troubleshooting.
It provides production-style visibility for AI/GPU infrastructure, enabling faster diagnosis, capacity planning, and operational reliability.
Pure GitOps GPU observability stack for single-node MicroK8s using Argo CD app-of-apps.
NVIDIA GPU Operator
└─ DCGM Exporter --metrics--> Prometheus (kube-prometheus-stack) --> Grafana dashboards
Kubernetes Pod Logs --> Promtail --> Loki --> Grafana Explore / Logs panels
kustomization.yaml # Argo CD repo entrypoint (root kustomize)
apps.yaml # Argo CD root Application (points at repo root)
apps/
gpu-operator/
application.yaml
monitoring/
application.yaml
extras/
kustomization.yaml
dashboards/
logging/
application.yaml
Makefile
- NVIDIA driver installed on host and
nvidia-smiworks. - MicroK8s ingress controller enabled and using ingress class
public. - A default storage class is available for Prometheus/Grafana/Loki PVCs.
- Argo CD installed in the cluster.
- Apply the Argo CD root application:
kubectl apply -f apps.yaml- Sync apps:
kubectl -n argocd get applications
kubectl -n argocd annotate application gpu-observability-root argocd.argoproj.io/refresh=hard --overwrite- Check expected URLs:
https://grafana.172.17.93.185.nip.iohttps://prometheus.172.17.93.185.nip.io
kubectl get pods -n gpu-operator
kubectl get pods -n monitoring
kubectl get pods -n logging- Open Prometheus Targets page at
https://prometheus.172.17.93.185.nip.io/targets. - Verify
dcgm-exportertarget isUP.
- Open Grafana at
https://grafana.172.17.93.185.nip.io. - Confirm GPU DCGM Overview dashboard shows utilization, memory, temperature, and power panels.
- Confirm Node Basic dashboard shows CPU/memory/disk panels.
Run a test pod:
kubectl run log-demo --image=busybox -n default -- /bin/sh -c 'while true; do echo hello-loki; sleep 5; done'Then in Grafana Explore:
- Select Loki datasource.
- Query
{namespace="default", pod="log-demo"}and verify logs appear.
If spec.sources[].helm.values in Git is correct but Argo CD keeps retrying a stale operation from operation.sync.sources[].helm.values, run:
# 1) Confirm the app points at the expected revision and inspect current operation block
kubectl -n argocd get application monitoring -o yaml | sed -n '/^spec:/,/^status:/p'
kubectl -n argocd get application monitoring -o yaml | sed -n '/^operation:/,/^status:/p'
# 2) Terminate the currently running operation (if present)
argocd app terminate-op monitoring
# 3) Force a hard refresh from Git and clear cached manifests
argocd app get monitoring --hard-refresh
kubectl -n argocd annotate application monitoring argocd.argoproj.io/refresh=hard --overwrite
# 4) Start a new sync from the latest Git revision
argocd app sync monitoring --prune --retry-limit 1
# 5) Verify operation values no longer contain placeholders and hostnames are current
kubectl -n argocd get application monitoring -o jsonpath='{.operation.sync.sources[*].helm.values}'
echoIf the operation block still references stale values, clear it by deleting/recreating the Application from Git:
kubectl -n argocd delete application monitoring
kubectl apply -f apps/monitoring/application.yaml
argocd app sync monitoring --pruneOn WSL2 MicroK8s, NVIDIA GPU Operator may skip dcgm-exporter when its GPU-node auto-detection gates are not satisfied.
This repository uses the MicroK8s nvidia addon for runtime/toolkit/device plugin on the node, and deploys only nvidia-dcgm-exporter from GitOps (apps/gpu-operator/extras) with runtimeClassName: nvidia, a dedicated Service, and a ServiceMonitor scraped every 15s.
The GPU Operator Helm values disable devicePlugin, toolkit, gfd, migManager, validator, and dcgmExporter to avoid conflicts with MicroK8s-managed components and with our directly managed exporter manifests.