gpu-observability-gitops

What it is

A pure GitOps observability platform for NVIDIA GPU workloads on MicroK8s.

What it does

Deploys and configures GPU telemetry (DCGM exporter).
Wires metrics into Prometheus and dashboards into Grafana.
Adds log aggregation via Loki/Promtail for correlated troubleshooting.

Why it matters

It provides production-style visibility for AI/GPU infrastructure, enabling faster diagnosis, capacity planning, and operational reliability.

Pure GitOps GPU observability stack for single-node MicroK8s using Argo CD app-of-apps.

Architecture

NVIDIA GPU Operator
  └─ DCGM Exporter --metrics--> Prometheus (kube-prometheus-stack) --> Grafana dashboards

Kubernetes Pod Logs --> Promtail --> Loki --> Grafana Explore / Logs panels

Repository layout

kustomization.yaml         # Argo CD repo entrypoint (root kustomize)
apps.yaml                  # Argo CD root Application (points at repo root)
apps/
  gpu-operator/
    application.yaml
  monitoring/
    application.yaml
    extras/
      kustomization.yaml
      dashboards/
  logging/
    application.yaml
Makefile

Prerequisites (MicroK8s host)

NVIDIA driver installed on host and nvidia-smi works.
MicroK8s ingress controller enabled and using ingress class public.
A default storage class is available for Prometheus/Grafana/Loki PVCs.
Argo CD installed in the cluster.

Install

Apply the Argo CD root application:

kubectl apply -f apps.yaml

Sync apps:

kubectl -n argocd get applications
kubectl -n argocd annotate application gpu-observability-root argocd.argoproj.io/refresh=hard --overwrite

Check expected URLs:

https://grafana.172.17.93.185.nip.io
https://prometheus.172.17.93.185.nip.io

Verification

1) Pods are running

kubectl get pods -n gpu-operator
kubectl get pods -n monitoring
kubectl get pods -n logging

2) Prometheus sees DCGM exporter target

Open Prometheus Targets page at https://prometheus.172.17.93.185.nip.io/targets.
Verify dcgm-exporter target is UP.

3) Grafana dashboards show GPU metrics

Open Grafana at https://grafana.172.17.93.185.nip.io.
Confirm GPU DCGM Overview dashboard shows utilization, memory, temperature, and power panels.
Confirm Node Basic dashboard shows CPU/memory/disk panels.

4) Loki datasource and logs

Run a test pod:

kubectl run log-demo --image=busybox -n default -- /bin/sh -c 'while true; do echo hello-loki; sleep 5; done'

Then in Grafana Explore:

Select Loki datasource.
Query {namespace="default", pod="log-demo"} and verify logs appear.

Argo CD force-refresh runbook (stale operation values)

If spec.sources[].helm.values in Git is correct but Argo CD keeps retrying a stale operation from operation.sync.sources[].helm.values, run:

# 1) Confirm the app points at the expected revision and inspect current operation block
kubectl -n argocd get application monitoring -o yaml | sed -n '/^spec:/,/^status:/p'
kubectl -n argocd get application monitoring -o yaml | sed -n '/^operation:/,/^status:/p'

# 2) Terminate the currently running operation (if present)
argocd app terminate-op monitoring

# 3) Force a hard refresh from Git and clear cached manifests
argocd app get monitoring --hard-refresh
kubectl -n argocd annotate application monitoring argocd.argoproj.io/refresh=hard --overwrite

# 4) Start a new sync from the latest Git revision
argocd app sync monitoring --prune --retry-limit 1

# 5) Verify operation values no longer contain placeholders and hostnames are current
kubectl -n argocd get application monitoring -o jsonpath='{.operation.sync.sources[*].helm.values}'
echo

If the operation block still references stale values, clear it by deleting/recreating the Application from Git:

kubectl -n argocd delete application monitoring
kubectl apply -f apps/monitoring/application.yaml
argocd app sync monitoring --prune

WSL2 GPU metrics note

On WSL2 MicroK8s, NVIDIA GPU Operator may skip dcgm-exporter when its GPU-node auto-detection gates are not satisfied.

This repository uses the MicroK8s nvidia addon for runtime/toolkit/device plugin on the node, and deploys only nvidia-dcgm-exporter from GitOps (apps/gpu-operator/extras) with runtimeClassName: nvidia, a dedicated Service, and a ServiceMonitor scraped every 15s.

The GPU Operator Helm values disable devicePlugin, toolkit, gfd, migManager, validator, and dcgmExporter to avoid conflicts with MicroK8s-managed components and with our directly managed exporter manifests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-observability-gitops

What it is

What it does

Why it matters

Architecture

Repository layout

Prerequisites (MicroK8s host)

Install

Verification

1) Pods are running

2) Prometheus sees DCGM exporter target

3) Grafana dashboards show GPU metrics

4) Loki datasource and logs

Argo CD force-refresh runbook (stale operation values)

WSL2 GPU metrics note

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
apps		apps
Makefile		Makefile
README.md		README.md
apps.yaml		apps.yaml
kustomization.yaml		kustomization.yaml

dwetmore/gpu-observability-gitops

Folders and files

Latest commit

History

Repository files navigation

gpu-observability-gitops

What it is

What it does

Why it matters

Architecture

Repository layout

Prerequisites (MicroK8s host)

Install

Verification

1) Pods are running

2) Prometheus sees DCGM exporter target

3) Grafana dashboards show GPU metrics

4) Loki datasource and logs

Argo CD force-refresh runbook (stale operation values)

WSL2 GPU metrics note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages