A proof of concept implementation for external plugin support in Kubernetes Node Problem Detector (NPD), inspired by containerd's external snapshotter architecture.
This repository demonstrates how to extend Node Problem Detector with external plugins that run as separate processes and communicate via gRPC over Unix sockets. Instead of embedding monitoring logic directly into NPD, you can now create standalone monitor programs that NPD can discover and communicate with dynamically.
Traditional NPD: All monitoring logic is compiled into the main NPD binary, requiring rebuilds for new monitors.
External Plugin NPD: Monitoring logic runs in separate processes that NPD communicates with via gRPC, enabling:
- ✅ Runtime extensibility - Add new monitors without rebuilding NPD
- ✅ Language flexibility - Write monitors in any language that supports gRPC
- ✅ Isolation - Monitor crashes don't affect NPD core
- ✅ Independent deployment - Update monitors without touching NPD
- ✅ Resource control - Fine-grained resource limits per monitor
- Reduced maintenance burden: External plugins don't require NPD releases
- Cleaner codebase: 93% reduction in source files (139 → 9 files)
- Better testing: Plugins can be tested independently
- Security isolation: Plugins run with minimal privileges
- Easy GPU monitoring: Deploy GPU monitoring without custom NPD builds
- Kubernetes integration: Node conditions automatically reflect GPU health
- Operational visibility: Clear separation between NPD core and plugin issues
- Simple troubleshooting: Container logs clearly show plugin communication status
┌─────────────────┐ Unix Socket ┌─────────────────┐
│ │ ◄────────────────► │ │
│ Node Problem │ │ External GPU │
│ Detector │ gRPC Protocol │ Monitor │
│ │ │ │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Kubernetes │ │ nvidia-smi │
│ API (Events, │ │ Hardware │
│ Conditions) │ │ Access │
└─────────────────┘ └─────────────────┘
- Kubernetes cluster (1.20+)
- Docker or compatible container runtime
- kubectl configured
- For GPU monitoring: NVIDIA GPU nodes with device plugin
# Apply configurations and DaemonSet
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml
# Wait for deployment
kubectl rollout status daemonset/npd-ext -n kube-systemOr build from source:
# Clone this repository
git clone <repository-url>
cd npd-ext
# Build binaries
make build-all-binaries
# Build and push container images
make docker-build-all
make docker-push-all
# Deploy to Kubernetes
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml# Check pods are running (should show 2/2 Running)
kubectl get pods -n kube-system -l app=npd-ext
# Check NPD logs for external monitor connection
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector --tail=20
# Check GPU monitor logs for actual GPU stats
kubectl logs -n kube-system -l app=npd-ext -c gpu-monitor --tail=20
# Verify GPU conditions are added to node status
kubectl describe node <gpu-node-name> | grep -A5 -B5 GPUSuccessful Deployment Logs:
# NPD Container should show:
I1031 18:57:03.909105 81742 external_monitor.go:48] Creating external monitor from config: /config/external-gpu-monitor.json
I1031 18:57:05.608260 81742 external_monitor_proxy.go:159] Connected to external monitor: gpu-monitor
I1031 18:57:06.807778 81742 external_monitor_proxy.go:193] External monitor gpu-monitor metadata: version=1.0.0, api_version=v1
# GPU Monitor Container should show:
2025/10/31 18:57:03 Starting GPU Monitor v1.0.0
2025/10/31 18:57:03 GPU Monitor listening on /var/run/npd/npd-gpu-monitor.sock
2025/10/31 18:57:36 CheckHealth called (sequence: 1)
2025/10/31 18:57:37 GPU stats: temp=25°C, memory=0/46068MB (0.0%), power=21WNode Conditions Added:
kubectl describe node <gpu-node> | grep GPU
GPUHealthy False Fri, 31 Oct 2025 14:58:37 -0400 GPUIsHealthy GPU is healthy: temp=25°C, memory=0.0%, power=21W
GPUHung False Fri, 31 Oct 2025 14:58:37 -0400 GPUHung GPU is hung and not responding
GPUMemoryPressure False Fri, 31 Oct 2025 14:58:37 -0400 GPUMemoryPressure GPU memory usage is too high
GPUTemperatureHigh False Fri, 31 Oct 2025 14:58:37 -0400 GPUTemperatureHigh GPU temperature is too high- External Monitor Proxy (
pkg/externalmonitor/) - Bridges gRPC to NPD's Monitor interface - gRPC Protocol (
api/services/external/v1/) - Protobuf definitions for external plugins - GPU Monitor Example (
examples/external-plugins/gpu-monitor/) - Complete NVIDIA GPU monitor implementation
- gRPC Communication - Unix socket based IPC between NPD and external monitors
- Health Checking - Automatic reconnection and circuit breaking
- Configuration - JSON-based external monitor configuration
- Error Handling - Comprehensive error recovery and exponential backoff
- Resource Management - Independent resource limits for each component
The included GPU monitor demonstrates external plugin capabilities:
# Monitor NVIDIA GPU health
# Reports conditions: GPUHung, GPUMemoryPressure, GPUTemperatureHigh
# Connects via: /var/run/npd/npd-gpu-monitor.sockSample GPU Events:
# High GPU memory usage
reason: GPUMemoryPressure
message: "GPU memory usage: 96% (threshold: 95%)"
# GPU temperature warning
reason: GPUTemperatureHigh
message: "GPU temperature: 87°C (threshold: 85°C)"External monitors implement the ExternalMonitor gRPC service:
service ExternalMonitor {
rpc CheckHealth(HealthCheckRequest) returns (Status);
rpc GetMetadata(google.protobuf.Empty) returns (MonitorMetadata);
rpc Stop(google.protobuf.Empty) returns (google.protobuf.Empty);
}NPD discovers external monitors via configuration:
{
"plugin": "external",
"pluginConfig": {
"socketPath": "/var/run/npd/my-monitor.sock",
"grpcTimeout": "10s",
"reconnectInterval": "30s"
}
}- NPD starts and reads external monitor configs
- NPD creates proxy instances for each external monitor
- Proxies attempt gRPC connections to Unix sockets
- External monitors register and begin health reporting
- NPD receives status updates and reports to Kubernetes
type MyMonitorServer struct {
pb.UnimplementedExternalMonitorServer
}
func (s *MyMonitorServer) CheckHealth(ctx context.Context, req *pb.HealthCheckRequest) (*pb.Status, error) {
// Your monitoring logic here
return &pb.Status{
Conditions: []*pb.Condition{{
Type: "MyCustomCondition",
Status: pb.ConditionStatus_True,
Reason: "MyReason",
Message: "Custom monitoring detected an issue",
}},
}, nil
}{
"plugin": "external",
"pluginConfig": {
"socketPath": "/var/run/npd/my-monitor.sock"
},
"conditions": [
{
"type": "MyCustomCondition",
"reason": "MyReason",
"message": "Custom monitoring condition"
}
]
}containers:
- name: my-monitor
image: my-registry/my-monitor:v1.0.0
command: ["/my-monitor", "-socket", "/var/run/npd/my-monitor.sock"]
volumeMounts:
- name: npd-socket
mountPath: /var/run/npdnpd-ext/
├── api/services/external/v1/ # gRPC protobuf definitions
├── pkg/externalmonitor/ # External monitor proxy implementation
├── cmd/nodeproblemdetector/ # NPD binary with external support
├── examples/external-plugins/ # Example external monitors
│ └── gpu-monitor/ # NVIDIA GPU monitor example
├── deployment/ # Kubernetes deployment manifests
├── Dockerfile # NPD container image
├── Dockerfile.gpu-monitor # GPU monitor container image
└── Makefile # Build automation
NPD and external monitors in the same pod:
- Shared Unix socket via emptyDir volume
- Simplified networking and RBAC
- Single resource pool
kubectl apply -f deployment/npd-ext-daemonset.yamlIndependent deployments for NPD and monitors:
- Fine-grained resource control
- Independent scaling
- GPU monitors only on GPU nodes
kubectl apply -f deployment/npd-ext-separate.yaml{
"plugin": "external",
"pluginConfig": {
"socketPath": "/var/run/npd/my-monitor.sock"
}
}{
"plugin": "external",
"pluginConfig": {
"socketPath": "/var/run/npd/my-monitor.sock",
"grpcTimeout": "15s",
"reconnectInterval": "60s",
"maxReconnectAttempts": 10,
"healthCheckInterval": "30s"
},
"conditions": [...],
"rules": [...]
}| Component | CPU (Request) | CPU (Limit) | Memory (Request) | Memory (Limit) |
|---|---|---|---|---|
| NPD Core | 10m | 10m | 80Mi | 80Mi |
| GPU Monitor | 10m | 50m | 20Mi | 50Mi |
- Protocol: gRPC over Unix sockets
- Serialization: Protocol Buffers (~100-500 bytes per status)
- Frequency: Configurable (default: 30s health checks)
- Latency: <1ms for local Unix socket communication
Pod CrashLoopBackOff
# Check specific container logs
kubectl logs <pod-name> -n kube-system -c node-problem-detector --tail=50
kubectl logs <pod-name> -n kube-system -c gpu-monitor --tail=50
# Common causes:
# 1. Missing log paths in custom plugin configs (add "path": "/var/log/messages")
# 2. Unsupported plugin types (remove journald-based monitors)
# 3. GPU access issues (verify nvidia.com/gpu resource allocation)External Monitor Not Connecting
# Check if external monitor is starting
kubectl logs <pod-name> -n kube-system -c gpu-monitor | grep "GPU Monitor listening"
# Verify NPD can reach external monitor
kubectl logs <pod-name> -n kube-system -c node-problem-detector | grep "Connected to external monitor"
# Check socket permissions
kubectl exec <pod-name> -n kube-system -c node-problem-detector -- ls -la /var/run/npd/GPU Monitor Issues
# Test nvidia-smi access
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- nvidia-smi
# Check GPU resource allocation
kubectl describe node <node-name> | grep nvidia.com/gpu
# Verify CUDA base image
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- which nvidia-smi# Show external monitor configuration
kubectl get configmap -n kube-system npd-ext-config -o yaml
# Check NPD external monitor status
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector | grep external
# Monitor gRPC communication
kubectl exec -n kube-system <pod> -c node-problem-detector -- netstat -ln | grep /var/run/npd- Create monitor binary implementing the gRPC service
- Add configuration to the ConfigMap
- Deploy as sidecar or separate DaemonSet
- Update NPD command with new
--config.external-monitorflag
Python Monitor
import grpc
from concurrent import futures
import external_monitor_pb2_grpc as pb2_grpc
class MyMonitor(pb2_grpc.ExternalMonitorServicer):
def CheckHealth(self, request, context):
# Your monitoring logic
passRust Monitor
use tonic::{transport::Server, Request, Response, Status};
use external_monitor::external_monitor_server::{ExternalMonitor, ExternalMonitorServer};
#[tonic::async_trait]
impl ExternalMonitor for MyMonitor {
async fn check_health(&self, request: Request<HealthCheckRequest>) -> Result<Response<Status>, Status> {
// Your monitoring logic
}
}This is a proof of concept. To contribute:
- Fork the repository
- Create feature branch
- Add tests for new functionality
- Submit pull request with clear description
Copyright 2024 The Kubernetes Authors All rights reserved.
Licensed under the Apache License, Version 2.0.