NPD External Plugin Proof of Concept

A proof of concept implementation for external plugin support in Kubernetes Node Problem Detector (NPD), inspired by containerd's external snapshotter architecture.

What is this?

This repository demonstrates how to extend Node Problem Detector with external plugins that run as separate processes and communicate via gRPC over Unix sockets. Instead of embedding monitoring logic directly into NPD, you can now create standalone monitor programs that NPD can discover and communicate with dynamically.

Why External Plugins?

Traditional NPD: All monitoring logic is compiled into the main NPD binary, requiring rebuilds for new monitors.

External Plugin NPD: Monitoring logic runs in separate processes that NPD communicates with via gRPC, enabling:

✅ Runtime extensibility - Add new monitors without rebuilding NPD
✅ Language flexibility - Write monitors in any language that supports gRPC
✅ Isolation - Monitor crashes don't affect NPD core
✅ Independent deployment - Update monitors without touching NPD
✅ Resource control - Fine-grained resource limits per monitor

Real-World Benefits Demonstrated

For NPD Maintainers

Reduced maintenance burden: External plugins don't require NPD releases
Cleaner codebase: 93% reduction in source files (139 → 9 files)
Better testing: Plugins can be tested independently
Security isolation: Plugins run with minimal privileges

For End Users

Easy GPU monitoring: Deploy GPU monitoring without custom NPD builds
Kubernetes integration: Node conditions automatically reflect GPU health
Operational visibility: Clear separation between NPD core and plugin issues
Simple troubleshooting: Container logs clearly show plugin communication status

Architecture Overview

┌─────────────────┐    Unix Socket     ┌─────────────────┐
│                 │ ◄────────────────► │                 │
│  Node Problem   │                    │  External GPU   │
│   Detector      │  gRPC Protocol     │    Monitor      │
│                 │                    │                 │
└─────────────────┘                    └─────────────────┘
        │                                       │
        ▼                                       ▼
┌─────────────────┐                    ┌─────────────────┐
│   Kubernetes    │                    │  nvidia-smi     │
│  API (Events,   │                    │   Hardware      │
│  Conditions)    │                    │    Access       │
└─────────────────┘                    └─────────────────┘

Quick Start

Prerequisites

Kubernetes cluster (1.20+)
Docker or compatible container runtime
kubectl configured
For GPU monitoring: NVIDIA GPU nodes with device plugin

1. Deploy Pre-built Images

# Apply configurations and DaemonSet
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml

# Wait for deployment
kubectl rollout status daemonset/npd-ext -n kube-system

Or build from source:

# Clone this repository
git clone <repository-url>
cd npd-ext

# Build binaries
make build-all-binaries

# Build and push container images
make docker-build-all
make docker-push-all

# Deploy to Kubernetes
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml

2. Verify Deployment

# Check pods are running (should show 2/2 Running)
kubectl get pods -n kube-system -l app=npd-ext

# Check NPD logs for external monitor connection
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector --tail=20

# Check GPU monitor logs for actual GPU stats
kubectl logs -n kube-system -l app=npd-ext -c gpu-monitor --tail=20

# Verify GPU conditions are added to node status
kubectl describe node <gpu-node-name> | grep -A5 -B5 GPU

3. Expected Output

Successful Deployment Logs:

# NPD Container should show:
I1031 18:57:03.909105 81742 external_monitor.go:48] Creating external monitor from config: /config/external-gpu-monitor.json
I1031 18:57:05.608260 81742 external_monitor_proxy.go:159] Connected to external monitor: gpu-monitor
I1031 18:57:06.807778 81742 external_monitor_proxy.go:193] External monitor gpu-monitor metadata: version=1.0.0, api_version=v1

# GPU Monitor Container should show:
2025/10/31 18:57:03 Starting GPU Monitor v1.0.0
2025/10/31 18:57:03 GPU Monitor listening on /var/run/npd/npd-gpu-monitor.sock
2025/10/31 18:57:36 CheckHealth called (sequence: 1)
2025/10/31 18:57:37 GPU stats: temp=25°C, memory=0/46068MB (0.0%), power=21W

Node Conditions Added:

kubectl describe node <gpu-node> | grep GPU
GPUHealthy           False   Fri, 31 Oct 2025 14:58:37 -0400   GPUIsHealthy            GPU is healthy: temp=25°C, memory=0.0%, power=21W
GPUHung              False   Fri, 31 Oct 2025 14:58:37 -0400   GPUHung                 GPU is hung and not responding
GPUMemoryPressure    False   Fri, 31 Oct 2025 14:58:37 -0400   GPUMemoryPressure       GPU memory usage is too high
GPUTemperatureHigh   False   Fri, 31 Oct 2025 14:58:37 -0400   GPUTemperatureHigh      GPU temperature is too high

What's Included

Core Components

External Monitor Proxy (pkg/externalmonitor/) - Bridges gRPC to NPD's Monitor interface
gRPC Protocol (api/services/external/v1/) - Protobuf definitions for external plugins
GPU Monitor Example (examples/external-plugins/gpu-monitor/) - Complete NVIDIA GPU monitor implementation

Key Features Demonstrated

gRPC Communication - Unix socket based IPC between NPD and external monitors
Health Checking - Automatic reconnection and circuit breaking
Configuration - JSON-based external monitor configuration
Error Handling - Comprehensive error recovery and exponential backoff
Resource Management - Independent resource limits for each component

Example: GPU Monitor

The included GPU monitor demonstrates external plugin capabilities:

# Monitor NVIDIA GPU health
# Reports conditions: GPUHung, GPUMemoryPressure, GPUTemperatureHigh
# Connects via: /var/run/npd/npd-gpu-monitor.sock

Sample GPU Events:

# High GPU memory usage
reason: GPUMemoryPressure
message: "GPU memory usage: 96% (threshold: 95%)"

# GPU temperature warning
reason: GPUTemperatureHigh
message: "GPU temperature: 87°C (threshold: 85°C)"

How External Plugins Work

1. Plugin Implementation

External monitors implement the ExternalMonitor gRPC service:

service ExternalMonitor {
    rpc CheckHealth(HealthCheckRequest) returns (Status);
    rpc GetMetadata(google.protobuf.Empty) returns (MonitorMetadata);
    rpc Stop(google.protobuf.Empty) returns (google.protobuf.Empty);
}

2. NPD Configuration

NPD discovers external monitors via configuration:

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock",
    "grpcTimeout": "10s",
    "reconnectInterval": "30s"
  }
}

3. Runtime Communication

NPD starts and reads external monitor configs
NPD creates proxy instances for each external monitor
Proxies attempt gRPC connections to Unix sockets
External monitors register and begin health reporting
NPD receives status updates and reports to Kubernetes

Creating Your Own External Plugin

1. Implement the gRPC Service

type MyMonitorServer struct {
    pb.UnimplementedExternalMonitorServer
}

func (s *MyMonitorServer) CheckHealth(ctx context.Context, req *pb.HealthCheckRequest) (*pb.Status, error) {
    // Your monitoring logic here
    return &pb.Status{
        Conditions: []*pb.Condition{{
            Type:    "MyCustomCondition",
            Status:  pb.ConditionStatus_True,
            Reason:  "MyReason",
            Message: "Custom monitoring detected an issue",
        }},
    }, nil
}

2. Create Configuration

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock"
  },
  "conditions": [
    {
      "type": "MyCustomCondition",
      "reason": "MyReason",
      "message": "Custom monitoring condition"
    }
  ]
}

3. Deploy as Sidecar

containers:
- name: my-monitor
  image: my-registry/my-monitor:v1.0.0
  command: ["/my-monitor", "-socket", "/var/run/npd/my-monitor.sock"]
  volumeMounts:
  - name: npd-socket
    mountPath: /var/run/npd

Repository Structure

npd-ext/
├── api/services/external/v1/       # gRPC protobuf definitions
├── pkg/externalmonitor/            # External monitor proxy implementation
├── cmd/nodeproblemdetector/        # NPD binary with external support
├── examples/external-plugins/      # Example external monitors
│   └── gpu-monitor/               # NVIDIA GPU monitor example
├── deployment/                    # Kubernetes deployment manifests
├── Dockerfile                     # NPD container image
├── Dockerfile.gpu-monitor         # GPU monitor container image
└── Makefile                       # Build automation

Deployment Patterns

Pattern 1: Sidecar (Recommended)

NPD and external monitors in the same pod:

Shared Unix socket via emptyDir volume
Simplified networking and RBAC
Single resource pool

kubectl apply -f deployment/npd-ext-daemonset.yaml

Pattern 2: Separate DaemonSets

Independent deployments for NPD and monitors:

Fine-grained resource control
Independent scaling
GPU monitors only on GPU nodes

kubectl apply -f deployment/npd-ext-separate.yaml

Configuration Examples

Minimal External Monitor Config

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock"
  }
}

Advanced External Monitor Config

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock",
    "grpcTimeout": "15s",
    "reconnectInterval": "60s",
    "maxReconnectAttempts": 10,
    "healthCheckInterval": "30s"
  },
  "conditions": [...],
  "rules": [...]
}

Performance Characteristics

Resource Usage

Component	CPU (Request)	CPU (Limit)	Memory (Request)	Memory (Limit)
NPD Core	10m	10m	80Mi	80Mi
GPU Monitor	10m	50m	20Mi	50Mi

Communication Overhead

Protocol: gRPC over Unix sockets
Serialization: Protocol Buffers (~100-500 bytes per status)
Frequency: Configurable (default: 30s health checks)
Latency: <1ms for local Unix socket communication

Troubleshooting

Common Issues

Pod CrashLoopBackOff

# Check specific container logs
kubectl logs <pod-name> -n kube-system -c node-problem-detector --tail=50
kubectl logs <pod-name> -n kube-system -c gpu-monitor --tail=50

# Common causes:
# 1. Missing log paths in custom plugin configs (add "path": "/var/log/messages")
# 2. Unsupported plugin types (remove journald-based monitors)
# 3. GPU access issues (verify nvidia.com/gpu resource allocation)

External Monitor Not Connecting

# Check if external monitor is starting
kubectl logs <pod-name> -n kube-system -c gpu-monitor | grep "GPU Monitor listening"

# Verify NPD can reach external monitor
kubectl logs <pod-name> -n kube-system -c node-problem-detector | grep "Connected to external monitor"

# Check socket permissions
kubectl exec <pod-name> -n kube-system -c node-problem-detector -- ls -la /var/run/npd/

GPU Monitor Issues

# Test nvidia-smi access
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- nvidia-smi

# Check GPU resource allocation
kubectl describe node <node-name> | grep nvidia.com/gpu

# Verify CUDA base image
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- which nvidia-smi

Debug Commands

# Show external monitor configuration
kubectl get configmap -n kube-system npd-ext-config -o yaml

# Check NPD external monitor status
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector | grep external

# Monitor gRPC communication
kubectl exec -n kube-system <pod> -c node-problem-detector -- netstat -ln | grep /var/run/npd

Extending the System

Adding New Monitor Types

Create monitor binary implementing the gRPC service
Add configuration to the ConfigMap
Deploy as sidecar or separate DaemonSet
Update NPD command with new --config.external-monitor flag

Language Examples

Python Monitor

import grpc
from concurrent import futures
import external_monitor_pb2_grpc as pb2_grpc

class MyMonitor(pb2_grpc.ExternalMonitorServicer):
    def CheckHealth(self, request, context):
        # Your monitoring logic
        pass

Rust Monitor

use tonic::{transport::Server, Request, Response, Status};
use external_monitor::external_monitor_server::{ExternalMonitor, ExternalMonitorServer};

#[tonic::async_trait]
impl ExternalMonitor for MyMonitor {
    async fn check_health(&self, request: Request<HealthCheckRequest>) -> Result<Response<Status>, Status> {
        // Your monitoring logic
    }
}

Contributing

This is a proof of concept. To contribute:

Fork the repository
Create feature branch
Add tests for new functionality
Submit pull request with clear description

License

Licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api/services/external/v1		api/services/external/v1
cmd/nodeproblemdetector		cmd/nodeproblemdetector
deployment		deployment
examples/external-plugins/gpu-monitor		examples/external-plugins/gpu-monitor
pkg/externalmonitor		pkg/externalmonitor
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gpu-monitor		Dockerfile.gpu-monitor
Makefile		Makefile
PUBLISHING.md		PUBLISHING.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

NPD External Plugin Proof of Concept

What is this?

Why External Plugins?

Real-World Benefits Demonstrated

For NPD Maintainers

For End Users

Architecture Overview

Quick Start

Prerequisites

1. Deploy Pre-built Images

2. Verify Deployment

3. Expected Output

What's Included

Core Components

Key Features Demonstrated

Example: GPU Monitor

How External Plugins Work

1. Plugin Implementation

2. NPD Configuration

3. Runtime Communication

Creating Your Own External Plugin

1. Implement the gRPC Service

2. Create Configuration

3. Deploy as Sidecar

Repository Structure

Deployment Patterns

Pattern 1: Sidecar (Recommended)

Pattern 2: Separate DaemonSets

Configuration Examples

Minimal External Monitor Config

Advanced External Monitor Config

Performance Characteristics

Resource Usage

Communication Overhead

Troubleshooting

Common Issues

Debug Commands

Extending the System

Adding New Monitor Types

Language Examples

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages