Skip to content

dims/npd-ext

Repository files navigation

NPD External Plugin Proof of Concept

A proof of concept implementation for external plugin support in Kubernetes Node Problem Detector (NPD), inspired by containerd's external snapshotter architecture.

What is this?

This repository demonstrates how to extend Node Problem Detector with external plugins that run as separate processes and communicate via gRPC over Unix sockets. Instead of embedding monitoring logic directly into NPD, you can now create standalone monitor programs that NPD can discover and communicate with dynamically.

Why External Plugins?

Traditional NPD: All monitoring logic is compiled into the main NPD binary, requiring rebuilds for new monitors.

External Plugin NPD: Monitoring logic runs in separate processes that NPD communicates with via gRPC, enabling:

  • Runtime extensibility - Add new monitors without rebuilding NPD
  • Language flexibility - Write monitors in any language that supports gRPC
  • Isolation - Monitor crashes don't affect NPD core
  • Independent deployment - Update monitors without touching NPD
  • Resource control - Fine-grained resource limits per monitor

Real-World Benefits Demonstrated

For NPD Maintainers

  • Reduced maintenance burden: External plugins don't require NPD releases
  • Cleaner codebase: 93% reduction in source files (139 → 9 files)
  • Better testing: Plugins can be tested independently
  • Security isolation: Plugins run with minimal privileges

For End Users

  • Easy GPU monitoring: Deploy GPU monitoring without custom NPD builds
  • Kubernetes integration: Node conditions automatically reflect GPU health
  • Operational visibility: Clear separation between NPD core and plugin issues
  • Simple troubleshooting: Container logs clearly show plugin communication status

Architecture Overview

┌─────────────────┐    Unix Socket     ┌─────────────────┐
│                 │ ◄────────────────► │                 │
│  Node Problem   │                    │  External GPU   │
│   Detector      │  gRPC Protocol     │    Monitor      │
│                 │                    │                 │
└─────────────────┘                    └─────────────────┘
        │                                       │
        ▼                                       ▼
┌─────────────────┐                    ┌─────────────────┐
│   Kubernetes    │                    │  nvidia-smi     │
│  API (Events,   │                    │   Hardware      │
│  Conditions)    │                    │    Access       │
└─────────────────┘                    └─────────────────┘

Quick Start

Prerequisites

  • Kubernetes cluster (1.20+)
  • Docker or compatible container runtime
  • kubectl configured
  • For GPU monitoring: NVIDIA GPU nodes with device plugin

1. Deploy Pre-built Images

# Apply configurations and DaemonSet
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml

# Wait for deployment
kubectl rollout status daemonset/npd-ext -n kube-system

Or build from source:

# Clone this repository
git clone <repository-url>
cd npd-ext

# Build binaries
make build-all-binaries

# Build and push container images
make docker-build-all
make docker-push-all

# Deploy to Kubernetes
kubectl apply -f deployment/npd-ext-config.yaml
kubectl apply -f deployment/npd-ext-daemonset.yaml

2. Verify Deployment

# Check pods are running (should show 2/2 Running)
kubectl get pods -n kube-system -l app=npd-ext

# Check NPD logs for external monitor connection
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector --tail=20

# Check GPU monitor logs for actual GPU stats
kubectl logs -n kube-system -l app=npd-ext -c gpu-monitor --tail=20

# Verify GPU conditions are added to node status
kubectl describe node <gpu-node-name> | grep -A5 -B5 GPU

3. Expected Output

Successful Deployment Logs:

# NPD Container should show:
I1031 18:57:03.909105 81742 external_monitor.go:48] Creating external monitor from config: /config/external-gpu-monitor.json
I1031 18:57:05.608260 81742 external_monitor_proxy.go:159] Connected to external monitor: gpu-monitor
I1031 18:57:06.807778 81742 external_monitor_proxy.go:193] External monitor gpu-monitor metadata: version=1.0.0, api_version=v1

# GPU Monitor Container should show:
2025/10/31 18:57:03 Starting GPU Monitor v1.0.0
2025/10/31 18:57:03 GPU Monitor listening on /var/run/npd/npd-gpu-monitor.sock
2025/10/31 18:57:36 CheckHealth called (sequence: 1)
2025/10/31 18:57:37 GPU stats: temp=25°C, memory=0/46068MB (0.0%), power=21W

Node Conditions Added:

kubectl describe node <gpu-node> | grep GPU
GPUHealthy           False   Fri, 31 Oct 2025 14:58:37 -0400   GPUIsHealthy            GPU is healthy: temp=25°C, memory=0.0%, power=21W
GPUHung              False   Fri, 31 Oct 2025 14:58:37 -0400   GPUHung                 GPU is hung and not responding
GPUMemoryPressure    False   Fri, 31 Oct 2025 14:58:37 -0400   GPUMemoryPressure       GPU memory usage is too high
GPUTemperatureHigh   False   Fri, 31 Oct 2025 14:58:37 -0400   GPUTemperatureHigh      GPU temperature is too high

What's Included

Core Components

  • External Monitor Proxy (pkg/externalmonitor/) - Bridges gRPC to NPD's Monitor interface
  • gRPC Protocol (api/services/external/v1/) - Protobuf definitions for external plugins
  • GPU Monitor Example (examples/external-plugins/gpu-monitor/) - Complete NVIDIA GPU monitor implementation

Key Features Demonstrated

  • gRPC Communication - Unix socket based IPC between NPD and external monitors
  • Health Checking - Automatic reconnection and circuit breaking
  • Configuration - JSON-based external monitor configuration
  • Error Handling - Comprehensive error recovery and exponential backoff
  • Resource Management - Independent resource limits for each component

Example: GPU Monitor

The included GPU monitor demonstrates external plugin capabilities:

# Monitor NVIDIA GPU health
# Reports conditions: GPUHung, GPUMemoryPressure, GPUTemperatureHigh
# Connects via: /var/run/npd/npd-gpu-monitor.sock

Sample GPU Events:

# High GPU memory usage
reason: GPUMemoryPressure
message: "GPU memory usage: 96% (threshold: 95%)"

# GPU temperature warning
reason: GPUTemperatureHigh
message: "GPU temperature: 87°C (threshold: 85°C)"

How External Plugins Work

1. Plugin Implementation

External monitors implement the ExternalMonitor gRPC service:

service ExternalMonitor {
    rpc CheckHealth(HealthCheckRequest) returns (Status);
    rpc GetMetadata(google.protobuf.Empty) returns (MonitorMetadata);
    rpc Stop(google.protobuf.Empty) returns (google.protobuf.Empty);
}

2. NPD Configuration

NPD discovers external monitors via configuration:

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock",
    "grpcTimeout": "10s",
    "reconnectInterval": "30s"
  }
}

3. Runtime Communication

  1. NPD starts and reads external monitor configs
  2. NPD creates proxy instances for each external monitor
  3. Proxies attempt gRPC connections to Unix sockets
  4. External monitors register and begin health reporting
  5. NPD receives status updates and reports to Kubernetes

Creating Your Own External Plugin

1. Implement the gRPC Service

type MyMonitorServer struct {
    pb.UnimplementedExternalMonitorServer
}

func (s *MyMonitorServer) CheckHealth(ctx context.Context, req *pb.HealthCheckRequest) (*pb.Status, error) {
    // Your monitoring logic here
    return &pb.Status{
        Conditions: []*pb.Condition{{
            Type:    "MyCustomCondition",
            Status:  pb.ConditionStatus_True,
            Reason:  "MyReason",
            Message: "Custom monitoring detected an issue",
        }},
    }, nil
}

2. Create Configuration

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock"
  },
  "conditions": [
    {
      "type": "MyCustomCondition",
      "reason": "MyReason",
      "message": "Custom monitoring condition"
    }
  ]
}

3. Deploy as Sidecar

containers:
- name: my-monitor
  image: my-registry/my-monitor:v1.0.0
  command: ["/my-monitor", "-socket", "/var/run/npd/my-monitor.sock"]
  volumeMounts:
  - name: npd-socket
    mountPath: /var/run/npd

Repository Structure

npd-ext/
├── api/services/external/v1/       # gRPC protobuf definitions
├── pkg/externalmonitor/            # External monitor proxy implementation
├── cmd/nodeproblemdetector/        # NPD binary with external support
├── examples/external-plugins/      # Example external monitors
│   └── gpu-monitor/               # NVIDIA GPU monitor example
├── deployment/                    # Kubernetes deployment manifests
├── Dockerfile                     # NPD container image
├── Dockerfile.gpu-monitor         # GPU monitor container image
└── Makefile                       # Build automation

Deployment Patterns

Pattern 1: Sidecar (Recommended)

NPD and external monitors in the same pod:

  • Shared Unix socket via emptyDir volume
  • Simplified networking and RBAC
  • Single resource pool
kubectl apply -f deployment/npd-ext-daemonset.yaml

Pattern 2: Separate DaemonSets

Independent deployments for NPD and monitors:

  • Fine-grained resource control
  • Independent scaling
  • GPU monitors only on GPU nodes
kubectl apply -f deployment/npd-ext-separate.yaml

Configuration Examples

Minimal External Monitor Config

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock"
  }
}

Advanced External Monitor Config

{
  "plugin": "external",
  "pluginConfig": {
    "socketPath": "/var/run/npd/my-monitor.sock",
    "grpcTimeout": "15s",
    "reconnectInterval": "60s",
    "maxReconnectAttempts": 10,
    "healthCheckInterval": "30s"
  },
  "conditions": [...],
  "rules": [...]
}

Performance Characteristics

Resource Usage

Component CPU (Request) CPU (Limit) Memory (Request) Memory (Limit)
NPD Core 10m 10m 80Mi 80Mi
GPU Monitor 10m 50m 20Mi 50Mi

Communication Overhead

  • Protocol: gRPC over Unix sockets
  • Serialization: Protocol Buffers (~100-500 bytes per status)
  • Frequency: Configurable (default: 30s health checks)
  • Latency: <1ms for local Unix socket communication

Troubleshooting

Common Issues

Pod CrashLoopBackOff

# Check specific container logs
kubectl logs <pod-name> -n kube-system -c node-problem-detector --tail=50
kubectl logs <pod-name> -n kube-system -c gpu-monitor --tail=50

# Common causes:
# 1. Missing log paths in custom plugin configs (add "path": "/var/log/messages")
# 2. Unsupported plugin types (remove journald-based monitors)
# 3. GPU access issues (verify nvidia.com/gpu resource allocation)

External Monitor Not Connecting

# Check if external monitor is starting
kubectl logs <pod-name> -n kube-system -c gpu-monitor | grep "GPU Monitor listening"

# Verify NPD can reach external monitor
kubectl logs <pod-name> -n kube-system -c node-problem-detector | grep "Connected to external monitor"

# Check socket permissions
kubectl exec <pod-name> -n kube-system -c node-problem-detector -- ls -la /var/run/npd/

GPU Monitor Issues

# Test nvidia-smi access
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- nvidia-smi

# Check GPU resource allocation
kubectl describe node <node-name> | grep nvidia.com/gpu

# Verify CUDA base image
kubectl exec <pod-name> -n kube-system -c gpu-monitor -- which nvidia-smi

Debug Commands

# Show external monitor configuration
kubectl get configmap -n kube-system npd-ext-config -o yaml

# Check NPD external monitor status
kubectl logs -n kube-system -l app=npd-ext -c node-problem-detector | grep external

# Monitor gRPC communication
kubectl exec -n kube-system <pod> -c node-problem-detector -- netstat -ln | grep /var/run/npd

Extending the System

Adding New Monitor Types

  1. Create monitor binary implementing the gRPC service
  2. Add configuration to the ConfigMap
  3. Deploy as sidecar or separate DaemonSet
  4. Update NPD command with new --config.external-monitor flag

Language Examples

Python Monitor

import grpc
from concurrent import futures
import external_monitor_pb2_grpc as pb2_grpc

class MyMonitor(pb2_grpc.ExternalMonitorServicer):
    def CheckHealth(self, request, context):
        # Your monitoring logic
        pass

Rust Monitor

use tonic::{transport::Server, Request, Response, Status};
use external_monitor::external_monitor_server::{ExternalMonitor, ExternalMonitorServer};

#[tonic::async_trait]
impl ExternalMonitor for MyMonitor {
    async fn check_health(&self, request: Request<HealthCheckRequest>) -> Result<Response<Status>, Status> {
        // Your monitoring logic
    }
}

Contributing

This is a proof of concept. To contribute:

  1. Fork the repository
  2. Create feature branch
  3. Add tests for new functionality
  4. Submit pull request with clear description

License

Copyright 2024 The Kubernetes Authors All rights reserved.

Licensed under the Apache License, Version 2.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors