# PKI Health Monitor

This notebook queries the **PKI Exporter** Prometheus-format metrics endpoint and displays CA health status across all three PKI hierarchies (RSA-4096, ECC P-384, ML-DSA-87).

## Monitoring Stack Architecture

```
Dogtag CAs (9 targets) → PKI Exporter (:9091/metrics) → Prometheus (:9090) → Grafana (:3000)
```

The PKI Exporter scrapes all Dogtag CAs every 15 seconds and exposes Prometheus-format metrics. This notebook reads those metrics directly from the exporter.

### CAs Scraped

| PKI Type | Root CA | Intermediate CA | IoT CA | EST CA |
|----------|---------|-----------------|--------|--------|
| **RSA-4096** | 8443 | 8444 | 8445 | 8447 |
| **ECC P-384** | 8463 | 8464 | 8465 | 8466 |
| **ML-DSA-87** | 8453 | 8454 | 8455 | 8456 |

### Metrics Available

| Metric | Description |
|--------|-------------|
| `pki_ca_up` | CA reachability (1=up, 0=down) |
| `pki_certificates_total` | Certificate count by status (VALID/REVOKED) |
| `pki_crl_last_update_timestamp` | Last CRL generation time |
| `pki_crl_next_update_timestamp` | Next CRL generation time |
| `pki_crl_entries_total` | Number of entries in CRL |
| `pki_ocsp_response_seconds` | OCSP response latency |
| `pki_issuance_rate` | Certificate issuance rate (from perf-test) |
| `pki_revocation_rate` | Certificate revocation rate (from perf-test) |
| `pki_issuance_duration_seconds` | Issuance latency percentiles |

## Configuration

The Jupyter container is on the same Docker network as the PKI Exporter, so we can reach it by hostname.

| Variable | Default | Description |
|----------|---------|-------------|
| `PKI_EXPORTER_URL` | `http://pki-exporter.cert-lab.local:9091` | PKI Exporter base URL |

In [None]:
import re
import time
from datetime import datetime

import httpx
import pandas as pd
from IPython.display import display, clear_output, HTML

PKI_EXPORTER_URL = "http://pki-exporter.cert-lab.local:9091"

print(f"PKI Exporter URL: {PKI_EXPORTER_URL}")

In [None]:
def fetch_metrics():
    """Fetch and parse Prometheus text format metrics into a list of dicts."""
    resp = httpx.get(f"{PKI_EXPORTER_URL}/metrics", timeout=10)
    resp.raise_for_status()
    return parse_prometheus(resp.text)


def parse_prometheus(text):
    """Parse Prometheus exposition format into structured records."""
    records = []
    # Matches: metric_name{label="val",...} value
    pattern = re.compile(
        r'^([a-zA-Z_:][a-zA-Z0-9_:]*)'
        r'(?:\{([^}]*)\})?'
        r'\s+([\d.eE+-]+(?:NaN)?)$'
    )
    for line in text.splitlines():
        line = line.strip()
        if not line or line.startswith('#'):
            continue
        m = pattern.match(line)
        if m:
            name = m.group(1)
            labels_str = m.group(2) or ""
            value = m.group(3)
            labels = {}
            if labels_str:
                for pair in re.findall(r'(\w+)="([^"]*)"', labels_str):
                    labels[pair[0]] = pair[1]
            try:
                value = float(value)
            except ValueError:
                pass
            records.append({"metric": name, "value": value, **labels})
    return records


def get_metric(records, name, **label_filters):
    """Filter parsed metrics by name and optional label values."""
    results = [r for r in records if r["metric"] == name]
    for k, v in label_filters.items():
        results = [r for r in results if r.get(k) == v]
    return results


# Test connection
try:
    metrics = fetch_metrics()
    print(f"Connected to PKI Exporter. Parsed {len(metrics)} metric samples.")
except Exception as e:
    print(f"Failed to connect to PKI Exporter: {e}")
    print("Make sure the monitoring stack is running (Phase 10 of start-lab.sh).")
    metrics = []

## CA Health Status

Shows whether each CA is reachable. A value of `1` means the CA responded to a health check; `0` means it is down or unreachable.

In [None]:
if metrics:
    health = get_metric(metrics, "pki_ca_up")
    if health:
        df = pd.DataFrame(health)
        df["status"] = df["value"].apply(lambda v: "UP" if v == 1 else "DOWN")
        cols = [c for c in ["pki_type", "ca_level", "status"] if c in df.columns]
        if cols:
            pivot = df.pivot_table(
                index="ca_level", columns="pki_type",
                values="status", aggfunc="first"
            )
            # Reorder columns and rows for readability
            for col_order in [["rsa", "ecc", "pq"], ["rsa", "ecc"], ["rsa"]]:
                available = [c for c in col_order if c in pivot.columns]
                if available:
                    pivot = pivot[available]
                    break
            row_order = ["root", "intermediate", "iot", "est", "acme"]
            available_rows = [r for r in row_order if r in pivot.index]
            pivot = pivot.loc[available_rows]
            print("CA Health Status (pki_type x ca_level):")
            display(pivot)
        else:
            display(df)
    else:
        print("No pki_ca_up metrics found.")
else:
    print("No metrics loaded. Run the cell above first.")

## Certificate Inventory

Number of **VALID** and **REVOKED** certificates per CA, as reported by the Dogtag REST API.

In [None]:
if metrics:
    certs = get_metric(metrics, "pki_certificates_total")
    if certs:
        df = pd.DataFrame(certs)
        df["count"] = df["value"].astype(int)
        cols = [c for c in ["pki_type", "ca_level", "status", "count"] if c in df.columns]
        if "status" in df.columns:
            pivot = df.pivot_table(
                index=["pki_type", "ca_level"], columns="status",
                values="count", aggfunc="sum", fill_value=0
            )
            print("Certificate Inventory:")
            display(pivot)
        else:
            display(df[cols])
    else:
        print("No pki_certificates_total metrics found.")
else:
    print("No metrics loaded.")

## CRL Status

Certificate Revocation List timing and entry count for each CA.

In [None]:
if metrics:
    crl_last = get_metric(metrics, "pki_crl_last_update_timestamp")
    crl_next = get_metric(metrics, "pki_crl_next_update_timestamp")
    crl_entries = get_metric(metrics, "pki_crl_entries_total")

    if crl_last or crl_next or crl_entries:
        rows = []
        # Index by (pki_type, ca_level)
        lookup_next = {(r.get("pki_type"), r.get("ca_level")): r["value"] for r in crl_next}
        lookup_entries = {(r.get("pki_type"), r.get("ca_level")): r["value"] for r in crl_entries}

        for r in crl_last:
            key = (r.get("pki_type"), r.get("ca_level"))
            last_ts = r["value"]
            next_ts = lookup_next.get(key, None)
            entries = lookup_entries.get(key, None)
            rows.append({
                "pki_type": key[0],
                "ca_level": key[1],
                "last_update": datetime.fromtimestamp(last_ts).strftime("%Y-%m-%d %H:%M:%S") if last_ts and last_ts > 0 else "N/A",
                "next_update": datetime.fromtimestamp(next_ts).strftime("%Y-%m-%d %H:%M:%S") if next_ts and next_ts > 0 else "N/A",
                "entries": int(entries) if entries is not None else "N/A",
            })

        if rows:
            df = pd.DataFrame(rows)
            print("CRL Status:")
            display(df.set_index(["pki_type", "ca_level"]))
        else:
            print("No CRL data parsed.")
    else:
        print("No CRL metrics found.")
else:
    print("No metrics loaded.")

## OCSP Latency

OCSP response time per CA. Thresholds: **green** < 200ms, **yellow** < 500ms, **red** ≥ 500ms.

In [None]:
if metrics:
    ocsp = get_metric(metrics, "pki_ocsp_response_seconds")
    if ocsp:
        rows = []
        for r in ocsp:
            latency_ms = r["value"] * 1000
            if latency_ms < 200:
                indicator = "[OK]"
            elif latency_ms < 500:
                indicator = "[WARN]"
            else:
                indicator = "[SLOW]"
            rows.append({
                "pki_type": r.get("pki_type", "?"),
                "ca_level": r.get("ca_level", "?"),
                "latency_ms": f"{latency_ms:.1f}",
                "status": indicator,
            })
        df = pd.DataFrame(rows)
        print("OCSP Response Latency:")
        display(df.set_index(["pki_type", "ca_level"]))
    else:
        print("No pki_ocsp_response_seconds metrics found.")
else:
    print("No metrics loaded.")

## Performance Metrics

Issuance and revocation rates and latency percentiles from the last performance test run. These metrics are generated by `./lab perf-test` and exposed via the PKI Exporter.

In [None]:
if metrics:
    # Throughput rates
    issue_rate = get_metric(metrics, "pki_issuance_rate")
    revoke_rate = get_metric(metrics, "pki_revocation_rate")
    issue_total = get_metric(metrics, "pki_issuance_total")
    revoke_total = get_metric(metrics, "pki_revocation_total")

    if issue_rate or revoke_rate:
        rows = []
        rate_lookup = {r.get("pki_type"): r["value"] for r in revoke_rate}
        total_issue = {r.get("pki_type"): int(r["value"]) for r in issue_total}
        total_revoke = {r.get("pki_type"): int(r["value"]) for r in revoke_total}
        for r in issue_rate:
            pki = r.get("pki_type", "?")
            rows.append({
                "pki_type": pki,
                "issued": total_issue.get(pki, "N/A"),
                "revoked": total_revoke.get(pki, "N/A"),
                "issue_rate_per_sec": f"{r['value']:.2f}",
                "revoke_rate_per_sec": f"{rate_lookup.get(pki, 0):.2f}",
            })
        if rows:
            print("Performance Throughput:")
            display(pd.DataFrame(rows).set_index("pki_type"))
    else:
        print("No throughput metrics found. Run ./lab perf-test to generate data.")

    # Latency percentiles
    latency = get_metric(metrics, "pki_issuance_duration_seconds")
    if latency:
        rows = []
        for r in latency:
            rows.append({
                "pki_type": r.get("pki_type", "?"),
                "quantile": r.get("quantile", "?"),
                "latency_ms": f"{r['value'] * 1000:.1f}",
            })
        if rows:
            df = pd.DataFrame(rows)
            pivot = df.pivot_table(
                index="pki_type", columns="quantile",
                values="latency_ms", aggfunc="first"
            )
            print("\nIssuance Latency Percentiles (ms):")
            display(pivot)
    else:
        print("No latency percentile metrics found.")
else:
    print("No metrics loaded.")

## Auto-Refresh Dashboard

Polls the PKI Exporter every 15 seconds and displays a summary. **Interrupt the kernel** (stop button) to halt.

In [None]:
def dashboard_summary(metrics):
    """Print a compact health summary."""
    # CA health
    health = get_metric(metrics, "pki_ca_up")
    up = sum(1 for h in health if h["value"] == 1)
    total = len(health)
    print(f"CAs: {up}/{total} UP")

    # Certificate counts
    certs = get_metric(metrics, "pki_certificates_total")
    valid = sum(int(c["value"]) for c in certs if c.get("status") == "VALID")
    revoked = sum(int(c["value"]) for c in certs if c.get("status") == "REVOKED")
    print(f"Certificates: {valid} valid, {revoked} revoked")

    # OCSP average
    ocsp = get_metric(metrics, "pki_ocsp_response_seconds")
    if ocsp:
        avg_ms = sum(o["value"] for o in ocsp) / len(ocsp) * 1000
        print(f"OCSP avg latency: {avg_ms:.1f} ms")

    # Per-PKI breakdown
    for pki in ["rsa", "ecc", "pq"]:
        pki_health = [h for h in health if h.get("pki_type") == pki]
        if pki_health:
            pki_up = sum(1 for h in pki_health if h["value"] == 1)
            print(f"  {pki.upper()}: {pki_up}/{len(pki_health)} CAs up")


POLL_INTERVAL = 15  # seconds
print(f"Auto-refreshing every {POLL_INTERVAL}s. Interrupt kernel to stop.\n")

try:
    while True:
        clear_output(wait=True)
        ts = datetime.now().strftime("%H:%M:%S")
        print(f"PKI Health Dashboard  [{ts}]  (refreshes every {POLL_INTERVAL}s)")
        print("=" * 50)
        try:
            m = fetch_metrics()
            dashboard_summary(m)
        except Exception as e:
            print(f"Error fetching metrics: {e}")
        time.sleep(POLL_INTERVAL)
except KeyboardInterrupt:
    print("\nDashboard stopped.")