Skip to content

feat: add aggregated API server implementation#2

Merged
scotwells merged 1 commit intofeat/clickhouse-setupfrom
feat/introduce-aggregated-apiserver
Dec 17, 2025
Merged

feat: add aggregated API server implementation#2
scotwells merged 1 commit intofeat/clickhouse-setupfrom
feat/introduce-aggregated-apiserver

Conversation

@scotwells
Copy link
Contributor

@scotwells scotwells commented Dec 17, 2025

Summary

This PR implements a Kubernetes aggregated API server for querying audit logs from ClickHouse. Includes API types, REST storage implementation with cursor-based pagination, CEL expression filtering, and OpenTelemetry instrumentation for metrics and tracing.

Refer to the API documentation for more details on how to interact with the API. Refer to the apiserver architecture documentation for more information on implementation of the apiserver.

Important

This PR is showing a very large diff because of the auto-generated openapi specification (~19k lines).

Details

The AuditLogQuery type can be created to query the system for audit logs. This type is ephemeral meaning the type is not actually created in the system, the audit log results are added to the status of the resource and returned to the client.

Here's an example showing how to use relative timestamps to query audit logs over the last 7 days for objects in the production or staging namespace.

apiVersion: activity.miloapis.com/v1alpha1
kind: AuditLogQuery
metadata:
  name: production-staging-events
spec:
  # Query the last 7 days
  startTime: "now-7d"
  endTime: "now"
  # Query events from multiple namespaces
  filter: "objectRef.namespace in ['production', 'staging']"
  limit: 10

The status of the AuditLogQuery will contain the results of the query, pagination information, and the effective start / end times used in the request.

apiVersion: activity.miloapis.com/v1alpha1
kind: AuditLogQuery
metadata:
  name: last-7-days
spec:
  # Query the last 7 days
  startTime: "now-7d"
  endTime: "now"
  # Query events from multiple namespaces
  filter: "objectRef.namespace in ['production', 'staging']"
  limit: 10
status:
  continue: eyJ0IjoiMjAyNS0xMi0xN1QxODozOToyOC43OTMyMTdaIiwiYSI6ImVhNDQ3NzAxLTE1ZTAtNDVkZC1hNDg3LTYxODY4ZTgzNWFlYiIsImgiOiJ4Y0dFYlZpMGpINk1QdTlpYTh1cTd3PT0iLCJpIjoiMjAyNS0xMi0xN1QxODozOTozMy4wNjQ1MzI4MDRaIn0=
  effectiveEndTime: "2025-12-17T18:39:32Z"
  effectiveStartTime: "2025-12-10T18:39:32Z"
  results:
  - annotations:
      authorization.k8s.io/decision: allow
      authorization.k8s.io/reason: 'RBAC: allowed by ClusterRoleBinding "kubeadm:cluster-admins"
        of ClusterRole "cluster-admin" to Group "kubeadm:cluster-admins"'
    apiVersion: audit.k8s.io/v1
    auditID: 983cf1c7-fc4c-40e1-8aaf-fa3a781ad04f
    kind: Event
    level: RequestResponse
  ...

Pagination is done by using the cursor from the previous request in the spec.continue field of the request. Cursor tokens are only valid with the same page size, filter, time range and will expire after 1 hour.

apiVersion: activity.miloapis.com/v1alpha1
kind: AuditLogQuery
metadata:
  name: production-staging-events
spec:
  # Fields from the previous page request
  ...
  continue: eyJ0IjoiMjAyNS0xMi0xN1QxODozOToyOC43OTM...

CEL expressions provide a powerful and type-safe querying language can be used to filter the audit logs returned in the results. Users can input their CEL expression into spec.filter with the following fields available to them:

Available Fields:

  • verb - API action: get, list, create, update, patch, delete, watch
  • auditID - unique event identifier
  • stage - request phase: RequestReceived, ResponseStarted, ResponseComplete, Panic
  • stageTimestamp - when this stage occurred (RFC3339 timestamp)
  • user.username - who made the request (user or service account)
  • responseStatus.code - HTTP response code (200, 201, 404, 500, etc.)
  • objectRef.namespace - target resource namespace
  • objectRef.resource - resource type (pods, deployments, secrets, configmaps, etc.)
  • objectRef.name - specific resource name

Up Next

  • Deployment configurations for the apiserver
  • Vector / NATS pipeline configuration w/ e2e testing environment
  • Automated end-to-end testing with CI pipeline
  • CLI package for interacting with the new activity audit log querying API w/ kubectl plugin
  • Operational dashboards & performance testing

Relates to datum-cloud/enhancements#536

Implement a Kubernetes aggregated API server for querying audit logs
from ClickHouse storage. Includes API types, REST storage implementation
with cursor-based pagination, CEL expression filtering, and
OpenTelemetry instrumentation for metrics and tracing.
Copy link
Contributor

@ecv ecv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well damn.

@scotwells scotwells merged commit 3cc9799 into feat/clickhouse-setup Dec 17, 2025
@scotwells scotwells deleted the feat/introduce-aggregated-apiserver branch December 17, 2025 19:50
scotwells added a commit that referenced this pull request Dec 17, 2025
## Summary

This PR builds on #1 and #2 to introduce a complete and functional
end-to-end testing environment for testing the new activity service in a
[local kind cluster](https://kind.sigs.k8s.io). The test environment is
built on top of a [test-infra] cluster that includes base services like
flux, envoy gateway, and a telemetry stack.

The test environment includes a [Vector](https://vector.dev),
[NATS](https://nats.io), [Clickhouse](https://clickhouse.com) pipeline
that automatically collects audit logs emitted from the test-infra kind
cluster.

```mermaid
graph LR
    APIServer[Activity API Server<br/>Generates audit logs]
    Vector1[Vector Sidecar<br/>Publishes events]
    NATS[NATS JetStream<br/>Event storage & routing]
    Vector2[Vector Aggregator<br/>Batching & persistence]
    CH[ClickHouse<br/>Long-term storage]
    QueryAPI[Activity API Server<br/>Query interface]
    Client[Clients<br/>kubectl/API]

    APIServer -->|writes| Vector1
    Vector1 -->|publish| NATS
    NATS -->|push| Vector2
    Vector2 -->|insert| CH
    Client -->|query| QueryAPI
    QueryAPI -->|CEL → SQL| CH
    CH -->|results| QueryAPI

    style APIServer fill:#e1f5ff
    style NATS fill:#fff3e0
    style CH fill:#f3e5f5
    style QueryAPI fill:#e8f5e9
    style Vector1 fill:#fff9c4
    style Vector2 fill:#fff9c4
```

## Details

The apiserver deployment manifests are structured as a standard `base`
kustomize deployment that includes the kubernetes Deployment, Service,
and RBAC resources.

The following kustomize components have been introduced to provide
optional functionality that can be enabled in environments when
necessary.

- **api-registration**: Configures the APIService registration with the
k8s apiserver to proxy requests to the activity apiserver
- **cert-manager-ca**: Configures a namespaced cert issuer to use with
the activity apiserver
- **grafana-clickhouse**: Configures a new Grafana datasource to connect
to the deployed clickhouse instance
- **namespace**: Creates a namespace to use for the system's deployment
- **nats-stream**: Creates a new nats JetStream to use for the audit log
pipeline
- **tracing**: Configures the APIserver with a tracing configuration
- **vector-aggregator**: Deploys an aggregated version of Vector that
ingests audit logs from NATS and writes them to Clickhouse
- **vector-sidecar**: Deployment of Vector that runs on every node in
the cluster that's responsible for collecting audit logs from apiservers
and writing them to NATS

Also included is deployment automation for deploying the system's
dependencies:

- **[Clickhouse
operator](https://github.com/Altinity/clickhouse-operator)**: Manages
deployments of Clickhouse through CRDs
- **NATS**: Deploys an instance of NATS and
[NACK](https://github.com/nats-io/nack) to configure NATS through CRDs

After deploying a fresh test environment, I used a
[kubectl-plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/)
to query the activity API and retrieve audit logs that have been
collected through the pipeline.

```shell
▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10
TIMESTAMP             VERB    USER                                                                                   NAMESPACE          RESOURCE             NAME                                     STATUS
2025-12-17 17:07:12   get     system:serviceaccount:kyverno:kyverno-background-controller                                                                                                             200
2025-12-17 17:07:11   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:serviceaccount:kyverno:kyverno-reports-controller                                                                                                                200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              tls-assets-vmalert-telemetry-system-vm   200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              vmalert-telemetry-system-vm              200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   configmaps           vm-telemetry-system-vm-rulefiles-0       200
2025-12-17 17:07:10   watch   system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator                      daemonsets                                                    200
2025-12-17 17:07:10   watch   system:serviceaccount:kyverno:kyverno-admission-controller                                                generatingpolicies                                            200
2025-12-17 17:07:10   watch   system:serviceaccount:flux-system:image-reflector-controller                                              imagepolicies                                                 200

More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page.
Or use --all-pages to fetch all results automatically.
``` 

Here's a dashboard I'm working on that was loaded into the local
environment to visualize the health and performance of the pipeline. The
source for this dashboard will be included in a follow up PR.

<img width="1385" height="1221" alt="image"
src="https://github.com/user-attachments/assets/4328f182-d167-4565-a7e7-9703419c2b48"
/>


[test-infra]: https://github.com/datum-cloud/test-infra

## Up Next

- CI pipeline integration w/ end-to-end testing
- CLI package for interacting with the new activity audit log querying
API w/ [kubectl
plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/)
- Operational dashboards & performance testing

--- 

Relates to https://github.com/datum-cloud/engineering/issues/90
scotwells added a commit that referenced this pull request Jan 21, 2026
This PR introduces a new `history` command that can be used to diff a
resource over time. This is helpful for analyzing what changes were made
to a resource and who made the change.

What's shown here is an example of running the command against a
resource in Datum Cloud so we could analyze the changes one of our
system components was making to a resource.

```shell
▶ ./datumctl activity --project another-project-w24uyl history dnsrecordsets dns-record-set-www-ibm-com-5npagr-qp25o9 --diff

╭─────────────────────────────────────────────────────────────╮
│ Change #1   📝 patch    [200]
│ 🕐 2026-01-21 09:29:35
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📸 Initial state (oldest available change)

{
  "apiVersion": "dns.networking.miloapis.com/v1alpha1",
  "kind": "DNSRecordSet",
  "metadata": {
    "name": "dns-record-set-www-ibm-com-5npagr-qp25o9",
    "namespace": "default"
  },
  "spec": {
    ...
  },
  "status": {
    ...
  }
}

╭─────────────────────────────────────────────────────────────╮
│ Change #2   📝 patch    [200]
│ 🕐 2026-01-21 09:29:35
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📝 Changes: metadata only

--- Previous
+++ Current
@@ -143,7 +143,7 @@
     "conditions": [
       {
         "lastTransitionTime": "2025-12-03T10:10:17Z",
-        "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}",
+        "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}",
         "observedGeneration": 1,
         "reason": "PDNSError",
         "status": "False",

╭─────────────────────────────────────────────────────────────╮
│ Change #3   📝 patch    [200]
│ 🕐 2026-01-21 09:31:04
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📝 Changes: metadata only

--- Previous
+++ Current
@@ -143,7 +143,7 @@
     "conditions": [
       {
         "lastTransitionTime": "2025-12-03T10:10:17Z",
-        "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}",
+        "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}",
         "observedGeneration": 1,
         "reason": "PDNSError",
         "status": "False",
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants