feat: add aggregated API server implementation#2
Merged
scotwells merged 1 commit intofeat/clickhouse-setupfrom Dec 17, 2025
Merged
feat: add aggregated API server implementation#2scotwells merged 1 commit intofeat/clickhouse-setupfrom
scotwells merged 1 commit intofeat/clickhouse-setupfrom
Conversation
Implement a Kubernetes aggregated API server for querying audit logs from ClickHouse storage. Includes API types, REST storage implementation with cursor-based pagination, CEL expression filtering, and OpenTelemetry instrumentation for metrics and tracing.
scotwells
added a commit
that referenced
this pull request
Dec 17, 2025
## Summary This PR builds on #1 and #2 to introduce a complete and functional end-to-end testing environment for testing the new activity service in a [local kind cluster](https://kind.sigs.k8s.io). The test environment is built on top of a [test-infra] cluster that includes base services like flux, envoy gateway, and a telemetry stack. The test environment includes a [Vector](https://vector.dev), [NATS](https://nats.io), [Clickhouse](https://clickhouse.com) pipeline that automatically collects audit logs emitted from the test-infra kind cluster. ```mermaid graph LR APIServer[Activity API Server<br/>Generates audit logs] Vector1[Vector Sidecar<br/>Publishes events] NATS[NATS JetStream<br/>Event storage & routing] Vector2[Vector Aggregator<br/>Batching & persistence] CH[ClickHouse<br/>Long-term storage] QueryAPI[Activity API Server<br/>Query interface] Client[Clients<br/>kubectl/API] APIServer -->|writes| Vector1 Vector1 -->|publish| NATS NATS -->|push| Vector2 Vector2 -->|insert| CH Client -->|query| QueryAPI QueryAPI -->|CEL → SQL| CH CH -->|results| QueryAPI style APIServer fill:#e1f5ff style NATS fill:#fff3e0 style CH fill:#f3e5f5 style QueryAPI fill:#e8f5e9 style Vector1 fill:#fff9c4 style Vector2 fill:#fff9c4 ``` ## Details The apiserver deployment manifests are structured as a standard `base` kustomize deployment that includes the kubernetes Deployment, Service, and RBAC resources. The following kustomize components have been introduced to provide optional functionality that can be enabled in environments when necessary. - **api-registration**: Configures the APIService registration with the k8s apiserver to proxy requests to the activity apiserver - **cert-manager-ca**: Configures a namespaced cert issuer to use with the activity apiserver - **grafana-clickhouse**: Configures a new Grafana datasource to connect to the deployed clickhouse instance - **namespace**: Creates a namespace to use for the system's deployment - **nats-stream**: Creates a new nats JetStream to use for the audit log pipeline - **tracing**: Configures the APIserver with a tracing configuration - **vector-aggregator**: Deploys an aggregated version of Vector that ingests audit logs from NATS and writes them to Clickhouse - **vector-sidecar**: Deployment of Vector that runs on every node in the cluster that's responsible for collecting audit logs from apiservers and writing them to NATS Also included is deployment automation for deploying the system's dependencies: - **[Clickhouse operator](https://github.com/Altinity/clickhouse-operator)**: Manages deployments of Clickhouse through CRDs - **NATS**: Deploys an instance of NATS and [NACK](https://github.com/nats-io/nack) to configure NATS through CRDs After deploying a fresh test environment, I used a [kubectl-plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) to query the activity API and retrieve audit logs that have been collected through the pipeline. ```shell ▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10 TIMESTAMP VERB USER NAMESPACE RESOURCE NAME STATUS 2025-12-17 17:07:12 get system:serviceaccount:kyverno:kyverno-background-controller 200 2025-12-17 17:07:11 get system:anonymous 200 2025-12-17 17:07:10 get system:anonymous 200 2025-12-17 17:07:10 get system:serviceaccount:kyverno:kyverno-reports-controller 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets tls-assets-vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system configmaps vm-telemetry-system-vm-rulefiles-0 200 2025-12-17 17:07:10 watch system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator daemonsets 200 2025-12-17 17:07:10 watch system:serviceaccount:kyverno:kyverno-admission-controller generatingpolicies 200 2025-12-17 17:07:10 watch system:serviceaccount:flux-system:image-reflector-controller imagepolicies 200 More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page. Or use --all-pages to fetch all results automatically. ``` Here's a dashboard I'm working on that was loaded into the local environment to visualize the health and performance of the pipeline. The source for this dashboard will be included in a follow up PR. <img width="1385" height="1221" alt="image" src="https://github.com/user-attachments/assets/4328f182-d167-4565-a7e7-9703419c2b48" /> [test-infra]: https://github.com/datum-cloud/test-infra ## Up Next - CI pipeline integration w/ end-to-end testing - CLI package for interacting with the new activity audit log querying API w/ [kubectl plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) - Operational dashboards & performance testing --- Relates to https://github.com/datum-cloud/engineering/issues/90
scotwells
added a commit
that referenced
this pull request
Jan 21, 2026
This PR introduces a new `history` command that can be used to diff a resource over time. This is helpful for analyzing what changes were made to a resource and who made the change. What's shown here is an example of running the command against a resource in Datum Cloud so we could analyze the changes one of our system components was making to a resource. ```shell ▶ ./datumctl activity --project another-project-w24uyl history dnsrecordsets dns-record-set-www-ibm-com-5npagr-qp25o9 --diff ╭─────────────────────────────────────────────────────────────╮ │ Change #1 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📸 Initial state (oldest available change) { "apiVersion": "dns.networking.miloapis.com/v1alpha1", "kind": "DNSRecordSet", "metadata": { "name": "dns-record-set-www-ibm-com-5npagr-qp25o9", "namespace": "default" }, "spec": { ... }, "status": { ... } } ╭─────────────────────────────────────────────────────────────╮ │ Change #2 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ╭─────────────────────────────────────────────────────────────╮ │ Change #3 📝 patch [200] │ 🕐 2026-01-21 09:31:04 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a Kubernetes aggregated API server for querying audit logs from ClickHouse. Includes API types, REST storage implementation with cursor-based pagination, CEL expression filtering, and OpenTelemetry instrumentation for metrics and tracing.
Refer to the API documentation for more details on how to interact with the API. Refer to the apiserver architecture documentation for more information on implementation of the apiserver.
Important
This PR is showing a very large diff because of the auto-generated openapi specification (~19k lines).
Details
The AuditLogQuery type can be created to query the system for audit logs. This type is ephemeral meaning the type is not actually created in the system, the audit log results are added to the status of the resource and returned to the client.
Here's an example showing how to use relative timestamps to query audit logs over the last 7 days for objects in the
productionorstagingnamespace.The
statusof the AuditLogQuery will contain the results of the query, pagination information, and the effective start / end times used in the request.Pagination is done by using the cursor from the previous request in the
spec.continuefield of the request. Cursor tokens are only valid with the same page size, filter, time range and will expire after 1 hour.CEL expressions provide a powerful and type-safe querying language can be used to filter the audit logs returned in the results. Users can input their CEL expression into
spec.filterwith the following fields available to them:Available Fields:
Up Next
Relates to datum-cloud/enhancements#536