Add ClickHouse database for audit log storage#1
Merged
Conversation
Set up a ClickHouse database to store Kubernetes audit events from the Activity API. This provides fast queries over large volumes of audit data. What this adds: - Database schema for storing audit events with multi-tenant scoping - Automated schema migrations that run on deployment - Hot/cold storage tiering (recent data on SSD, old data on S3) - Task commands for managing migrations (task migrations:local, etc.) How it works: - Migrations run automatically via Kubernetes Job on each deployment - The Job checks which migrations have been applied and runs new ones - After 5 minutes, the Job deletes itself and gets recreated by GitOps - All migration SQL lives in migrations/ directory Storage setup: - ClickHouse stores recent audit events on local disks for fast queries - After 90 days, events move to S3-compatible storage (RustFS in test) - This keeps costs low while maintaining query performance Try it: task migrations:local # Run migrations against local ClickHouse task migrations:new NAME=foo # Create a new migration task migrations:generate # Update Kubernetes ConfigMap
scotwells
added a commit
that referenced
this pull request
Dec 17, 2025
## Summary This PR builds on #1 and #2 to introduce a complete and functional end-to-end testing environment for testing the new activity service in a [local kind cluster](https://kind.sigs.k8s.io). The test environment is built on top of a [test-infra] cluster that includes base services like flux, envoy gateway, and a telemetry stack. The test environment includes a [Vector](https://vector.dev), [NATS](https://nats.io), [Clickhouse](https://clickhouse.com) pipeline that automatically collects audit logs emitted from the test-infra kind cluster. ```mermaid graph LR APIServer[Activity API Server<br/>Generates audit logs] Vector1[Vector Sidecar<br/>Publishes events] NATS[NATS JetStream<br/>Event storage & routing] Vector2[Vector Aggregator<br/>Batching & persistence] CH[ClickHouse<br/>Long-term storage] QueryAPI[Activity API Server<br/>Query interface] Client[Clients<br/>kubectl/API] APIServer -->|writes| Vector1 Vector1 -->|publish| NATS NATS -->|push| Vector2 Vector2 -->|insert| CH Client -->|query| QueryAPI QueryAPI -->|CEL → SQL| CH CH -->|results| QueryAPI style APIServer fill:#e1f5ff style NATS fill:#fff3e0 style CH fill:#f3e5f5 style QueryAPI fill:#e8f5e9 style Vector1 fill:#fff9c4 style Vector2 fill:#fff9c4 ``` ## Details The apiserver deployment manifests are structured as a standard `base` kustomize deployment that includes the kubernetes Deployment, Service, and RBAC resources. The following kustomize components have been introduced to provide optional functionality that can be enabled in environments when necessary. - **api-registration**: Configures the APIService registration with the k8s apiserver to proxy requests to the activity apiserver - **cert-manager-ca**: Configures a namespaced cert issuer to use with the activity apiserver - **grafana-clickhouse**: Configures a new Grafana datasource to connect to the deployed clickhouse instance - **namespace**: Creates a namespace to use for the system's deployment - **nats-stream**: Creates a new nats JetStream to use for the audit log pipeline - **tracing**: Configures the APIserver with a tracing configuration - **vector-aggregator**: Deploys an aggregated version of Vector that ingests audit logs from NATS and writes them to Clickhouse - **vector-sidecar**: Deployment of Vector that runs on every node in the cluster that's responsible for collecting audit logs from apiservers and writing them to NATS Also included is deployment automation for deploying the system's dependencies: - **[Clickhouse operator](https://github.com/Altinity/clickhouse-operator)**: Manages deployments of Clickhouse through CRDs - **NATS**: Deploys an instance of NATS and [NACK](https://github.com/nats-io/nack) to configure NATS through CRDs After deploying a fresh test environment, I used a [kubectl-plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) to query the activity API and retrieve audit logs that have been collected through the pipeline. ```shell ▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10 TIMESTAMP VERB USER NAMESPACE RESOURCE NAME STATUS 2025-12-17 17:07:12 get system:serviceaccount:kyverno:kyverno-background-controller 200 2025-12-17 17:07:11 get system:anonymous 200 2025-12-17 17:07:10 get system:anonymous 200 2025-12-17 17:07:10 get system:serviceaccount:kyverno:kyverno-reports-controller 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets tls-assets-vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system configmaps vm-telemetry-system-vm-rulefiles-0 200 2025-12-17 17:07:10 watch system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator daemonsets 200 2025-12-17 17:07:10 watch system:serviceaccount:kyverno:kyverno-admission-controller generatingpolicies 200 2025-12-17 17:07:10 watch system:serviceaccount:flux-system:image-reflector-controller imagepolicies 200 More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page. Or use --all-pages to fetch all results automatically. ``` Here's a dashboard I'm working on that was loaded into the local environment to visualize the health and performance of the pipeline. The source for this dashboard will be included in a follow up PR. <img width="1385" height="1221" alt="image" src="https://github.com/user-attachments/assets/4328f182-d167-4565-a7e7-9703419c2b48" /> [test-infra]: https://github.com/datum-cloud/test-infra ## Up Next - CI pipeline integration w/ end-to-end testing - CLI package for interacting with the new activity audit log querying API w/ [kubectl plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) - Operational dashboards & performance testing --- Relates to https://github.com/datum-cloud/engineering/issues/90
scotwells
added a commit
that referenced
this pull request
Jan 21, 2026
This PR introduces a new `history` command that can be used to diff a resource over time. This is helpful for analyzing what changes were made to a resource and who made the change. What's shown here is an example of running the command against a resource in Datum Cloud so we could analyze the changes one of our system components was making to a resource. ```shell ▶ ./datumctl activity --project another-project-w24uyl history dnsrecordsets dns-record-set-www-ibm-com-5npagr-qp25o9 --diff ╭─────────────────────────────────────────────────────────────╮ │ Change #1 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📸 Initial state (oldest available change) { "apiVersion": "dns.networking.miloapis.com/v1alpha1", "kind": "DNSRecordSet", "metadata": { "name": "dns-record-set-www-ibm-com-5npagr-qp25o9", "namespace": "default" }, "spec": { ... }, "status": { ... } } ╭─────────────────────────────────────────────────────────────╮ │ Change #2 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ╭─────────────────────────────────────────────────────────────╮ │ Change #3 📝 patch [200] │ 🕐 2026-01-21 09:31:04 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Set up a ClickHouse database to store Kubernetes audit events from the Activity API, providing fast queries over large volumes of audit data.
What this adds:
How it works:
Storage setup:
Try it: