Add ClickHouse database for audit log storage by scotwells · Pull Request #1 · datum-cloud/activity

scotwells · 2025-12-11T23:21:23Z

Summary

Set up a ClickHouse database to store Kubernetes audit events from the Activity API, providing fast queries over large volumes of audit data.

What this adds:

Database schema for storing audit events with multi-tenant scoping
Automated schema migrations that run on deployment
Hot/cold storage tiering (recent data on SSD, old data on S3)
Task commands for managing migrations (task migrations:local, etc.)

How it works:

Migrations run automatically via Kubernetes Job on each deployment
The Job checks which migrations have been applied and runs new ones
After 5 minutes, the Job deletes itself and gets recreated by GitOps
All migration SQL lives in migrations/ directory

Storage setup:

ClickHouse stores recent audit events on local disks for fast queries
After 90 days, events move to S3-compatible storage (RustFS in test environment)
This keeps costs low while maintaining query performance

Try it:

  task migrations:local          # Run migrations against local ClickHouse
  task migrations:new NAME=foo   # Create a new migration
  task migrations:generate       # Update Kubernetes ConfigMap

Set up a ClickHouse database to store Kubernetes audit events from the Activity API. This provides fast queries over large volumes of audit data. What this adds: - Database schema for storing audit events with multi-tenant scoping - Automated schema migrations that run on deployment - Hot/cold storage tiering (recent data on SSD, old data on S3) - Task commands for managing migrations (task migrations:local, etc.) How it works: - Migrations run automatically via Kubernetes Job on each deployment - The Job checks which migrations have been applied and runs new ones - After 5 minutes, the Job deletes itself and gets recreated by GitOps - All migration SQL lives in migrations/ directory Storage setup: - ClickHouse stores recent audit events on local disks for fast queries - After 90 days, events move to S3-compatible storage (RustFS in test) - This keeps costs low while maintaining query performance Try it: task migrations:local # Run migrations against local ClickHouse task migrations:new NAME=foo # Create a new migration task migrations:generate # Update Kubernetes ConfigMap

jszychowski-datum

LGTM

JoseSzycho

LGTM x 2

## Summary This PR builds on #1 and #2 to introduce a complete and functional end-to-end testing environment for testing the new activity service in a [local kind cluster](https://kind.sigs.k8s.io). The test environment is built on top of a [test-infra] cluster that includes base services like flux, envoy gateway, and a telemetry stack. The test environment includes a [Vector](https://vector.dev), [NATS](https://nats.io), [Clickhouse](https://clickhouse.com) pipeline that automatically collects audit logs emitted from the test-infra kind cluster. ```mermaid graph LR APIServer[Activity API Server Generates audit logs] Vector1[Vector Sidecar Publishes events] NATS[NATS JetStream Event storage & routing] Vector2[Vector Aggregator Batching & persistence] CH[ClickHouse Long-term storage] QueryAPI[Activity API Server Query interface] Client[Clients kubectl/API] APIServer -->|writes| Vector1 Vector1 -->|publish| NATS NATS -->|push| Vector2 Vector2 -->|insert| CH Client -->|query| QueryAPI QueryAPI -->|CEL → SQL| CH CH -->|results| QueryAPI style APIServer fill:#e1f5ff style NATS fill:#fff3e0 style CH fill:#f3e5f5 style QueryAPI fill:#e8f5e9 style Vector1 fill:#fff9c4 style Vector2 fill:#fff9c4 ``` ## Details The apiserver deployment manifests are structured as a standard `base` kustomize deployment that includes the kubernetes Deployment, Service, and RBAC resources. The following kustomize components have been introduced to provide optional functionality that can be enabled in environments when necessary. - **api-registration**: Configures the APIService registration with the k8s apiserver to proxy requests to the activity apiserver - **cert-manager-ca**: Configures a namespaced cert issuer to use with the activity apiserver - **grafana-clickhouse**: Configures a new Grafana datasource to connect to the deployed clickhouse instance - **namespace**: Creates a namespace to use for the system's deployment - **nats-stream**: Creates a new nats JetStream to use for the audit log pipeline - **tracing**: Configures the APIserver with a tracing configuration - **vector-aggregator**: Deploys an aggregated version of Vector that ingests audit logs from NATS and writes them to Clickhouse - **vector-sidecar**: Deployment of Vector that runs on every node in the cluster that's responsible for collecting audit logs from apiservers and writing them to NATS Also included is deployment automation for deploying the system's dependencies: - **[Clickhouse operator](https://github.com/Altinity/clickhouse-operator)**: Manages deployments of Clickhouse through CRDs - **NATS**: Deploys an instance of NATS and [NACK](https://github.com/nats-io/nack) to configure NATS through CRDs After deploying a fresh test environment, I used a [kubectl-plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) to query the activity API and retrieve audit logs that have been collected through the pipeline. ```shell ▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10 TIMESTAMP VERB USER NAMESPACE RESOURCE NAME STATUS 2025-12-17 17:07:12 get system:serviceaccount:kyverno:kyverno-background-controller 200 2025-12-17 17:07:11 get system:anonymous 200 2025-12-17 17:07:10 get system:anonymous 200 2025-12-17 17:07:10 get system:serviceaccount:kyverno:kyverno-reports-controller 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets tls-assets-vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system secrets vmalert-telemetry-system-vm 200 2025-12-17 17:07:10 get system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator telemetry-system configmaps vm-telemetry-system-vm-rulefiles-0 200 2025-12-17 17:07:10 watch system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator daemonsets 200 2025-12-17 17:07:10 watch system:serviceaccount:kyverno:kyverno-admission-controller generatingpolicies 200 2025-12-17 17:07:10 watch system:serviceaccount:flux-system:image-reflector-controller imagepolicies 200 More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page. Or use --all-pages to fetch all results automatically. ``` Here's a dashboard I'm working on that was loaded into the local environment to visualize the health and performance of the pipeline. The source for this dashboard will be included in a follow up PR. <img width="1385" height="1221" alt="image" src="https://github.com/user-attachments/assets/4328f182-d167-4565-a7e7-9703419c2b48" /> [test-infra]: https://github.com/datum-cloud/test-infra ## Up Next - CI pipeline integration w/ end-to-end testing - CLI package for interacting with the new activity audit log querying API w/ [kubectl plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) - Operational dashboards & performance testing --- Relates to https://github.com/datum-cloud/engineering/issues/90

This PR introduces a new `history` command that can be used to diff a resource over time. This is helpful for analyzing what changes were made to a resource and who made the change. What's shown here is an example of running the command against a resource in Datum Cloud so we could analyze the changes one of our system components was making to a resource. ```shell ▶ ./datumctl activity --project another-project-w24uyl history dnsrecordsets dns-record-set-www-ibm-com-5npagr-qp25o9 --diff ╭─────────────────────────────────────────────────────────────╮ │ Change #1 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📸 Initial state (oldest available change) { "apiVersion": "dns.networking.miloapis.com/v1alpha1", "kind": "DNSRecordSet", "metadata": { "name": "dns-record-set-www-ibm-com-5npagr-qp25o9", "namespace": "default" }, "spec": { ... }, "status": { ... } } ╭─────────────────────────────────────────────────────────────╮ │ Change #2 📝 patch [200] │ 🕐 2026-01-21 09:29:35 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ╭─────────────────────────────────────────────────────────────╮ │ Change #3 📝 patch [200] │ 🕐 2026-01-21 09:31:04 │ 👤 dns-operator ╰─────────────────────────────────────────────────────────────╯ 📝 Changes: metadata only --- Previous +++ Current @@ -143,7 +143,7 @@ "conditions": [ { "lastTransitionTime": "2025-12-03T10:10:17Z", - "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}", + "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}", "observedGeneration": 1, "reason": "PDNSError", "status": "False", ```

scotwells requested review from JoseSzycho and zachsmith1 December 11, 2025 23:43

scotwells mentioned this pull request Dec 11, 2025

Create a scalable activity system for collecting, storing, and querying audit logs datum-cloud/enhancements#536

Closed

scotwells added 2 commits December 11, 2025 20:04

chore: fix whitespace formatting

1358870

chore: fix documentation formatting

8c7b1b3

jszychowski-datum approved these changes Dec 12, 2025

View reviewed changes

JoseSzycho approved these changes Dec 12, 2025

View reviewed changes

scotwells merged commit 6871a8f into main Dec 15, 2025

scotwells mentioned this pull request Dec 17, 2025

feat: functional end-to-end testing environment #4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClickHouse database for audit log storage#1

Add ClickHouse database for audit log storage#1
scotwells merged 3 commits intomainfrom
feat/clickhouse-setup

scotwells commented Dec 11, 2025

Uh oh!

jszychowski-datum left a comment

Uh oh!

JoseSzycho left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

scotwells commented Dec 11, 2025

Summary

What this adds:

How it works:

Storage setup:

Try it:

Uh oh!

jszychowski-datum left a comment

Choose a reason for hiding this comment

Uh oh!

JoseSzycho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants