Skip to content

Add ClickHouse database for audit log storage#1

Merged
scotwells merged 3 commits intomainfrom
feat/clickhouse-setup
Dec 15, 2025
Merged

Add ClickHouse database for audit log storage#1
scotwells merged 3 commits intomainfrom
feat/clickhouse-setup

Conversation

@scotwells
Copy link
Contributor

Summary

Set up a ClickHouse database to store Kubernetes audit events from the Activity API, providing fast queries over large volumes of audit data.

What this adds:

  • Database schema for storing audit events with multi-tenant scoping
  • Automated schema migrations that run on deployment
  • Hot/cold storage tiering (recent data on SSD, old data on S3)
  • Task commands for managing migrations (task migrations:local, etc.)

How it works:

  • Migrations run automatically via Kubernetes Job on each deployment
  • The Job checks which migrations have been applied and runs new ones
  • After 5 minutes, the Job deletes itself and gets recreated by GitOps
  • All migration SQL lives in migrations/ directory

Storage setup:

  • ClickHouse stores recent audit events on local disks for fast queries
  • After 90 days, events move to S3-compatible storage (RustFS in test environment)
  • This keeps costs low while maintaining query performance

Try it:

  task migrations:local          # Run migrations against local ClickHouse
  task migrations:new NAME=foo   # Create a new migration
  task migrations:generate       # Update Kubernetes ConfigMap

Set up a ClickHouse database to store Kubernetes audit events from the
Activity API. This provides fast queries over large volumes of audit data.

What this adds:
- Database schema for storing audit events with multi-tenant scoping
- Automated schema migrations that run on deployment
- Hot/cold storage tiering (recent data on SSD, old data on S3)
- Task commands for managing migrations (task migrations:local, etc.)

How it works:
- Migrations run automatically via Kubernetes Job on each deployment
- The Job checks which migrations have been applied and runs new ones
- After 5 minutes, the Job deletes itself and gets recreated by GitOps
- All migration SQL lives in migrations/ directory

Storage setup:
- ClickHouse stores recent audit events on local disks for fast queries
- After 90 days, events move to S3-compatible storage (RustFS in test)
- This keeps costs low while maintaining query performance

Try it:
  task migrations:local          # Run migrations against local ClickHouse
  task migrations:new NAME=foo   # Create a new migration
  task migrations:generate       # Update Kubernetes ConfigMap
Copy link

@jszychowski-datum jszychowski-datum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@JoseSzycho JoseSzycho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM x 2

@scotwells scotwells merged commit 6871a8f into main Dec 15, 2025
scotwells added a commit that referenced this pull request Dec 17, 2025
## Summary

This PR builds on #1 and #2 to introduce a complete and functional
end-to-end testing environment for testing the new activity service in a
[local kind cluster](https://kind.sigs.k8s.io). The test environment is
built on top of a [test-infra] cluster that includes base services like
flux, envoy gateway, and a telemetry stack.

The test environment includes a [Vector](https://vector.dev),
[NATS](https://nats.io), [Clickhouse](https://clickhouse.com) pipeline
that automatically collects audit logs emitted from the test-infra kind
cluster.

```mermaid
graph LR
    APIServer[Activity API Server<br/>Generates audit logs]
    Vector1[Vector Sidecar<br/>Publishes events]
    NATS[NATS JetStream<br/>Event storage & routing]
    Vector2[Vector Aggregator<br/>Batching & persistence]
    CH[ClickHouse<br/>Long-term storage]
    QueryAPI[Activity API Server<br/>Query interface]
    Client[Clients<br/>kubectl/API]

    APIServer -->|writes| Vector1
    Vector1 -->|publish| NATS
    NATS -->|push| Vector2
    Vector2 -->|insert| CH
    Client -->|query| QueryAPI
    QueryAPI -->|CEL → SQL| CH
    CH -->|results| QueryAPI

    style APIServer fill:#e1f5ff
    style NATS fill:#fff3e0
    style CH fill:#f3e5f5
    style QueryAPI fill:#e8f5e9
    style Vector1 fill:#fff9c4
    style Vector2 fill:#fff9c4
```

## Details

The apiserver deployment manifests are structured as a standard `base`
kustomize deployment that includes the kubernetes Deployment, Service,
and RBAC resources.

The following kustomize components have been introduced to provide
optional functionality that can be enabled in environments when
necessary.

- **api-registration**: Configures the APIService registration with the
k8s apiserver to proxy requests to the activity apiserver
- **cert-manager-ca**: Configures a namespaced cert issuer to use with
the activity apiserver
- **grafana-clickhouse**: Configures a new Grafana datasource to connect
to the deployed clickhouse instance
- **namespace**: Creates a namespace to use for the system's deployment
- **nats-stream**: Creates a new nats JetStream to use for the audit log
pipeline
- **tracing**: Configures the APIserver with a tracing configuration
- **vector-aggregator**: Deploys an aggregated version of Vector that
ingests audit logs from NATS and writes them to Clickhouse
- **vector-sidecar**: Deployment of Vector that runs on every node in
the cluster that's responsible for collecting audit logs from apiservers
and writing them to NATS

Also included is deployment automation for deploying the system's
dependencies:

- **[Clickhouse
operator](https://github.com/Altinity/clickhouse-operator)**: Manages
deployments of Clickhouse through CRDs
- **NATS**: Deploys an instance of NATS and
[NACK](https://github.com/nats-io/nack) to configure NATS through CRDs

After deploying a fresh test environment, I used a
[kubectl-plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/)
to query the activity API and retrieve audit logs that have been
collected through the pipeline.

```shell
▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10
TIMESTAMP             VERB    USER                                                                                   NAMESPACE          RESOURCE             NAME                                     STATUS
2025-12-17 17:07:12   get     system:serviceaccount:kyverno:kyverno-background-controller                                                                                                             200
2025-12-17 17:07:11   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:serviceaccount:kyverno:kyverno-reports-controller                                                                                                                200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              tls-assets-vmalert-telemetry-system-vm   200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              vmalert-telemetry-system-vm              200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   configmaps           vm-telemetry-system-vm-rulefiles-0       200
2025-12-17 17:07:10   watch   system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator                      daemonsets                                                    200
2025-12-17 17:07:10   watch   system:serviceaccount:kyverno:kyverno-admission-controller                                                generatingpolicies                                            200
2025-12-17 17:07:10   watch   system:serviceaccount:flux-system:image-reflector-controller                                              imagepolicies                                                 200

More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page.
Or use --all-pages to fetch all results automatically.
``` 

Here's a dashboard I'm working on that was loaded into the local
environment to visualize the health and performance of the pipeline. The
source for this dashboard will be included in a follow up PR.

<img width="1385" height="1221" alt="image"
src="https://github.com/user-attachments/assets/4328f182-d167-4565-a7e7-9703419c2b48"
/>


[test-infra]: https://github.com/datum-cloud/test-infra

## Up Next

- CI pipeline integration w/ end-to-end testing
- CLI package for interacting with the new activity audit log querying
API w/ [kubectl
plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/)
- Operational dashboards & performance testing

--- 

Relates to https://github.com/datum-cloud/engineering/issues/90
scotwells added a commit that referenced this pull request Jan 21, 2026
This PR introduces a new `history` command that can be used to diff a
resource over time. This is helpful for analyzing what changes were made
to a resource and who made the change.

What's shown here is an example of running the command against a
resource in Datum Cloud so we could analyze the changes one of our
system components was making to a resource.

```shell
▶ ./datumctl activity --project another-project-w24uyl history dnsrecordsets dns-record-set-www-ibm-com-5npagr-qp25o9 --diff

╭─────────────────────────────────────────────────────────────╮
│ Change #1   📝 patch    [200]
│ 🕐 2026-01-21 09:29:35
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📸 Initial state (oldest available change)

{
  "apiVersion": "dns.networking.miloapis.com/v1alpha1",
  "kind": "DNSRecordSet",
  "metadata": {
    "name": "dns-record-set-www-ibm-com-5npagr-qp25o9",
    "namespace": "default"
  },
  "spec": {
    ...
  },
  "status": {
    ...
  }
}

╭─────────────────────────────────────────────────────────────╮
│ Change #2   📝 patch    [200]
│ 🕐 2026-01-21 09:29:35
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📝 Changes: metadata only

--- Previous
+++ Current
@@ -143,7 +143,7 @@
     "conditions": [
       {
         "lastTransitionTime": "2025-12-03T10:10:17Z",
-        "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}",
+        "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}",
         "observedGeneration": 1,
         "reason": "PDNSError",
         "status": "False",

╭─────────────────────────────────────────────────────────────╮
│ Change #3   📝 patch    [200]
│ 🕐 2026-01-21 09:31:04
│ 👤 dns-operator
╰─────────────────────────────────────────────────────────────╯
📝 Changes: metadata only

--- Previous
+++ Current
@@ -143,7 +143,7 @@
     "conditions": [
       {
         "lastTransitionTime": "2025-12-03T10:10:17Z",
-        "message": "Record \"@\": status 422: {\"error\": \"RRset www.ibm.com. IN CNAME has more than one record\"}",
+        "message": "Record \"outer-global-dual.ibmcom-tls12.edgekey.net\": status 422: {\"error\": \"RRset outer-global-dual.ibmcom-tls12.edgekey.net.www.ibm.com. IN CNAME has more than one record\"}",
         "observedGeneration": 1,
         "reason": "PDNSError",
         "status": "False",
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants