Skip to content

feat: functional end-to-end testing environment#4

Merged
scotwells merged 1 commit intomainfrom
feat/deployment-automation
Dec 17, 2025
Merged

feat: functional end-to-end testing environment#4
scotwells merged 1 commit intomainfrom
feat/deployment-automation

Conversation

@scotwells
Copy link
Contributor

@scotwells scotwells commented Dec 17, 2025

Summary

This PR builds on datum-cloud/engineering#1 and datum-cloud/engineering#2 to introduce a complete and functional end-to-end testing environment for testing the new activity service in a local kind cluster. The test environment is built on top of a test-infra cluster that includes base services like flux, envoy gateway, and a telemetry stack.

The test environment includes a Vector, NATS, Clickhouse pipeline that automatically collects audit logs emitted from the test-infra kind cluster.

graph LR
    APIServer[Activity API Server<br/>Generates audit logs]
    Vector1[Vector Sidecar<br/>Publishes events]
    NATS[NATS JetStream<br/>Event storage & routing]
    Vector2[Vector Aggregator<br/>Batching & persistence]
    CH[ClickHouse<br/>Long-term storage]
    QueryAPI[Activity API Server<br/>Query interface]
    Client[Clients<br/>kubectl/API]

    APIServer -->|writes| Vector1
    Vector1 -->|publish| NATS
    NATS -->|push| Vector2
    Vector2 -->|insert| CH
    Client -->|query| QueryAPI
    QueryAPI -->|CEL → SQL| CH
    CH -->|results| QueryAPI

    style APIServer fill:#e1f5ff
    style NATS fill:#fff3e0
    style CH fill:#f3e5f5
    style QueryAPI fill:#e8f5e9
    style Vector1 fill:#fff9c4
    style Vector2 fill:#fff9c4
Loading

Details

The apiserver deployment manifests are structured as a standard base kustomize deployment that includes the kubernetes Deployment, Service, and RBAC resources.

The following kustomize components have been introduced to provide optional functionality that can be enabled in environments when necessary.

  • api-registration: Configures the APIService registration with the k8s apiserver to proxy requests to the activity apiserver
  • cert-manager-ca: Configures a namespaced cert issuer to use with the activity apiserver
  • grafana-clickhouse: Configures a new Grafana datasource to connect to the deployed clickhouse instance
  • namespace: Creates a namespace to use for the system's deployment
  • nats-stream: Creates a new nats JetStream to use for the audit log pipeline
  • tracing: Configures the APIserver with a tracing configuration
  • vector-aggregator: Deploys an aggregated version of Vector that ingests audit logs from NATS and writes them to Clickhouse
  • vector-sidecar: Deployment of Vector that runs on every node in the cluster that's responsible for collecting audit logs from apiservers and writing them to NATS

Also included is deployment automation for deploying the system's dependencies:

  • Clickhouse operator: Manages deployments of Clickhouse through CRDs
  • NATS: Deploys an instance of NATS and NACK to configure NATS through CRDs

After deploying a fresh test environment, I used a kubectl-plugin to query the activity API and retrieve audit logs that have been collected through the pipeline.

▶ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10
TIMESTAMP             VERB    USER                                                                                   NAMESPACE          RESOURCE             NAME                                     STATUS
2025-12-17 17:07:12   get     system:serviceaccount:kyverno:kyverno-background-controller                                                                                                             200
2025-12-17 17:07:11   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:anonymous                                                                                                                                                        200
2025-12-17 17:07:10   get     system:serviceaccount:kyverno:kyverno-reports-controller                                                                                                                200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              tls-assets-vmalert-telemetry-system-vm   200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   secrets              vmalert-telemetry-system-vm              200
2025-12-17 17:07:10   get     system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator   telemetry-system   configmaps           vm-telemetry-system-vm-rulefiles-0       200
2025-12-17 17:07:10   watch   system:serviceaccount:telemetry-system:telemetry-system-vm-victoria-metrics-operator                      daemonsets                                                    200
2025-12-17 17:07:10   watch   system:serviceaccount:kyverno:kyverno-admission-controller                                                generatingpolicies                                            200
2025-12-17 17:07:10   watch   system:serviceaccount:flux-system:image-reflector-controller                                              imagepolicies                                                 200

More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzowNzoxMC4yNTA5NjhaIiwiYSI6IjJlZTJhMDgwLTAzYmMtNDk3Yi1hYjliLWU4ODQyZjBkMzY2NyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzowNzoxNC4wNzYwOTU1NDRaIn0=' to get the next page.
Or use --all-pages to fetch all results automatically.

Here's a dashboard I'm working on that was loaded into the local environment to visualize the health and performance of the pipeline. The source for this dashboard will be included in a follow up PR.

image

Up Next

  • CI pipeline integration w/ end-to-end testing
  • CLI package for interacting with the new activity audit log querying API w/ kubectl plugin
  • Operational dashboards & performance testing

Relates to datum-cloud/enhancements#536

This introduces a complete and functional testing environment for
testing the new activity service in a local kind cluster. The test
environment is built on top of the [test-infra] cluster that includes
base services like flux, envoy gateway, and a telemetry stack.

The test environment includes a Vector, NATS, Clickhouse pipeline that
automatically collects audit logs emitted from the test-infra kind
cluster.

The apiserver deployment manifests are structured as a standard `base`
kustomize deployment that includes the kubernetes Deployment, Service,
and RBAC resources.

The following kustomize components have been introduced to provide
optional functionality that can be enabled in environments when
necessary.

- api-registration: Configures the APIService registration with the k8s
  apiserver to proxy requests to the activity apiserver
- cert-manager-ca: Configures a namespaced cert issuer to use with the
  activity apiserver
- grafana-clickhouse: Configures a new Grafana datasource to connect to
  the deployed clickhouse instance
- namespace: Creates a namespace to use for the system's deployment
- nats-stream: Creates a new nats JetStream to use for the audit log
  pipeline
- tracing: Configures the APIserver with a tracing configuration
- vector-aggregator: Deploys an aggregated version of Vector that
  ingests audit logs from NATS and writes them to Clickhouse
- vector-sidecar: Deployment of Vector that runs on every node in the
  cluster that's responsible for collecting audit logs from apiservers
  and writing them to NATS

Also included is deployment automation for deploying the system's
dependencies:

- Clickhouse operator: Manages deployments of Clickhouse through CRDs
- NATS: Deploys an instance of NATS and NACK to configure NATS through
  CRDs

[test-infra]: https://github.com/datum-cloud/test-infra
@scotwells scotwells merged commit 8e64528 into main Dec 17, 2025
@scotwells scotwells deleted the feat/deployment-automation branch December 17, 2025 23:14
scotwells added a commit that referenced this pull request Dec 17, 2025
## Summary

The Activity CLI is available for consumers to easily build CLIs that
can interact with the Activity API. The activity CLI includes the
following commands:

- `activity` - Entrypoint to all commands available with the activity
CLI
- `query` - Command to execute queries against the API

This approach gives users the option of how they want to integrate the
CLI and provide a native experience to their users. CLIs (e.g.
[datumctl](https://github.com/datum-cloud/datumctl)) can choose to
import just the `query` command if they wish.

This also includes a [kubectl-plugin] named `activity` so that kubectl
users can use `kubectl activity` to interact with the activity API.

I've also included an auto-generated Golang client that can be used to
interact with the activity API.

[kubectl-plugin]:
    https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/

## Details

I was able to test this by installing the `kubectl-activity` command to
my `$PATH` and used kubectl to query the test environment created in #4.

```shell
$ go build -o ~/go/bin/kubectl-activity ./cmd/kubectl-activity
 
$ kubectl activity query --filter='objectRef.resource != "leases"' --limit 10
TIMESTAMP             VERB     USER                                                                            NAMESPACE   RESOURCE               NAME                                       STATUS
2025-12-17 17:22:56   watch    system:kube-controller-manager                                                              updaterequests                                                    200
2025-12-17 17:22:56   create   system:serviceaccount:activity-system:activity-apiserver                                    subjectaccessreviews                                              201
2025-12-17 17:22:55   get      system:anonymous                                                                                                                                              200
2025-12-17 17:22:55   get      system:serviceaccount:kyverno:kyverno-admission-controller                                                                                                    200
2025-12-17 17:22:54   watch    system:kube-controller-manager                                                              volumeattachments                                                 200
2025-12-17 17:22:54   watch    system:apiserver                                                                            persistentvolumes                                                 200
2025-12-17 17:22:53   get      system:anonymous                                                                                                                                              200
2025-12-17 17:22:53   get      system:serviceaccount:kyverno:kyverno-admission-controller                      kyverno     secrets                kyverno-svc.kyverno.svc.kyverno-tls-pair   200
2025-12-17 17:22:53   get      system:serviceaccount:kyverno:kyverno-admission-controller                      kyverno     secrets                kyverno-svc.kyverno.svc.kyverno-tls-ca     200
2025-12-17 17:22:53   watch    system:serviceaccount:telemetry-system:telemetry-system-vm-kube-state-metrics               services                                                          200

More results available. Use --continue-after 'eyJ0IjoiMjAyNS0xMi0xN1QyMzoyMjo1My4xOTM3MjhaIiwiYSI6Ijc1ZTc4ZGVjLTA3ODQtNGY2OS1hY2NlLWM4OGFhOTQ0ZDUzNyIsImgiOiJDZ1FrUHh5S2NCT2NkTEUyNm9meDhBPT0iLCJpIjoiMjAyNS0xMi0xN1QyMzoyMzowMi4xODM3NjU5MjhaIn0=' to get the next page.
Or use --all-pages to fetch all results automatically.
```

## Up Next

- CI / build pipeline
- Operational dashboards & performance testing

---

Relates to https://github.com/datum-cloud/engineering/issues/90
@drewr
Copy link
Contributor

drewr commented Dec 19, 2025

After a while my VM finally seemed to get through the setup, BUT it doesn't think it wasn't successful. Thoughts?

[.....lot of stuff that don't show any obvious errors.....]
helmrepository.source.toolkit.fluxcd.io/vector created
helmrepository.source.toolkit.fluxcd.io/vector-sidecar created

⏳ Waiting for RustFS bucket initialization...
job.batch/rustfs-bucket-init condition met
⏳ Waiting for ClickHouse to be ready...
clickhouseinstallation.clickhouse.altinity.com/activity-clickhouse condition met
pod/chi-activity-clickhouse-activity-0-0-0 condition met
⏳ Waiting for ClickHouse migrations to complete...
job.batch/clickhouse-migrate condition met
⏳ Waiting for Activity server to be ready...
pod/activity-apiserver-6f7667949d-28zpz condition met
⏳ Waiting for Vector aggregator to be ready...
pod/vector-aggregator-cfcf75c9b-27j8k condition met
pod/vector-aggregator-cfcf75c9b-2vc77 condition met

⏳ Waiting for Grafana ClickHouse datasource to be synced...
grafanadatasource.grafana.integreatly.org/clickhouse-datasource condition met

✅ Activity server and all dependencies deployed successfully!

📊 Check status:
  All resources:     task test-infra:kubectl -- get all -n activity-system
  API server pods:   task test-infra:kubectl -- get pods -l app=activity-apiserver -n activity-system
  ClickHouse pods:   task test-infra:kubectl -- get pods -l clickhouse.altinity.com/chi=activity-clickhouse -n activity-system
  Vector pods:       task test-infra:kubectl -- get pods -l app.kubernetes.io/instance=vector-aggregator -n activity-system
  NATS pods:         task test-infra:kubectl -- get pods -n nats-system
  NATS streams:      task test-infra:kubectl -- get streams -n nats-system
  S3 bucket:         task test-infra:kubectl -- get objectbucketclaim -n activity-system
  API service:       kubectl get apiservice v1alpha1.activity.datum.net
  Grafana datasrc:   task test-infra:kubectl -- get grafanadatasource clickhouse-datasource -n activity-system

📋 View logs:
  API server:        task test-infra:kubectl -- logs -l app=activity-apiserver -n activity-system -f
  ClickHouse:        task test-infra:kubectl -- logs -l clickhouse.altinity.com/chi=activity-clickhouse -n activity-system -f
  Vector:            task test-infra:kubectl -- logs -l app.kubernetes.io/instance=vector-aggregator -n activity-system -f
  NATS:              task test-infra:kubectl -- logs -l app.kubernetes.io/name=nats -n nats-system -f

📊 Observability:
  Access Grafana:    task test-infra:kubectl -- port-forward -n telemetry-system svc/grafana-service 3000:3000
  Grafana URL:       http://localhost:3000 (admin / datum123)
  Verify datasource: task test-infra:kubectl -- get grafanadatasource -n activity-system

🔗 Milo Integration:
  If you have Milo deployed, integrate its audit logs:
    task dev:integrate-milo

task: Failed to run task "dev:setup": task: Task "observability:deploy" does not exist

@scotwells
Copy link
Contributor Author

That error is because I removed the observability stack before committing the changes because I want the observability stack to be its own PR. Seems like everything deployed fine otherwise

@drewr
Copy link
Contributor

drewr commented Dec 19, 2025

Copy that!

@drewr
Copy link
Contributor

drewr commented Dec 19, 2025

None of the commands seem to work yet:

$ TASK_X_REMOTE_TASKFILES=1 task test-infra:kubectl -- logs -l app.kubernetes.io/instance=vector-aggregator -n activity-system -f
task: No Taskfile found at "https://raw.githubusercontent.com/datum-cloud/test-infra/feat/global-grafana-watch/Taskfile.yml"

Poked around the test-infra repo to see if I could reverse engineer the intent but there aren't any PRs or branches there to match. Is it supposed to just reference the root Taskfile there?

@scotwells
Copy link
Contributor Author

That's from me merging this PR datum-cloud/test-infra#18 earlier today. Just need to cut a new release and adjust the taskfile in the activity repo to use the tagged version.

@scotwells
Copy link
Contributor Author

Fixed in #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants