Skip to content

feat: long-haul (canary) test driver for the operator (#220)#348

Open
WentingWu666666 wants to merge 35 commits intodocumentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests
Open

feat: long-haul (canary) test driver for the operator (#220)#348
WentingWu666666 wants to merge 35 commits intodocumentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

@WentingWu666666 WentingWu666666 commented Apr 20, 2026

Closes #220.

What this PR adds

A standalone long-haul (canary) test driver for the DocumentDB Kubernetes Operator, plus the GitHub Actions plumbing to keep it running on a long-lived AKS cluster.

The driver lives under test/longhaul/ and is its own Go module. It is not a Ginkgo suite — it is a long-running binary (cmd/longhaul) deployed as a Deployment into a namespace alongside a real DocumentDB CR. It exercises the operator with a continuous, realistic workload while a weighted-random scheduler injects disruptive operations, and it journals every event so failures can be reproduced.

Pieces that ship in this PR

  • Continuous data plane (workload/): writer + verifier goroutines, sequence / checksum gap detection, lag-aware reads — the durability oracle for FM1/FM3/FM7.
  • Operation scheduler (operations/): weighted random selection, per-category cooldowns, steady-state gating, and an outage-policy verdict per operation. Operations implemented today: ScaleUp, ScaleDown, UpgradeDocumentDB (auto-skips when instancesPerNode < 2 because rolling restart on a single instance is unavoidable downtime).
  • Real cluster client (monitor/k8sclient.go): reads the DocumentDB CR, watches pods, pulls metrics-server samples — no fakes.
  • Health monitor + leak detector (monitor/health.go, monitor/leakdetect.go): steady-state detection plus linear-regression slope on memory/CPU samples.
  • Append-only journal (journal/): every disruption window, every workload event, every verdict. The disruption-window oracle is what makes verdicts reproducible.
  • Reporter + alerter (report/): writes the longhaul-report ConfigMap so a human can kubectl get cm longhaul-report -o yaml for a single source of truth on pass/fail. Alerts surface in the monitor workflow.
  • In-cluster packaging (Dockerfile, deploy/longhaul.yaml): runs as a Deployment (not a Job) — if the process exits, it restarts; the ConfigMap is the durable verdict, not the pod's exit code.
  • Two GHA workflows chained via workflow_run:
    • LONGHAUL - Build Test Driver Image — builds and pushes to GHCR.
    • LONGHAUL - Deploy Test Driver to AKS — uses a long-lived ServiceAccount-token kubeconfig (LONGHAUL_KUBECONFIG repo secret) to roll the Deployment.
  • LONGHAUL - Monitor — hourly cron that pulls the report ConfigMap and posts a workflow summary; alerts on stale or failing reports.
  • Tier-1 unit tests across config, journal, monitor, report, operations — no cluster needed, run with go test ./... from test/longhaul/.
  • Docsdocs/designs/long-haul-test-design.md (problem statement, eight failure modes, design principles, architecture, operations catalog with status, env-var table, phase tracker) and test/longhaul/README.md (how to run, what every env var does, why Deployment over Job).

Why this is structured the way it is

  • Long-haul is what happens when the failure modes you need to detect require more time and operations than fit in any bounded test run. Memory leaks, CR-history drift, upgrade-under-state failures — these don't speed up. The eight failure modes in the design doc are why every component exists.
  • No drain before disruption. Workload runs through scale, upgrade, and (future) failover ops — that's exactly the bug class we're trying to find.
  • Deployment, not Job. A Job that finishes is a Job no one looks at. The longhaul-report ConfigMap is the durable verdict, and the monitor cron makes sure a human sees stale runs.
  • Single cluster today, two clusters later. A baseline DocumentDB cluster (control group) is in the design but not in this PR — the single-cluster driver already exercises the primary failure modes.

Live canary

I've stood up a long-running cluster — the driver Deployment lives there and gets re-rolled by the deploy workflow on every successful image build:

  • AKS cluster: longhaul-aks (longhaul-rg, sub 81901d5e…)
  • Everything (DocumentDB CR + driver Deployment + longhaul-report ConfigMap) runs in documentdb-test-ns.
  • Useful pokes:
    • kubectl logs deploy/longhaul-test -n documentdb-test-ns
    • kubectl get cm longhaul-report -n documentdb-test-ns -o yaml

What's intentionally out of scope

Tracked in the design doc's phase table:

  • Two-cluster topology (Phase 3): primary + baseline, so signals become attributable.
  • Backup / restore operations (Phase 2a) and chaos primitives beyond scale (Phase 2d): drain node, kill primary pod, operator restart.
  • Operator version upgrade operation: we exercise DocumentDB version upgrades today; operator upgrades will land once the release workflow is wired to bump the canary's target.
  • OutagePolicy.AllowedDowntime enforcement: the field is reserved on the struct; current verdicts use AllowedWriteFailures + MustRecoverWithin. Called out explicitly in the design doc.

How to review

If you only want to skim, three files give you the shape of the PR:

  1. docs/designs/long-haul-test-design.md — the why.
  2. test/longhaul/cmd/longhaul/main.go — the wiring.
  3. test/longhaul/operations/scheduler.go — the heart of the disruption loop.

If you want to run it locally:

cd test/longhaul
LONGHAUL_CLUSTER_NAME=documentdb-sample LONGHAUL_NAMESPACE=documentdb-test-ns \
  LONGHAUL_MAX_DURATION=10m \
  go run ./cmd/longhaul

Happy to walk through any area on a call.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 8 times, most recently from ca9522c to d1b3462 Compare April 22, 2026 16:38
@WentingWu666666 WentingWu666666 changed the title docs: add long haul test design document feat: add long-haul test design and Phase 1a skeleton (#220) Apr 22, 2026
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from d1b3462 to 6bd4274 Compare April 22, 2026 16:41
@WentingWu666666 WentingWu666666 marked this pull request as ready for review April 22, 2026 16:50
Copilot AI review requested due to automatic review settings April 22, 2026 16:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.

Changes:

  • Introduces a long-haul Ginkgo suite with a BeforeSuite enablement gate (LONGHAUL_ENABLED) and a placeholder canary spec.
  • Adds a test/longhaul/config package for env-based configuration loading/validation with unit tests.
  • Documents the long-haul test architecture and usage via a new design doc and package README.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
operator/src/test/longhaul/suite_test.go Adds the Ginkgo suite entry point for long-haul tests.
operator/src/test/longhaul/longhaul_test.go Adds gated BeforeSuite config loading/validation and a placeholder canary spec.
operator/src/test/longhaul/config/suite_test.go Adds the Ginkgo suite entry point for config unit tests.
operator/src/test/longhaul/config/config_test.go Adds unit tests for config defaults, env parsing, validation, and enablement gating.
operator/src/test/longhaul/config/config.go Implements long-haul test configuration (env vars, defaults, validation, enablement).
operator/src/test/longhaul/README.md Adds usage docs, configuration reference, and CI-safety notes for long-haul tests.
docs/designs/long-haul-test-design.md Adds the long-haul canary design document and phased implementation plan.

Comment thread operator/src/test/longhaul/README.md Outdated
Comment thread operator/src/test/longhaul/README.md Outdated
// Config holds all configuration for a long haul test run.
type Config struct {
// MaxDuration is the maximum test duration. Zero means run until failure.
// Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says infinite runs require LONGHAUL_MAX_DURATION=0, but time.ParseDuration requires a unit (and the README/tests use 0s). Update the comment to match the actual accepted value to avoid confusing users.

Suggested change
// Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs.
// Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs.

Copilot uses AI. Check for mistakes.
Comment thread test/longhaul/config/config_test.go
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md
Comment thread docs/designs/long-haul-test-design.md Outdated
Copy link
Copy Markdown
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.

Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅


🟠 Major

M1. Module placement will pollute operator/src/go.modoperator/src/test/longhaul/ (entire tree)

The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.

For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:

  • Phase 1b adds go.mongodb.org/mongo-driver (design line 75)
  • Phase 1c adds a file-backed journal
  • Phase 1d adds OTel / Prometheus client libs
  • Phase 2f adds k8s Job manifests + likely controller-runtime client

…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.

M2. Design does not engage with the existing e2e suitedocs/designs/long-haul-test-design.md throughout

Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:

  • Ginkgo v2 harness with SynchronizedBeforeSuite for cluster bootstrapping (test/e2e/pkg/testenv)
  • Label taxonomy (test/e2e/labels.go)
  • Run-ID tagging for artifact separation (test/e2e/runid.go)
  • envsubst-based manifest rendering
  • A mongo-driver v2 wire-protocol client (same driver the design proposes)
  • .github/actions/setup-test-environment composite action for cluster provisioning

The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.

M3. Four design gaps that will cost rework, not captured in §Open Questionslong-haul-test-design.md:443-445

  1. Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
  2. Workload resumption idempotency. Writers use a unique index on (writer_id, seq) (line 80). If a writer or the Job pod restarts, where does seq restart from? Reset to 0 → writes collide. Bootstrap from max(seq) per writer_id → the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described.
  3. Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
  4. Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.

These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.

M4. MaxDuration=30m default contradicts "run until failure"config.go:594, long-haul-test-design.md:37,188

The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.


🟡 Minor (cheap in-PR fixes)

  • m1 IsEnabled()config.go:637-640 — doesn't trim whitespace. LONGHAUL_ENABLED=" true " → false (silent skip). Add strings.TrimSpace. Silent-failure mode; worth fixing in-PR.
  • m2 Validate()config.go:625-633 — accepts negative MaxDuration. time.ParseDuration("-5m") succeeds. Add if c.MaxDuration < 0 { return fmt.Errorf(...) }.
  • m3 Error strings capitalized — config.go:627,630ST1005 / Go convention: lowercase. Matching tests use ContainSubstring("Namespace") / "ClusterName" and will need adjustment.
  • m4 Path inconsistency — long-haul-test-design.md:340 says kubectl apply -f test/longhaul/deploy/job.yaml but the directory structure roots at operator/src/test/longhaul/. Fix path, or resolve M1 first.
  • m5 README overstates CI safety — README.md:543-547 — phrase as "no CI workflow currently sets LONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code.
  • m6 config is a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Consider longhaulcfg or cfg. Not blocking.
  • m7 Missing IsEnabled test cases — config_test.go:727-757 — given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True", "YES"), "0" / "no" / "false" → false.

🟢 Nit

  • longhaul_test.go:799 — package-level mutable testConfig. Fine for a single-file suite; revisit if Phase 1b splits files.
  • DefaultConfig() by value vs. Validate() pointer receiver — slight asymmetry. Leave it.
  • README line 509's -timeout 0 guidance is correct and rarely documented — nice touch.

Positive feedback

  • Skip gate is correctly placed in BeforeSuite, not BeforeEach — single skipped line instead of N per spec. Exactly right.
  • LoadFromEnv cleanly separates parsing from validation, so Validate() can be called on a programmatically-built Config in Phase 1b unit tests without fighting env vars.
  • The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
  • Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
  • Test coverage on the config package is thorough for the surface area.

Questions for the author

  1. (M1) Any reason longhaul can't live at test/longhaul/ as a sibling module to test/e2e/? If there is one, please capture it in the design.
  2. (M2) Have you looked at test/e2e/pkg/ for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation?
  3. (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
  4. Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
  5. Should CONTRIBUTING.md or AGENTS.md get a one-liner pointing at operator/src/test/longhaul/README.md so contributors aren't left wondering "what is this tree?"

Reviewed with the GitHub Copilot code-review agent.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 3 times, most recently from f3ffbef to 8459cae Compare April 22, 2026 18:47
@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

WentingWu666666 commented Apr 22, 2026

Response to @xgerman's Review

Thank you for the thorough review! I've addressed all items. Here's a summary:


Major Items

M1 (Module placement): Agreed extracted long-haul tests to test/longhaul/ at the repo root as a separate Go module (github.com/documentdb/documentdb-operator/test/longhaul). This keeps Ginkgo/Gomega and any future test-only deps completely out of the operator's runtime go.mod. Follows the three-directory layout described in the updated design doc.

M2 (E2E engagement): Designed a three-directory layout for test code reuse:

Each directory has its own go.mod with replace directives. The design doc now includes a package-to-use mapping table showing which shared utils each long-haul phase will reuse (mongo client, assertions, DocumentDB CR helpers, etc.). Extraction of shared packages from test/e2e/pkg/e2eutils/ to test/utils/ will be coordinated with @xgerman in a follow-up.

M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:

  1. Journal durability: PVC-backed file at /data/journal/, structured JSON lines, on-startup reconstruction from existing file.
  2. Writer seq resumption: Bootstrap from max(seq) per writer_id in DB. Oracle tolerates crash-recovery gaps (logged, not flagged as data loss).
  3. Teardown on abort: Signal handler for SIGTERM/SIGINT cancel contexts flush journal write final status to ConfigMap exit. Leftover-run detector on startup.
  4. Latency baseline: Kept as Open Question establish P99 baseline during first 30min warmup, alert on >2 sustained regression.

These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs.

M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets LONGHAUL_MAX_DURATION=0s. The 30m default remains as a safety net for local development.


Minor Items All Fixed

  • m1: Added strings.TrimSpace to IsEnabled()
  • m2: Added negative MaxDuration validation in Validate()
  • m3: Lowercased error strings (Go convention ST1005)
  • m4: Fixed path: test/longhaul/deploy/job.yaml
  • m5: Rephrased CI safety: "No CI workflow currently sets this variable do not add it to any PR-gated workflow"
  • m6: Valid concern. Keeping config for Phase 1a (no shadowing risk yet). Will revisit if the operator's own config needs importing in later phases.
  • m7: Added 8 new test cases: whitespace (" true ", " yes "), whitespace-only (" "), mixed case ("True", "YES"), explicit false values ("0", "no", "false"). Total: 23 specs (was 15).

Questions

  1. (M1) Resolved extracted to separate Go module at test/longhaul/.
  2. (M2) Resolved three-directory layout with shared utils plan documented.
  3. (M3.1) PVC-backed file journal documented as provisional design decision.
  4. (Issue Set up long haul test cluster #220) The design's Problem Statement already quotes the three requirements from Set up long haul test cluster #220. SLO targets remain as Open Question Adding Microsoft SECURITY.MD #2 they need team input.
  5. (CONTRIBUTING.md/AGENTS.md) Good idea. Will add a one-liner in a follow-up PR to avoid scope creep here.

Test results: 23 config specs + 1 canary spec (skipped) = all passing. go vet clean.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 3 times, most recently from 9676d24 to 158a393 Compare April 23, 2026 17:31
@hossain-rayhan
Copy link
Copy Markdown
Collaborator

The plan looks good.
I have one concern about the test restart policy.

We should keep running parallel tests for different combination. Like, operator n version with gateway/extension n-1 version. Operator n-2 version with keeps working fine with latest non-schema changes.

At least, different operator version tests, not only the latest release restarts the tests.

@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

Thanks for the feedback @hossain-rayhan! Great point about version compatibility.

Since the operator and database images follow independent version tracks, we agree cross-version testing matters e.g., upgrading the operator while the database is still on the previous version, and vice versa.

That said, we think this is better suited for the E2E test suite rather than long-haul. Cross-version compat bugs surface immediately on upgrade (deterministic, finite), while long-haul focuses on time-dependent issues memory leaks, connection pool exhaustion, replication drift that only appear after hours/days of sustained operation.

We'll make sure the E2E suite covers both operator-first and database-first upgrade paths with version N/N-1 combinations.

Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from 158a393 to 1d63aea Compare April 24, 2026 15:35
wentingwu000 and others added 28 commits May 7, 2026 13:08
Previous run data caused unique index conflicts on (writer_id, seq).
Now the writes collection is dropped at test start for a clean slate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
… binary)

Adds the infrastructure to run the long-haul test as a Kubernetes Job:
- cmd/longhaul/main.go: standalone binary (no Ginkgo dependency)
- Dockerfile: multi-stage distroless image
- deploy/job.yaml: Job + ConfigMap for environment configuration
- deploy/rbac.yaml: ServiceAccount (Phase 2 adds RBAC roles)
- Updated README with deployment instructions

The standalone binary is validated against AKS (1490 writes, 100% success).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…rts, alerting

- Replace placeholder ClusterClient with real k8s implementation using
  unstructured dynamic client (no operator module dependency)
- Add metrics-server integration for memory/CPU sampling with graceful
  fallback when metrics API is unavailable
- Add periodic checkpoint reporter (configurable interval, default 1h)
  that writes to stdout and persists to ConfigMap
- Add GitHub Actions workflow annotations (error/warning/notice)
- Update RBAC with pod/CR/configmap access and ClusterRole for metrics
- Update Dockerfile to golang:1.26-alpine for k8s client-go compatibility

Validated on AKS: pod running healthy, all components initialized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Polls AKS every 15 minutes. If the Job fails or ConfigMap reports FAIL,
the workflow exits non-zero  GitHub sends email to repo watchers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add deploy/setup.yaml with namespace, credentials secret, and DocumentDB CR
  spec for reproducible long-haul test environments
- Update job.yaml image to documentdblonghaul.azurecr.io registry
- Update monitor workflow to target longhaul-aks cluster
- Set max replicas to 1 (current webhook validation limit)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The Microsoft tenant blocks new service principal creation without a
ServiceManagementReference. Switch to a base64-encoded kubeconfig stored
as LONGHAUL_KUBECONFIG repo secret  simpler and avoids the SP policy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Change cron to hourly (matches LONGHAUL_REPORT_INTERVAL default 1h)
- Add staleness check: fail if ConfigMap last-updated > 2h ago
- Detects hung tests or bugs that stop updating the report

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Move staleness check before result check (stale data is untrustworthy)
- Change 'Job not found' from notice to error (triggers notification)
- Add instructions to disable workflow during maintenance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The standalone binary (cmd/longhaul/main.go) is the primary entry point,
deployed as a Kubernetes Job. The Ginkgo wrapper added no unique value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove suite_test.go and longhaul_test.go from structure
- Replace Ginkgo run instructions with 'go run ./cmd/longhaul/'
- Remove LONGHAUL_ENABLED from config table
- Add LONGHAUL_REPORT_INTERVAL to config table
- Update CI safety section (no longer gated by LONGHAUL_ENABLED)
- Update deploy file listing to match current state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add upgrade-if-needed job (runs only after health check passes)
- Compare running versions against repo's Chart.yaml/values.yaml
- Helm upgrade operator if appVersion drifts
- Patch DocumentDB CR if documentDbVersion drifts
- Ensures long-haul cluster always runs latest merged versions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Query ghcr.io for latest published operator and documentdb image tags
- Fallback to Chart.yaml / values.yaml if GHCR API fails
- Ensures cluster upgrades to actual released versions, not just repo HEAD

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add schemaVersion: auto to setup.yaml so schema follows binary version
- Upgrade step always sets schemaVersion=auto alongside documentDbVersion
- Handle case where documentDbVersion is not set in CR (uses operator default)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ction

- Operator: /orgs/documentdb/packages/container/documentdb-operator/versions
- DocumentDB: /orgs/documentdb/packages/container/documentdb-kubernetes-operator%2Fdocumentdb/versions
- jq filter only matches strict X.Y.Z tags (rejects rc, latest, sha prefixes)
- Keeps Chart.yaml/values.yaml as fallback if GHCR query fails

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Operator: rollout status + image tag re-check after helm upgrade
- DocumentDB: poll status.documentDBImage every 20s (10m timeout)
- Use status.documentDBImage instead of spec for running version detection
- Fail workflow with notification if verification times out

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…to operator upgrade

The DocumentDB version upgrade is structurally identical to a scale operation
(spec patch on the DocumentDB CR -> operator reconciles -> rolling restart),
so it belongs in the test binary alongside scale/failover. This lets the
continuous data-integrity verifiers detect any corruption introduced by the
upgrade, and treats upgrade as a real load-bearing event rather than a
maintenance window where load is paused.

Workflow now:
  - Helm-upgrades the operator (still infra-level work) and applies CRDs
    first, since 'helm upgrade' does not touch CRDs by default.
  - Publishes the desired DocumentDB version to a 'longhaul-versions'
    ConfigMap instead of patching the CR directly.

Test now has an UpgradeDocumentDB operation:
  - Precondition: ConfigMap version differs from status.documentDBImage.
  - Execute: patch spec.documentDbVersion + spec.schemaVersion='auto', poll
    status until the new image tag is observed, then WaitForSteadyState.
  - Low weight so it doesn't crowd out scale/failover.
  - Generous OutagePolicy because rolling restarts touch every pod.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The driver is restart-safe (resumes from PVC journal/checkpoint), so a
Deployment(replicas=1) lets us roll new images with kubectl set image
instead of delete-and-recreate.

- deploy/deployment.yaml replaces deploy/job.yaml
  - kind: Deployment, replicas: 1, strategy: Recreate
  - image is templated: __OWNER__ and __IMAGE_TAG__ are substituted by
    the deploy workflow (or sed manually)
- .github/workflows/longhaul-deploy.yml (manual workflow_dispatch)
  - Auths via LONGHAUL_KUBECONFIG secret (SA-token kubeconfig)
  - Inputs: image_tag (required), apply_manifests (optional)
  - kubectl apply (optional) -> set image -> rollout restart -> rollout status
- README runbook updated for Deployment workflow

On critical failure the driver still exits non-zero; the Deployment then
auto-restarts. Pod restart count and the report ConfigMap are the
sources of truth for `did the test fail'', not pod status alone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Bootstrap (namespace, DocumentDB CR, credentials secret, driver SA +
ClusterRole for metrics) is one-time admin work and lives in setup.yaml
+ rbac.yaml. The deployer SA used by the workflow is intentionally
namespace-scoped and cannot create those, so trying to apply them in
every deploy run failed with 'Forbidden'.

Drop the apply_manifests input and unconditionally apply only
deployment.yaml.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
* longhaul-deploy.yml: add workflow_run trigger so a successful build on
  main auto-deploys the matching :sha-<short> image. workflow_dispatch
  preserved for manual rolls/rollbacks. Resolve image tag from either
  trigger via a new steps.img.tag output.
* longhaul-monitor.yaml: clarify in the header that the monitor only
  upgrades the operator directly; for DocumentDB it merely publishes the
  desired version to the longhaul-versions ConfigMap, and the long-haul
  driver performs the in-band upgrade as a load-aware operation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The monitor workflow had two pre-existing bugs that prevented it from
ever passing after the long-haul auth + workload-shape changes landed:

1. Kubeconfig decode mismatch. The LONGHAUL_KUBECONFIG secret is stored
   as raw YAML (the deploy workflow writes it with printf), but the
   monitor decoded it with 'base64 -d', producing garbage and failing
   the very first kubectl call.

2. Job vs Deployment drift. The long-haul test was converted from a
   Job to a Deployment(replicas=1) so 'kubectl set image' can roll
   it. The monitor still ran 'kubectl get job longhaul-test' and
   tripped the 'No long-haul Job found' guard on every run.

Switch the monitor to:
- write the kubeconfig with printf (matches longhaul-deploy.yml),
- export KUBECONFIG via GITHUB_ENV so subsequent steps find it,
- check the Deployment's readyReplicas / Progressing condition
  instead of Job .status.active / Failed condition.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The golang:*-alpine base image already ships ca-certificates (it has to,
otherwise 'go mod download' against proxy.golang.org would not work out
of the box), and every dep in test/longhaul/go.mod resolves through the
module proxy with no VCS fetches, so 'git' is unused.

Removing the install shaves a layer and ~5-8s off cold builds. If we
later add a 'replace' pointing at a Git URL or import a path that needs
go-get HTML discovery, we'll need to add 'git' back.

Verified locally with 'docker build .'  build succeeds and produces a
working binary in the distroless runtime stage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…de (documentdb#220)

The README, design doc, and a couple of header comments in config.go were lagging behind the actual driver state shipped in this PR. This refresh aligns all three with what the code actually does today, so reviewers and future contributors do not have to cross-reference stale phase markers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…b#220)

Address Copilot review feedback on PR documentdb#348: the `uses defaults when no env vars set'' spec previously only cleared three variables, so a developer with any other LONGHAUL_* var exported in their shell could see false failures. Move the clearing into a BeforeEach that wipes every recognized LONGHAUL_* env var.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…rade on HA (documentdb#220)

Address xgerman's blockers on PR documentdb#348:

[#1] ScaleUp/ScaleDown were patching spec.nodeCount, which the CRD pins to 1 (minimum=maximum=1). The API server rejected every scale call -- no real scale event was ever exercised and OpsExecuted over-counted. Switch all scale paths to spec.instancesPerNode (CRD range 1-3), the actual HA scale dimension. Add GetInstancesPerNode reader; remove the now-unused GetCurrentReplicas. Clamp Max to 3 / Min to 1 in the operation constructors.

[#2] deploy/deployment.yaml exported LONGHAUL_MIN_INSTANCES / LONGHAUL_MAX_INSTANCES but config.LoadFromEnv only read LONGHAUL_MIN_REPLICAS / LONGHAUL_MAX_REPLICAS, so the deployment env vars were silently dropped and the driver fell back to defaults (which were also wrong: 2..5 vs the CRD's 1..3). Rename Env constants, struct fields, and config defaults (1..3) to match the deployment manifest. Add a Validate() check rejecting MaxInstances > 3.

[#4] UpgradeDocumentDB.Precondition now skips when spec.instancesPerNode<2 with a clear message. A single-instance cluster has no standby, so a rolling upgrade WILL produce real (unavoidable) downtime; reporting that as a policy violation is a true positive but not actionable. The 10s scheduler tick re-evaluates as soon as the cluster scales up. Skipped operations do not consume the global cooldown (scheduler.tryExecute only updates lastOpTime after successful executeOp), so this guard is free.

Bonus fixes found while refactoring:

* spec.documentDbVersion -> spec.documentDBVersion (capital DB) -- the CRD's actual JSON tag. The old field was being silently ignored by the API server, so the upgrade operation never actually patched the version. (Caught by writing the GetInstancesPerNode unit-style read against the same CR object.)

* deploy/setup.yaml: bump cluster to instancesPerNode: 2 (HA topology) so the upgrade operation has a standby to fail over to and actually fires on the canary. Fix the stale 'kubectl apply -f deploy/job.yaml' header comment to point at deployment.yaml. (xgerman nits documentdb#13 + documentdb#14.)

* docs/designs/long-haul-test-design.md: rename MinReplicas/MaxReplicas references to MinInstances/MaxInstances and add the CRD bounds context so future readers understand why the range is 1-3, not arbitrary.

* test/longhaul/config/config_test.go: update the BeforeEach env-clear list to the new constant names.

Verified: go vet ./... clean; go test ./... -> ok config (19 specs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…dDowntime TODO (documentdb#220)

Sweep of xgerman's lower-severity findings on PR documentdb#348:

[documentdb#8] journal.DisruptionWindow.ExceededPolicy() ignored Policy.AllowedDowntime even though every operation's OutagePolicy() sets it. Document the gap with a TODO(longhaul, documentdb#220) explaining what enforcement would require (per-write timestamps fed into the active window from writer.go) so the dead field is not mistaken for actual policy enforcement.

[documentdb#9] UpgradeDocumentDB.waitForImage was 'continue'-ing on every status read error during the rolling restart, which made a sustained apiserver outage indistinguishable from a slow upgrade until the recovery timeout fired. Plumb the journal into the upgrade operation and emit j.Warn on each transient read error so a real outage shows up in the report immediately while one-off blips still retry on the next 10s tick. Constructor signature changes; main.go updated.

[nit] report.EmitAnnotation called Duration.Round(1) which rounds to 1ns -- a no-op. Use Round(time.Second) to match GenerateMarkdown so annotations read '2h17m' instead of '2h16m59.834561287s'.

[nit] monitor.ClusterHealth.WritesHealthy declared but never set or read. Removed; can be added back alongside its populator if/when the writer's healthy/unhealthy signal is wired in.

[documentdb#7 deferred] Pinning go.mod to 1.25.x is non-trivial: indirect k8s.io/* deps were tidied against pre-release versions (dated 20260120) that require go 1.26 features. Downgrading them needs its own change (release-pinned k8s.io/* + transitive resolution) rather than just editing the directive. Tracking separately.

Verified: go vet ./..., go test ./... (config: 19 specs) clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220)

[documentdb#6] writer.writeOne now treats mongo.IsDuplicateKeyError as a successful, idempotent ACK. With retryable writes on by default in the v2 driver, a network blip during a disruption window can produce: server commits insert, ACK is dropped, driver auto-retries the same _id, server returns code 11000. The data is durably committed in that case, so counting it as a write failure (which feeds AllowedWriteFailures and the policy gate) was turning successful writes into spurious FAIL verdicts during the very disruption windows the harness exists to measure.

[documentdb#5] verifier.verifyWriter now resumes from a per-writer nextSeq map instead of re-scanning the entire writer history every 10s. A 100ms-per-write writer accumulates ~864k docs/day, so the original full scan grew to ~75M doc-reads/hour per writer over a 3-day run, both saturating the cluster and turning the verifier's own load into a confounding signal in the report. Filter is now {writer_id, seq: \$gte\: nextSeq}; nextSeq is advanced to lastSeen+1 each cycle. Gaps are still detected because we always start the scan at the first unverified seq -- a future cycle that finds a gap-fill below nextSeq is intentionally not re-checked (an in-flight delayed write that lands below a previously-seen seq would already have produced a gap on the cycle that observed the higher seq).

Verified: go vet ./..., go test ./... clean. Mongo-dependent unit tests for both behaviors land in the next commit alongside the rest of the unit-test sweep.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220)

Adds the per-package unit-test sweep xgerman flagged in PR documentdb#348 [#3]. Workload writer/verifier remain integration-tested only because their meaningful behavior (mongo retryable writes, dup-key handling, incremental scan filters) requires a real or mocked mongo collection that is out of scope for this PR.

journal/policy_test.go covers DisruptionWindow.{IsActive, Duration, ExceededPolicy} including boundary cases (write failures equal to budget allowed; active window evaluated against MustRecoverWithin).

journal/journal_test.go covers Record/Info/Warn/Error append semantics, EventsSince cutoff, the OpenDisruptionWindow auto-close-previous behavior, RecordWriteFailure no-op when no active window, HasPolicyViolation across active and closed windows, and a concurrent-append correctness check.

monitor/health_test.go uses a fakeClusterClient stub to exercise IsSteadyState transitions: not-yet-healthy, becoming-healthy after the configured wait, loss-of-health resets steadySince, poll-error resets steadySince, and WaitForSteadyState's reached/cancelled paths.

monitor/leakdetect_test.go covers the linear-regression slope: below min-samples returns zeroed result, flat memory produces ~0 slope, growing memory above threshold flags HasLeak with the expected slope (~3600 MB/h for 1 MB/s growth), sub-threshold growth does not flag, and the minSamples<3 floor is enforced.

operations/scheduler_test.go uses a fakeOp stub to validate selectOperation: nil when no precondition passes, nil when total weight is zero, only-eligible filtering, weighted-distribution within tolerance over 2000 trials, and executeOp's window-open/close + ERROR-event-on-failure behavior.

report/report_test.go snapshots GenerateMarkdown invariants: PASS vs FAIL header, duration rounded to seconds, data-plane metrics table always present, disruption-windows and leak-analysis sections gated on data presence, and the recent-events tail truncated to 20.

Verified: go test ./... clean. -race not run locally (CGO unavailable on this Windows env); the concurrent-append test still validates final-count correctness without it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from e3a5846 to d6f9adc Compare May 7, 2026 17:11
Adds a section to test/longhaul/README.md explaining how the long haul
harness relates to the Ginkgo-based e2e suite landed in PR documentdb#346:

- Different shapes (daemon vs test binary) and lifetimes (days vs minutes)
- Different operator-API access patterns (dynamic vs typed client)
- Concrete e2e helpers (mongo, operatorhealth, clusterprobe) that long
  haul will likely consume once it grows beyond raw URI / no-TLS

Explains why a shared test/shared/ module is deliberately not introduced
yet: today's only duplicated surface (raw mongo connect + ping) is too
small to justify a third Go module given the version drift between the
two test modules. Revisit when the driver adopts e2e's connection model.

No code changes.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set up long haul test cluster

5 participants