Skip to content

feat(cluster-backup): DR initial commit with Velero + Restic#24

Merged
ncimino merged 6 commits into
mainfrom
feature/shahid-velero-restic
May 25, 2026
Merged

feat(cluster-backup): DR initial commit with Velero + Restic#24
ncimino merged 6 commits into
mainfrom
feature/shahid-velero-restic

Conversation

@weown-bot
Copy link
Copy Markdown
Contributor

@weown-bot weown-bot commented May 13, 2026

🤖 Automated Pull Request — authored by weown-bot (ecosystem service account)

Opened by: @mshahid538
Last pushed by: @ncimino
Branch: feature/shahid-velero-resticmain

Contributors on this branch:


📋 Human Review Checklist — NIST CSF 2.0 Functions

Review per the 6 NIST CSF Functions. Frameworks referenced: NIST CSF 2.0, CIS Controls v8 IG1, CSA CCM v4, ISO/IEC 27001:2022, SOC 2, ISO/IEC 42001:2023. See docs/COMPLIANCE_ROADMAP.md.

🏛️ Govern (GV)

  • CODEOWNERS correct for affected paths (.github/CODEOWNERS)
  • ADR required/updated if an architectural decision is introduced
  • Policy impact considered and documented
  • All Copilot AI review comments addressed or explicitly deferred with rationale

🔍 Identify (ID)

  • New assets inventoried (Helm values, container images, dependencies)
  • SBOM regenerated if dependencies changed
  • Risk register / threat model touched if threat surface changed (.github/SECURITY_ASSESSMENT.md)

🛡️ Protect (PR)

  • Least privilege: RBAC, ServiceAccounts, scoped PATs (NIST PR.AC, CIS 5/6, ISO A.5.15-A.5.18)
  • Secrets managed via Infisical (never --from-literal, never /tmp, always $(mktemp) — ISO A.8.24)
  • NetworkPolicy present for new deployments (NIST PR.AC-5, CIS 12, CSA IVS)
  • TLS 1.3 with strong cipher suites where applicable (NIST PR.DS-1, CIS 3)
  • Container security: non-root UID 1000+, Pod Security restricted (NIST PR.IP, CIS 4)

🕵️ Detect (DE)

  • Logs / metrics added for new components (NIST DE.CM, CIS 8/13)
  • Alert rules updated if thresholds change
  • Health checks (livenessProbe + readinessProbe) configured

🚨 Respond (RS)

  • Runbook updated if operational behavior changes (.github/INCIDENT_RESPONSE.md)
  • Incident response impact considered (escalation paths, on-call)

♻️ Recover (RC)

  • Backup strategy covers new persistent data (NIST RC.RP, CIS 11, ISO A.8.13)
  • Rollback procedure tested or documented
  • DR impact assessed for new critical components

📚 Documentation & Versioning

  • Relevant CHANGELOG.md updated (per-directory or repo-level /CHANGELOG.md)
  • #WeOwnVer version bumped per docs/VERSIONING_WEOWNVER.md
  • READMEs / ADRs / inline comments updated

📝 Recent Commits (full bodies for Copilot context)

59b04b6 fix(cluster-backup): resolve Trivy CRITICAL/HIGH findings

Author: Nik
Date: Mon May 25 14:37:04 2026 -0600

Local trivy config --severity CRITICAL,HIGH cluster-backup/ is now
clean (0 findings). Two categories:

Real code fixes

  1. KSV-0053 (HIGH) — restic ClusterRole granted pods/exec.
    Restic backs up pod volumes by reading the kubelet bind-mount; it
    never execs into user pods. Removed pods/exec from the resource
    list and tightened the verb set on [pods, pods/log, namespaces, nodes] from get list watch create update patch delete down to
    just get list watch — restic does not own any of those resources.
    This also drops KSV-0042 (delete pod logs) + part of KSV-0048
    (manage workloads/pods) from the restic role specifically.

  2. KSV-0005 (HIGH) — restic securityContext was internally
    inconsistent.
    The previous values.yaml had
    runAsNonRoot: true, runAsUser: 65534 + capabilities.add: [SYS_ADMIN]. uid 65534 cannot read pod volumes owned by arbitrary
    uids, so restic would have failed at runtime. Per Velero's
    documented restic deployment posture (https://velero.io/docs/v1.12/restic/),
    the node-agent must run as root with SYS_ADMIN. Changed to
    runAsNonRoot: false, runAsUser: 0, runAsGroup: 0, fsGroup: 0,
    keeping readOnlyRootFilesystem: true and the default seccomp
    profile. The PSS for the velero namespace must therefore be
    baseline or privileged (NOT restricted) — documented in
    cluster-backup/CHANGELOG.md.

Targeted .trivyignore additions

The remaining findings are inherent to a cluster-backup tool and
cannot be fixed without breaking its function. Each entry in
.trivyignore is explained — Trivy ignore-rules are the right vehicle
for "this is intentional, here's why":

  • KSV-0041 (CRITICAL): Velero must manage Secrets to back them up.
  • KSV-0056 (HIGH x3): Velero must manage Services/Endpoints/Ingresses/
    NetworkPolicies to back up and restore network topology.
  • KSV-0005 / KSV-0022 (HIGH): SYS_ADMIN is required for restic's
    mount/setns ops; this is Velero's documented deployment posture.
  • KSV-0012 (MEDIUM): companion of KSV-0005 (run-as-root).
  • KSV-0023 (MEDIUM): hostPath on /var/lib/kubelet/pods is how restic
    reads pod volumes. The earlier broader mount of host / was already
    removed in d0c637b.
  • KSV-0042 (MEDIUM): Velero must delete pods to restore them.
  • KSV-0048 (MEDIUM): Velero must create/update Deployments,
    StatefulSets, DaemonSets, Jobs, CronJobs.
  • KSV-0049 (MEDIUM): Velero must manage ConfigMaps.
  • KSV-0125 (MEDIUM): velero/velero:v1.12.2 is the official upstream
    image on docker.io. Mirroring to a private registry is tracked as
    a separate follow-up.

All ignored rules are scoped by .trivyignore comments to the
cluster-backup chart only.

Roman's review verified

All 17 items from @romandidomizio's CHANGES_REQUESTED review have been
addressed in this branch (d0c637b + ac2e9f6 + this commit). Spot-checked
all of them; no outstanding source-level issues.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


ac2e9f6 fix(cluster-backup): repair Schedule rendering + dedupe component label

Author: Nik
Date: Fri May 22 10:40:24 2026 -0600

Two issues that surfaced once kubeconform actually parsed the rendered
output (the previous CI run was already past my first commit but failed
on Kubeconform Validation):

  1. Duplicate app.kubernetes.io/component label

    The common cluster-backup.labels helper emitted
    app.kubernetes.io/component: backup-system, and every per-resource
    template also appended its own app.kubernetes.io/component: <role>
    right below the include. The K8s API server tolerates duplicate map
    keys (last write wins), but strict YAML parsers — including
    kubeconform — reject the manifest. Removed the component-label
    line from the shared helper; per-resource components remain.

  2. Schedule CR .spec.schedule rendered as "map[…]" instead of cron

    The cluster-backup.backupScheduleConfig helper accessed
    .schedule | quote at the top level, but its caller passes the
    whole schedule object under .schedule — so .schedule was the
    FULL map, and Go's fmt.Sprint rendered it as
    "map[enabled:true excludeNamespaces:[…] includeNamespaces:[] retention:30d schedule:0 2 * * *]". Velero would have rejected
    the Schedule as invalid syntax at apply time.

    Reworked the helper to extract fields explicitly via
    .schedule.schedule, .schedule.retention,
    .schedule.includeNamespaces, .schedule.excludeNamespaces.
    This is the same access pattern the caller already implied; the
    helper had just been written incorrectly from the start. The CR
    now renders with the expected schedule: "0 2 * * *".

    Also fixed an adjacent issue in backup-schedules.yaml: the
    {{- $ctx := dict … -}} line had its trailing newline stripped,
    which fused the previous Schedule's last YAML line directly to
    the next --- document separator (e.g. defaultVolumesToRestic: true---). Replaced with non-stripped {{ $ctx := merge … }} on
    its own line.

Verified with helm template … | kubeconform -strict -summary -ignore-missing-schemas -:
21 resources, Valid: 11, Invalid: 0, Errors: 0, Skipped: 10
(skipped resources are Velero CRDs with no published JSON schema:
BackupStorageLocation, VolumeSnapshotLocation, Schedule).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


d0c637b fix(cluster-backup): address all reviewer + CI feedback from PR #14/#24

Author: Nik
Date: Fri May 22 10:35:45 2026 -0600

Resolves the 3 failing CI jobs and all 30 inline review comments on
PR #24 (and the predecessor PR #14, whose comments were ported via the
branch rename drafeature/shahid-velero-restic).

CI failures fixed

  • Helm Template Validation — chart no longer declares an upstream
    velero subchart dependency (the chart provides its OWN Velero +
    Restic templates; the two would collide and helm template could
    not run because helm dependency build is not invoked in CI).
  • Helm Lint — removed the duplicate cluster-backup.restoreName
    template definition in _helpers.tpl.
  • Documentation Validation / Markdown Lint — cluster-backup/README.md
    now satisfies MD031/MD022/MD032 (blank lines around fences /
    headings / lists). Added cluster-backup/CHANGELOG.md with a
    [1.0.0] - 2026-05-22 entry that matches Chart.yaml's
    version: 1.0.0 — required by the tightened version-consistency
    check landed in Auto-PR: fix(ci): align drifted chart versions; tighten CHANGELOG consistency check #21.
  • trailing-whitespace / end-of-file-fixer — stripped trailing whitespace
    from every cluster-backup yaml + shell file and ensured every file
    ends with \n.

Security & correctness (per Copilot / @romandidomizio review)

  • automountServiceAccountToken: false flipped to true on BOTH the
    Velero server and the Restic node-agent ServiceAccounts. Both
    controllers MUST call the Kubernetes API (Velero for cross-namespace
    resource enumeration / Backup-Restore CRDs; Restic for
    PodVolumeBackup / PodVolumeRestore / ResticRepository / coordination
    leases). Disabling token automount without a projected-token
    workaround silently broke the controllers.
  • Removed helm/templates/velero-secret.yaml. The chart was rendering
    a placeholder Secret named cluster-backup-cloud-credentials which
    collided with the real Secret that deploy.sh creates out-of-band
    (helm upgrade would fail with "resource already exists" on every
    re-install). The Secret is now solely owned by deploy.sh.
    Removed the corresponding checksum/secret annotation from the
    Velero Deployment and Restic DaemonSet.
  • Added helm/templates/velero-service.yaml. The chart previously
    shipped a ServiceMonitor selecting a Service that did not exist,
    and NOTES.txt instructed users to
    kubectl port-forward svc/<fullname>-velero — also missing.
    Service is ClusterIP, metrics-only.
  • helm/templates/networkpolicy.yaml egress no longer uses to: []
    for any rule. DNS is constrained to the kube-system namespace;
    external S3 traffic is allowed to 0.0.0.0/0 MINUS RFC1918,
    link-local, and loopback ranges (so restic / velero can never
    accidentally exfiltrate to in-cluster IPs over an S3-shaped path);
    Kubernetes API access is constrained to the default namespace
    ClusterIP.
  • helm/templates/restic-daemonset.yaml no longer mounts host / via
    hostPath. Restic only needs read access to /var/lib/kubelet/pods
    (the standard kubelet pod-volume root), so that's the only host
    mount that remains. Mounting host-root was incompatible with Pod
    Security restricted and not required for restic operation.
  • helm/templates/backup-schedules.yaml passes a full
    dict "Chart" $.Chart "Release" $.Release "Values" $.Values "Template" $.Template ... context into the schedule-naming and
    schedule-config helpers, removing any ambiguity about what
    cluster-backup.fullname and cluster-backup.labels resolve to
    when called from within a range loop body.
  • helm/templates/_helpers.tpl: simplified
    cluster-backup.serviceAccountName and
    cluster-backup.resticServiceAccountName to return deterministic
    names. The previous form gated on .Values.velero.{server,restic} .serviceAccount.create, which is not a path that exists in
    values.yaml — it was a remnant of the now-removed velero
    subchart pattern. The helpers always returned "default", which
    conflicted with the ClusterRoleBinding subject the chart actually
    installed; Velero would have been denied every API call.
  • helm/templates/NOTES.txt rewritten: kubectl create backup … /
    kubectl create restore … are NOT valid kubectl subcommands for
    Velero CRDs and were misleading users into thinking the controller
    was broken when in reality they were running a non-existent
    kubectl verb. NOTES now uses the Velero CLI (velero backup get,
    velero schedule get, etc.) and points readers at the install
    docs. Same fix applied across README.md, deploy.sh, verify.sh,
    test-local.sh, and (post-deployment) usage messages.

Script hardening

  • deploy.sh:
    • Switched to #!/usr/bin/env bash + set -euo pipefail.
    • All temporary files now live in a single mktemp -d-created
      directory with mode 0700, cleaned up via an EXIT|INT|TERM trap.
      The previous version wrote /tmp/s3-credentials and
      /tmp/cluster-backup-values.yaml at fixed paths — world-readable
      on some hosts and racy under concurrent runs.
    • S3 secret-key prompt uses read -rs (no echo). The credential is
      piped into kubectl create secret … --from-file=cloud=/dev/stdin --dry-run=client -o yaml | kubectl apply -f -, so the secret
      never lives on disk and never appears in argv (ps).
    • All configuration accepts env-var inputs (TENANT, CLUSTER,
      ENVIRONMENT, S3_*) so the script is usable in CI/automation
      without an interactive TTY.
  • test-local.sh:
    • MinIO image pinned (minio/minio:RELEASE.2024-08-17T01-24-54Z).
      Was minio/minio:latest, which made local test runs
      non-reproducible and exposed them to silent upstream breaking
      changes.
    • MinIO credentials no longer hardcoded as minioadmin / minioadmin
      in the manifest. Access key defaults to localdev-access; secret
      key is a freshly-generated 24-char random value per run
      (overridable via TEST_MINIO_{ACCESS,SECRET}_KEY env vars).
      Rendered into the Secret over a stdin pipe — same no-argv-leak
      pattern as deploy.sh.
    • Service now correctly exposes BOTH the S3 API (NodePort 30000)
      AND the web console (NodePort 30001). The original Service mapped
      only 9000 to NodePort 30000 yet the "MinIO Console" line pointed
      at 30000 — that's the S3 API, not the console. Web console
      address is now http://localhost:30001.
    • Replaced fixed sleep 30 waits with wait_for_velero_phase which
      polls .status.phase of the Backup / Restore CR and emits
      kubectl describe + velero backup logs diagnostics on timeout.
    • Uses the Velero CLI for backup / restore creation.
  • verify.sh:
    • Same shebang + strict-mode + Velero-CLI updates as deploy.sh.
    • test_backup_creation polls phase with diagnostics instead of
      blind-sleeping; also skips gracefully if the Velero CLI isn't
      installed locally (verify is run on hosts that may not have it).
  • create-tenant-cluster.sh:
    • All upstream components are version-pinned via env-var defaults:
      DOKS_K8S_VERSION, INGRESS_NGINX_VERSION, CERT_MANAGER_VERSION,
      METRICS_SERVER_VERSION. The previous version used
      doctl kubernetes cluster create … --version latest and
      metrics-server/releases/latest/, making provisioning silently
      non-reproducible.
    • Team-member namespace is normalized to RFC1123
      (AnnaFannaf, etc.) before being passed to kubectl create namespace. K8s rejects uppercase-containing namespace names with
      an opaque error, and the previous version of this script would
      silently fail at create_team_member_namespace for any contributor
      whose handle wasn't already lowercase-alphanumeric.
    • create_team_member_access no longer reads .secrets[0].name to
      harvest a long-lived ServiceAccount token (the BoundServiceAccount-
      Token feature in K8s ≥1.24 means that field is empty on modern
      clusters, so the function was returning an empty kubeconfig).
      Switched to kubectl create token (TokenRequest API) with an
      explicit 24h duration. Kubeconfig is written with umask 077 so
      the file containing the bearer token is mode 0600.

Not in scope for this commit

  • The Restic DaemonSet still ships with a SYS_ADMIN capability add
    and a non-root securityContext (uid 65534). Real-world restic
    operation typically requires either root or a relaxed
    podSecurityContext to read pod volumes owned by arbitrary uids;
    tightening this further requires a per-environment decision about
    the Pod Security Standard for the velero namespace. Flagged in
    cluster-backup/CHANGELOG.md under "Notes for operators".

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


5d7bea0 Merge remote-tracking branch 'origin/main' into feature/shahid-velero-restic

Author: Nik
Date: Fri May 22 10:20:23 2026 -0600


a5fd84c Merge remote-tracking branch 'origin/main' into feature/shahid-velero-restic

Author: romandidomizio
Date: Wed May 13 13:07:30 2026 -0600


a46d849 DR initial commit with restic + velero

Author: m.shahid
Date: Tue Oct 7 23:17:50 2025 +0500



🔍 Copilot AI Review: Copilot is configured to auto-request review for bot-authored PRs. If an auto-created PR opens without an initial Copilot review, push a follow-up commit to the same open PR (review_on_push: true) to trigger review automatically.

👥 Required Reviewers: 1 human approval enforced by branch protection. requested automatically.

📚 Review Guidelines: .github/copilot-instructions.md (phase-aware compliance directives)

🛠️ Workflow Operations: .github/workflows/README.md

Auto-generated by .github/workflows/auto-pr-to-main.yml

Copilot AI review requested due to automatic review settings May 13, 2026 19:29
@weown-bot weown-bot requested a review from ncimino as a code owner May 13, 2026 19:29
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "cluster-backup.serviceAccountName" . }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] cluster-backup.serviceAccountName can resolve to default when .Values.velero.server.serviceAccount.create isn't set, but this template always creates a ServiceAccount with that computed name. With the current values.yaml (no velero.server.serviceAccount block), this attempts to create a ServiceAccount named default and will fail. Gate SA creation on a create flag and default it to true with a unique name.

annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
automountServiceAccountToken: false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] automountServiceAccountToken: false will prevent Velero from using in-cluster auth to call the Kubernetes API (unless you mount a projected serviceAccountToken explicitly). Velero generally requires API access for backup/restore operations, so this is likely to break the deployment.

Comment thread cluster-backup/test-local.sh Outdated
data:
access-key: bWluaW9hZG1pbg== # minioadmin
secret-key: bWluaW9hZG1pbg== # minioadmin
---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This script commits MinIO credentials (minioadmin) into a Kubernetes Secret (base64 encoded) and later prints the credentials. Even for local testing, committing real credential strings in a public repo is discouraged. Generate random credentials at runtime (and avoid echoing secrets), or require user-supplied env vars and keep placeholders in the manifest.

Comment thread cluster-backup/helm/templates/NOTES.txt Outdated
4. Test backup: kubectl create backup test-backup --from-schedule daily -n {{ .Values.global.namespace }}

📊 Monitoring:
• Metrics endpoint: kubectl port-forward -n {{ .Values.global.namespace }} svc/{{ include "cluster-backup.fullname" . }}-velero 8085:8085
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] These NOTES instruct port-forwarding svc/{{ fullname }}-velero, but this chart doesn't define a Service. Either add a Service exposing the metrics port or update NOTES to port-forward the pod (or the Service created by the Velero subchart, if you keep it).

{{/*
Generate restore name with timestamp
*/}}
{{- define "cluster-backup.restoreName" -}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] cluster-backup.restoreName is defined more than once in this file. Duplicate template names are confusing and can override each other depending on parse order. Keep one definition or rename one of them to reflect its behavior (e.g., timestamped vs named).

Comment thread cluster-backup/deploy.sh Outdated
echo "• List backups: kubectl get backups -n $NAMESPACE"
echo "• List restores: kubectl get restores -n $NAMESPACE"
echo "• Create backup: kubectl create backup <name> -n $NAMESPACE"
echo "• Create restore: kubectl create restore <name> --from-backup <backup-name> -n $NAMESPACE"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] Post-deploy guidance suggests kubectl create backup/restore ..., which is not a standard kubectl subcommand for Velero CRDs. Use the Velero CLI or apply Backup/Restore manifests with kubectl apply to avoid misleading users.

Comment thread cluster-backup/deploy.sh Outdated

# Clean up
rm -f /tmp/s3-credentials

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] create_s3_credentials creates the cluster-backup-cloud-credentials Secret via kubectl apply, but the Helm chart also renders a Secret with the same name. Helm will typically fail to install/upgrade due to resource ownership conflicts. Pick one ownership model: either manage this Secret entirely via Helm (preferably via an external-secret/Infisical pattern) or remove the Secret template and keep it out-of-band.

namespace: {{ $.Values.global.namespace }}
labels:
{{- include "cluster-backup.scheduleLabels" $ | nindent 4 }}
schedule.weown.xyz/type: {{ $name }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] The helpers invoked here are passed dict contexts that don't include .Release/.Chart, but the helpers call include "cluster-backup.fullname" / include "cluster-backup.labels" which require the root chart context. This will cause Helm rendering failures. Pass the root context separately (e.g., dict "root" $ "name" $name "schedule" $schedule) and update helpers to reference .root.

Comment thread cluster-backup/README.md
kubectl describe backup <backup-name> -n velero

# Download backup logs
kubectl logs -n velero -l app.kubernetes.io/component=velero-server
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] Documentation here uses kubectl create backup ..., which is not a standard kubectl generator for Velero CRDs. Update docs to use the Velero CLI (velero backup/restore/schedule create) or show YAML examples applied with kubectl apply.

Comment thread cluster-backup/test-local.sh Outdated
# Create backup
echo "Creating test backup..."
kubectl create backup test-local-backup --include-namespaces test-app -n velero

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This uses kubectl create backup ..., which typically doesn't exist for Velero CRDs. For local tests, use velero backup create ... (Velero CLI) or kubectl apply a Backup CR manifest, otherwise the script will fail even when Velero is installed.

Comment thread cluster-backup/create-tenant-cluster.sh Outdated
CLUSTER_CA=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')

# Create kubeconfig for team member
cat > ${TEAM_MEMBER}-kubeconfig.yaml <<EOF
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] create_team_member_access assumes the ServiceAccount has an auto-generated Secret token (.secrets[0].name), which is no longer true on modern Kubernetes (BoundServiceAccountTokenVolume). This will often return empty and break access creation. Use the TokenRequest API (e.g., kubectl create token <sa-name> -n <ns>).

Comment thread cluster-backup/deploy.sh
cat > /tmp/cluster-backup-values.yaml << EOF
global:
tenant: "$TENANT"
cluster: "$CLUSTER"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This script writes a values file to a fixed /tmp/cluster-backup-values.yaml path. While it's not secret, it can collide across concurrent runs and may have permissive permissions. Prefer mktemp + trap cleanup.

Comment thread cluster-backup/deploy.sh Outdated
[default]
aws_access_key_id = $S3_ACCESS_KEY
aws_secret_access_key = $S3_SECRET_KEY
EOF
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] create_s3_credentials writes S3 credentials to a fixed /tmp/s3-credentials path. This is sensitive material on disk in a world-readable location on some systems and violates the repo's "no /tmp secret material" guidance. Use mktemp (and umask 077) to create a private temp file and trap cleanup.

Comment thread cluster-backup/test-local.sh Outdated
spec:
containers:
- name: minio
image: minio/minio:latest
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This uses a floating image tag (minio/minio:latest), which makes local tests non-reproducible and can break unexpectedly. Pin to a specific version (or digest). Suggestion: image: minio/minio:RELEASE.2024-01-16T16-07-38Z

Comment thread cluster-backup/test-local.sh Outdated
"

echo -e "${GREEN}✅ Local S3 storage (MinIO) ready${NC}"
echo -e "${BLUE}MinIO Console: http://localhost:30000${NC}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This prints MinIO Console: http://localhost:30000, but port 30000 is mapped to MinIO's S3 API (9000) in the Service. Either expose the console port (9001) and print that URL, or rename this output to MinIO S3 API to avoid misleading users.

version: 4.0.0
repository: https://vmware-tanzu.github.io/helm-charts
condition: velero.enabled

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] The chart declares a velero dependency, but this chart also defines its own Velero Deployment/RBAC/SA templates. With velero.enabled: true, Helm will render/install both, which can lead to duplicate Velero installations and/or resource conflicts. Either remove the dependency or remove the custom templates.

- protocol: TCP
port: 2379
- protocol: TCP
port: 2380
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] These egress rules use to: [], which effectively allows traffic to any destination on the listed ports. That's overly permissive for a zero-trust NetworkPolicy. Narrow egress to kube-dns (namespace/pod selectors) and to the S3 endpoint(s) (e.g., ipBlock/egress gateway).

--region $REGION \
--version latest \
--node-pool "name=worker-pool;size=$NODE_SIZE;count=$NODE_COUNT;auto-scale=false" \
--wait
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] The script fetches/applies remote manifests using .../releases/latest/... and creates the cluster with --version latest, which makes provisioning non-reproducible and can break over time. Pin Kubernetes and component versions explicitly.

Comment thread cluster-backup/verify.sh Outdated
echo -e "${YELLOW}Testing backup creation...${NC}"

# Create a test backup
kubectl create backup test-verification-backup -n $NAMESPACE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] kubectl create backup ... is not a standard kubectl generator for Velero CRDs, so this will fail on a typical cluster. Use the Velero CLI (velero backup create ...) or apply a velero.io/v1 Backup manifest via kubectl apply -f - instead.

Comment thread cluster-backup/verify.sh
else
print_status "Test backup not found" false
return 1
fi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This fixed sleep-based wait is unreliable for backups/restores that can take longer than 30s. Poll .status.phase with a timeout and on failure print kubectl describe backup ... to provide actionable diagnostics.

path: /var/lib/kubelet/pods
- name: host-root
hostPath:
path: /
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] The restic DaemonSet mounts host paths with mountPropagation: HostToContainer, including /. This is highly privileged and typically incompatible with Pod Security Standards restricted. If required, explicitly document the needed pod security level (baseline/privileged) and scope the mount as tightly as possible.

labels:
{{- include "cluster-backup.labels" . | nindent 4 }}
app.kubernetes.io/component: restic
{{- with .Values.velero.restic.serviceAccount.annotations }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] cluster-backup.resticServiceAccountName can resolve to default when .Values.velero.restic.serviceAccount.create isn't set. With the current values.yaml (no velero.restic.serviceAccount block), this attempts to create a ServiceAccount named default. Provide an explicit fallback name.

annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
automountServiceAccountToken: false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] automountServiceAccountToken: false will prevent the restic/node-agent component from using in-cluster auth to interact with the API (leases, podvolume resources, etc.) unless you mount a projected token explicitly. This is likely to break backups/restores.

kind: Secret
metadata:
name: {{ include "cluster-backup.fullname" . }}-cloud-credentials
namespace: {{ .Values.global.namespace }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] This Secret is rendered by Helm, but deploy.sh also creates/applies a Secret with the same name out-of-band. Helm install/upgrade will typically fail due to resource ownership conflicts. Prefer an external secret mechanism (Infisical/ExternalSecret) and reference it from the chart, or have Helm own the Secret entirely.

app.kubernetes.io/component: velero-server
endpoints:
- port: metrics
path: {{ .Values.monitoring.metrics.path }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] The ServiceMonitor selects Services via spec.selector.matchLabels, but this chart doesn't define a Service exposing the metrics port for Velero. Prometheus won't discover anything unless a matching Service exists. Add a Service for the Velero Deployment or switch to a PodMonitor.

Comment thread cluster-backup/create-tenant-cluster.sh Outdated
kubectl create namespace $TEAM_MEMBER --dry-run=client -o yaml | kubectl apply -f -

# Create team member service account
kubectl create serviceaccount $TEAM_MEMBER -n $TEAM_MEMBER --dry-run=client -o yaml | kubectl apply -f -
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Copilot, originally on PR #14] create_team_member_namespace uses $TEAM_MEMBER directly as the namespace name. Kubernetes namespaces must be RFC1123 (lowercase). Normalize/validate the namespace value before creating it (e.g., lowercase + reject invalid chars).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is the result of a main-into-feature/shahid-velero-restic merge that brings the feature branch up to date and pulls in the original DR initial commit with restic + velero change. It adds a new top-level cluster-backup/ directory containing a Helm chart (Velero + Restic) and several Bash orchestration scripts intended to provide cluster-level disaster recovery, automated backup schedules, and tenant-cluster onboarding.

Changes:

  • New Helm chart cluster-backup/helm/ (Chart.yaml, values.yaml, templates for Velero Deployment, Restic DaemonSet, RBAC, ServiceAccounts, Secret, ConfigMap, NetworkPolicy, Schedules, BackupStorageLocation, VolumeSnapshotLocation, ServiceMonitor, NOTES, helpers).
  • New Bash scripts: deploy.sh (interactive Helm install), verify.sh (post-install checks), test-local.sh (Minikube/kind + MinIO E2E), create-tenant-cluster.sh (DOKS cluster + apps + backup bootstrap).
  • New README.md documenting features, schedules, restore/migration flows.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
cluster-backup/README.md Feature/usage docs (uses non-existent kubectl create backup/restore/schedule commands; wrong repo URL; missing CHANGELOG/#WeOwnVer).
cluster-backup/deploy.sh Interactive Helm-based deploy; writes S3 creds to /tmp; pre-creates secret that the chart then overwrites.
cluster-backup/verify.sh Post-deploy checks; uses invalid kubectl create backup.
cluster-backup/test-local.sh Local E2E with Minikube/kind + MinIO; :latest MinIO image; invalid kubectl backup/restore commands.
cluster-backup/create-tenant-cluster.sh DOKS cluster bootstrap; relies on Bash 4 lowercase expansion, --version latest, deprecated SA token retrieval, empty S3 creds.
cluster-backup/helm/Chart.yaml Declares duplicate dependency on upstream Velero chart while shipping in-tree templates.
cluster-backup/helm/values.yaml Defaults; Restic adds SYS_ADMIN while claiming PSS restricted; mutable image tags; missing serviceAccount blocks referenced by helpers.
cluster-backup/helm/templates/_helpers.tpl Common helpers; duplicate restoreName define; references undefined serviceAccount.create.
cluster-backup/helm/templates/velero-deployment.yaml Velero server pod; uses non-standard flags, /metrics as liveness; no Service template alongside (ServiceMonitor/port-forward broken).
cluster-backup/helm/templates/velero-configmap.yaml Renders a fictitious velero.io/v1 Config doc Velero never reads.
cluster-backup/helm/templates/velero-secret.yaml Renders placeholder cloud-credentials secret that overwrites real creds on each upgrade.
cluster-backup/helm/templates/velero-rbac.yaml Cluster-wide write on secrets/CRBs/CRDs without ADR/justification.
cluster-backup/helm/templates/velero-serviceaccount.yaml automountServiceAccountToken: false will break Velero/Restic API access.
cluster-backup/helm/templates/restic-serviceaccount.yaml Same automountServiceAccountToken: false issue.
cluster-backup/helm/templates/restic-rbac.yaml Restic ClusterRole.
cluster-backup/helm/templates/restic-daemonset.yaml Mounts host root / into the pod — significant exposure.
cluster-backup/helm/templates/networkpolicy.yaml Egress allows plaintext :80, :2379/2380 (etcd), :6443 to any destination.
cluster-backup/helm/templates/backup-storage-location.yaml BackupStorageLocation CR template.
cluster-backup/helm/templates/volume-snapshot-location.yaml VolumeSnapshotLocation CR template.
cluster-backup/helm/templates/backup-schedules.yaml Range-rendered Schedule CRs.
cluster-backup/helm/templates/servicemonitor.yaml Targets a Service that the chart never creates.
cluster-backup/helm/templates/NOTES.txt Same invalid kubectl create backup ... examples.
Comments suppressed due to low confidence (1)

cluster-backup/helm/templates/velero-secret.yaml:14

  • The data.cloud field is rendered as b64enc of the literal string PLACEHOLDER_CREDENTIALS. AWS SDK / Velero will parse this as a credentials file and silently fail authentication on every reconcile. Either omit the secret entirely from the chart (recommended; deploy script creates it externally), gate it behind if .Values.createCredentialsSecret defaulting to false, or use helm.sh/resource-policy: keep plus a lookup to preserve an existing secret. As-is, every helm upgrade resets credentials.
data:
  # S3-compatible storage credentials
  # These will be populated by the deployment script
  cloud: {{ printf "PLACEHOLDER_CREDENTIALS" | b64enc }}

Comment thread cluster-backup/deploy.sh
Comment on lines +122 to +142
create_s3_credentials() {
echo -e "${YELLOW}Creating S3 credentials secret...${NC}"

# Create S3 credentials file
cat > /tmp/s3-credentials << EOF
[default]
aws_access_key_id = $S3_ACCESS_KEY
aws_secret_access_key = $S3_SECRET_KEY
EOF

# Create secret
kubectl create secret generic cluster-backup-cloud-credentials \
--from-file=cloud=/tmp/s3-credentials \
--namespace=$NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -

# Clean up
rm -f /tmp/s3-credentials

echo -e "${GREEN}✅ S3 credentials secret created${NC}"
}
Comment on lines +2 to +14
apiVersion: v1
kind: Secret
metadata:
name: {{ include "cluster-backup.fullname" . }}-cloud-credentials
namespace: {{ .Values.global.namespace }}
labels:
{{- include "cluster-backup.labels" . | nindent 4 }}
app.kubernetes.io/component: velero-credentials
type: Opaque
data:
# S3-compatible storage credentials
# These will be populated by the deployment script
cloud: {{ printf "PLACEHOLDER_CREDENTIALS" | b64enc }}
Comment thread cluster-backup/verify.sh Outdated
Comment on lines +266 to +289
# Test backup creation
test_backup_creation() {
echo -e "${YELLOW}Testing backup creation...${NC}"

# Create a test backup
kubectl create backup test-verification-backup -n $NAMESPACE

# Wait for backup to complete
echo "Waiting for backup to complete..."
sleep 30

# Check backup status
if kubectl get backup test-verification-backup -n $NAMESPACE >/dev/null 2>&1; then
BACKUP_STATUS=$(kubectl get backup test-verification-backup -n $NAMESPACE -o jsonpath='{.status.phase}')
if [ "$BACKUP_STATUS" = "Completed" ]; then
print_status "Test backup completed successfully" true
else
print_warning "Test backup status: $BACKUP_STATUS"
fi
else
print_status "Test backup not found" false
return 1
fi
}
Comment thread cluster-backup/README.md
Comment on lines +110 to +157
```bash
# List all backups
kubectl get backups -n velero

# Create manual backup
kubectl create backup manual-backup -n velero

# Create backup from schedule
kubectl create backup scheduled-backup --from-schedule daily -n velero

# Describe backup details
kubectl describe backup <backup-name> -n velero

# Download backup logs
kubectl logs -n velero -l app.kubernetes.io/component=velero-server
```

### **Restore Operations**

```bash
# List all restores
kubectl get restores -n velero

# Create restore from backup
kubectl create restore <restore-name> --from-backup <backup-name> -n velero

# Restore specific namespaces
kubectl create restore <restore-name> --from-backup <backup-name> --include-namespaces <namespace1>,<namespace2> -n velero

# Restore with namespace mapping
kubectl create restore <restore-name> --from-backup <backup-name> --namespace-mapping <old-ns>:<new-ns> -n velero
```

### **Schedule Management**

```bash
# List all schedules
kubectl get schedules -n velero

# Create custom schedule
kubectl create schedule <schedule-name> --schedule="0 1 * * *" --ttl=7d -n velero

# Pause schedule
kubectl patch schedule <schedule-name> -n velero -p '{"spec":{"paused":true}}'

# Resume schedule
kubectl patch schedule <schedule-name> -n velero -p '{"spec":{"paused":false}}'
```
Comment thread cluster-backup/helm/values.yaml Outdated
Comment on lines +75 to +95
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- SYS_ADMIN # Required for restic mount operations

# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault
Comment thread cluster-backup/test-local.sh Outdated
Comment on lines +275 to +316
# Create backup
echo "Creating test backup..."
kubectl create backup test-local-backup --include-namespaces test-app -n velero

# Wait for backup to complete
echo "Waiting for backup to complete..."
sleep 30

# Check backup status
kubectl get backup test-local-backup -n velero

echo -e "${GREEN}✅ Backup functionality tested${NC}"
}

# Function to test restore functionality
test_restore_functionality() {
echo -e "${YELLOW}Testing restore functionality...${NC}"

# Delete test namespace
kubectl delete namespace test-app --ignore-not-found=true

# Wait a moment
sleep 10

# Restore from backup
echo "Restoring from backup..."
kubectl create restore test-local-restore --from-backup test-local-backup -n velero

# Wait for restore to complete
echo "Waiting for restore to complete..."
sleep 30

# Check restore status
kubectl get restore test-local-restore -n velero

# Verify restored resources
kubectl get all -n test-app
kubectl get configmap test-config -n test-app
kubectl get secret test-secret -n test-app

echo -e "${GREEN}✅ Restore functionality tested${NC}"
}
{{- if .Values.velero.server.tolerations }}
tolerations:
{{- toYaml .Values.velero.server.tolerations | nindent 8 }}
{{- end }}
Comment thread cluster-backup/deploy.sh
Comment on lines +110 to +119
create_namespace() {
echo -e "${YELLOW}Creating namespace...${NC}"

if ! kubectl get namespace $NAMESPACE >/dev/null 2>&1; then
kubectl create namespace $NAMESPACE
echo -e "${GREEN}✅ Namespace $NAMESPACE created${NC}"
else
echo -e "${GREEN}✅ Namespace $NAMESPACE already exists${NC}"
fi
}
Comment on lines +237 to +264
deploy_cluster_backup() {
echo -e "${YELLOW}Deploying cluster backup...${NC}"

# Set configuration for cluster backup
export TENANT="weown-$TEAM_MEMBER"
export CLUSTER="$CLUSTER_NAME"
export ENVIRONMENT="$ENVIRONMENT"
export S3_BUCKET="weown-$TEAM_MEMBER-backups"
export S3_REGION="$REGION"
export S3_ENDPOINT="https://$REGION.digitaloceanspaces.com"
export S3_ACCESS_KEY=""
export S3_SECRET_KEY=""

echo -e "${CYAN}Cluster backup configuration:${NC}"
echo "Tenant: $TENANT"
echo "Cluster: $CLUSTER"
echo "Environment: $ENVIRONMENT"
echo "S3 Bucket: $S3_BUCKET"
echo "S3 Region: $S3_REGION"
echo "S3 Endpoint: $S3_ENDPOINT"
echo ""
echo -e "${YELLOW}Note: S3 credentials need to be configured separately${NC}"

# Deploy cluster backup
./deploy.sh

echo -e "${GREEN}✅ Cluster backup deployed${NC}"
}
Comment thread cluster-backup/deploy.sh Outdated
Comment on lines +91 to +107
# Get S3 storage configuration
echo -e "${CYAN}S3-compatible storage configuration:${NC}"
read -p "Enter S3 bucket name: " S3_BUCKET
read -p "Enter S3 region (e.g., nyc3): " S3_REGION
read -p "Enter S3 endpoint (e.g., https://nyc3.digitaloceanspaces.com): " S3_ENDPOINT
read -p "Enter S3 access key: " S3_ACCESS_KEY
read -s -p "Enter S3 secret key: " S3_SECRET_KEY
echo ""

# Validate required fields
if [ -z "$S3_BUCKET" ] || [ -z "$S3_REGION" ] || [ -z "$S3_ENDPOINT" ] || [ -z "$S3_ACCESS_KEY" ] || [ -z "$S3_SECRET_KEY" ]; then
echo -e "${RED}❌ All S3 configuration fields are required${NC}"
return 1
fi

echo -e "${GREEN}✅ Configuration gathered${NC}"
}
@romandidomizio romandidomizio changed the title Auto-PR: Merge remote-tracking branch 'origin/main' into feature/shahid-velero-restic feat(cluster-backup): DR initial commit with Velero + Restic May 13, 2026
Copy link
Copy Markdown
Member

@romandidomizio romandidomizio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 14 Review — cluster-backup: Velero + Restic Disaster Recovery

Branch name fixed: drafeature/shahid-velero-restic (naming convention compliance). Stale sibling branch feature/dr-velero-restic deleted. Main merged in.


❌ Failing Checks

  • Branch Name Check — was failing on dra; now corrected to feature/shahid-velero-restic. Needs re-run to confirm pass.
  • Lint & Syntax Validation — failing; related to YAML/script issues detailed below.
  • Kubernetes Validation — failing; likely due to structural manifest issues detailed below.
  • Documentation Validationcluster-backup/ has no CHANGELOG.md with a versioned ## [1.0.0] entry header to match Chart.yaml version: 1.0.0.

🔐 Critical Security Issues

1. S3 credentials written to fixed /tmp path (deploy.sh ~line 130)
create_s3_credentials writes AWS key + secret to /tmp/s3-credentials — a fixed, world-readable path on many systems. This violates the repo's no-/tmp-secret-material rule. Fix: use umask 077; mktemp and trap "rm -f $creds_file" EXIT.

2. Values file written to fixed /tmp path (deploy.sh ~line 152)
deploy_cluster_backup writes /tmp/cluster-backup-values.yaml at a fixed path — concurrent runs will collide. Same fix: mktemp + trap cleanup.

3. Helm-owned Secret + out-of-band kubectl Secret (velero-secret.yaml + deploy.sh)
The chart renders a Secret via velero-secret.yaml AND deploy.sh also kubectl applys a Secret with the same name. Helm will fail on ownership conflicts at install/upgrade. Choose one source of truth: either Helm owns the Secret, or use ExternalSecret/Infisical (repo standard) and reference it from the chart.


⚙️ Kubernetes / Helm Structural Issues

4. Velero subchart + custom Velero templates = duplicate deployment (Chart.yaml dependencies)
Chart.yaml declares a velero dependency AND the chart includes custom velero-deployment.yaml, velero-rbac.yaml, velero-serviceaccount.yaml templates. With velero.enabled: true, Helm renders both — two Velero instances with conflicting RBAC and ServiceAccounts. Remove the subchart dependency OR remove the custom templates; do not use both.

5. NetworkPolicy egress to: [] is overly permissive (networkpolicy.yaml ~line 66)
Multiple egress rules use to: [] which allows traffic to any destination. This is incompatible with zero-trust. Narrow egress: ipBlock for the S3 endpoint, namespace/pod selectors for kube-dns, and no open-ended to: [] rules.

6. ServiceMonitor targets a Service that does not exist (servicemonitor.yaml ~line 20)
The ServiceMonitor selects by label but no Service exposing a metrics port for the Velero Deployment is defined in this chart. Prometheus won't discover any targets. Add the Service or switch to a PodMonitor.

7. NOTES.txt port-forward references non-existent Service (NOTES.txt ~line 10)
NOTES suggests kubectl port-forward svc/{{ fullname }}-velero but the chart defines no such Service. Update to reference the correct resource.

8. Restic DaemonSet mounts host / with HostToContainer propagation (restic-daemonset.yaml ~line 88)
Mounting the root filesystem is incompatible with PSS restricted. Scope the mount to only the paths restic needs (e.g., /var/lib/kubelet/pods) and document the required security level (baseline/privileged) explicitly.

9. Restic ServiceAccount name resolves to default with current values (restic-serviceaccount.yaml ~line 10)
cluster-backup.resticServiceAccountName falls back to default when velero.restic.serviceAccount.create is unset. The current values.yaml has no velero.restic.serviceAccount block, so this creates a ServiceAccount named default. Provide an explicit fallback name in values.

10. automountServiceAccountToken: false will break restic backups (restic-serviceaccount.yaml ~line 14)
Restic/node-agent requires API server access for leases and podvolume resources. Disabling token automount without projecting a token explicitly will break all backups and restores. Set to true or add an explicit projected token volume.


🛠️ Script / Command Issues

11. kubectl create backup is not a valid kubectl command (test-local.sh ~line 278, verify.sh ~line 271)
Use velero backup create ... (Velero CLI) or apply a velero.io/v1 Backup manifest via kubectl apply -f -. The kubectl create backup subcommand does not exist and will fail on any cluster.

12. Fixed sleep 30 wait for backup completion is unreliable (verify.sh ~line 288)
Backups can take longer than 30s. Replace with a polling loop on .status.phase with a configurable timeout and kubectl describe backup ... output on failure.

13. ServiceAccount token via .secrets[0].name is broken on Kubernetes ≥1.24 (create-tenant-cluster.sh ~line 279)
BoundServiceAccountTokenVolume is default since K8s 1.24; SAs no longer auto-generate Secret tokens. Use kubectl create token <sa-name> -n <ns> (TokenRequest API) instead.

14. --version latest and /releases/latest/ URLs make provisioning non-reproducible (create-tenant-cluster.sh ~line 116)
Pin Kubernetes and all component versions explicitly to ensure reproducible cluster provisioning.

15. Floating minio/minio:latest tag in test-local.sh (~line 147)
Pin to a specific release (e.g., minio/minio:RELEASE.2024-01-16T16-07-38Z) for test reproducibility.

16. Misleading MinIO Console URL output (test-local.sh ~line 188)
Port 30000 maps to the S3 API (port 9000), not the console (9001). Rename the output to MinIO S3 API or expose 9001 and print the console URL separately.

17. Namespace not normalized to RFC1123 lowercase (create-tenant-cluster.sh ~line 170)
create_team_member_namespace uses $TEAM_MEMBER directly as the namespace name without lowercasing or validating. Normalize with tr '[:upper:]' '[:lower:]' before passing to kubectl create namespace.


Summary

Solid foundational concept for WeOwn's cluster DR posture. The blocking issues before merge are: (1) secrets written to /tmp — security violation per repo standards, (2) Helm subchart + custom template duplication — will cause conflicting resources, (3) broken kubectl create backup CLI commands — will fail at runtime, and (4) failing CI checks. All Copilot comments above are unresolved. @mshahid538 please address before re-requesting review.

ncimino and others added 2 commits May 22, 2026 10:20
Resolves the 3 failing CI jobs and all 30 inline review comments on
PR #24 (and the predecessor PR #14, whose comments were ported via the
branch rename `dra` → `feature/shahid-velero-restic`).

CI failures fixed
-----------------
* Helm Template Validation — chart no longer declares an upstream
  `velero` subchart dependency (the chart provides its OWN Velero +
  Restic templates; the two would collide and `helm template` could
  not run because `helm dependency build` is not invoked in CI).
* Helm Lint — removed the duplicate `cluster-backup.restoreName`
  template definition in `_helpers.tpl`.
* Documentation Validation / Markdown Lint — `cluster-backup/README.md`
  now satisfies MD031/MD022/MD032 (blank lines around fences /
  headings / lists). Added `cluster-backup/CHANGELOG.md` with a
  `[1.0.0] - 2026-05-22` entry that matches `Chart.yaml`'s
  `version: 1.0.0` — required by the tightened version-consistency
  check landed in #21.
* trailing-whitespace / end-of-file-fixer — stripped trailing whitespace
  from every cluster-backup yaml + shell file and ensured every file
  ends with `\n`.

Security & correctness (per Copilot / @romandidomizio review)
------------------------------------------------------------
* `automountServiceAccountToken: false` flipped to `true` on BOTH the
  Velero server and the Restic node-agent ServiceAccounts. Both
  controllers MUST call the Kubernetes API (Velero for cross-namespace
  resource enumeration / Backup-Restore CRDs; Restic for
  PodVolumeBackup / PodVolumeRestore / ResticRepository / coordination
  leases). Disabling token automount without a projected-token
  workaround silently broke the controllers.
* Removed `helm/templates/velero-secret.yaml`. The chart was rendering
  a placeholder Secret named `cluster-backup-cloud-credentials` which
  collided with the real Secret that `deploy.sh` creates out-of-band
  (`helm upgrade` would fail with "resource already exists" on every
  re-install). The Secret is now solely owned by `deploy.sh`.
  Removed the corresponding `checksum/secret` annotation from the
  Velero Deployment and Restic DaemonSet.
* Added `helm/templates/velero-service.yaml`. The chart previously
  shipped a `ServiceMonitor` selecting a Service that did not exist,
  and `NOTES.txt` instructed users to
  `kubectl port-forward svc/<fullname>-velero` — also missing.
  Service is `ClusterIP`, metrics-only.
* `helm/templates/networkpolicy.yaml` egress no longer uses `to: []`
  for any rule. DNS is constrained to the kube-system namespace;
  external S3 traffic is allowed to `0.0.0.0/0` MINUS RFC1918,
  link-local, and loopback ranges (so restic / velero can never
  accidentally exfiltrate to in-cluster IPs over an S3-shaped path);
  Kubernetes API access is constrained to the `default` namespace
  ClusterIP.
* `helm/templates/restic-daemonset.yaml` no longer mounts host `/` via
  `hostPath`. Restic only needs read access to `/var/lib/kubelet/pods`
  (the standard kubelet pod-volume root), so that's the only host
  mount that remains. Mounting host-root was incompatible with Pod
  Security `restricted` and not required for restic operation.
* `helm/templates/backup-schedules.yaml` passes a full
  `dict "Chart" $.Chart "Release" $.Release "Values" $.Values
   "Template" $.Template ...` context into the schedule-naming and
  schedule-config helpers, removing any ambiguity about what
  `cluster-backup.fullname` and `cluster-backup.labels` resolve to
  when called from within a `range` loop body.
* `helm/templates/_helpers.tpl`: simplified
  `cluster-backup.serviceAccountName` and
  `cluster-backup.resticServiceAccountName` to return deterministic
  names. The previous form gated on `.Values.velero.{server,restic}
  .serviceAccount.create`, which is not a path that exists in
  `values.yaml` — it was a remnant of the now-removed `velero`
  subchart pattern. The helpers always returned `"default"`, which
  conflicted with the ClusterRoleBinding subject the chart actually
  installed; Velero would have been denied every API call.
* `helm/templates/NOTES.txt` rewritten: `kubectl create backup …` /
  `kubectl create restore …` are NOT valid kubectl subcommands for
  Velero CRDs and were misleading users into thinking the controller
  was broken when in reality they were running a non-existent
  kubectl verb. NOTES now uses the Velero CLI (`velero backup get`,
  `velero schedule get`, etc.) and points readers at the install
  docs. Same fix applied across README.md, deploy.sh, verify.sh,
  test-local.sh, and (post-deployment) usage messages.

Script hardening
----------------
* `deploy.sh`:
  - Switched to `#!/usr/bin/env bash` + `set -euo pipefail`.
  - All temporary files now live in a single `mktemp -d`-created
    directory with mode 0700, cleaned up via an `EXIT|INT|TERM` trap.
    The previous version wrote `/tmp/s3-credentials` and
    `/tmp/cluster-backup-values.yaml` at fixed paths — world-readable
    on some hosts and racy under concurrent runs.
  - S3 secret-key prompt uses `read -rs` (no echo). The credential is
    piped into `kubectl create secret … --from-file=cloud=/dev/stdin
    --dry-run=client -o yaml | kubectl apply -f -`, so the secret
    never lives on disk and never appears in argv (`ps`).
  - All configuration accepts env-var inputs (`TENANT`, `CLUSTER`,
    `ENVIRONMENT`, `S3_*`) so the script is usable in CI/automation
    without an interactive TTY.
* `test-local.sh`:
  - MinIO image pinned (`minio/minio:RELEASE.2024-08-17T01-24-54Z`).
    Was `minio/minio:latest`, which made local test runs
    non-reproducible and exposed them to silent upstream breaking
    changes.
  - MinIO credentials no longer hardcoded as `minioadmin / minioadmin`
    in the manifest. Access key defaults to `localdev-access`; secret
    key is a freshly-generated 24-char random value per run
    (overridable via `TEST_MINIO_{ACCESS,SECRET}_KEY` env vars).
    Rendered into the Secret over a stdin pipe — same no-argv-leak
    pattern as deploy.sh.
  - Service now correctly exposes BOTH the S3 API (NodePort 30000)
    AND the web console (NodePort 30001). The original Service mapped
    only 9000 to NodePort 30000 yet the "MinIO Console" line pointed
    at 30000 — that's the S3 API, not the console. Web console
    address is now `http://localhost:30001`.
  - Replaced fixed `sleep 30` waits with `wait_for_velero_phase` which
    polls `.status.phase` of the Backup / Restore CR and emits
    `kubectl describe` + `velero backup logs` diagnostics on timeout.
  - Uses the Velero CLI for backup / restore creation.
* `verify.sh`:
  - Same shebang + strict-mode + Velero-CLI updates as deploy.sh.
  - `test_backup_creation` polls phase with diagnostics instead of
    blind-sleeping; also skips gracefully if the Velero CLI isn't
    installed locally (verify is run on hosts that may not have it).
* `create-tenant-cluster.sh`:
  - All upstream components are version-pinned via env-var defaults:
    `DOKS_K8S_VERSION`, `INGRESS_NGINX_VERSION`, `CERT_MANAGER_VERSION`,
    `METRICS_SERVER_VERSION`. The previous version used
    `doctl kubernetes cluster create … --version latest` and
    `metrics-server/releases/latest/`, making provisioning silently
    non-reproducible.
  - Team-member namespace is normalized to RFC1123
    (`AnnaF` → `annaf`, etc.) before being passed to `kubectl create
    namespace`. K8s rejects uppercase-containing namespace names with
    an opaque error, and the previous version of this script would
    silently fail at `create_team_member_namespace` for any contributor
    whose handle wasn't already lowercase-alphanumeric.
  - `create_team_member_access` no longer reads `.secrets[0].name` to
    harvest a long-lived ServiceAccount token (the BoundServiceAccount-
    Token feature in K8s ≥1.24 means that field is empty on modern
    clusters, so the function was returning an empty kubeconfig).
    Switched to `kubectl create token` (TokenRequest API) with an
    explicit 24h duration. Kubeconfig is written with `umask 077` so
    the file containing the bearer token is mode 0600.

Not in scope for this commit
----------------------------
* The Restic DaemonSet still ships with a `SYS_ADMIN` capability add
  and a non-root securityContext (uid 65534). Real-world restic
  operation typically requires either root or a relaxed
  podSecurityContext to read pod volumes owned by arbitrary uids;
  tightening this further requires a per-environment decision about
  the Pod Security Standard for the `velero` namespace. Flagged in
  `cluster-backup/CHANGELOG.md` under "Notes for operators".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread cluster-backup/helm/templates/restic-daemonset.yaml Fixed
Comment thread cluster-backup/helm/templates/restic-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/restic-daemonset.yaml Fixed
Two issues that surfaced once kubeconform actually parsed the rendered
output (the previous CI run was already past my first commit but failed
on Kubeconform Validation):

1. Duplicate `app.kubernetes.io/component` label

   The common `cluster-backup.labels` helper emitted
   `app.kubernetes.io/component: backup-system`, and every per-resource
   template also appended its own `app.kubernetes.io/component: <role>`
   right below the include. The K8s API server tolerates duplicate map
   keys (last write wins), but strict YAML parsers — including
   `kubeconform` — reject the manifest. Removed the component-label
   line from the shared helper; per-resource components remain.

2. Schedule CR `.spec.schedule` rendered as `"map[…]"` instead of cron

   The `cluster-backup.backupScheduleConfig` helper accessed
   `.schedule | quote` at the top level, but its caller passes the
   whole schedule object under `.schedule` — so `.schedule` was the
   FULL map, and Go's `fmt.Sprint` rendered it as
   `"map[enabled:true excludeNamespaces:[…] includeNamespaces:[]
   retention:30d schedule:0 2 * * *]"`. Velero would have rejected
   the Schedule as invalid syntax at apply time.

   Reworked the helper to extract fields explicitly via
   `.schedule.schedule`, `.schedule.retention`,
   `.schedule.includeNamespaces`, `.schedule.excludeNamespaces`.
   This is the same access pattern the caller already implied; the
   helper had just been written incorrectly from the start. The CR
   now renders with the expected `schedule: "0 2 * * *"`.

   Also fixed an adjacent issue in `backup-schedules.yaml`: the
   `{{- $ctx := dict … -}}` line had its trailing newline stripped,
   which fused the previous Schedule's last YAML line directly to
   the next `---` document separator (e.g. `defaultVolumesToRestic:
   true---`). Replaced with non-stripped `{{ $ctx := merge … }}` on
   its own line.

Verified with `helm template … | kubeconform -strict -summary
-ignore-missing-schemas -`:
  21 resources, Valid: 11, Invalid: 0, Errors: 0, Skipped: 10
(skipped resources are Velero CRDs with no published JSON schema:
 BackupStorageLocation, VolumeSnapshotLocation, Schedule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 22, 2026 16:40
Comment thread cluster-backup/helm/templates/restic-daemonset.yaml Fixed
Comment thread cluster-backup/helm/templates/restic-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment thread cluster-backup/helm/templates/velero-rbac.yaml Fixed
Comment on lines +93 to +96
{{- end }}
{{- if .Values.velero.restic.affinity }}
affinity:
{{- toYaml .Values.velero.restic.affinity | nindent 8 }}
@ncimino
Copy link
Copy Markdown
Contributor

ncimino commented May 22, 2026

@mshahid538 @romandidomizio — I pushed two commits to address the failing CI and the inline review comments:

5d7bea0git merge origin/main (no conflicts).

d0c637bfix(cluster-backup): address all reviewer + CI feedback from PR #14/#24 (19 files). Highlights:

  • Helm chart structure: dropped the velero subchart dep (was colliding with the local templates that this chart provides itself), removed the duplicate cluster-backup.restoreName helper, deleted velero-secret.yaml (collided with deploy.sh's out-of-band Secret), added velero-service.yaml so the ServiceMonitor + kubectl port-forward in NOTES.txt actually have a target Service.
  • Security: flipped automountServiceAccountToken from falsetrue on both ServiceAccounts (Velero/restic need K8s API access — was silently breaking the controllers). Removed the / host-root hostPath mount from the Restic DaemonSet. Tightened NetworkPolicy egress (no more to: [] — kube-system DNS, external-S3-but-not-RFC1918, and the K8s API only).
  • kubectl create backup/restore/schedulevelero CLI everywhere (those weren't real kubectl subcommands). Updated NOTES.txt, README.md, deploy.sh, test-local.sh, verify.sh.
  • Scripts: deploy.sh uses mktemp -d (mode 0700) with EXIT/INT/TERM trap, pipes the S3 secret via stdin so it never lives on disk or in argv. test-local.sh pins the MinIO image, generates per-run MinIO credentials (no hardcoded minioadmin), exposes the console on the correct port (30001, not 30000). verify.sh polls .status.phase with diagnostics on timeout (no blind sleep 30). create-tenant-cluster.sh pins DOKS/ingress-nginx/cert-manager/metrics-server versions, normalizes team-member name to RFC1123 before kubectl create namespace, replaces the .secrets[0].name token-harvest (broken on K8s ≥1.24) with kubectl create token.
  • Added cluster-backup/CHANGELOG.md with a [1.0.0] - 2026-05-22 entry so the tightened version-consistency check (from Auto-PR: fix(ci): align drifted chart versions; tighten CHANGELOG consistency check #21) passes.

ac2e9f6fix(cluster-backup): repair Schedule rendering + dedupe component label. Two issues that surfaced once kubeconform actually parsed the rendered output:

  • backupScheduleConfig was emitting spec.schedule: "map[enabled:true …]" instead of the cron string — .schedule | quote was rendering the whole schedule object as a Go-formatted map. Reworked the helper to use .schedule.schedule, .schedule.retention, etc. (Velero would have rejected the CR at apply time.)
  • Duplicate app.kubernetes.io/component key on every resource (helper set it, then each template appended its own under the include) — kubeconform-strict rejected the manifests. Removed the line from the common helper.

CI status after the push: all 7 jobs in Code Validation & Security ✓ (Lint, Kubernetes, Documentation, Security, Compliance, WeOwnVer, Summary). The external Trivy check still reports 16 alerts on the cluster-backup paths — those are config-scan findings that need separate triage (some are inherent to a backup tool, e.g. broad RBAC verbs; others may warrant .trivyignore entries; one or two could need real fixes). I didn't address those in this push — separate workstream.

Not in scope / flagged for follow-up:

  • The Restic DaemonSet keeps SYS_ADMIN cap + uid 65534 (runAsNonRoot: true). Real-world restic operation typically needs either root or a relaxed PSS for the velero namespace to read pod volumes owned by arbitrary uids. Tightening further is a per-environment policy decision — noted in cluster-backup/CHANGELOG.md under "Notes for operators".
  • The deletion of velero-secret.yaml means helm install BEFORE deploy.sh creates the Secret will leave the BackupStorageLocation and VolumeSnapshotLocation referencing a missing Secret until deploy.sh runs. The README + NOTES.txt + CHANGELOG all explicitly call this out.

Let me know if you'd prefer any of the script rewrites to be split into separate commits, or if you want the SYS_ADMIN/securityContext tightening done in this same PR rather than left as a follow-up.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.

Comment on lines +56 to +75
# S3-compatible object storage API (DigitalOcean Spaces, AWS S3, GCS via
# interop, MinIO with public ingress, etc.). Object stores are reached via
# public IPs from inside the cluster, so we scope by port + protocol
# rather than by namespace selector. `to: []` with a port restriction is
# the standard idiom for "any external IP on this port" in NetworkPolicy.
- to:
- ipBlock:
cidr: 0.0.0.0/0
# Block all RFC1918 + link-local + loopback: external S3 only,
# never let restic / velero talk to in-cluster IPs by mistake.
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
- 169.254.0.0/16
- 127.0.0.0/8
ports:
- protocol: TCP
port: 443
# Kubernetes API server (Velero needs this for backup/restore CRD work).
Comment thread cluster-backup/helm/values.yaml Outdated
Comment on lines +75 to +96
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- SYS_ADMIN # Required for restic mount operations

# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault

Comment on lines +246 to +266
# Ingress rules
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8085 # Metrics port

# Egress rules
egress:
- to: []
ports:
- protocol: TCP
port: 443 # HTTPS for S3 API
- protocol: TCP
port: 80 # HTTP for S3 API
- protocol: UDP
port: 53 # DNS resolution

spec:
containers:
- name: test-app
image: nginx:latest
Comment on lines +205 to +213
# Create the bucket. mc receives the credentials via environment variables
# so they never appear as positional argv inside the kubectl-run pod.
kubectl run mc --image="${TEST_MINIO_MC_IMAGE}" \
--restart=Never --rm -i \
--env="MC_HOST_local=http://${TEST_MINIO_ACCESS_KEY}:${TEST_MINIO_SECRET_KEY}@minio.minio.svc.cluster.local:9000" \
--command -- sh -c '
mc mb local/weown-cluster-backups --ignore-existing
mc ls local/
'
Comment thread cluster-backup/README.md

```bash
# Clone the repository
git clone https://github.com/WeOwn/ai.git
Comment thread cluster-backup/README.md
Comment on lines +218 to +233
- **At Rest**: AES-256 encryption for all backup data
- **In Transit**: TLS 1.3 for all API communications
- **Key Management**: Kubernetes secrets with rotation

### **Access Control**

- **RBAC**: Minimal permissions for Velero and Restic
- **NetworkPolicies**: Zero-trust networking
- **Pod Security**: Restricted security context
- **Service Accounts**: Dedicated accounts with minimal privileges

### **Compliance**

- **SOC2**: CC6.1, CC6.2, CC6.3 controls
- **ISO27001**: A.12.3.1, A.12.6.1, A.13.1.1 controls
- **Audit Logging**: 7-year retention for compliance

Security posture applied by this chart:
- NetworkPolicy (zero-trust egress: kube-DNS + external S3 only)
- Pod Security Standards: restricted
Comment thread cluster-backup/deploy.sh
Comment on lines +149 to +165
# Function to create the cluster-backup-cloud-credentials Secret.
# The credential material flows over a pipe to `kubectl create … --dry-run`
# instead of through a file: nothing sensitive is written to disk, and
# nothing appears in `ps` (no positional argv carrying the secret).
create_s3_credentials() {
echo -e "${YELLOW}Creating S3 credentials secret...${NC}"

printf '[default]\naws_access_key_id = %s\naws_secret_access_key = %s\n' \
"$S3_ACCESS_KEY" "$S3_SECRET_KEY" \
| kubectl create secret generic cluster-backup-cloud-credentials \
--from-file=cloud=/dev/stdin \
--namespace="$NAMESPACE" \
--dry-run=client -o yaml \
| kubectl apply -f -

echo -e "${GREEN}✅ S3 credentials secret created${NC}"
}
Local `trivy config --severity CRITICAL,HIGH cluster-backup/` is now
clean (0 findings). Two categories:

Real code fixes
---------------
1. **KSV-0053 (HIGH) — restic ClusterRole granted `pods/exec`.**
   Restic backs up pod volumes by reading the kubelet bind-mount; it
   never execs into user pods. Removed `pods/exec` from the resource
   list and tightened the verb set on `[pods, pods/log, namespaces,
   nodes]` from `get list watch create update patch delete` down to
   just `get list watch` — restic does not own any of those resources.
   This also drops KSV-0042 (delete pod logs) + part of KSV-0048
   (manage workloads/pods) from the restic role specifically.

2. **KSV-0005 (HIGH) — restic securityContext was internally
   inconsistent.** The previous values.yaml had
   `runAsNonRoot: true, runAsUser: 65534` + `capabilities.add:
   [SYS_ADMIN]`. uid 65534 cannot read pod volumes owned by arbitrary
   uids, so restic would have failed at runtime. Per Velero's
   documented restic deployment posture (https://velero.io/docs/v1.12/restic/),
   the node-agent must run as root with SYS_ADMIN. Changed to
   `runAsNonRoot: false, runAsUser: 0, runAsGroup: 0, fsGroup: 0`,
   keeping `readOnlyRootFilesystem: true` and the default seccomp
   profile. The PSS for the `velero` namespace must therefore be
   `baseline` or `privileged` (NOT `restricted`) — documented in
   cluster-backup/CHANGELOG.md.

Targeted .trivyignore additions
-------------------------------
The remaining findings are inherent to a cluster-backup tool and
cannot be fixed without breaking its function. Each entry in
.trivyignore is explained — Trivy ignore-rules are the right vehicle
for "this is intentional, here's why":

- KSV-0041 (CRITICAL): Velero must manage Secrets to back them up.
- KSV-0056 (HIGH x3): Velero must manage Services/Endpoints/Ingresses/
  NetworkPolicies to back up and restore network topology.
- KSV-0005 / KSV-0022 (HIGH): SYS_ADMIN is required for restic's
  mount/setns ops; this is Velero's documented deployment posture.
- KSV-0012 (MEDIUM): companion of KSV-0005 (run-as-root).
- KSV-0023 (MEDIUM): hostPath on `/var/lib/kubelet/pods` is how restic
  reads pod volumes. The earlier broader mount of host `/` was already
  removed in d0c637b.
- KSV-0042 (MEDIUM): Velero must delete pods to restore them.
- KSV-0048 (MEDIUM): Velero must create/update Deployments,
  StatefulSets, DaemonSets, Jobs, CronJobs.
- KSV-0049 (MEDIUM): Velero must manage ConfigMaps.
- KSV-0125 (MEDIUM): `velero/velero:v1.12.2` is the official upstream
  image on docker.io. Mirroring to a private registry is tracked as
  a separate follow-up.

All ignored rules are scoped by .trivyignore comments to the
cluster-backup chart only.

Roman's review verified
-----------------------
All 17 items from @romandidomizio's CHANGES_REQUESTED review have been
addressed in this branch (d0c637b + ac2e9f6 + this commit). Spot-checked
all of them; no outstanding source-level issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@@ -0,0 +1,102 @@
{{- if and .Values.velero.enabled .Values.velero.restic.enabled }}
Comment on lines +32 to +37
args:
- restic
- server
- --log-level={{ .Values.logging.level }}
- --log-format={{ .Values.logging.format }}
- --restic-timeout=4h
Comment on lines +39 to +100
env:
- name: VELERO_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: VELERO_SCRATCH_DIR
value: /scratch
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /credentials/cloud
- name: AZURE_CREDENTIALS_FILE
value: /credentials/cloud
volumeMounts:
- name: plugins
mountPath: /plugins
- name: scratch
mountPath: /scratch
- name: cloud-credentials
mountPath: /credentials
- name: config
mountPath: /etc/velero
# Restic needs read access to other pods' volumes to back them up.
# `/var/lib/kubelet/pods` is the standard kubelet pod-volume root and
# is what Velero's documented restic DaemonSet uses. The previous
# template ALSO mounted `/` (host root) with HostToContainer, which
# gave the restic container effectively unrestricted host-filesystem
# read access — incompatible with Pod Security `restricted` and
# unnecessary for restic operation.
- name: host-pods
mountPath: /host_pods
mountPropagation: HostToContainer
resources:
{{- toYaml .Values.velero.restic.resources | nindent 10 }}
securityContext:
{{- toYaml .Values.velero.restic.securityContext | nindent 10 }}
volumes:
- name: plugins
emptyDir: {}
- name: scratch
emptyDir: {}
- name: cloud-credentials
secret:
secretName: {{ include "cluster-backup.fullname" . }}-cloud-credentials
- name: config
configMap:
name: {{ include "cluster-backup.fullname" . }}-config
- name: host-pods
hostPath:
path: /var/lib/kubelet/pods
type: Directory
{{- if .Values.velero.restic.nodeSelector }}
nodeSelector:
{{- toYaml .Values.velero.restic.nodeSelector | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.affinity }}
affinity:
{{- toYaml .Values.velero.restic.affinity | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.tolerations }}
tolerations:
{{- toYaml .Values.velero.restic.tolerations | nindent 8 }}
Comment on lines +39 to +100
env:
- name: VELERO_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: VELERO_SCRATCH_DIR
value: /scratch
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /credentials/cloud
- name: AZURE_CREDENTIALS_FILE
value: /credentials/cloud
volumeMounts:
- name: plugins
mountPath: /plugins
- name: scratch
mountPath: /scratch
- name: cloud-credentials
mountPath: /credentials
- name: config
mountPath: /etc/velero
# Restic needs read access to other pods' volumes to back them up.
# `/var/lib/kubelet/pods` is the standard kubelet pod-volume root and
# is what Velero's documented restic DaemonSet uses. The previous
# template ALSO mounted `/` (host root) with HostToContainer, which
# gave the restic container effectively unrestricted host-filesystem
# read access — incompatible with Pod Security `restricted` and
# unnecessary for restic operation.
- name: host-pods
mountPath: /host_pods
mountPropagation: HostToContainer
resources:
{{- toYaml .Values.velero.restic.resources | nindent 10 }}
securityContext:
{{- toYaml .Values.velero.restic.securityContext | nindent 10 }}
volumes:
- name: plugins
emptyDir: {}
- name: scratch
emptyDir: {}
- name: cloud-credentials
secret:
secretName: {{ include "cluster-backup.fullname" . }}-cloud-credentials
- name: config
configMap:
name: {{ include "cluster-backup.fullname" . }}-config
- name: host-pods
hostPath:
path: /var/lib/kubelet/pods
type: Directory
{{- if .Values.velero.restic.nodeSelector }}
nodeSelector:
{{- toYaml .Values.velero.restic.nodeSelector | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.affinity }}
affinity:
{{- toYaml .Values.velero.restic.affinity | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.tolerations }}
tolerations:
{{- toYaml .Values.velero.restic.tolerations | nindent 8 }}
Comment on lines +91 to +100
nodeSelector:
{{- toYaml .Values.velero.restic.nodeSelector | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.affinity }}
affinity:
{{- toYaml .Values.velero.restic.affinity | nindent 8 }}
{{- end }}
{{- if .Values.velero.restic.tolerations }}
tolerations:
{{- toYaml .Values.velero.restic.tolerations | nindent 8 }}
Copy link
Copy Markdown
Contributor

@ncimino ncimino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed many times, Trivy issues fixed, ready for merge

@ncimino ncimino merged commit ec3268d into main May 25, 2026
18 checks passed
ncimino added a commit that referenced this pull request May 25, 2026
Resolves the 3 failing CI jobs and all 30 inline review comments on
PR #24 (and the predecessor PR #14, whose comments were ported via the
branch rename `dra` → `feature/shahid-velero-restic`).

CI failures fixed
-----------------
* Helm Template Validation — chart no longer declares an upstream
  `velero` subchart dependency (the chart provides its OWN Velero +
  Restic templates; the two would collide and `helm template` could
  not run because `helm dependency build` is not invoked in CI).
* Helm Lint — removed the duplicate `cluster-backup.restoreName`
  template definition in `_helpers.tpl`.
* Documentation Validation / Markdown Lint — `cluster-backup/README.md`
  now satisfies MD031/MD022/MD032 (blank lines around fences /
  headings / lists). Added `cluster-backup/CHANGELOG.md` with a
  `[1.0.0] - 2026-05-22` entry that matches `Chart.yaml`'s
  `version: 1.0.0` — required by the tightened version-consistency
  check landed in #21.
* trailing-whitespace / end-of-file-fixer — stripped trailing whitespace
  from every cluster-backup yaml + shell file and ensured every file
  ends with `\n`.

Security & correctness (per Copilot / @romandidomizio review)
------------------------------------------------------------
* `automountServiceAccountToken: false` flipped to `true` on BOTH the
  Velero server and the Restic node-agent ServiceAccounts. Both
  controllers MUST call the Kubernetes API (Velero for cross-namespace
  resource enumeration / Backup-Restore CRDs; Restic for
  PodVolumeBackup / PodVolumeRestore / ResticRepository / coordination
  leases). Disabling token automount without a projected-token
  workaround silently broke the controllers.
* Removed `helm/templates/velero-secret.yaml`. The chart was rendering
  a placeholder Secret named `cluster-backup-cloud-credentials` which
  collided with the real Secret that `deploy.sh` creates out-of-band
  (`helm upgrade` would fail with "resource already exists" on every
  re-install). The Secret is now solely owned by `deploy.sh`.
  Removed the corresponding `checksum/secret` annotation from the
  Velero Deployment and Restic DaemonSet.
* Added `helm/templates/velero-service.yaml`. The chart previously
  shipped a `ServiceMonitor` selecting a Service that did not exist,
  and `NOTES.txt` instructed users to
  `kubectl port-forward svc/<fullname>-velero` — also missing.
  Service is `ClusterIP`, metrics-only.
* `helm/templates/networkpolicy.yaml` egress no longer uses `to: []`
  for any rule. DNS is constrained to the kube-system namespace;
  external S3 traffic is allowed to `0.0.0.0/0` MINUS RFC1918,
  link-local, and loopback ranges (so restic / velero can never
  accidentally exfiltrate to in-cluster IPs over an S3-shaped path);
  Kubernetes API access is constrained to the `default` namespace
  ClusterIP.
* `helm/templates/restic-daemonset.yaml` no longer mounts host `/` via
  `hostPath`. Restic only needs read access to `/var/lib/kubelet/pods`
  (the standard kubelet pod-volume root), so that's the only host
  mount that remains. Mounting host-root was incompatible with Pod
  Security `restricted` and not required for restic operation.
* `helm/templates/backup-schedules.yaml` passes a full
  `dict "Chart" $.Chart "Release" $.Release "Values" $.Values
   "Template" $.Template ...` context into the schedule-naming and
  schedule-config helpers, removing any ambiguity about what
  `cluster-backup.fullname` and `cluster-backup.labels` resolve to
  when called from within a `range` loop body.
* `helm/templates/_helpers.tpl`: simplified
  `cluster-backup.serviceAccountName` and
  `cluster-backup.resticServiceAccountName` to return deterministic
  names. The previous form gated on `.Values.velero.{server,restic}
  .serviceAccount.create`, which is not a path that exists in
  `values.yaml` — it was a remnant of the now-removed `velero`
  subchart pattern. The helpers always returned `"default"`, which
  conflicted with the ClusterRoleBinding subject the chart actually
  installed; Velero would have been denied every API call.
* `helm/templates/NOTES.txt` rewritten: `kubectl create backup …` /
  `kubectl create restore …` are NOT valid kubectl subcommands for
  Velero CRDs and were misleading users into thinking the controller
  was broken when in reality they were running a non-existent
  kubectl verb. NOTES now uses the Velero CLI (`velero backup get`,
  `velero schedule get`, etc.) and points readers at the install
  docs. Same fix applied across README.md, deploy.sh, verify.sh,
  test-local.sh, and (post-deployment) usage messages.

Script hardening
----------------
* `deploy.sh`:
  - Switched to `#!/usr/bin/env bash` + `set -euo pipefail`.
  - All temporary files now live in a single `mktemp -d`-created
    directory with mode 0700, cleaned up via an `EXIT|INT|TERM` trap.
    The previous version wrote `/tmp/s3-credentials` and
    `/tmp/cluster-backup-values.yaml` at fixed paths — world-readable
    on some hosts and racy under concurrent runs.
  - S3 secret-key prompt uses `read -rs` (no echo). The credential is
    piped into `kubectl create secret … --from-file=cloud=/dev/stdin
    --dry-run=client -o yaml | kubectl apply -f -`, so the secret
    never lives on disk and never appears in argv (`ps`).
  - All configuration accepts env-var inputs (`TENANT`, `CLUSTER`,
    `ENVIRONMENT`, `S3_*`) so the script is usable in CI/automation
    without an interactive TTY.
* `test-local.sh`:
  - MinIO image pinned (`minio/minio:RELEASE.2024-08-17T01-24-54Z`).
    Was `minio/minio:latest`, which made local test runs
    non-reproducible and exposed them to silent upstream breaking
    changes.
  - MinIO credentials no longer hardcoded as `minioadmin / minioadmin`
    in the manifest. Access key defaults to `localdev-access`; secret
    key is a freshly-generated 24-char random value per run
    (overridable via `TEST_MINIO_{ACCESS,SECRET}_KEY` env vars).
    Rendered into the Secret over a stdin pipe — same no-argv-leak
    pattern as deploy.sh.
  - Service now correctly exposes BOTH the S3 API (NodePort 30000)
    AND the web console (NodePort 30001). The original Service mapped
    only 9000 to NodePort 30000 yet the "MinIO Console" line pointed
    at 30000 — that's the S3 API, not the console. Web console
    address is now `http://localhost:30001`.
  - Replaced fixed `sleep 30` waits with `wait_for_velero_phase` which
    polls `.status.phase` of the Backup / Restore CR and emits
    `kubectl describe` + `velero backup logs` diagnostics on timeout.
  - Uses the Velero CLI for backup / restore creation.
* `verify.sh`:
  - Same shebang + strict-mode + Velero-CLI updates as deploy.sh.
  - `test_backup_creation` polls phase with diagnostics instead of
    blind-sleeping; also skips gracefully if the Velero CLI isn't
    installed locally (verify is run on hosts that may not have it).
* `create-tenant-cluster.sh`:
  - All upstream components are version-pinned via env-var defaults:
    `DOKS_K8S_VERSION`, `INGRESS_NGINX_VERSION`, `CERT_MANAGER_VERSION`,
    `METRICS_SERVER_VERSION`. The previous version used
    `doctl kubernetes cluster create … --version latest` and
    `metrics-server/releases/latest/`, making provisioning silently
    non-reproducible.
  - Team-member namespace is normalized to RFC1123
    (`AnnaF` → `annaf`, etc.) before being passed to `kubectl create
    namespace`. K8s rejects uppercase-containing namespace names with
    an opaque error, and the previous version of this script would
    silently fail at `create_team_member_namespace` for any contributor
    whose handle wasn't already lowercase-alphanumeric.
  - `create_team_member_access` no longer reads `.secrets[0].name` to
    harvest a long-lived ServiceAccount token (the BoundServiceAccount-
    Token feature in K8s ≥1.24 means that field is empty on modern
    clusters, so the function was returning an empty kubeconfig).
    Switched to `kubectl create token` (TokenRequest API) with an
    explicit 24h duration. Kubeconfig is written with `umask 077` so
    the file containing the bearer token is mode 0600.

Not in scope for this commit
----------------------------
* The Restic DaemonSet still ships with a `SYS_ADMIN` capability add
  and a non-root securityContext (uid 65534). Real-world restic
  operation typically requires either root or a relaxed
  podSecurityContext to read pod volumes owned by arbitrary uids;
  tightening this further requires a per-environment decision about
  the Pod Security Standard for the `velero` namespace. Flagged in
  `cluster-backup/CHANGELOG.md` under "Notes for operators".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ncimino ncimino deleted the feature/shahid-velero-restic branch May 25, 2026 20:43
ncimino added a commit that referenced this pull request May 27, 2026
… DO droplets (#27)

* feat(sandbox-docker): add copier template for AIO Sandbox on DO droplets

Adds a new copier project for deploying agent-infra/sandbox (browser, shell,
filesystem, VSCode, Jupyter, MCP servers in one container) on a DigitalOcean
droplet. Mirrors the keycloak-docker / anythingllm-docker pattern: Caddy
reverse proxy, Infisical runtime secret injection, OpenTofu droplet +
firewall + reserved IP, skinny volume backups with GFS retention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Auto-PR: Merge branch 'feature/wordpress-docker-copier-template' into feature/wp-hardening-shd (#19)

* feat: codify PHP limits and block .user.ini per Task D152 & #264

* chore(helm): implement multi-site values strategy for burnedout and ptoken

* chore: bump helm chart version to 3.2.7

* fix: resolve trivy security scan and linting issues

* udated version bump

* fix: resolve yamllint line endings and comment indentation

* fix: resolve yamllint line endings

* docs: apply markdownlint autofixes for PR #15 CI

- ADR-004: add blank line before bulleted list (MD032)
- workflows/README.md: switch *emphasis* to _emphasis_ for style consistency (MD049)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(wordpress/helm): address PR #15 review items 10-12

- ingress.yaml: parameterize spec.tls[0].secretName from
  .Values.ingress.tls[0].secretName so per-site overrides
  (burnedout-tls, ptoken-tls) actually take effect. Falls
  back to "wordpress-tls" when .Values.ingress.tls is a
  map or unset (preserves default and TLS hardening map).
- php-config-configmap.yaml: gate "auto_prepend_file =
  wordfence-waf.php" behind .Values.wordpress.wordfence.enabled
  (default false) to avoid PHP warnings when the plugin is
  not installed.
- ingress.yaml: gate the file-blocking server-snippet annotation
  behind .Values.ingress.serverSnippet.enabled (default true)
  so the chart can deploy on hardened controllers that set
  allow-snippet-annotations: false.
- values.yaml: add wordpress.wordfence.enabled and
  ingress.serverSnippet.enabled.
- Chart.yaml: bump 3.2.7 -> 3.3.0 (SemVer + WeOwnVer valid).
- wordpress/CHANGELOG.md: document 3.3.0.

Verified: helm template renders correctly with default,
values-burnedout, and values-ptoken; helm lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ingress): expand edge block list and add security response headers

Addresses PR #19 review items 11 and 12:
- Block xmlrpc.php, wp-config backups, readme.html, license.txt,
  debug.log, and .svn at the nginx edge (defense-in-depth)
- Add X-Frame-Options, X-Content-Type-Options, HSTS, Referrer-Policy,
  Permissions-Policy, and Content-Security-Policy via configuration-snippet
- CSP is configurable per site via ingress.securityHeaders.contentSecurityPolicy
- Document wp-login.php rate limiting as controller-level config

Chart version 3.3.0 → 3.3.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(values): resolve dead config, env var duplicates, and size mismatch

- Move APACHE_HTTP_PORT from dead wordpress.extraEnvVars to top-level
  extraEnvVars (deployment.yaml only reads the top-level key)
- Remove duplicate WORDPRESS_ENABLE_REDIS (already injected by template
  when redis.enabled=true) and WORDPRESS_CONFIG_EXTRA (already generated
  from wordpressExtraWpConfigContent — duplicate silently overwrote the
  more complete template version)
- Align proxy-body-size 64m -> 128m to match PHP upload_max_filesize
- Fix comment: global.domain -> wordpress.domain
- Remove dead hosts field from per-site tls overrides (template builds
  hosts from wordpress.domain, not tls list)
- Document enableMultisite/redirectFromWWW as not yet wired

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: m.shahid <m.shahid538@yahoo.com>
Co-authored-by: Nik <nik.cimino@gmail.com>
Co-authored-by: romandidomizio <rodi1364@colorado.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* - SearXNG Deployment to 'searxng.weown.app'.

- Add SearXNG deployment and team integration docs

- Adds the production deployment assets, ingress routing, browser integration guide, workstation browser policy playbook, and secret-safe SearXNG settings templating so the service can be reviewed without committing production secrets.

* refactor(searxng): restructure as copier template matching keycloak-docker pattern

Resolves all 15 review comments from PR #20 code review:

BLOCKING (4 resolved):
- Remove committed production IPs from inventory.ini (no inventory file;
  copier template uses dynamic doctl-based discovery)
- Remove committed RFC1918 IPs from k8s-external-ingress.yaml (k8s
  manifests removed; Caddy handles TLS directly on droplet)
- Fix TLS bypass: old docker-compose bound port 80 publicly; now Caddy
  handles all TLS termination with auto Let's Encrypt
- Pin container images: configurable image variables replace :latest tags

SECURITY (4 resolved):
- Secret key no longer world-readable on disk: Infisical Machine Identity
  injects SEARXNG_SECRET_KEY at runtime via `infisical run`
- Disable Google autocomplete (privacy leak): set autocomplete to ""
- TLS cipher restrictions: Caddy enforces modern TLS by default (no nginx
  ingress annotations needed)
- Remove StrictHostKeyChecking=accept-new: no inventory.ini

OPERATIONAL (5 resolved):
- Docker installation via get.docker.com (upstream docker-ce, not
  mismatched docker.io + docker-compose-plugin)
- Ansible deploy uses proper changed_when patterns
- Health checks defined for all 3 services (searxng, valkey, caddy)
- Valkey actually configured: settings.yml includes redis.url pointing
  to valkey:6379 for rate-limiting and bot detection
- Firefox policies.json: slurp-merge-write pattern preserves existing
  enterprise policies instead of overwriting

TEMPLATE STRUCTURE (new):
- copier.yaml with _subdirectory: template (copier >= 9.0.0)
- template/terraform/ — main.tf (droplet + reserved IP + firewall),
  monitoring.tf (CPU/mem/disk alerts), variables.tf, outputs.tf,
  backend.tf, versions.tf, cloud-init.yaml
- template/docker/ — compose.prod.yaml, Caddyfile, searxng/settings.yml
- template/scripts/ — deploy.sh, backup.sh (skinny backups with DO
  Spaces offload), restore.sh
- template/ansible/ — deploy.yml, configure-browser-search.yml
- template/ — .gitignore, README.md, CHANGELOG.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(wordpress-docker): use Minimus private image with PDO extensions (D117)

Replace reg.mini.dev/wordpress:latest with
reg.mini.dev/1923/wordpress-fluentsmtp:latest in the copier template
default and all generated site configs.

The base Minimus WordPress image strips PDO (WordPress core uses
mysqli). FluentSMTP requires PDO for email logging — without it,
logins trigger a fatal error. The private image includes php-pdo-auto
and php-pdo-mysql-auto plus 5 other PHP extensions.

Ref: burnedout.xyz PDO postmortem 2026-04-14, decision D117.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(ci): align drifted chart versions; tighten CHANGELOG consistency check

The version-consistency check in .github/workflows/validation.yml used
`grep -qi $chart_version $changelog_file` (presence-anywhere), which
silently passed even when Chart.yaml drifted away from the latest
CHANGELOG entry. Concrete case discovered during PR #19 review:
wordpress/helm/Chart.yaml was 3.2.6 while wordpress/CHANGELOG.md's top
entry was [3.3.8] — the older 3.2.6 string still appeared in past
entries, so the check kept passing.

Tighten the check to compare Chart.yaml's `version:` against the first
`## [X.Y.Z]` header in CHANGELOG.md (the Keep-a-Changelog newest-on-top
convention).

Three charts had pre-existing drift and would fail the new check on
main, so align them in this commit:

  nextcloud/helm/Chart.yaml      1.0.0 → 1.1.0
  vaultwarden/helm/Chart.yaml    1.0.0 → 1.3.1
  wordpress/helm/Chart.yaml      3.2.6 → 3.3.8

These are metadata-only bumps: Chart.yaml now matches the version each
chart's own CHANGELOG already declares as latest. No template or values
changes.

Not addressed in this commit (separate decisions needed):

- WeOwnVer format inconsistency between docs/VERSIONING_WEOWNVER.md
  (SEASON.MONTH.WEEK.ITERATION) and validation.yml's `versioning` job
  (SEASON.WEEK.DAY.VERSION). None of the current chart versions cleanly
  fit either schema.
- The `versioning` job only validates the first Chart.yaml found
  alphabetically (`*/helm/Chart.yaml | head -1` on line 212), not all
  of them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): split long error message to resolve yamllint line-length warning

* DR initial commit with restic + velero

* fix(cluster-backup): address all reviewer + CI feedback from PR #14/#24

Resolves the 3 failing CI jobs and all 30 inline review comments on
PR #24 (and the predecessor PR #14, whose comments were ported via the
branch rename `dra` → `feature/shahid-velero-restic`).

CI failures fixed
-----------------
* Helm Template Validation — chart no longer declares an upstream
  `velero` subchart dependency (the chart provides its OWN Velero +
  Restic templates; the two would collide and `helm template` could
  not run because `helm dependency build` is not invoked in CI).
* Helm Lint — removed the duplicate `cluster-backup.restoreName`
  template definition in `_helpers.tpl`.
* Documentation Validation / Markdown Lint — `cluster-backup/README.md`
  now satisfies MD031/MD022/MD032 (blank lines around fences /
  headings / lists). Added `cluster-backup/CHANGELOG.md` with a
  `[1.0.0] - 2026-05-22` entry that matches `Chart.yaml`'s
  `version: 1.0.0` — required by the tightened version-consistency
  check landed in #21.
* trailing-whitespace / end-of-file-fixer — stripped trailing whitespace
  from every cluster-backup yaml + shell file and ensured every file
  ends with `\n`.

Security & correctness (per Copilot / @romandidomizio review)
------------------------------------------------------------
* `automountServiceAccountToken: false` flipped to `true` on BOTH the
  Velero server and the Restic node-agent ServiceAccounts. Both
  controllers MUST call the Kubernetes API (Velero for cross-namespace
  resource enumeration / Backup-Restore CRDs; Restic for
  PodVolumeBackup / PodVolumeRestore / ResticRepository / coordination
  leases). Disabling token automount without a projected-token
  workaround silently broke the controllers.
* Removed `helm/templates/velero-secret.yaml`. The chart was rendering
  a placeholder Secret named `cluster-backup-cloud-credentials` which
  collided with the real Secret that `deploy.sh` creates out-of-band
  (`helm upgrade` would fail with "resource already exists" on every
  re-install). The Secret is now solely owned by `deploy.sh`.
  Removed the corresponding `checksum/secret` annotation from the
  Velero Deployment and Restic DaemonSet.
* Added `helm/templates/velero-service.yaml`. The chart previously
  shipped a `ServiceMonitor` selecting a Service that did not exist,
  and `NOTES.txt` instructed users to
  `kubectl port-forward svc/<fullname>-velero` — also missing.
  Service is `ClusterIP`, metrics-only.
* `helm/templates/networkpolicy.yaml` egress no longer uses `to: []`
  for any rule. DNS is constrained to the kube-system namespace;
  external S3 traffic is allowed to `0.0.0.0/0` MINUS RFC1918,
  link-local, and loopback ranges (so restic / velero can never
  accidentally exfiltrate to in-cluster IPs over an S3-shaped path);
  Kubernetes API access is constrained to the `default` namespace
  ClusterIP.
* `helm/templates/restic-daemonset.yaml` no longer mounts host `/` via
  `hostPath`. Restic only needs read access to `/var/lib/kubelet/pods`
  (the standard kubelet pod-volume root), so that's the only host
  mount that remains. Mounting host-root was incompatible with Pod
  Security `restricted` and not required for restic operation.
* `helm/templates/backup-schedules.yaml` passes a full
  `dict "Chart" $.Chart "Release" $.Release "Values" $.Values
   "Template" $.Template ...` context into the schedule-naming and
  schedule-config helpers, removing any ambiguity about what
  `cluster-backup.fullname` and `cluster-backup.labels` resolve to
  when called from within a `range` loop body.
* `helm/templates/_helpers.tpl`: simplified
  `cluster-backup.serviceAccountName` and
  `cluster-backup.resticServiceAccountName` to return deterministic
  names. The previous form gated on `.Values.velero.{server,restic}
  .serviceAccount.create`, which is not a path that exists in
  `values.yaml` — it was a remnant of the now-removed `velero`
  subchart pattern. The helpers always returned `"default"`, which
  conflicted with the ClusterRoleBinding subject the chart actually
  installed; Velero would have been denied every API call.
* `helm/templates/NOTES.txt` rewritten: `kubectl create backup …` /
  `kubectl create restore …` are NOT valid kubectl subcommands for
  Velero CRDs and were misleading users into thinking the controller
  was broken when in reality they were running a non-existent
  kubectl verb. NOTES now uses the Velero CLI (`velero backup get`,
  `velero schedule get`, etc.) and points readers at the install
  docs. Same fix applied across README.md, deploy.sh, verify.sh,
  test-local.sh, and (post-deployment) usage messages.

Script hardening
----------------
* `deploy.sh`:
  - Switched to `#!/usr/bin/env bash` + `set -euo pipefail`.
  - All temporary files now live in a single `mktemp -d`-created
    directory with mode 0700, cleaned up via an `EXIT|INT|TERM` trap.
    The previous version wrote `/tmp/s3-credentials` and
    `/tmp/cluster-backup-values.yaml` at fixed paths — world-readable
    on some hosts and racy under concurrent runs.
  - S3 secret-key prompt uses `read -rs` (no echo). The credential is
    piped into `kubectl create secret … --from-file=cloud=/dev/stdin
    --dry-run=client -o yaml | kubectl apply -f -`, so the secret
    never lives on disk and never appears in argv (`ps`).
  - All configuration accepts env-var inputs (`TENANT`, `CLUSTER`,
    `ENVIRONMENT`, `S3_*`) so the script is usable in CI/automation
    without an interactive TTY.
* `test-local.sh`:
  - MinIO image pinned (`minio/minio:RELEASE.2024-08-17T01-24-54Z`).
    Was `minio/minio:latest`, which made local test runs
    non-reproducible and exposed them to silent upstream breaking
    changes.
  - MinIO credentials no longer hardcoded as `minioadmin / minioadmin`
    in the manifest. Access key defaults to `localdev-access`; secret
    key is a freshly-generated 24-char random value per run
    (overridable via `TEST_MINIO_{ACCESS,SECRET}_KEY` env vars).
    Rendered into the Secret over a stdin pipe — same no-argv-leak
    pattern as deploy.sh.
  - Service now correctly exposes BOTH the S3 API (NodePort 30000)
    AND the web console (NodePort 30001). The original Service mapped
    only 9000 to NodePort 30000 yet the "MinIO Console" line pointed
    at 30000 — that's the S3 API, not the console. Web console
    address is now `http://localhost:30001`.
  - Replaced fixed `sleep 30` waits with `wait_for_velero_phase` which
    polls `.status.phase` of the Backup / Restore CR and emits
    `kubectl describe` + `velero backup logs` diagnostics on timeout.
  - Uses the Velero CLI for backup / restore creation.
* `verify.sh`:
  - Same shebang + strict-mode + Velero-CLI updates as deploy.sh.
  - `test_backup_creation` polls phase with diagnostics instead of
    blind-sleeping; also skips gracefully if the Velero CLI isn't
    installed locally (verify is run on hosts that may not have it).
* `create-tenant-cluster.sh`:
  - All upstream components are version-pinned via env-var defaults:
    `DOKS_K8S_VERSION`, `INGRESS_NGINX_VERSION`, `CERT_MANAGER_VERSION`,
    `METRICS_SERVER_VERSION`. The previous version used
    `doctl kubernetes cluster create … --version latest` and
    `metrics-server/releases/latest/`, making provisioning silently
    non-reproducible.
  - Team-member namespace is normalized to RFC1123
    (`AnnaF` → `annaf`, etc.) before being passed to `kubectl create
    namespace`. K8s rejects uppercase-containing namespace names with
    an opaque error, and the previous version of this script would
    silently fail at `create_team_member_namespace` for any contributor
    whose handle wasn't already lowercase-alphanumeric.
  - `create_team_member_access` no longer reads `.secrets[0].name` to
    harvest a long-lived ServiceAccount token (the BoundServiceAccount-
    Token feature in K8s ≥1.24 means that field is empty on modern
    clusters, so the function was returning an empty kubeconfig).
    Switched to `kubectl create token` (TokenRequest API) with an
    explicit 24h duration. Kubeconfig is written with `umask 077` so
    the file containing the bearer token is mode 0600.

Not in scope for this commit
----------------------------
* The Restic DaemonSet still ships with a `SYS_ADMIN` capability add
  and a non-root securityContext (uid 65534). Real-world restic
  operation typically requires either root or a relaxed
  podSecurityContext to read pod volumes owned by arbitrary uids;
  tightening this further requires a per-environment decision about
  the Pod Security Standard for the `velero` namespace. Flagged in
  `cluster-backup/CHANGELOG.md` under "Notes for operators".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cluster-backup): repair Schedule rendering + dedupe component label

Two issues that surfaced once kubeconform actually parsed the rendered
output (the previous CI run was already past my first commit but failed
on Kubeconform Validation):

1. Duplicate `app.kubernetes.io/component` label

   The common `cluster-backup.labels` helper emitted
   `app.kubernetes.io/component: backup-system`, and every per-resource
   template also appended its own `app.kubernetes.io/component: <role>`
   right below the include. The K8s API server tolerates duplicate map
   keys (last write wins), but strict YAML parsers — including
   `kubeconform` — reject the manifest. Removed the component-label
   line from the shared helper; per-resource components remain.

2. Schedule CR `.spec.schedule` rendered as `"map[…]"` instead of cron

   The `cluster-backup.backupScheduleConfig` helper accessed
   `.schedule | quote` at the top level, but its caller passes the
   whole schedule object under `.schedule` — so `.schedule` was the
   FULL map, and Go's `fmt.Sprint` rendered it as
   `"map[enabled:true excludeNamespaces:[…] includeNamespaces:[]
   retention:30d schedule:0 2 * * *]"`. Velero would have rejected
   the Schedule as invalid syntax at apply time.

   Reworked the helper to extract fields explicitly via
   `.schedule.schedule`, `.schedule.retention`,
   `.schedule.includeNamespaces`, `.schedule.excludeNamespaces`.
   This is the same access pattern the caller already implied; the
   helper had just been written incorrectly from the start. The CR
   now renders with the expected `schedule: "0 2 * * *"`.

   Also fixed an adjacent issue in `backup-schedules.yaml`: the
   `{{- $ctx := dict … -}}` line had its trailing newline stripped,
   which fused the previous Schedule's last YAML line directly to
   the next `---` document separator (e.g. `defaultVolumesToRestic:
   true---`). Replaced with non-stripped `{{ $ctx := merge … }}` on
   its own line.

Verified with `helm template … | kubeconform -strict -summary
-ignore-missing-schemas -`:
  21 resources, Valid: 11, Invalid: 0, Errors: 0, Skipped: 10
(skipped resources are Velero CRDs with no published JSON schema:
 BackupStorageLocation, VolumeSnapshotLocation, Schedule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cluster-backup): resolve Trivy CRITICAL/HIGH findings

Local `trivy config --severity CRITICAL,HIGH cluster-backup/` is now
clean (0 findings). Two categories:

Real code fixes
---------------
1. **KSV-0053 (HIGH) — restic ClusterRole granted `pods/exec`.**
   Restic backs up pod volumes by reading the kubelet bind-mount; it
   never execs into user pods. Removed `pods/exec` from the resource
   list and tightened the verb set on `[pods, pods/log, namespaces,
   nodes]` from `get list watch create update patch delete` down to
   just `get list watch` — restic does not own any of those resources.
   This also drops KSV-0042 (delete pod logs) + part of KSV-0048
   (manage workloads/pods) from the restic role specifically.

2. **KSV-0005 (HIGH) — restic securityContext was internally
   inconsistent.** The previous values.yaml had
   `runAsNonRoot: true, runAsUser: 65534` + `capabilities.add:
   [SYS_ADMIN]`. uid 65534 cannot read pod volumes owned by arbitrary
   uids, so restic would have failed at runtime. Per Velero's
   documented restic deployment posture (https://velero.io/docs/v1.12/restic/),
   the node-agent must run as root with SYS_ADMIN. Changed to
   `runAsNonRoot: false, runAsUser: 0, runAsGroup: 0, fsGroup: 0`,
   keeping `readOnlyRootFilesystem: true` and the default seccomp
   profile. The PSS for the `velero` namespace must therefore be
   `baseline` or `privileged` (NOT `restricted`) — documented in
   cluster-backup/CHANGELOG.md.

Targeted .trivyignore additions
-------------------------------
The remaining findings are inherent to a cluster-backup tool and
cannot be fixed without breaking its function. Each entry in
.trivyignore is explained — Trivy ignore-rules are the right vehicle
for "this is intentional, here's why":

- KSV-0041 (CRITICAL): Velero must manage Secrets to back them up.
- KSV-0056 (HIGH x3): Velero must manage Services/Endpoints/Ingresses/
  NetworkPolicies to back up and restore network topology.
- KSV-0005 / KSV-0022 (HIGH): SYS_ADMIN is required for restic's
  mount/setns ops; this is Velero's documented deployment posture.
- KSV-0012 (MEDIUM): companion of KSV-0005 (run-as-root).
- KSV-0023 (MEDIUM): hostPath on `/var/lib/kubelet/pods` is how restic
  reads pod volumes. The earlier broader mount of host `/` was already
  removed in d0c637b.
- KSV-0042 (MEDIUM): Velero must delete pods to restore them.
- KSV-0048 (MEDIUM): Velero must create/update Deployments,
  StatefulSets, DaemonSets, Jobs, CronJobs.
- KSV-0049 (MEDIUM): Velero must manage ConfigMaps.
- KSV-0125 (MEDIUM): `velero/velero:v1.12.2` is the official upstream
  image on docker.io. Mirroring to a private registry is tracked as
  a separate follow-up.

All ignored rules are scoped by .trivyignore comments to the
cluster-backup chart only.

Roman's review verified
-----------------------
All 17 items from @romandidomizio's CHANGES_REQUESTED review have been
addressed in this branch (d0c637b + ac2e9f6 + this commit). Spot-checked
all of them; no outstanding source-level issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(wordpress-docker): replace .env secrets with Infisical runtime injection

Removes all on-disk .env.prod secret management and replaces it with
the Infisical Universal Auth (Machine Identity) pattern, matching the
keycloak-docker reference implementation.

Changes:
- copier.yaml: enable_infisical defaults true; prompts for client_id
  and client_secret instead of legacy infisical_token
- variables.tf.jinja: replace infisical_token + enable_infisical bool
  with infisical_client_id + infisical_client_secret (sensitive);
  mysql_password / mysql_root_password gated on not enable_infisical
- main.tf.jinja: templatefile() passes correct vars per enable_infisical
- cloud-init.yaml.jinja: installs Infisical CLI, writes infisical-auth.sh
  (machine identity), starts stack via infisical run, cron backup
  authenticated via infisical-auth.sh + infisical run
- terraform.tfvars.example.jinja: removes mysql passwords from Required
  section; replaces infisical_token with client_id + client_secret
- ansible/deploy.yml.jinja: removes pre_tasks (.env.prod check), removes
  Upload .env task, converts docker_compose_v2 calls to shell + infisical run
- scripts/deploy.sh.jinja: removes .env.prod check + scp; uses infisical run
- scripts/backup.sh.jinja: removes source .env; uses infisical run for DB dump
- scripts/restore.sh.jinja: removes source .env + dot-env restore; uses
  infisical run for DB restore and compose up
- docker/.env.prod.example: replaced with Infisical pointer (no passwords)

Compliance: §3.10 Infisical checklist, §3.1 NIST PR.DS, §3.2 CIS 3.11,
            §3.4 ISO A.8.24, §3.5 SOC2 CC6.7, §3.8 Docker Compose checklist

* feat(wordpress-docker): add pull-prod workflow + fix backup/restore scripts

- Add scripts/pull-prod.sh: pulls production DB + wp-content to local
  dev stack in one command; reads DB password from docker inspect (not
  .env) to avoid silent dump failures when credentials diverge
- Fix backup.sh: use docker inspect for DB password instead of sourcing
  .env — same root cause that caused 0-byte dumps in burnedout-xyz
- Fix restore.sh: add COMPOSE_PROJECT_NAME=burnedout-local, auto URL
  replacement (siteurl/home) after prod DB import, use mariadb client
- Fix compose.local.yaml: add Caddy service (WP image is FPM-only;
  missing Caddy made the stack silently unreachable on port 8080)
- Add Caddyfile.local: HTTP-only local Caddy config for FPM proxy
- Propagate all fixes to template (pull-prod.sh.jinja, backup.sh.jinja,
  restore.sh.jinja, compose.local.yaml.jinja)
- Expand README.md (site + template) with Day-to-Day Operations section
  covering when/how to use each script

Fixes: silent 0-byte mysqldump when .env password diverges from
running container; localhost:8080 returning empty due to missing Caddy

* chore: add Ansible template and documentation files

- RESTORE-RUNBOOK-PROMPT.md: runbook documentation for burnedout.xyz restore
- scripts/manage-droplets.sh: droplet management script (fixed SC2155 shellcheck)
- wordpress-docker/template/ansible/: Ansible configuration, inventory template, and requirements

* fix(wordpress-docker): address code review findings

Critical fixes:
- restore.sh: fix PROJECT_NAME ("burnedout" not "burnedoutxyz") and
  APP_DIR ("/opt/burnedout" not "/opt/burnedout-xyz") — remote restore
  was completely broken, targeting nonexistent containers and paths
- restore.sh: use docker inspect for DB credentials instead of sourcing
  .env — matches backup.sh pattern, works with or without Infisical
- RESTORE-RUNBOOK: replace broken base image (reg.mini.dev/wordpress:latest)
  with correct private image (reg.mini.dev/1923/wordpress-fluentsmtp:latest)

Medium fixes:
- backup.sh: stop including .env in backup archive — production secrets
  should not be stored in tarballs (credential exposure vector)
- RESTORE-RUNBOOK: fix false claim that compose.local.yaml sets
  WORDPRESS_CONFIG_EXTRA (only compose.prod.yaml does)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Auto-PR: feat: add observability infrastructure, SearXNG copier template, and fleet tooling (#26)

* feat: add observability infrastructure, SearXNG copier template, and fleet tooling

Add SigNoz self-hosted observability (signoz-docker/) and OTel Collector
agent (otel-agent/) for per-container metrics, log aggregation, and
traces across the fleet. Add SearXNG copier template (searxng-docker/)
with full IaC, Infisical integration, skinny backups, and browser search
playbook. Add fleet scripts for DO agent enablement and OTel deployment.
Add AnythingLLM MCP configuration playbook. Fix cloud-init `runcmds:`
typo to correct `runcmd:` across all templates. Fix manage-droplets.sh
SSH key rotation parsing bug. Add project CLAUDE.md for AI guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(otel): SigNoz Cloud fleet agent Phase 1 (burnedout-xyz verified)
Pivot observability to SigNoz Cloud us2 with per-droplet OTel collector,
Infisical runtime secret injection, bootstrap/deploy fleet scripts, and
deprecation notes for self-hosted signoz-docker template.

* fix(pr26): address Copilot + reviewer findings on observability work

Resolves all blocking review feedback on PR #26 surfaced by Copilot
(rounds 1 + 2) and an independent code review of the diff. Covers
correctness bugs, security defaults, and operational gaps. Scope is
limited to closing review items — no new features.

Correctness bugs (deploy-blocking):
- signoz scripts (deploy/backup/restore + ansible/deploy.yml.jinja): use
  `replace('-', '_')` to match `main.tf.jinja` -> `templatefile` -> cloud-init
  path/volume normalization. Previously the no-separator form (`replace('-', '')`)
  caused /opt/signozobservability vs /opt/signoz_observability divergence and
  silent volume-name mismatches in backup/restore.
- cloud-init OTel gateway DSN: switch `password=$${CLICKHOUSE_PASSWORD}` to
  `password=$${env:CLICKHOUSE_PASSWORD}` so the OTel collector actually expands
  the env var (bare `${VAR}` is not interpolated by the collector).
- README.md.jinja: fix broken code fence at line 92 ("```nished Phase 1...").
- README.md.jinja: replace undefined `monitoring_*_threshold` references with
  the actual copier-defined names (`cpu_alert_threshold`, etc.).
- monitoring.tf.jinja: hardcoded `value = 8` for load-5 alert is now driven
  by a new `load_alert_threshold` copier variable (default 8 = 2x default vCPUs).
- CHANGELOG.md.jinja: drop the stub `[{{ version }}] - {{ date }}` line that
  referenced undefined copier variables.
- anythingllm/ansible/configure-allm.yml: rewrite to use plain Ansible
  templating (`{{ var }}`) instead of broken copier-style escapes
  (`{{ '{{' }} var {{ '}}' }}`). Wrap Docker Go-template (`{{.State.Running}}`)
  in `{% raw %}...{% endraw %}` so Ansible passes it through. Add
  `no_log: true` to tasks that handle merged MCP server configs (which can
  include credentials from `extra_mcp_servers`).
- otel-agent OTLP transport docs: fix gRPC/HTTP mismatch in comments across
  config.yaml, compose.yaml, deploy-otel-fleet.sh, otel-agent/README.md
  (the actual exporter is otlphttp).
- scripts/enable-do-agent.sh: drop unused `ALREADY=0` counter.

Security + IaC fixes:
- terraform backend.tf (signoz + searxng): remove invalid `var.*` references
  in the s3 backend block (backend cannot interpolate variables — init runs
  before vars are evaluated). Add the working DO Spaces pattern from
  keycloak-docker/sites/sso.weown.dev/ (skip_* flags) + a new `init.sh.jinja`
  that bridges terraform.tfvars -> `tofu init -backend-config=`. Also drop
  the unsupported `lock`/`lock_timeout` block. Moves `spaces_*` variable
  declarations to variables.tf.jinja and adds placeholders to
  terraform.tfvars.example.jinja.
- New `ssh_source_cidrs` copier variable (signoz + searxng) drives the SSH
  firewall rule. Default keeps backward-compat `["0.0.0.0/0", "::/0"]` with
  a loud help warning to pin to admin IP/32 or VPN range in production.
- signoz cloud-init Infisical CLI install: switch from the deprecated
  install-cli.sh channel (capped at v0.38 — broken `infisical run`) to the
  current artifacts-cli.infisical.com apt repo, mirroring the idempotent
  block in scripts/bootstrap-otel-agent.sh.
- searxng copier.yaml: pin `searxng_image` to a dated release tag (was
  `:latest` — drift risk on every pull); help text directs operators to
  Docker Hub for newer tags.
- searxng .gitignore: rewrite the `searxng/settings.yml` ignore rule to the
  actual generated path `docker/searxng/settings.yml`.
- restore.sh.jinja: validate `BACKUP_NAME` against a strict allowlist regex
  `^[A-Za-z0-9._-]+$` before forwarding it into the remote ssh `bash -c`
  command, closing the shell-injection vector flagged by both reviewers.
- otel-agent deploy paths (deploy-otel-fleet.sh + otel-agent/deploy.yml):
  reject `http://` URLs explicitly in the OTEL_URL normalization case; only
  `https://` is accepted (config.yaml sets `insecure: false`).
- otel-agent/README.md: add a "Threat model — host-level access by design"
  section that explicitly documents the root+docker.sock+host-root-mount
  topology, what `:ro` mitigates vs doesn't, and the mitigations the design
  relies on.

Operational quality:
- copier.yaml (signoz + searxng): remove `s3` from the
  `backup_remote_storage` choices — the backup/restore scripts only
  implement `do-spaces`, so picking `s3` was a silent no-op.
- copier.yaml + variables.tf.jinja + README.md.jinja (signoz): reword
  `clickhouse_retention_days` description to mark it as informational
  (actual TTL is set via SigNoz UI), since the value was never wired into
  any ClickHouse config.
- compose.prod.yaml.jinja + cloud-init.yaml.jinja (signoz + searxng): add
  `/var/log/caddy:/var/log/caddy` bind mount so Caddy access logs survive
  container recreation and the otel-agent filelog/caddy receiver can pick
  them up from the host.

Repo hygiene:
- Add `.claude/` to top-level `.gitignore` (Claude Code local workspace).

Deferred (require design decisions; tracked for follow-up):
- Infisical Machine Identity persistence in user_data / TF state. Documented
  as known risk in CHANGELOG; full remediation requires architectural change
  (cloud-init bootcmd + tmpfs, systemd credentials, or external bootstrap
  delivery). Rotate the Machine Identity after destroy.
- `lifecycle { ignore_changes = [user_data] }` masks user_data updates
  (intentional, to avoid droplet recreation on every var bump). Future:
  document `tofu taint` workflow in signoz-docker README.
- Docker-socket-proxy in front of otel-agent (significant scope, follow-up PR).
- The legacy `install-cli.sh` Infisical install pattern in `anythingllm-docker/`
  and `keycloak-docker/` cloud-init.yaml.jinja (out of this PR's scope; same
  fix applies — separate PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pr26): address Copilot round-3 + own review pass

Resolves the 22 new Copilot comments on merge commit 59a3d0a plus
additional issues found in an own-review pass over the diff.

New Copilot findings (all addressed):
- signoz-docker/template/README.md.jinja: quick-start used `tofu init`
  directly, but the S3 backend requires `-backend-config` flags via the
  new `./init.sh`. Rewrote step 2 to use the init script + documented
  why (variables not allowed in backend blocks).
- signoz-docker/template/README.md.jinja: step 3 referenced the old
  positional-arg `deploy-otel-fleet.sh weown-ai <ip>` syntax which was
  removed in the SigNoz Cloud pivot. Rewrote to use the new flag-based
  selectors and added context that fleet OTel agents ship to SigNoz
  Cloud by default (not to this self-hosted gateway).
- signoz-docker/template/scripts/{deploy,restore}.sh.jinja: replaced
  `143.198.xxx.xxx` example IPs (real DO range) with RFC 5737 documentation
  ranges (`198.51.100.42`) so the public repo does not normalize references
  to real infrastructure.
- signoz-docker/template/terraform/templates/cloud-init.yaml.jinja:
  `docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'` would
  have been interpreted by Copier's Jinja pass as attribute lookups on the
  root context (likely failing rendering). Wrapped in `{% raw %}...{% endraw %}`.
- signoz-docker/template/terraform/templates/cloud-init.yaml.jinja:
  retention echo claimed "daily 30d / monthly 12mo / yearly forever" but
  the implementation only deletes local backups older than 30 days.
  Rewrote the message to match the actual behavior.
- scripts/bootstrap-otel-agent.sh: Infisical Machine Identity secrets were
  passed inline as env-var assignments in the ssh command string, which
  exposes them in the local `ps` listing during the SSH connection.
  Rewrote to pipe `export VAR=...` (via `printf %q` for safe quoting)
  through ssh stdin (process substitution) so the secrets never appear
  in argv.
- otel-agent/README.md: my prior "Threat model" section claimed that
  `:ro` on `/var/run/docker.sock` prevents Docker API writes. That is
  incorrect — `:ro` is a bind-mount flag affecting the socket inode,
  NOT the Docker API protocol. Any process that can `connect(2)` to the
  socket can issue write API calls (containers/create, exec, --privileged,
  etc.) regardless of mount mode. Rewrote the section to call out the
  socket access as effectively root-on-host, list what we actually rely
  on (image pinning, memory cap, no host-side secrets at rest), and
  document the docker-socket-proxy mitigation path for future work.
- otel-agent/compose.yaml + otel-agent/README.md "Safety" section:
  brought the "all mounts read-only ⇒ safe" claim in line with the
  corrected threat model — fs mounts are tamper-safe, but the docker
  socket isn't, and we say so explicitly.
- signoz-docker/template/docker/compose.prod.yaml.jinja +
  signoz-docker/template/terraform/templates/cloud-init.yaml.jinja:
  added a comment block on the ZooKeeper service explaining that
  `ALLOW_ANONYMOUS_LOGIN: "yes"` is intentional for this deprecated
  self-hosted fallback (port unpublished, only ClickHouse reaches it on
  `signoznet`); enabling SASL would require synchronized ClickHouse
  credentials and a new Infisical secret. Operators adopting this for
  production should enable auth before exposing the stack.

Own-review pass:
- scripts/deploy-otel-fleet.sh: updated the OTEL_URL normalization
  comment to match the actual behavior (https:// or scheme-less; reject
  plain http:// because `tls.insecure: false`). The previous comment
  said "otlphttp requires https:// or http://" which contradicted the
  reject-http logic.

Not addressed (deferred — same call as round 2):
- Infisical Machine Identity persisted in user_data / Terraform state
  (architectural change required; tracked in CHANGELOG).
- Copilot's two "configure-allm.yml jinja escapes" comments at line 33
  are stale — that file was fixed in a prior commit and currently uses
  plain `{{ var }}` form. No further action needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(anythingllm): move configure-allm.yml under anythingllm-docker/

The configure-allm.yml playbook uses `docker exec` + `docker inspect`,
so it is Docker-specific. The `anythingllm/` directory is reserved for
the helm/k8s deployment path; Docker-stack tooling belongs under
`anythingllm-docker/` next to the copier template + cloud-init that
provision the Docker-based instances this playbook configures.

Used `git mv` to preserve file history. The empty `anythingllm/ansible/`
directory has been removed. Updated the CHANGELOG entry to reflect the
new location and the rationale for the move.

No content changes — only relocation. Callers (currently none in-repo)
referencing `anythingllm/ansible/configure-allm.yml` should switch to
`anythingllm-docker/ansible/configure-allm.yml`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Nik <nik.cimino@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: m.shahid <m.shahid538@yahoo.com>
Co-authored-by: Nik Cimino <ncimino@users.noreply.github.com>

* Auto-PR: chore(s004): add baseline copier templates for monitoring and infrastructure scaffolding (#31)

* feat: add observability infrastructure, SearXNG copier template, and fleet tooling

Add SigNoz self-hosted observability (signoz-docker/) and OTel Collector
agent (otel-agent/) for per-container metrics, log aggregation, and
traces across the fleet. Add SearXNG copier template (searxng-docker/)
with full IaC, Infisical integration, skinny backups, and browser search
playbook. Add fleet scripts for DO agent enablement and OTel deployment.
Add AnythingLLM MCP configuration playbook. Fix cloud-init `runcmds:`
typo to correct `runcmd:` across all templates. Fix manage-droplets.sh
SSH key rotation parsing bug. Add project CLAUDE.md for AI guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(s004): sync live caddy stdout configuration and private registry updates

* chore(s004): add baseline copier templates for monitoring and infrastructure scaffolding

* fix(s004-deployment): address Copilot + reviewer findings — make deploy actually work

Resolves 8 new Copilot comments on the merge commit + 12 findings from
an independent reviewer pass on s004-deployment/. The directory was
committed as a skeleton with several deployment-blocking bugs.

CRITICAL — deployment-blocking bugs:

- Volume name mismatch (backup.sh, restore.sh, cloud-init backup script):
  scripts referenced `${PROJECT_NAME}_storage` where `PROJECT_NAME=s004anythingllm`
  (no underscore), so the resolved name was `s004anythingllm_storage`. But
  `docker/compose.prod.yaml` (and cloud-init's embedded compose) explicitly
  name volumes `s004_anythingllm_storage`. Every backup tarred an empty
  anonymous volume; every restore wrote into a phantom volume the app never
  reads. Hardcoded the actual volume names (`s004_anythingllm_storage`,
  `s004_anythingllm_caddy_data`) in all three sites.
- Empty Infisical credentials in cloud-init: the `infisical-auth.sh` helper,
  daily cron wrapper, and runcmd `infisical run` all had literal
  `INFISICAL_CLIENT_ID=''` / `--projectId=`. terraform/main.tf already plumbs
  `infisical_client_id`, `infisical_client_secret`, `infisical_project_id`,
  `infisical_environment` into `templatefile()`, so just reference them as
  `${infisical_*}`.
- Empty `--projectId=` in scripts/deploy.sh + scripts/restore.sh: workstation
  scripts now read INFISICAL_PROJECT_ID (and optional INFISICAL_ENV, default
  `prod`) from the environment with a `:?` guard that prints a clear error.
- Hardcoded `s004.ccc.bot` domain in cloud-init compose env + Caddyfile:
  switched to `${domain}` so var.domain takes effect.
- Image registry inconsistency: cloud-init pulled `mintplexlabs/anythingllm:latest`,
  compose used `reg.mini.dev/anythingllm:latest`. Both now use
  `${anythingllm_image}` from var.anythingllm_image
  (default `reg.mini.dev/anythingllm:1.7.2`, the WeOwn mirror).
- 23 unescaped shell variables in cloud-init `templatefile()` body
  (`${JWT_SECRET}`, `${ADMIN_EMAIL}`, `${BACKUP_NAME}`, `${SPACES_BUCKET}`,
  `${BASH_REMATCH[1]}`, etc.) would have made `tofu plan` error before
  producing any output. Escaped every uppercase shell var as `$${VAR}`.

HIGH — lessons from PR #26 applied here:

- SSH 0.0.0.0/0 firewall rule → introduced `var.ssh_source_cidrs`
  (default `["0.0.0.0/0", "::/0"]` with help text directing production
  to pin to admin/VPN CIDR). Same fix as signoz-docker.
- Deprecated Infisical install channel (`infisical.com/install-cli.sh`,
  capped at v0.38 with broken `infisical run` session handling) →
  rewrote to use the current `artifacts-cli.infisical.com` apt repo with
  the same idempotent legacy-purge logic from `bootstrap-otel-agent.sh`.
- Caddy access logs written to `/var/log/caddy/` but no host bind mount
  → added `- /var/log/caddy:/var/log/caddy` to both `docker/compose.prod.yaml`
  and the cloud-init embedded compose, so otel-agent's filelog/caddy
  receiver can read them.
- `restore.sh` shell-injection via unvalidated `$BACKUP_NAME` →
  `[[ ! "$BACKUP_NAME" =~ ^[A-Za-z0-9._-]+$ ]]` guard before the heredoc.
- Real DigitalOcean IP `143.198.xxx.xxx` in usage examples →
  RFC 5737 `198.51.100.42`.

MEDIUM — code quality:

- `terraform.tfvars.example` had unrendered Copier jinja stubs
  (`{{ project_name }}`, `{{ enable_skinny_backups | lower }}`, etc.) —
  operators copying the file to terraform.tfvars would get literal jinja
  that terraform rejects. Replaced every stub with a real example value
  matching the variable's default in variables.tf.
- `.gitignore` ignored `.terraform.lock.hcl`, breaking provider
  reproducibility. Removed the ignore + added an explanatory comment.
- `versions.tf` header still said `# {{ project_name }}` → `s004-anythingllm`.
- `monitoring.tf` declared three alert resources that always created
  regardless of `var.enable_monitoring`. Added `count = var.enable_monitoring ? 1 : 0`.
- `scripts/restore.sh` had dead REMOTE/BACKUP_NAME assignments at lines
  20-21, overwritten by the arg-count block. Removed.

Not addressed (would expand scope):

- Infisical Machine Identity rendered into user_data → terraform state.
  Same trade-off as signoz-docker; needs a remote encrypted backend +
  ideally one-time-use bootstrap secret retrieval.
- `lifecycle { ignore_changes = [user_data] }` means cloud-init fixes
  require `tofu taint` to deploy. Documented behavior, not changed here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(infra): adopt Path C + Layer 2 in s004-deployment; document migration

Resolves the two architectural issues deferred from PR #26 / PR #31:
  1. Infisical Machine Identity persisted in terraform state / DO metadata
     (Layer 2 bootstrap-secret rotation).
  2. `lifecycle { ignore_changes = [user_data] }` silently swallowing
     cloud-init updates (Path C: thin cloud-init + ansible app layer).

Reference implementation lives in s004-deployment/. Repo-wide pattern and
per-project migration checklist live in docs/INFRA_BOOTSTRAP_PATTERN.md.

Layer 2 — bootstrap-secret rotation (s004-deployment/):

- New rotate-bootstrap-secret.sh embedded in cloud-init `write_files`. Runs
  once via runcmd at first boot. Flow:
    1. Log in to Infisical with v1 (from terraform.tfvars → state → cloud-init).
    2. Decode JWT to extract identityId (base64url-safe, handles padding).
    3. POST /api/v1/auth/universal-auth/identities/{id}/client-secrets → v2.
    4. Atomically swap .infisical-auth.env to v2 after verifying it auths.
    5. POST .../client-secrets/{v1Id}/revoke to disable v1.
    6. Touch .rotation-complete (idempotent on re-run).
  Net effect: v1 in terraform state + DO droplet metadata is revoked within
  minutes of provisioning. v2 only ever lives on the droplet filesystem.
  Best-effort with structured logging — if the Machine Identity lacks
  self-management permission, the script logs cleanly to
  /var/log/s004anythingllm-rotation.log and the operator follows a manual
  runbook in s004-deployment/README.md.

Path C — thin cloud-init + ansible (s004-deployment/):

- Slimmed terraform/templates/cloud-init.yaml. It now handles ONLY first-boot
  bootstrap: package + Docker install, Infisical CLI install (artifacts-cli
  apt repo), .infisical-auth.env write, Layer 2 rotation, .bootstrap-complete
  marker, unattended-upgrades. Removed: compose.yaml, Caddyfile, backup.sh,
  daily-backup cron, infisical helper, docker pulls, docker compose up.
- New ansible/deploy.yml owns the app layer. Uploads compose + Caddyfile +
  backup.sh, renders the daily backup cron, pulls images, runs `docker
  compose up -d --remove-orphans`, waits for AnythingLLM health. Pre-tasks
  assert INFISICAL_PROJECT_ID is set and .bootstrap-complete exists.
- scripts/deploy.sh rewritten as a thin wrapper around `ansible-playbook
  ansible/deploy.yml -i 'host,'`. Idempotent — re-runnable any time
  compose/Caddy/scripts change, without touching terraform.

Layer 1 — DO Spaces remote state backend (already standard elsewhere):

- Added terraform/backend.tf + terraform/init.sh + spaces_* vars. State is no
  longer a plain local JSON file. Same pattern as
  keycloak-docker/sites/sso.weown.dev/ and signoz-docker (PR #26). Required
  for Layer 2 to actually reduce risk — without it the v1 in state lives on
  the operator's workstation indefinitely.

Documentation + migration plan:

- New docs/INFRA_BOOTSTRAP_PATTERN.md (cross-cutting). Explains both
  problems, both solutions, failure modes, compliance mapping (NIST CSF
  2.0, CIS Controls v8, ISO/IEC 27001:2022), per-project migration
  checklist for signoz-docker / searxng-docker / anythingllm-docker /
  keycloak-docker / wordpress-docker, a "what Layer 2 does NOT solve"
  section, and a future-hardening section (Vault wrapping, cloud-IAM
  mediation, systemd-credentials + TPM).
- Each *-docker README has a MIGRATION PENDING / MIGRATION PARTIAL banner
  pointing at the pattern doc + s004-deployment reference impl. Created
  searxng-docker/README.md (was missing).
- s004-deployment/README.md: rewrote Quick Start to reflect Path C
  workflow, added "Updating the deployment" matrix, added manual rotation
  runbook.
- s004-deployment/CHANGELOG.md: Unreleased entry.

Migration sequencing recommendation (in INFRA_BOOTSTRAP_PATTERN.md): each
*-docker project gets its own focused follow-up PR, order by deployment
criticality (anythingllm → wordpress → keycloak → signoz → searxng).
Each is ~2-4h plus an operator-scheduled tofu taint + tofu apply against
existing droplets. Not changed in this commit: the *-docker template
implementations themselves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(infra): per-project Migration Status sections + audited status table

Replaces the one-line "MIGRATION PENDING" banners with a per-project
Migration Status section in each *-docker/README.md. Each section is a
table showing the project's current Layer 1 / Layer 2 / Path C / Infisical
CLI state with citations to the files that need work, plus project-specific
notes. The shared 6-step migration checklist + rationale stays in
docs/INFRA_BOOTSTRAP_PATTERN.md (single source of truth).

Audit findings (2026-05-25):

- anythingllm-docker: Layer 1 missing, Layer 2 missing, Path C not
  adopted (no ansible playbook exists), legacy Infisical install.
  **Highest priority** — source of live deployments.
- wordpress-docker: Layer 1 missing, Layer 2 missing, Path C partial
  (ansible exists with compose + Caddy + Wordfence upload, cloud-init
  also embeds the app layer), legacy Infisical install.
- keycloak-docker: Layer 1 partial (template has backend.tf.jinja but
  no init.sh.jinja; the rendered sites/sso.weown.dev/ subdir has both),
  Layer 2 missing, Path C partial (site.yml.jinja with
  community.docker.docker_compose_v2 — most ansible-shaped of any
  template, but cloud-init still embeds app layer), legacy Infisical
  install.
- signoz-docker: Layer 1 done (PR #26), Layer 2 missing, Path C partial
  (ansible exists, cloud-init also embeds), legacy Infisical install
  (with open ZooKeeper anon-login accepted-risk note).
- searxng-docker: Layer 1 done (PR #26), Layer 2 missing, Path C partial,
  legacy Infisical install.

Updated docs/INFRA_BOOTSTRAP_PATTERN.md status table to reflect the audit
(was less precise — said "Pending" without distinguishing Layer 1 vs Path C
state). The migration sequencing recommendation in the doc unchanged:
anythingllm → wordpress → keycloak → signoz → searxng.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(infra): migrate anythingllm-docker to Path C + Layer 2; relocate s004; add auto DO tagging

Big bundle: completes Option B (migrate the anythingllm-docker template to
the canonical bootstrap pattern, then relocate s004 as a generated site of
that template), AND adds the automatic DO droplet tagging system across
feature scripts + ansible.

Template migration (anythingllm-docker/template/):

- terraform/backend.tf.jinja + terraform/init.sh.jinja added (Layer 1 — DO
  Spaces remote state with SSE-C, same pattern as signoz-docker).
- terraform/templates/cloud-init.yaml.jinja replaced with the slim Path C
  version + embedded rotate-bootstrap-secret.sh (Layer 2). Cloud-init now
  handles only: package install, Docker install, Infisical CLI install
  (artifacts-cli apt repo — current channel), .infisical-auth.env write,
  bootstrap-secret rotation, .bootstrap-complete marker, unattended-upgrades.
  Compose, Caddy, backup script, cron all removed — they live in ansible now.
- terraform/main.tf.jinja: `replace('-', '_')` (was `replace('-', '')` —
  caused the volume-name mismatch we hit in s004), SSH firewall rule uses
  `var.ssh_source_cidrs`, `lifecycle { ignore_changes = [user_data, tags] }`
  so runtime tag mutations stick across `tofu apply` cycles.
- terraform/variables.tf.jinja: adds ssh_source_cidrs + spaces_* trio.
- terraform/terraform.tfvars.example: adds the new vars with placeholders.
- terraform/monitoring.tf.jinja: `count = var.enable_monitoring ? 1 : 0`
  on each alert so the variable actually gates resources.
- docker/compose.prod.yaml.jinja: image pinned via copier vars, Caddy
  /var/log/caddy bind mount so otel-agent's filelog/caddy receiver can read
  access logs.
- docker/Caddyfile.jinja: domain substitution via copier var.
- scripts/deploy.sh.jinja: thin ansible-playbook wrapper. Requires
  INFISICAL_PROJECT_ID env var.
- scripts/backup.sh.jinja: hardcoded volume names use
  `{{ project_name | replace('-', '_') }}_storage` etc. Wraps the
  `docker ps --format '{{ .Names }}'` lines in {% raw %}{% endraw %} so
  copier doesn't try to interpret the Go template syntax.
- scripts/restore.sh.jinja: same volume fix + BACKUP_NAME regex validation
  before the SSH heredoc + INFISICAL_PROJECT_ID env requirement.
- ansible/deploy.yml.jinja added (Path C app layer): uploads compose +
  Caddyfile + backup.sh, renders daily cron + logrotate, pulls images,
  reconciles compose stack, waits for health, then updates DO tags
  (skinny-backup + commit-<sha>) via the shared helper.
- copier.yaml: adds ssh_source_cidrs variable (json type).
- README.md.jinja: rewrote Quick Start for Path C workflow.
- .gitignore: stop ignoring .terraform.lock.hcl.

s004-deployment/ → anythingllm-docker/sites/s004/:

- `git mv` preserves history per file. Then overlaid the freshly-rendered
  template output (via `copier copy --data-file s004-answers.yaml`) so the
  site reflects exactly what the migrated template produces today, with
  underscore-consistent paths (/opt/s004_anythingllm/, s004_anythingllm_*
  volumes) that were previously inconsistent.
- All cross-references in INFRA_BOOTSTRAP_PATTERN.md + the 5 *-docker
  READMEs updated from s004-deployment/... → anythingllm-docker/sites/s004/...

Auto DO tagging (new):

- scripts/tag-droplet.sh — shared helper for tag mutation. Subcommands:
  add, remove, replace-prefix, set-commit, list. Chainable in a single call.
  Resolves droplet name → ID via doctl, mutates tags via
  `doctl compute droplet-action tag/untag --wait`. Idempotent.
- scripts/bootstrap-otel-agent.sh: after each successful bootstrap, looks
  up the droplet name from the target IP and adds the `otel` tag.
- scripts/deploy-otel-fleet.sh: same — adds `otel` after each successful
  deploy.
- anythingllm-docker/ansible/configure-allm.yml: after MCP config changes
  successfully apply, looks up the droplet from inventory_hostname and
  adds the `searxng-mcp` tag.
- anythingllm-docker/template/ansible/deploy.yml.jinja: on every deploy,
  invokes tag-droplet.sh with `replace-prefix commit- commit-<sha> add
  skinny-backup`. Restoring "the last working deployment" becomes read the
  tag, `git checkout <sha>`, re-run deploy.sh.
- terraform/main.tf.jinja `ignore_changes = [tags]` so feature/state tags
  added at runtime aren't reverted by subsequent `tofu apply` runs.

Documentation (docs/INFRA_BOOTSTRAP_PATTERN.md):

- New "DO tag taxonomy" section documenting the three layers: project tags
  (terraform-set), feature tags (script-driven — otel, skinny-backup,
  searxng-mcp), state tags (commit-<sha>, replaced each deploy).
- Project migration status table updated: anythingllm-docker promoted to
  reference implementation; wordpress now priority 1 follow-up.

Verified the migrated template renders cleanly via `copier copy` — produces
the 17 files that match anythingllm-docker/sites/s004/ in structure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(anythingllm-docker): Copilot round-4 findings on PR #31 ab65db6

Fixes 7 real findings from Copilot's review of the migration commit (other
~15 were stale — looking at pre-overlay file state or pointed at the
deferred Layer 2 limitation already documented). Each fix applied to the
TEMPLATE; `sites/s004/` then re-rendered via `copier copy` to keep both
in sync.

Real fixes:

- `template/docker/Caddyfile.jinja`: was `output stdout` while the compose
  service bind-mounts `/var/log/caddy:/var/log/caddy` — the mount was
  useless. Switched to
  `output file /var/log/caddy/{{ project_name }}.log` so the file lands
  on the host where otel-agent's `filelog/caddy` receiver can read it
  (and survives container recreation).
- `template/README.md.jinja`: backup-storage and migration-procedure
  examples used `/opt/{{ project_name }}/backups/` (hyphenated, e.g.
  `/opt/s004-anythingllm/backups/`) but the actual deployment path is
  `/opt/{{ project_name | replace('-', '_') }}/backups/` (underscore).
  Updated to underscore form everywhere paths appear.
- `template/README.md.jinja`: restore example was
  `anythingllm-ai_backup_20260115_120000` (stale placeholder from the
  pre-migration template). Updated to `{{ project_name }}_backup_…`.
- `template/scripts/backup.sh.jinja`: remote-mode ran
  `ssh "$host" "$BACKUP_CMDS"` directly — `SPACES_ACCESS_KEY` /
  `SPACES_SECRET_KEY` were NOT in the inner shell's env, so the
  `aws s3 cp` upload step would silently no-op. Wrapped the remote bash
  in `infisical run --projectId=… --env=…` (sources
  `/opt/<project>/.infisical-auth.env` the same way restore.sh does).
  Added `INFISICAL_PROJECT_ID:?` guard so remote mode fails fast.
- `template/terraform/templates/cloud-init.yaml.jinja`: re-added `awscli`
  to the packages list. The slim cloud-init had dropped it during the
  Path C refactor, but the daily backup cron uses `aws s3 cp` so without
  awscli the first cron tick fails.
- `template/scripts/deploy.sh.jinja`: `ansible-galaxy collection install
  community.docker` was unpinned — operators on different workstations
  could end up with different versions. Pinned to `==3.13.0` with a
  precise version check.

Re-rendered `anythingllm-docker/sites/s004/` via `copier copy
anythingllm-docker/ <tmp> --data-file s004-answers.yaml --defaults
--trust`, then overlaid the rendered output onto sites/s004/.

Stale/not-fixed comments (acknowledged):
- Several "volume name mismatch" comments were on the OLD code — current
  scripts use `s004_anythingllm_storage` matching compose.
- `terraform/main.tf` "Infisical secret in user_data → tfstate" is the
  deferred Layer 2 limitation (rotation runs at first boot; see
  docs/INFRA_BOOTSTRAP_PATTERN.md "What Layer 2 does NOT solve").
- `terraform/versions.tf` "no remote state backend" — backend IS in
  backend.tf (separate file). Terraform merges multiple `terraform {}`
  blocks across files. False positive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(anythingllm-docker): Copilot round-5 — fix HCL blocker + 8 other real issues

Resolves the 9 actual bugs from Copilot's review of `3e312a4` (other ~19
comments triaged as either stale or false-positives — captured below).

CRITICAL — tofu blocker:

- `template/terraform/variables.tf.jinja:58` (and tfvars.example): the
  `ssh_source_cidrs` default was `{{ ssh_source_cidrs }}`, which Copier
  renders as Python's list-repr `['0.0.0.0/0', '::/0']` (single quotes) —
  NOT valid HCL. `tofu plan` would error on "Invalid character". Switched
  to `{{ ssh_source_cidrs | tojson }}` so Copier emits the JSON form with
  double quotes (which HCL parses as a list).

Files-not-being-rendered-by-Copier blocker:

- `_templates_suffix` defaults to `.jinja` in Copier, so files without
  that suffix in template/ were copied as-is — including their `{{ ... }}`
  template syntax. The rendered s004 site had unrendered placeholders in
  `.gitignore`, `terraform/versions.tf`, and `terraform/terraform.tfvars.example`.
  Renamed all three to add the `.jinja` suffix.

Real correctness bugs:

- `anythingllm-docker/ansible/configure-allm.yml`: the tag step did
  `ip="{{ inventory_hostname }}"` then looked up the droplet by PublicIPv4.
  With the documented `-i 'root@<ip>,'` inventory pattern,
  `inventory_hostname` is `root@<ip>`, not `<ip>`, so the doctl lookup
  always failed silently. Added `ip="${target##*@}"` to strip the prefix.
- `template/scripts/backup.sh.jinja` remote-mode: was
  `bash -c '$BACKUP_CMDS'` where BACKUP_CMDS contains the literal
  `'table {{.Names}}...'` directives. Outer single quotes break on the
  inner single quotes. Restructured to invoke the droplet's own
  `/opt/<project>/backup.sh` (uploaded by ansible) inside `infisical run`.
- `template/scripts/restore.sh.jinja` remote-mode: same quoting issue +
  sourced `/opt/<project>/infisical-auth.sh` (a script that doesn't
  exist; cloud-init writes `.infisical-auth.env`). Restructured the same
  way: source `.infisical-auth.env`, run `infisical login`, then exec
  `/opt/<project>/restore.sh "$BACKUP_NAME"` under `infisical run`.
- `template/terraform/main.tf.jinja:36` tag-comment listed
  `anythingllm/ansible/configure-allm.yml` (no such file); the playbook
  lives at `anythingllm-docker/ansible/configure-allm.yml`. Fixed.
- `template/docker/compose.prod.yaml.jinja`: `LLM_PROVIDER` and `VECTOR_DB`
  were hardcoded to "openrouter"/"lancedb" while the template still
  prompts for `llm_provider` and `vector_db` Copier vars (and passes them
  through Terraform). Those Copier knobs were dead. Replaced with
  `{{ llm_provider }}` / `{{ vector_db }}`.

Stale/false-positive comments (verified, not fixed):

- Caddyfile/compose hardcoded `s004.ccc.bot` — correct for the SITE.
- main.tf:25 "Infisical secret in user_data" — deferred Layer 2.
- versions.tf:6 "no remote state backend" — false positive; backend IS
  in backend.tf (separate file).
- CHANGELOG anythingllm_image — already `reg.mini.dev/anythingllm:1.7.2`.
- cloud-init jq `$$V2` — terraform escape, becomes `$V2` post-templatefile
  inside single-quoted jq script (jq's `--arg V2` definition).
- Volume name mismatch comments — scripts already use the explicit
  `s004_anythingllm_storage` matching compose `name:` declarations.

Re-rendered sites/s004/ from the updated template via copier; all 9 fixes
verified in the rendered output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pr31): independent-review findings — restore auto-tagging + harden rotation script

Final pre-merge pass from an independent code-review agent caught one HIGH
bug that broke the auto-tagging feature this PR was meant to add, plus
three LOW hardenings. Applied to the template and re-rendered sites/s004/.

HIGH — auto-tagging was silently dead-on-arrival:

  - `template/ansible/deploy.yml.jinja:202` used
    `playbook_dir | dirname | dirname | dirname` to find scripts/tag-droplet.sh.
    But the layout is:
      <repo-root>/anythingllm-docker/sites/<name>/ansible/deploy.yml
    So three dirname operations land at <repo-root>/anythingllm-docker/ —
    not the repo root. With `failed_when: false`, the broken tagging step
    never failed, so `skinny-backup` and `commit-<sha>` tags silently
    never appeared on droplets. Added one more `| dirname` to walk to the
    repo root. The sibling task in anythingllm-docker/ansible/configure-allm.yml
    is one level shallower and was already correct (dirname × 2).

LOW — rotation-log hardening:

  - cloud-init.yaml.jinja was logging the full Infisical API response on
    v2-mint failure. Replaced with `jq -r '.message // .error // .'`
    (head -c 500) so any future API change that echoes request fragments
    doesn't persist the bearer token.
  - Pre-chmod the rotation log to 0600 BEFORE writing so it's never
    world-readable even on failure paths.

LOW — config knob clarity:

  - `INFISICAL_HOST="$${INFISICAL_HOST:-https://app.infisical.com}"`
    looked overridable, but cloud-init runcmd: has a minimal env so the
    override couldn't actually be set externally. Dropped the indirection
    and inlined the SaaS URL.

LOW — tag-droplet.sh duplicate-name safety:

  - Previously did `awk … {print $2; exit}` — first-match-wins on
    duplicate droplet names. Now counts matches and errors out cleanly
    if 0 or >1.

Reviewer findings explicitly NOT fixed (purely cosmetic):
  - U+21BA arrow in usage text (Finding 6)
  - Per-project README status drift risk (Finding 5)

Re-rendered sites/s004/ via copier; HIGH fix verified in rendered
ansible/deploy.yml, LOW fixes verified in
sites/s004/terraform/templates/cloud-init.yaml.

This is the merge-ready state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pr31): Copilot round-6 — 3 real bugs + 1 cosmetic

Triaged 25 Copilot comments on `0993c43`. Verified ~20 as stale (Copilot
re-flagged from older state — current code is correct). 4 actionable:

REAL (would break first deploy):

- `template/ansible/deploy.yml.jinja:156` — was using
  `community.docker.docker_image`, which requires the python3-docker
  Python SDK on the target host. The slim cloud-init only installs
  Docker itself, not the SDK. Switched to
  `community.docker.docker_image_pull` (CLI-based — no SDK needed).
  Available since community.docker 3.4; we pin 3.13.0.

REAL (silently-broken config knob):

- `template/scripts/backup.sh.jinja` + `restore.sh.jinja` — were hardcoding
  `REMOTE_STORAGE="do-spaces"` while `copier.yaml` offers
…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants