Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,7 @@ kubectl -n cpaas-system exec alertmanager-kube-prometheus-0 -c alertmanager -- \
```

A `SUCCESS` line with no preceding `parse.go:176 WARN` indicates the configuration is fully compliant with the UTF-8 matchers parser; a `SUCCESS` line preceded by one or more `parse.go:176 WARN` lines means the configuration is currently accepted only because of the classic-parser fallback and must be updated before the next Alertmanager upgrade.

## See Also

- `Custom matchers with whitespace trigger Alertmanager UTF-8 parser fallback warning` — same root cause, walks through the unquoted-whitespace example explicitly.
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,7 @@ kubectl -n cpaas-system exec alertmanager-kube-prometheus-0 -c alertmanager -- \
```

A `SUCCESS` line with no preceding `parse.go:176 WARN` indicates the configuration is fully compliant with the UTF-8 matchers parser; a `SUCCESS` line preceded by one or more `parse.go:176 WARN` lines means the configuration is currently accepted only because of the classic-parser fallback and must be updated before the next Alertmanager upgrade.

## See Also

- `Alertmanager 0.27+ UTF-8 matchers parser warning for custom routing rules` — same root cause, broader walk-through of UTF-8-parser back-compat changes.
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ id: KB260500817

CI/CD tooling such as Tekton needs persistent storage that survives pod restarts so successive tasks in a pipeline can hand workspace data to each other. The conventional pattern is dynamic provisioning: a Pod or PipelineRun references a PersistentVolumeClaim, the PVC names a StorageClass, and an external provisioner watches the PVC and creates a PersistentVolume on demand. Many on-prem deployments back this with NFS so a single export can serve `ReadWriteMany` pipeline workspaces and per-PVC subdirectories.

On Alauda Container Platform (Kubernetes `v1.34.5`, cluster `glean-lab-base-0529`), the upstream `nfs-subdir-external-provisioner` Helm chart is not part of the artifacts catalog and there is no first-party packaging for it. The cluster ships a different NFS dynamic-provisioning driver instead: the `nfs` ModulePlugin (`chart-csi-driver-nfs`, default channel `v4.4.0-beta.7`, repository `acp/chart-csi-driver-nfs`), which installs the upstream `kubernetes-csi/csi-driver-nfs` CSI driver. The driver registers under `nfs.csi.k8s.io` and plays the same dynamic-provisioning role through a CSI flange instead of the sig-storage-lib external-provisioner pod that the article-style chart uses.
On Alauda Container Platform (Kubernetes `v1.34.5`), the upstream `nfs-subdir-external-provisioner` Helm chart is not part of the artifacts catalog and there is no first-party packaging for it. The cluster ships a different NFS dynamic-provisioning driver instead: the `nfs` ModulePlugin (`chart-csi-driver-nfs`, default channel `v4.4.0-beta.7`, repository `acp/chart-csi-driver-nfs`), which installs the upstream `kubernetes-csi/csi-driver-nfs` CSI driver. The driver registers under `nfs.csi.k8s.io` and plays the same dynamic-provisioning role through a CSI flange instead of the sig-storage-lib external-provisioner pod that the article-style chart uses.

## Resolution

Expand Down Expand Up @@ -105,7 +105,7 @@ spec:
claimName: pipeline-shared
```

Under the NFS CSI driver the PVC can request `ReadWriteMany`, so tasks scheduled on different nodes can mount the same workspace concurrently. Backing the same workspace with the default `topolvm-hdd` SC works too, but topolvm is a local-volume provisioner and only allows `ReadWriteOnce`; for sequential tasks Tekton handles that by scheduling an affinity-assistant StatefulSet to colocate the pods. This pattern was verified on `glean-lab-base-0529`: a two-task pipeline (`write` then `read`) sharing a `topolvm-hdd`-backed PVC workspace ran to `SUCCEEDED`, with the `read` task printing the file the `write` task wrote.
Under the NFS CSI driver the PVC can request `ReadWriteMany`, so tasks scheduled on different nodes can mount the same workspace concurrently. Backing the same workspace with the default `topolvm-hdd` SC works too, but topolvm is a local-volume provisioner and only allows `ReadWriteOnce`; for sequential tasks Tekton handles that by scheduling an affinity-assistant StatefulSet to colocate the pods. The fallback was verified on a stock ACP cluster: a two-task pipeline (`write` then `read`) sharing a `topolvm-hdd`-backed PVC workspace ran to `SUCCEEDED`, with the `read` task printing the file the `write` task wrote.

## Diagnostic Steps

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,13 @@ kubectl get pods -A -o json \

Restore VM connectivity by removing the workload that loaded `br_netfilter`, unloading the module on the worker, and verifying the bridge sysctls are no longer active. Each step is per-worker, because `br_netfilter` is a per-node kernel state — repeat the procedure on every worker that hosts a KubeVirt VM with a Linux Bridge `NetworkAttachmentDefinition` attachment.

**Disruptive node-level change — read before running.** Unloading `br_netfilter` and clearing the `bridge-nf-call-*` sysctls is a change to the worker's kernel networking layer. Plan it like a maintenance operation:

- Schedule a maintenance window for the worker; do not run during peak traffic.
- Confirm no platform component on the cluster depends on `br_netfilter` being present. The default ACP CNI is kube-ovn (no iptables-bridge dependency on the host), but third-party agents (security/observability sidecars, on-host iptables rules that match bridged traffic) installed by the customer may rely on the module. Audit any host-level workloads on the affected node before unloading.
- Capture a rollback path: note the current value of the three `bridge-nf-call-iptables/ip6tables/arptables` sysctls and the loaded-modules list before changing anything, so you can restore the prior state with `modprobe br_netfilter` plus `sysctl -w net.bridge.bridge-nf-call-*=<prior-value>` if the workload regresses.
- This procedure restores the running kernel state only. After a node reboot the module is not reloaded unless the privileged workload runs again. If the workload that originally loaded `br_netfilter` is reconciled by a controller (DaemonSet, operator), the module will be loaded again on the next pod creation — the durable fix is preventing the workload from running on the affected node (taint/`nodeSelector` exclusion) or removing the privileged `modprobe` call from the workload's startup, not just unloading the module once.

Stop the privileged workload that has the bridge subsystem held — the workload identified in Diagnostic Steps whose `nodeName` is the affected worker and whose container is privileged. The kernel unload step in the next paragraph succeeds only once the holder process exits and the module's reference count drops to 0:

```bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,11 @@ If those four names are absent from the merged spec, the webhook did not inject

## Resolution

Add a non-conflicting `volumeMount` path (e.g. `/tmp/.otel-instr-fix`) backed by a new `emptyDir` volume to the user's Deployment. This gives the OTel mutating webhook the headroom it needs to complete the Apache HTTPD instrumentation preparation steps so the merged layout is consistent. After the mitigation, the Apache HTTPD container no longer emits `No such file or directory` on startup and the OTel auto-instrumentation injection succeeds.
The resolution has two parts; do both in order, because step 1 addresses the root cause and step 2 is an empirical mitigation whose mechanism is not fully understood.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify whether both steps are required or conditional.

Line 63 states "do both in order," but line 67 introduces step 2 with "When step 1 does not visibly apply," suggesting step 2 is conditional and only needed if step 1 doesn't resolve the issue. Consider revising line 63 to something like "The resolution has two parts; attempt step 1 first, then apply step 2 if needed" to accurately reflect that step 2 is a fallback rather than always required.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/solutions/OpenTelemetry_Apache_HTTPD_auto_instrumentation_fails_to_inject_when_user_volumes_leave_no_room_for_the_agent.md`
at line 63, Update the ambiguous instruction at line 63 in the document so it
reflects that step 2 is conditional: replace "The resolution has two parts; do
both in order" with a clear instruction such as "Attempt step 1 first; apply
step 2 only if step 1 does not resolve the issue." Also scan and adjust the
subsequent phrasing that starts with "When step 1 does not visibly apply" so it
remains consistent with the new line 63 wording and clearly indicates step 2 is
a fallback rather than mandatory.


**Step 1 — remove the actual conflict (root cause).** Inspect the merged pod spec from the diagnostic above and identify any user `volumeMount` whose `mountPath` overlaps the webhook-injected paths `/opt/opentelemetry-webserver/agent` or `/usr/local/apache2/conf` (subpath mounts under those directories count). Either rename the user mount to a different path, or drop it and provide the same content through a different mechanism (e.g. bake the file into the image, or use a sidecar). With no overlap remaining, the next pod admission cycle lets the webhook lay down its volumes cleanly and the Apache HTTPD container starts without `No such file or directory`.

**Step 2 — extra `emptyDir` workaround (empirical mitigation).** When step 1 does not visibly apply — the merged pod spec shows no overlap with the OTel-injected paths but the injection still fails on this workload — adding a non-conflicting `volumeMount` path (e.g. `/tmp/.otel-instr-fix`) backed by a new `emptyDir` volume has been observed to unblock the webhook in practice. The precise mechanism is not characterised here; treat it as a workaround for cases where the merged pod inspection does not reveal an obvious conflict, and revisit if the upstream OTel operator fixes the underlying admission interaction. After the mitigation, the Apache HTTPD container no longer emits `No such file or directory` on startup and the OTel auto-instrumentation injection succeeds.

First, ensure the `Instrumentation` CR for Apache HTTPD exists in the workload's namespace. The `apacheHttpd` block selects the agent image and the in-container Apache configuration directory:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ After the JSON-patch (or after manually re-submitting), re-run the same `get pip
- The pending-gate mechanism (`spec.status: PipelineRunPending`) and the JSON-patch recovery (`--type=json -p='[{"op":"remove","path":"/spec/status"}]'`) are upstream Tekton and work the same on the ACP operator bundle — `tektoncd-operator.v4.2.0`, `TektonConfig` at `v0.76.0-c46274a` — as they do on a vanilla Tekton install.
- This article covers the generic Pending-PipelineRun recovery. It does not cover Git-event-trigger watchers or per-repository queue managers that submit PipelineRuns with the pending gate on the user's behalf — those components are not part of the default `TektonConfig` install profile on the bundle tested here, and any queue-level "unsticking" behavior they may add belongs to that component's documentation rather than to the generic PipelineRun gate.

## Evidence
## Verification

- ev:c1 — A PipelineRun created with `spec.status: PipelineRunPending` on the ACP DevOps Pipelines operator stays paused: `.spec.status` remains `PipelineRunPending`; `.status.conditions[*].reason=PipelineRunPending`; `.status.conditions[*].message='PipelineRun "pending-test" is pending'`.
- ev:c3 — While the gate is in place, `kubectl get taskrun -n <ns>` returns `No resources found`, and the only events emitted for the PipelineRun are `Started` and `FinalizerUpdate` — no scheduling or pod events.
- ev:c6 — A re-submitted PipelineRun (no `spec.status`) ran to a terminal state and produced a TaskRun; the parallel original PipelineRun with `spec.status=PipelineRunPending` stayed paused with no TaskRun.
- ev:c7 — `kubectl patch pipelinerun pending-test --type=json -p='[{"op":"remove","path":"/spec/status"}]'` cleared the gate; `.spec.status` became empty, `.status.conditions[*].reason` flipped to `Running`, and a TaskRun `pending-test-hello` was created within 5s.
- A PipelineRun created with `spec.status: PipelineRunPending` stays paused: `.spec.status` remains `PipelineRunPending`; `.status.conditions[*].reason=PipelineRunPending`; `.status.conditions[*].message='PipelineRun "<name>" is pending'`.
- While the gate is in place, `kubectl get taskrun -n <ns>` returns `No resources found`, and the only events emitted for the PipelineRun are `Started` and `FinalizerUpdate` — no scheduling or pod events.
- A re-submitted PipelineRun without `spec.status` runs to a terminal state and produces a TaskRun, while the parallel original PipelineRun with `spec.status=PipelineRunPending` stays paused with no TaskRun.
- After `kubectl patch pipelinerun <name> --type=json -p='[{"op":"remove","path":"/spec/status"}]'` the gate clears: `.spec.status` becomes empty, `.status.conditions[*].reason` flips to `Running`, and a TaskRun appears within a few seconds.
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ The reconcile of an already-resolved `ResolutionRequest` is structurally a no-op

## Resolution

Leave the ten-hour cadence in place. Because the wake-up of a completed `ResolutionRequest` returns immediately from `ReconcileKind`, the periodic reconcile costs nothing measurable — on lab-base the post-resolution settle reconcile and an annotate-triggered re-reconcile of the same `ResolutionRequest` were observed at `duration=0.000030088` and `duration=0.000122599` (sub-millisecond) respectively, with no API-server write to the object. There is no per-`ResolutionRequest` lifetime knob worth tuning here, and the controller-side knob that would raise the framework default to twenty-four hours is not exposed by the resolvers binary.
Leave the ten-hour cadence in place. Because the wake-up of a completed `ResolutionRequest` returns immediately from `ReconcileKind`, the periodic reconcile costs nothing measurable — on a stock ACP cluster the post-resolution settle reconcile and an annotate-triggered re-reconcile of the same `ResolutionRequest` were observed at `duration=0.000030088` and `duration=0.000122599` (sub-millisecond) respectively, with no API-server write to the object. There is no per-`ResolutionRequest` lifetime knob worth tuning here, and the controller-side knob that would raise the framework default to twenty-four hours is not exposed by the resolvers binary.

To confirm the no-op shape on a specific cluster, force a fresh reconcile of a completed `ResolutionRequest` by annotating it (so the work-queue picks it up without waiting for the ten-hour resync), then read the resolvers controller log for the same `knative.dev/key`; the `Reconcile succeeded` line for that key should report a sub-millisecond `duration` field on the second and subsequent reconciles, which is the live signature of the `IsDone()` short-circuit on this cluster.
To confirm the no-op shape on a specific cluster, force a fresh reconcile of a completed `ResolutionRequest` by annotating it (so the work-queue picks it up without waiting for the ten-hour resync), then read the resolvers controller log for the same `knative.dev/key`; the `Reconcile succeeded` line for that key should report a sub-millisecond `duration` field on the second and subsequent reconciles, which is the live signature of the `IsDone()` short-circuit.

## Diagnostic Steps

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,34 @@ The Pipelines-as-Code component specifically is exposed as the cluster-scoped CR

Two recovery paths apply, depending on whether the wedge is generic (an installer set whose finalizer cannot be removed by its owner) or component-specific (the operator's reconcile loop for one component is failing).

**Force-remove a stalled `TektonInstallerSet` finalizer.** When a `TektonInstallerSet` is marked for deletion but the owning component controller cannot drop its finalizer, the resource sits with a `deletionTimestamp` set and a single finalizer entry `tektoninstallersets.operator.tekton.dev`. Patching the finalizer list to `null` releases the API server-side delete:
**Re-create the stuck installer set first (preferred).** Re-creating the installer set is the operator-aware path: when its owning component controller is healthy it will lay the resource down again from its embedded manifests. Try this before patching finalizers, because force-removing a finalizer on a resource whose owner is still actively reconciling can orphan operator-managed children.

```bash
# Identify the failing component, then delete its installer set(s)
kubectl get tektoninstallerset \
-l operator.tekton.dev/created-by=<ComponentName>
kubectl delete tektoninstallerset <names>
```

If the delete completes cleanly the owner controller recreates the set within a few seconds.

**Force-remove a stalled `TektonInstallerSet` finalizer (last resort).** Only use this when the delete is genuinely stuck — meaning the resource carries a `deletionTimestamp` for more than a minute, the owning component reconcile is failing (operator pod logs show repeated errors against that component), and re-creation per the previous step is not unblocking the wedge.

Confirm preconditions before patching:

- The resource has a `deletionTimestamp` set: `kubectl get tektoninstallerset <name> -o jsonpath='{.metadata.deletionTimestamp}'` returns a non-empty timestamp.
- The only finalizer is the controller's own `tektoninstallersets.operator.tekton.dev`. If extra finalizers are present, investigate them first.
- Capture the current spec before patching: `kubectl get tektoninstallerset <name> -o yaml > /tmp/<name>.yaml`, so you can recover any operator-managed children if the operator does not lay them down again automatically.
- Confirm with the customer that the installer-set's owning manifests can be lost — the owner controller normally re-creates them, but a still-reconciling owner may not.

Then issue the patch; the resource leaves the API server immediately:

```bash
kubectl patch tektoninstallerset <name> \
--type=merge -p '{"metadata":{"finalizers":null}}'
```

On the running operator the finalizer pattern is exactly that — a single entry that the controller normally drops itself on a clean delete; the `null` patch is the manual override when the controller does not.
After the patch, watch the owning component re-create the installer set (`kubectl get tektoninstallerset -l operator.tekton.dev/created-by=<ComponentName> -w`). If it does not return within a minute, the owner is still failing — read the operator-pod logs for the underlying error before doing anything else.

**Re-reconcile a stuck Pipelines-as-Code component.** When the failing component is Pipelines-as-Code, list the `TektonInstallerSet`s that belong to the `pac` operand (their names are prefixed with the component, e.g. `openshiftpipelinesascode-main-deployment-*`, `openshiftpipelinesascode-main-static-*`, `openshiftpipelinesascode-post-*`) and delete them; the operator's component controller recreates them from its embedded manifests and the component returns to ready:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,16 +110,3 @@ done
```

A sweep that succeeds up to `~1372` bytes (`1400 − 28` for the ICMP+IP header) and fails from `~1473` onward localises the break point to the kube-ovn 1400 overlay ceiling, not to the upstream network. Combined with the node-side and pod-side readings above, that is sufficient to attribute the failure to an MTU misconfiguration at one of the three layers — NAD CNI-JSON `mtu`, `Subnet.spec.mtu`, or the node interface MTU — and to point at which one to lower.

phase2 ev10 (lab-base / global)
phase4 ev2 (lab-base) + phase2 ev2/ev7/ev8 (global)
phase5 ev7 (lab-base) — `--network-type=geneve`, `--encap-checksum=true`
phase5 ev5 + ev6 (lab-base) — pod eth0 mtu 1400 / node eth0 mtu 1500
phase5 ev6 (lab-base) — 23 veth host-ends all mtu 1400
phase2 ev11 (mechanism) + phase5 ev5/ev6/ev7 (lab-base ceiling proved)
phase2 ev11 (generic Geneve/TCP/PMTU behavior)
phase2 ev11 (generic TCP retransmit capture shape)
phase4 ev3 (lab-base) — KubeVirt Deployed
phase4 ev3 (lab-base) — KubeVirt Deployed
phase2 ev3 (NAD CRD on global) — note: on lab-base Multus is not installed, see phase4 ev4
phase5 ev8 (lab-base) — `subnet.spec.mtu` CRD field
Loading
Loading