-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Description
Add a "First Steps" section at the beginning of the troubleshooting workflow to prevent common misdiagnoses and reduce investigation time.
What: We are adding a new section that emphasizes three critical checks users should perform before detailed troubleshooting:
- Collect eck-diagnostics immediately (events expire after ~1 hour)
- Check Kubernetes security policies and permissions (most common blocker)
- Verify pod status before investigating application errors (causality awareness)
Why: Many ECK deployment issues are caused by Kubernetes admission layer blocks (security policies, quotas, admission webhooks) rather than application configuration. Without checking the Kubernetes layer first, users spend days investigating symptoms (operator errors, authentication failures) instead of the root cause (pods never created).
Details users need to know:
- eck-diagnostics captures critical namespace events that reveal pod creation failures
UP-TO-DATE: 0metric indicates Kubernetes is blocking pod creation (not app failure)- Operator errors (401, 503, connection refused) often occur because pods don't exist
- Security policy violations appear in events.json, not pod logs
- Events expire quickly - collect diagnostics early
Proposed Content
Section Title: First Steps: Critical Checks Before Detailed Investigation
Location: Add as first major section after page introduction, before existing troubleshooting steps
Content:
## First Steps: Critical Checks Before Detailed Investigation
Perform these checks first to catch common issues and prevent unnecessary investigation:
### Step 1: Collect eck-diagnostics
Collect diagnostics immediately for any ECK deployment issue. Events expire after ~1 hour in Kubernetes.
```bash
# Download from https://github.com/elastic/eck-diagnostics/releases/latest
./eck-diagnostics -o <operator-namespace> -r <resource-namespace>
# Check for pod creation failures
unzip -p eck-diagnostics-*.zip <namespace>/events.json | \
jq '.items[] | select(.reason=="FailedCreate")'When to collect:
- Deployments show
READY: 0/1orUP-TO-DATE: 0 - Reports of "no pods deployed"
- Any new ECK deployment issue
Step 2: Check Kubernetes Security Policies
Most "no pods created" issues stem from Kubernetes security policies blocking admission.
# Check namespace Pod Security labels
kubectl get namespace <namespace> -o yaml | grep pod-security
# Check for FailedCreate events
kubectl get events -n <namespace> | grep FailedCreate
# Check deployment status
kubectl get deployment -n <namespace>Common patterns:
| Symptom | Likely Cause | Action |
|---|---|---|
UP-TO-DATE: 0 |
Kubernetes blocking pod creation | Check events for FailedCreate |
| "violates PodSecurity" in events | Security policy violation | See kubernetes troubleshooting page |
| "exceeded quota" in events | Resource quota limit | Run kubectl describe quota |
Step 3: Verify Pod Status First
Operator errors are often symptoms of pods not existing.
kubectl get pods -n <namespace>Decision point:
- No pods (UP-TO-DATE: 0)? → Kubernetes-layer issue (check events, security policies)
- Pods exist but failing? → Application-layer issue (check pod logs)
Important: Don't investigate operator errors (401, 503) before verifying pods exist.
---
## Rationale
**Problem:** Users often investigate application-layer errors (authentication, connectivity) for days without first checking if pods were ever created. Kubernetes security policies silently block pod creation at the admission layer.
**Impact:** This addition provides a clear entry point that catches Kubernetes-layer issues immediately, reducing multi-day investigations to hours.
**Placement:** At the top of troubleshooting workflow ensures all users see these critical checks first.
### Resources
**Target Page:**
https://www.elastic.co/docs/troubleshoot/deployments/cloud-on-k8s/troubleshooting-methods
### Which documentation set does this change impact?
Elastic On-Prem only
### Feature differences
N/A
### What release is this request related to?
9.1
### Serverless release
N/A
### Collaboration model
The documentation team
### Point of contact.
**Main contact:** @eedugon
**Stakeholders:** @damianpfister