Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

this is a prototype UX for using openwebui to interact with the new amber codebase agent added in #337

jeremyeder and others added 2 commits November 18, 2025 20:16
Add a Kubernetes-native deployment of Open WebUI with LiteLLM proxy for
chatting with Claude models. This Phase 1 implementation provides a quick,
dev-friendly deployment to Kind cluster with minimal configuration.

Components:
- Base manifests (namespace, deployments, services, PVC, RBAC)
- LiteLLM proxy configured for Claude Sonnet 4.5, 3.7, and Haiku 3.5
- Open WebUI frontend with persistent storage
- Phase 1 overlay for Kind deployment with nginx-ingress
- Comprehensive documentation (README, Phase 1 guide, Phase 2 plan)
- Makefile for deployment automation

Architecture:
- Namespace: openwebui (isolated from ACP)
- Ingress: vteam.local/chat (reuses Kind cluster from e2e)
- Auth: Disabled in Phase 1 (dev/testing only)
- Storage: 500Mi PVC for chat history
- Images: ghcr.io/berriai/litellm, ghcr.io/open-webui/open-webui

Phase 2 (planned):
- OAuth authentication via oauth2-proxy
- Long-running Claude Code service for Amber integration
- Production hardening (secrets, RBAC, monitoring)
- OpenShift compatibility (Routes, SCC compliance)

Deployment:
```bash
cd components/open-webui-llm
# Edit overlays/phase1-kind/secrets.yaml with API key
make phase1-deploy
# Access: http://vteam.local:8080/chat (Podman) or /chat (Docker)
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Increase memory limit from 512Mi to 2Gi to prevent OOMKilled crashes
- Increase CPU limit from 500m to 1000m for better performance
- Update health probe paths to LiteLLM-specific endpoints:
  - /health/liveliness for liveness probe
  - /health/readiness for readiness probe
- Increase resource requests for stability

Fixes LiteLLM pod crash loop due to insufficient memory allocation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jeremyeder jeremyeder enabled auto-merge (squash) November 19, 2025 01:35
@github-actions
Copy link
Contributor

Claude Code Review

Summary

This PR introduces a well-structured deployment of Open WebUI + LiteLLM for interacting with Claude models via the Amber codebase agent. The implementation is clean, well-documented, and follows Kubernetes best practices with Kustomize overlays. This is clearly marked as "Phase 1" - a development/prototype deployment with authentication disabled, which is appropriate for initial integration testing.

Overall Assessment: Strong implementation with excellent documentation. However, there are critical security issues that must be addressed before merge, plus several important improvements for production readiness.


Issues by Severity

🚫 Blocker Issues

1. Missing SecurityContext - Pod Security Standards Violation ⚠️

  • Location: base/litellm/deployment.yaml, base/open-webui/deployment.yaml
  • Issue: No securityContext defined for containers, violating CLAUDE.md backend development standards
  • Risk: Pods run as root by default, fail on OpenShift, bypass security policies
  • Required Fix: Add SecurityContext per CLAUDE.md patterns (components/backend/handlers/sessions.go:470+):
    securityContext:
      allowPrivilegeEscalation: false
      runAsNonRoot: true
      capabilities:
        drop:
          - ALL
      # readOnlyRootFilesystem: true  # Only if /tmp writes not needed
  • Why Blocker: Security vulnerability + won't deploy to OpenShift (SecurityContextConstraints)

2. Hardcoded Secrets in Git Repository 🔐

  • Location: overlays/phase1-kind/secrets.yaml:10-11,22
  • Issue: Master keys and placeholder API keys committed to git
  • Risk: Security anti-pattern, even for dev keys
  • Required Fix:
    • Remove secrets.yaml from git, add to .gitignore
    • Create secrets.yaml.example with placeholder values
    • Update README to instruct: cp secrets.yaml.example secrets.yaml
  • Why Blocker: Violates security best practices, sets bad precedent

3. Authentication Disabled in Production-Capable Deployment 🔓

  • Location: base/open-webui/deployment.yaml:36-37
  • Issue: WEBUI_AUTH: "false" with no clear runtime override mechanism
  • Risk: Anyone with network access can use UI and consume API quota
  • Required Fix:
    • Keep auth disabled for Phase 1 BUT add prominent warnings in:
      • Deployment manifests (comments)
      • README (⚠️ WARNING sections)
      • Makefile help text
    • OR: Enable basic auth by default with documented override for dev
  • Why Blocker: Too easy to accidentally deploy to production without auth

🔴 Critical Issues

4. Resource Limits Too High for Default Kind Cluster

  • Location: base/litellm/deployment.yaml:54-55, base/open-webui/deployment.yaml:44-46
  • Issue: Total limits = 2 CPU, 3Gi RAM. Default Kind cluster has 4 CPU, 8Gi RAM
  • Impact: May not deploy on constrained environments, wastes resources
  • Recommendation:
    • Lower limits in base: litellm (500m CPU, 1Gi RAM), openwebui (500m CPU, 512Mi RAM)
    • Override in production overlay for higher limits
    • Add resource quota monitoring in Makefile

5. Using :main and :main-latest Image Tags 🏷️

  • Location: base/kustomization.yaml:22-24
  • Issue: Mutable tags mean non-reproducible deployments
  • Impact: Breaks in prod if upstream changes, hard to debug version issues
  • Recommendation: Pin to specific SHA or version tags:
    - name: ghcr.io/berriai/litellm
      newTag: v1.49.3  # Or sha256:abc123...
    - name: ghcr.io/open-webui/open-webui
      newTag: v0.3.19

6. No Network Policies 🌐

  • Location: Missing entirely
  • Issue: Pods can egress to any external service
  • Risk: Data exfiltration, unnecessary attack surface
  • Recommendation: Add NetworkPolicy to restrict:
    • LiteLLM → only Anthropic API (anthropic.com)
    • Open WebUI → only LiteLLM service
    • Deny all other egress

7. Probes Reference Wrong Endpoints

  • Location: base/litellm/deployment.yaml:58,64
  • Issue: Using /health/liveliness but LiteLLM docs show /health/liveness (typo)
  • Impact: Probes may fail if endpoint doesn't exist (need to verify upstream)
  • Action: Test actual LiteLLM endpoints or use /health for both

8. No Horizontal Pod Autoscaling

  • Location: Missing HPA resources
  • Issue: Single replica can't handle load spikes
  • Recommendation: Add HPA for both deployments (min: 1, max: 3, target: 70% CPU)

🟡 Major Issues

9. Missing Test Coverage

  • Location: No test files in components/open-webui-llm/
  • Issue: No automated tests for deployment health, connectivity, or functionality
  • Recommendation: Add tests following e2e pattern:
    • Smoke test: deploy → wait for ready → curl health endpoints → undeploy
    • Integration test: send test message via API → verify response
    • See: e2e/cypress/e2e/vteam.cy.ts for reference

10. PVC Has No Backup Strategy

  • Location: base/open-webui/pvc.yaml
  • Issue: Chat history lost on PVC deletion, no backup documented
  • Recommendation:
    • Add VolumeSnapshot CRD usage example in docs
    • Document export/import procedure in README
    • Add Makefile target: make phase1-backup, make phase1-restore

11. Secrets Reference Missing Secret Objects

  • Location: base/open-webui/deployment.yaml:31-35
  • Issue: References openwebui-secrets but it's only in overlay, not base
  • Impact: Breaks Kustomize build if overlay not applied, confusing error
  • Recommendation: Move secret template to base/ with optional: true flag (already set, but document)

12. Ingress Path Rewrite May Break Assets

  • Location: overlays/phase1-kind/ingress.yaml:9
  • Issue: rewrite-target: /$2 may break asset loading if app expects /chat prefix
  • Action: Test thoroughly, document known issues
  • Alternative: Use dedicated subdomain (chat.vteam.local) to avoid rewrites

13. No Health Check in Makefile Test Target

  • Location: Makefile:58-67
  • Issue: Test target only checks curl success, doesn't validate response
  • Recommendation: Parse JSON responses, check for expected fields

14. Hardcoded Master Key Exposed in Test Commands

  • Location: Makefile:65, README examples
  • Issue: Master key sk-litellm-dev-master-key shown in plaintext
  • Recommendation: Read from secret at runtime

🔵 Minor Issues

15. Inconsistent Label Usage

  • Observation: Some resources use app: litellm, others add app.kubernetes.io/name: litellm
  • Recommendation: Standardize on both everywhere (Kubernetes recommended labels)

16. Documentation Uses Emojis Inconsistently

  • Location: README.md, PHASE1.md
  • Issue: CLAUDE.md states "Only use emojis if the user explicitly requests it"
  • Recommendation: Remove emojis from docs (✅, ❌, ⚠️) for consistency

17. Namespace Not Configurable via Kustomize

  • Location: base/namespace.yaml:3, Makefile:5
  • Issue: Hardcoded to openwebui, limits multi-environment deployments
  • Recommendation: Use Kustomize namespace transformation in overlays

18. No Pod Disruption Budget

  • Location: Missing PDB resources
  • Impact: Rolling updates may cause downtime
  • Recommendation: Add PDB for production overlay (minAvailable: 1)

19. Missing Prometheus ServiceMonitor

  • Location: No observability resources
  • Recommendation: Add ServiceMonitor if Prometheus available, document metrics endpoints

20. Verbose ConfigMap Could Be Externalized

  • Location: base/litellm/configmap.yaml:10-34
  • Observation: 25-line YAML in ConfigMap, hard to maintain
  • Recommendation: Mount from file in overlays for easier editing

Positive Highlights

Excellent Documentation

  • Comprehensive README with troubleshooting section
  • Clear phase separation (Phase 1 vs Phase 2 planning)
  • Step-by-step deployment instructions
  • Thoughtful Makefile with helpful targets

Proper Kustomize Structure

  • Clean base/overlay separation
  • Environment-specific patches (pvc-patch, secrets)
  • Reusable base resources

Good Operational Tooling

  • Makefile abstracts complexity well
  • Health check commands included
  • Shell access targets for debugging
  • Comprehensive help text

Well-Organized Component Structure

  • Logical directory layout
  • Clear separation of concerns (litellm/, open-webui/)
  • Phase-based roadmap (PHASE2.md shows forward planning)

Recent Commit Shows Iteration

  • Fixed real issues (memory limits, health probes)
  • Demonstrates testing and refinement
  • Good commit message with rationale

Recommendations

Immediate (Before Merge)

  1. Add SecurityContext to both deployments (see Blocker Outcome: Reduce Refinement Time with agent System #1)
  2. Remove secrets.yaml, create .example file (see Blocker Epic: RAT Architecture & Design #2)
  3. Add prominent auth warning to README and deployment comments (see Blocker Epic: Data Source Integration #3)
  4. Pin image tags to specific versions (see Critical Epic: Jira Integration & Workflow #5)
  5. Verify health probe endpoints (/health/liveness vs /liveliness)
  6. Add basic smoke test (deploy → health check → cleanup)

Short-Term (Next PR)

  1. 🔄 Add NetworkPolicies for egress restriction
  2. 🔄 Reduce base resource limits, move higher limits to production overlay
  3. 🔄 Add HPA resources for scalability
  4. 🔄 Document backup/restore procedure for PVC

Long-Term (Phase 2)

  1. 🚀 Implement OAuth per PHASE2.md plan
  2. 🚀 Add OpenShift Routes alternative to Ingress
  3. 🚀 Integrate Amber agent for long-running sessions
  4. 🚀 Add comprehensive test suite following e2e patterns

Architecture Fit

Integration with Ambient Code Platform: ✅ Good

  • Uses same ingress pattern as existing e2e tests
  • Compatible with Kind cluster setup
  • Namespace isolation follows platform patterns
  • Ready for future integration with Amber (Feature/add codebase agent #337)

Deviations from CLAUDE.md:

  • ❌ Missing SecurityContext (Critical - see backend/operator standards)
  • ⚠️ No tests (Expected per "Testing Strategy" section)
  • ⚠️ Secrets in git (Violates "Production Considerations" → Security)

Final Verdict

Do Not Merge until Blocker issues resolved (#1, #2, #3).

After fixes, this is a solid Phase 1 prototype that:

  • Provides immediate value (chat with Claude via UI)
  • Sets good foundation for Phase 2
  • Demonstrates proper Kubernetes patterns
  • Includes excellent documentation

The core implementation is strong - the issues are mostly about security hardening and production readiness, which is expected for a Phase 1 prototype. Focus on the 3 blockers, then this is good to merge.


Estimated Effort to Fix Blockers: ~2-3 hours

  • SecurityContext: 30 min
  • Secrets refactor: 60 min
  • Auth warnings: 30 min
  • Image pinning: 15 min
  • Testing: 30 min

Great work on this integration! The documentation quality and phased approach show excellent planning. 🎯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant