Skip to content

Conversation

@jwm4
Copy link
Contributor

@jwm4 jwm4 commented Oct 29, 2025

Yahav and I were both having trouble running vTeam locally with crc. We were able to make projects and workspace and start an ideate session but then when we get to the message tab, we see no messages and get no response when we send one.

I looked into the OpenShift console and saw errors like this:

[GIN] GET | 200 | 10.217.0.68 | /api/projects/lendyr-red-local/agentic-sessions/agentic-session-1761252727
[GIN] POST | 202 | 10.217.0.68 | /api/projects/lendyr-red-local/sessions/agentic-session-1761252727/messages
2025/10/28 18:08:25 persistMessageToS3: path: /workspace/sessions/agentic-session-1761252727/messages.jsonl
2025/10/28 18:08:25 persistMessage: open failed: open /workspace/sessions/agentic-session-1761252727/messages.jsonl: no such file or directory

This PR is addresses the problem. It also adds a new CONTRIBUTING.md file with a section called "Local Development Setup" with a lot of content about getting and using CRC. The CONTRIBUTING.md file also has a lot of generic CONTRIBUTING.md stuff too, but I included it in this PR because the "Local Development Setup" is important for replicating my results showing that this works.

I've addressed most of the CodeRabbit concerns. The remaining ones seem not really applicable here because these are CRC-specific local development manifests. The storage class is required as CRC has no default storage class (verified in testing). Production manifests are separate in components/manifests/.

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@github-actions
Copy link
Contributor

PR Review: Update file system config for CRC

Summary

This PR adds persistent volume support to the local development backend deployment to fix a file system issue where message persistence was failing with "no such file or directory" errors.

✅ Positive Aspects

  1. Aligns with Production Configuration: The changes mirror the production deployment pattern in components/manifests/, which already uses a PVC mounted at /workspace. This is good consistency.

  2. Correct PVC Configuration: The PVC spec matches the production workspace-pvc.yaml:

    • ReadWriteOnce access mode (appropriate for single backend replica)
    • 5Gi storage request
    • Proper labels for identification
  3. Proper Deployment Order: The script now creates the PVC before the backend deployment, which is the correct order of operations.

  4. Environment Variable Set: The STATE_BASE_DIR=/workspace env var is explicitly set, matching production configuration.

🔍 Issues & Recommendations

Critical: Missing Directory Initialization

The root cause of your intermittent success is likely a race condition with directory creation. Looking at the backend code (components/backend/websocket/hub.go:165):

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

This creates the session-specific directory, but there's a potential issue:

  1. Empty PVC: When CRC creates a new PVC, it starts completely empty
  2. Timing Issue: If a message arrives before the first MkdirAll call succeeds, or if the pod restarts, the directory might not exist
  3. Error Handling: The code silently ignores the MkdirAll error (_ = ), so failures aren't logged

Recommendation: Add an init container or startup command to ensure the base directory structure exists:

# In backend-deployment.yaml, add:
initContainers:
- name: init-workspace
  image: quay.io/ambient_code/vteam_backend:latest
  command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 755 /workspace/sessions']
  volumeMounts:
  - name: backend-state
    mountPath: /workspace

Low: Inconsistent Label Values

The PVC uses app: vteam-backend but the deployment uses app: backend-api. While this doesn't break functionality, it's inconsistent with the production manifests.

Recommendation: Change backend-pvc.yaml line 6 to:

labels:
  app: backend-api  # Match deployment label
  component: state-storage

Info: Missing storageClassName

CRC will use its default storage class, but for clarity and to match production expectations, consider explicitly setting:

spec:
  storageClassName: crc-csi-hostpath-provisioner  # CRC's default
  accessModes:
    - ReadWriteOnce

Backend Code: Silent Error Handling

In components/backend/websocket/hub.go:165, the MkdirAll error is ignored:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

Recommendation: Log errors for debugging (separate PR):

if err := os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755); err != nil {
    log.Printf("Failed to create session directory: %v", err)
    return
}

🧪 Testing Recommendations

  1. Clean State Test: Delete the PVC and redeploy to simulate a fresh environment:

    oc delete pvc backend-state-pvc -n vteam-dev
    make dev-stop && make dev-start
  2. Rapid Message Test: Send multiple messages quickly after session creation to test race conditions

  3. Pod Restart Test: Kill the backend pod after creating a session, verify messages persist

  4. Directory Permissions: Exec into the backend pod and verify:

    oc exec -it deployment/backend-api -- ls -la /workspace

📋 Pre-Merge Checklist

  • Add init container to create base directory structure
  • Fix label consistency (vteam-backend → backend-api)
  • Test with clean PVC state
  • Verify messages persist across pod restarts
  • Consider adding backend logging improvements (can be separate PR)

Summary

The PR is on the right track and correctly identifies the need for persistent storage. The main issue is likely empty PVC initialization causing intermittent failures. Adding an init container should resolve the "no such file or directory" errors consistently.

Recommended Action: Add the init container, test with a clean PVC, and this should be ready to merge.


Review based on CLAUDE.md guidelines and production deployment patterns

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@github-actions
Copy link
Contributor

Code Review: Update file system config for crc

I've reviewed the changes in this PR. Overall, this is a solid fix for the filesystem persistence issue in the local CRC development environment. Here's my detailed feedback:

✅ What's Working Well

  1. Root Cause Identified: The PR correctly identifies and addresses the issue where /workspace/sessions directory doesn't exist, causing message persistence to fail.

  2. Init Container Pattern: Using an initContainer to create the directory structure is the right approach - it ensures the directory exists before the main backend container starts.

  3. PVC Creation Order: The PR correctly creates the PVC before deploying the backend, which prevents race conditions.

  4. Consistency with Production: The new backend-pvc.yaml matches the production manifest at components/manifests/workspace-pvc.yaml, ensuring dev/prod parity.

🔍 Code Quality Issues

1. Missing Storage Class (Medium Priority)

The PVC doesn't specify a storageClassName. While this will use the cluster default, it's better to be explicit for CRC:

spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ""  # Use default storage class explicitly
  resources:
    requests:
      storage: 5Gi

Or for CRC-specific storage:

  storageClassName: crc-csi-hostpath-provisioner  # CRC's default

2. Init Container Permissions (Low Priority)

The init container command uses chmod 755. Consider:

  • Security: Should this be 750 or 700 to restrict access?
  • Ownership: The command doesn't set ownership, relying on the container's default UID. This is fine, but document it.

Current:

command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 755 /workspace/sessions']

Consider:

command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 750 /workspace/sessions']

3. Missing SecurityContext (Medium Priority)

According to the CLAUDE.md Backend Development Standards, Job pods should have SecurityContext. The backend deployment is missing this:

containers:
- name: backend
  image: ...
  securityContext:
    allowPrivilegeEscalation: false
    runAsNonRoot: true
    capabilities:
      drop: ["ALL"]

This should be added for production-readiness, though it may not be the cause of the current issue.

🐛 Potential Issues

1. Init Container Image Pull (High Priority)

The init container uses:

image: image-registry.openshift-image-registry.svc:5000/vteam-dev/vteam-backend:latest

This assumes the backend image is already built and available. If someone runs make dev-start on a fresh cluster, this will fail because:

  • The init container runs before the main container
  • The image is built via oc start-build which happens before the deployment
  • But there's still a potential race condition if the build isn't complete

Recommendation: Add error handling or use a busybox-based init container:

initContainers:
- name: init-workspace
  image: registry.access.redhat.com/ubi9/ubi-minimal:latest
  command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 750 /workspace/sessions']
  volumeMounts:
  - name: backend-state
    mountPath: /workspace

This is more reliable and doesn't depend on the backend image.

2. Directory Creation in Application Code (Low Priority)

Looking at components/backend/websocket/hub.go:165:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

The code already tries to create the directory. The issue is that /workspace/sessions doesn't exist, so creating /workspace/sessions/SESSION_ID fails.

Why the init container works: It creates the parent /workspace/sessions directory.

Alternative approach: Fix the application code to create the full path:

sessionDir := fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID)
if err := os.MkdirAll(sessionDir, 0o755); err != nil {
    log.Printf("Failed to create session directory: %v", err)
    return
}

This should already work, so the issue might be permissions on /workspace itself. Verify PVC mount permissions.

📋 Testing Recommendations

  1. Fresh Cluster Test: Test on a completely fresh CRC instance to verify the build order works correctly.

  2. Permission Verification: After deployment, exec into the backend pod and verify:

    oc exec -it deployment/vteam-backend -- ls -la /workspace
    oc exec -it deployment/vteam-backend -- ls -la /workspace/sessions
  3. Message Persistence: Create a session, send messages, restart the backend pod, and verify messages persist.

  4. PVC Cleanup: Test that PVC cleanup works when running make dev-clean.

🔒 Security Considerations

  1. No sensitive data concerns: The PVC stores session messages, which is appropriate.

  2. RBAC: The backend service account needs permissions to create PVCs if not already granted. Verify in components/scripts/local-dev/manifests/ that RBAC allows this.

  3. Resource limits: The backend deployment has appropriate resource limits (512Mi memory, 500m CPU).

🎯 Performance Considerations

  1. Storage Size: 5Gi is reasonable for local dev. Monitor usage over time.

  2. ReadWriteOnce: Correctly limits to single replica, which matches the deployment spec (replicas: 1).

  3. No performance concerns: PVC I/O should be sufficient for local dev message persistence.

✨ Suggestions for Improvement

  1. Add comments explaining why the init container is needed:

    initContainers:
    # Create /workspace/sessions directory before backend starts
    # Required for message persistence in websocket/hub.go
    - name: init-workspace
  2. Document the storage requirement in the local dev docs or Makefile.

  3. Consider adding health check for the PVC mount in the readiness probe.

📝 Summary

Recommendation: Approve with minor changes

The PR solves the immediate problem and follows the right patterns. The main improvements needed are:

  1. Use a minimal/busybox image for the init container (higher reliability)
  2. Add SecurityContext to the backend container
  3. Add inline comments explaining the init container purpose
  4. Test on a fresh cluster to verify build ordering

Priority fixes before merge:

  • High: Switch to busybox/UBI-minimal for init container
  • Medium: Add SecurityContext

Can be addressed in follow-up PRs:

  • Storage class specification
  • Improved error handling in application code
  • Documentation updates

Great work identifying and fixing this issue! 🎉

@github-actions
Copy link
Contributor

Code Review: Update file system config for crc

Thanks for working on this issue! I've reviewed the PR and have several observations and suggestions.


✅ Positive Aspects

  1. Root Cause Identified: You correctly identified that the backend needs persistent storage for the /workspace/sessions directory where messages are stored (via persistMessageToS3 in websocket/hub.go:158-175).

  2. Follows Production Pattern: Your approach mirrors the production deployment in components/manifests/backend-deployment.yaml which already uses a PVC mounted at /workspace.

  3. STATE_BASE_DIR Environment Variable: Correctly added STATE_BASE_DIR=/workspace which the backend reads in server/k8s.go:67-70.


🔍 Issues and Recommendations

1. Init Container Security Context Missing

Issue: The init container lacks a securityContext, which violates Kubernetes security best practices outlined in CLAUDE.md.

From CLAUDE.md:

Always set SecurityContext for Job pods

Recommendation: Add security context to the init container:

initContainers:
- name: init-workspace
  image: image-registry.openshift-image-registry.svc:5000/vteam-dev/vteam-backend:latest
  command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 755 /workspace/sessions']
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    runAsNonRoot: true
  volumeMounts:
  - name: backend-state
    mountPath: /workspace

2. Init Container May Be Unnecessary

Issue: The persistMessageToS3 function already creates directories:

// websocket/hub.go:165
_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

Question: Is the init container actually needed, or does the issue stem from the PVC not being mounted at all in the original deployment?

Recommendation: Test whether removing the init container still works. If the backend creates directories correctly, the init container adds unnecessary complexity.

3. Missing Resource Limits

Issue: Neither the PVC creation step nor the init container have resource constraints.

Recommendation:

  • The PVC size (5Gi) seems reasonable for message storage, but should be validated against expected usage patterns.
  • Add resource requests/limits to the init container if kept:
initContainers:
- name: init-workspace
  # ...
  resources:
    requests:
      memory: "32Mi"
      cpu: "10m"
    limits:
      memory: "64Mi"
      cpu: "50m"

4. PVC Creation Timing

Issue: The PVC is created immediately before the backend deployment (crc-start.sh:307-311), which is correct. However, consider adding a wait condition to ensure the PVC is bound before deploying the backend.

Recommendation:

log "Creating backend PVC..."
oc apply -f "${MANIFESTS_DIR}/backend-pvc.yaml" -n "$PROJECT_NAME"

# Wait for PVC to be bound
log "Waiting for PVC to be bound..."
oc wait --for=condition=bound pvc/backend-state-pvc -n "$PROJECT_NAME" --timeout=60s || {
  log "WARNING: PVC did not bind within 60 seconds"
}

5. Storage Class Not Specified

Issue: The PVC does not specify a storageClassName, relying on the cluster default. This might work but could cause issues if CRC's default storage class has limitations.

Recommendation: Explicitly set the storage class for CRC:

spec:
  storageClassName: crc-csi-hostpath-provisioner  # CRC's default
  accessModes:
    - ReadWriteOnce

🧪 Testing Recommendations

  1. Verify PVC Creation: After running make dev-start, check:

    oc get pvc -n vteam-dev
    oc describe pvc backend-state-pvc -n vteam-dev
  2. Check Volume Mount: Verify the volume is mounted in the backend pod:

    oc exec -n vteam-dev deployment/vteam-backend -- ls -la /workspace
    oc exec -n vteam-dev deployment/vteam-backend -- df -h | grep workspace
  3. Test Message Persistence: Create a session, send messages, then restart the backend pod and verify messages are still retrievable.

  4. Check Permissions: Ensure the backend process can write to /workspace/sessions:

    oc exec -n vteam-dev deployment/vteam-backend -- touch /workspace/sessions/test.txt

🐛 Potential Root Cause Analysis

Based on the error message you shared:

open /workspace/sessions/agentic-session-1761252727/messages.jsonl: no such file or directory

The issue is likely NOT that the directory does not exist (since os.MkdirAll is called), but rather:

  1. PVC was never mounted in the original local-dev deployment (confirmed by comparing to origin/main)
  2. The /workspace directory might be ephemeral (container filesystem), so restarts lose data
  3. Permission issues - the backend might not have write permissions to the container's /workspace directory

Your fix (adding the PVC) should resolve issue #1, which is the most likely culprit.


📋 Suggested Changes Summary

  1. ✅ Add PVC: Already done
  2. ✅ Mount PVC: Already done
  3. ⚠️ Add security context to init container
  4. ❓ Consider removing init container (test without it first)
  5. 🔧 Add PVC wait condition to deployment script
  6. 🔧 Explicitly set storage class in PVC spec
  7. 🧪 Add the testing steps above to validate the fix

🎯 Next Steps

  1. Address the security context issue (required by project standards)
  2. Test whether the init container is actually necessary
  3. Add the PVC wait condition for more robust deployments
  4. Consider adding a comment in backend-pvc.yaml explaining why this PVC is needed for local dev (mirrors production pattern)

Overall, this is a solid fix that aligns with the production deployment pattern. With the minor adjustments above, it should reliably solve the message persistence issue in CRC. Great debugging work! 🚀

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@github-actions
Copy link
Contributor

Code Review: Update file system config for crc

Thank you for addressing the local development filesystem issue! This PR correctly identifies and fixes the root cause of the message persistence problem.

Positive Aspects

  1. Correct Root Cause Identification: The error was correctly traced to missing PVC mount
  2. Consistency with Production: Aligns with production deployment patterns
  3. Security Best Practices: InitContainer properly configured with minimal permissions
  4. Proper Resource Naming: Clear and descriptive naming

Critical Issues

1. StorageClassName Hardcoded to CRC

File: components/scripts/local-dev/manifests/backend-pvc.yaml:11

Problem: storageClassName: crc-csi-hostpath-provisioner is CRC-specific and will fail on other clusters.

Impact: Production PVC omits storageClassName to use cluster default. This divergence breaks portability.

Recommendation: Remove the storageClassName field entirely to match production.

2. No PVC Binding Verification

File: components/scripts/local-dev/crc-start.sh:307-308

Problem: No verification that PVC is bound before deploying backend.

Recommendation: Add wait condition to ensure PVC is bound before proceeding.

3. InitContainer Image Not Pinned

File: components/scripts/local-dev/manifests/backend-deployment.yaml:20

Recommendation: Pin to specific version (e.g., ubi9/ubi-minimal:9.5) instead of :latest

Testing Recommendations

  1. Test PVC recreation workflow
  2. Test message persistence across backend restarts
  3. Test on fresh CRC instance
  4. Test multi-session scenarios

Summary

Overall Assessment: ✅ Approve with Minor Changes

This PR correctly fixes the root cause. Implementation is sound and follows security best practices.

Primary blocker: Remove hardcoded storageClassName to match production

Secondary improvements: Add PVC binding verification, update documentation

Great work tracking this down! Once storageClassName is addressed, this should resolve the issues.

Related Files

  • Production PVC: components/manifests/workspace-pvc.yaml
  • Production Backend: components/manifests/backend-deployment.yaml
  • Backend State Logic: components/backend/websocket/hub.go:158-175

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@github-actions
Copy link
Contributor

Code Review: Update file system config for CRC

Thanks for working on this important fix! I've reviewed the PR and have several observations about the implementation. The core approach of adding a PVC for persistent storage is sound, but there are some areas that need attention.


🎯 Summary

This PR addresses the no such file or directory error when persisting messages in CRC environments by:

  1. Adding a PVC (backend-state-pvc) for persistent storage
  2. Mounting the PVC at /workspace in the backend pod
  3. Using an init container to pre-create the sessions directory
  4. Adding the comprehensive CONTRIBUTING.md documentation

What's Working Well

1. Root Cause Analysis

  • Correctly identified that the backend needs persistent storage for the /workspace/sessions/ directory
  • The error open /workspace/sessions/agentic-session-1761252727/messages.jsonl: no such file or directory stems from missing directory structure

2. CONTRIBUTING.md

  • Excellent addition! This is comprehensive and well-structured
  • Clear setup instructions for CRC 2.54.0 (addresses known certificate issues)
  • Detailed troubleshooting section
  • Good coverage of development workflows and hot-reloading

3. Init Container Pattern

  • Using an init container to pre-create directories is a valid Kubernetes pattern
  • Security context follows best practices (drop all capabilities)

⚠️ Issues & Concerns

1. Critical: Directory Creation Logic Issue

Problem: The init container creates /workspace/sessions, but the backend code in websocket/hub.go:165 already does this:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

Why it might still fail:

  • The init container runs as ubi-minimal with user 1000700000 (OpenShift random UID)
  • The backend pod runs as a different UID (also OpenShift random UID)
  • Directory ownership mismatch could prevent the backend from writing files

Recommended Fix:

initContainers:
- name: init-workspace
  image: registry.access.redhat.com/ubi9/ubi-minimal:latest
  command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 777 /workspace/sessions']  # More permissive for multi-user
  securityContext:
    runAsNonRoot: true  # Add this for OpenShift compliance
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL

Or better yet, remove the init container entirely and let the backend's MkdirAll handle it (it should work if PVC permissions are correct).


2. Medium: Storage Class Hardcoded

storageClassName: crc-csi-hostpath-provisioner

Issues:

  • This storage class is CRC-specific and won't work on:
    • Production OpenShift clusters (uses ocs-storagecluster-ceph-rbd or similar)
    • Other Kubernetes distributions
    • Cloud providers (AWS EBS, Azure Disk, GCP PD)

Recommended Fix:

# Remove storageClassName entirely to use cluster default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  # storageClassName: omit for default, or parameterize via env var

Or add logic in crc-start.sh to detect the environment and apply the correct storage class.


3. Medium: Security Context Missing in Backend Container

The backend container lacks a securityContext, which violates the patterns documented in CLAUDE.md:

Current:

containers:
- name: backend
  image: ...
  # No securityContext

Should be (per CLAUDE.md standards):

containers:
- name: backend
  image: ...
  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: false  # Can't be true due to /workspace writes
    capabilities:
      drop:
      - ALL
    runAsNonRoot: true  # OpenShift compliance

4. Low: STATE_BASE_DIR Environment Variable

The new STATE_BASE_DIR=/workspace env var is being set, but this is already the default in server/k8s.go:69:

StateBaseDir = os.Getenv("STATE_BASE_DIR")
if StateBaseDir == "" {
    StateBaseDir = "/workspace"  // Already defaults to /workspace
}

Impact: Harmless redundancy, but unnecessary.

Recommendation: Remove the env var unless you need a different path.


5. Low: PVC Size May Be Insufficient

storage: 5Gi

Considerations:

  • Messages for active sessions with large outputs could accumulate quickly
  • Multiple concurrent sessions multiply storage needs
  • No cleanup mechanism visible in the code

Recommendations:

  • Consider 10Gi or 20Gi for production workloads
  • Implement message retention/cleanup policies
  • Monitor PVC usage

6. Code Quality: Missing Error Handling

In websocket/hub.go:165, the directory creation error is silently ignored:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

Per CLAUDE.md standards:

"Never": Silent failures (always log errors)

Recommended Fix:

sessionsDir := fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID)
if err := os.MkdirAll(sessionsDir, 0o755); err != nil {
    log.Printf("Failed to create sessions directory %s: %v", sessionsDir, err)
    return  // Don't attempt to write if directory creation failed
}

🔒 Security Review

✅ Good:

  • Init container drops all capabilities
  • Using allowPrivilegeEscalation: false
  • PVC uses ReadWriteOnce (appropriate for single replica)

⚠️ Needs Attention:

  • Backend container missing security context (see issue Epic: Data Source Integration #3)
  • Directory permissions (755 vs 777) need testing with OpenShift random UIDs
  • No runAsNonRoot constraint (OpenShift will enforce, but should be explicit)

🚀 Performance Considerations

Positive:

  • PVC on CRC uses local hostpath storage (fast)
  • 5Gi should handle moderate message volumes

Potential Issues:

  • No message cleanup strategy: Messages accumulate indefinitely
  • JSONL append pattern: Sequential writes are fine, but no indexing for retrieval
  • Shared PVC: If you scale backend to multiple replicas, ReadWriteOnce will block scheduling

🧪 Testing Recommendations

Before merging, please verify:

  1. Basic Message Flow:

    make dev-start
    # Create session, send message, verify it appears in UI
  2. PVC Persistence:

    # Send messages
    oc delete pod -l app=vteam-backend -n vteam-dev
    # Wait for pod restart, verify messages still visible
  3. Directory Permissions:

    oc rsh deployment/vteam-backend -n vteam-dev
    ls -la /workspace/sessions
    # Verify backend can write files
  4. Error Handling:

    # Check logs for any permission errors
    make dev-logs-backend

📋 Action Items

Must Fix Before Merge:

  • Test with actual message flow (author mentioned "worked once, but can't replicate")
  • Fix directory ownership/permissions if backend still can't write
  • Add security context to backend container
  • Remove hardcoded storage class or make it conditional

Should Fix:

  • Add error logging for MkdirAll in websocket/hub.go:165
  • Consider removing init container (redundant with MkdirAll)
  • Test on production OpenShift to validate PVC provisioning

Nice to Have:

  • Implement message cleanup policy (retention period)
  • Add PVC usage monitoring/alerts
  • Document message persistence behavior in user docs

🎓 Learning Notes

This PR highlights a common Kubernetes pattern challenge: stateful data in containerized environments. Key lessons:

  1. Ephemeral vs Persistent Storage: Containers have ephemeral filesystems; use PVCs for data that must survive pod restarts
  2. OpenShift Random UIDs: Init containers and app containers run as different UIDs, requiring careful permission management
  3. Environment-Specific Configuration: Storage classes vary across environments (CRC vs production)

📚 References


💡 Next Steps

  1. Address the "can't replicate success" issue - this suggests intermittent problems
  2. Check OpenShift console logs for permission denied errors
  3. Consider posting debug output from a failed attempt
  4. Test the permission fix (chmod 777 or removing init container)

Happy to help debug further if you share logs from a failed attempt! 🔍

Signed-off-by: Bill Murdock <bmurdock@redhat.com>
@github-actions
Copy link
Contributor

PR Review: Update file system config for crc

I've reviewed PR #216 and the changes look good overall. The fix properly addresses the filesystem permission issue in CRC by adding a PVC and init container. Here's my detailed feedback:

✅ Strengths

  1. Root Cause Fix: The PR correctly identifies and fixes the "no such file or directory" error by ensuring the /workspace/sessions directory exists before the backend tries to write messages.

  2. Security Best Practices: The init container follows proper security patterns:

    • allowPrivilegeEscalation: false
    • Drops all capabilities
    • Uses Red Hat UBI minimal image
  3. Proper Ordering: The crc-start.sh script now creates the PVC before deploying the backend, ensuring the volume is available.

  4. Consistency with Production: The PVC configuration aligns with the production manifest (components/manifests/workspace-pvc.yaml).

  5. Excellent Documentation: The new CONTRIBUTING.md is comprehensive and provides clear CRC setup instructions that will help other developers reproduce the fix.

🔍 Issues & Recommendations

1. Security Context Missing on Backend Container (Medium Priority)

The production manifest (components/manifests/backend-deployment.yaml) doesn't have a securityContext on the backend container, but according to CLAUDE.md standards, it should:

CLAUDE.md requirement (line 1050-1057):

// Always set SecurityContext for Job pods
SecurityContext: &corev1.SecurityContext{
    AllowPrivilegeEscalation: boolPtr(false),
    ReadOnlyRootFilesystem:   boolPtr(false),  // Only if temp files needed
    Capabilities: &corev1.Capabilities{
        Drop: []corev1.Capability{"ALL"},
    },
},

Recommendation: Add a securityContext to the backend container in backend-deployment.yaml:

containers:
- name: backend
  image: image-registry.openshift-image-registry.svc:5000/vteam-dev/vteam-backend:latest
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL
  # ... rest of container spec

This should be applied to both the local-dev and production manifests for consistency.

2. Init Container Permissions (Low Priority)

The init container runs as root (by default) to create directories. While this works, consider if it's necessary:

Current approach:

initContainers:
- name: init-workspace
  command: ['sh', '-c', 'mkdir -p /workspace/sessions && chmod 755 /workspace/sessions']

Alternative consideration: If the PVC mounts with proper permissions, the init container might not be needed at all since os.MkdirAll in line 165 of websocket/hub.go already creates directories:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

Question: Does CRC require the parent /workspace/sessions directory to exist before the backend can create subdirectories? If testing shows the init container is necessary, that's fine - just worth documenting why.

3. Storage Class Hardcoding (Medium Priority)

The PVC manifest hardcodes the CRC storage class:

storageClassName: crc-csi-hostpath-provisioner

Issue: This will only work in CRC environments. If someone tries to use these manifests elsewhere, it will fail.

Recommendation: Make this configurable or document it clearly. Options:

  • Add a comment explaining this is CRC-specific
  • Use a script variable or kustomize overlay
  • Document in CONTRIBUTING.md that this needs to be changed for other environments

The production manifest (components/manifests/workspace-pvc.yaml) doesn't specify storageClassName, which allows the cluster's default storage class to be used.

4. Error Handling in persistMessageToS3 (Low Priority)

Line 165 in websocket/hub.go silently ignores MkdirAll errors:

_ = os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755)

While the subsequent OpenFile will fail if the directory can't be created, it would be better to log the error for debugging:

if err := os.MkdirAll(fmt.Sprintf("%s/sessions/%s", StateBaseDir, message.SessionID), 0o755); err != nil {
    log.Printf("persistMessage: MkdirAll failed: %v", err)
    return
}

This isn't a blocker for this PR since it's existing code, but worth noting for future improvements.

📝 Documentation Quality

The CONTRIBUTING.md file is excellent:

  • Clear CRC version recommendation (2.54.0) with rationale
  • Step-by-step setup instructions
  • Troubleshooting section
  • Hot-reload development workflow
  • Aligns with existing CLAUDE.md standards

🧪 Testing Validation

Question for the author: Have you tested that:

  1. Messages now persist correctly in CRC?
  2. Messages survive backend pod restarts (PVC persistence)?
  3. The init container successfully creates the directory structure?

⚡ Performance Considerations

No performance concerns - the changes are minimal and appropriate for local development.

🎯 Recommendations Summary

Before merging:

  1. ✅ Add securityContext to backend container (both local-dev and production manifests)
  2. ✅ Add comment explaining CRC-specific storage class or make it configurable

Optional improvements (can be separate PRs):
3. Consider if init container is truly necessary after adding PVC
4. Improve error logging in persistMessageToS3 (existing code improvement)

Conclusion

This is a solid fix that addresses a real issue preventing local development with CRC. The changes are well-thought-out and the documentation additions are valuable. With the minor security enhancements mentioned above, this will be ready to merge.

Great work identifying and fixing this issue! 🎉

@jwm4 jwm4 marked this pull request as ready for review October 29, 2025 23:02
Copy link
Collaborator

@Gkrumbach07 Gkrumbach07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

@Gkrumbach07 Gkrumbach07 merged commit 8cad2f2 into ambient-code:main Oct 31, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants