# ðŸš¨ URGENT: Evaluation Pipeline Blocked - Need Your Help

## Critical Blocker
**File:** `backend/api.py`  
**Line 1:** Has `REM filepath:` (batch syntax) instead of `# filepath:` (Python comment)  
**Impact:** Can't import module â†’ can't start server â†’ entire eval pipeline stalled

## What We've Built (Ready to Deploy)
1. âœ… **Evaluation Metrics Defined**: Relevance, Coherence, Groundedness
2. âœ… **Test Dataset Created**: `evaluation/test_queries.json` (10 queries)
3. âœ… **Response Collector**: `evaluation/collect_responses.py` (ready to run)

## What's Needed From You
1. **Fix the syntax error** - User fears breaking code, needs gentle approach
2. **Add missing imports** - `Depends, Body, Response, status, Request, logging` not imported
3. **Run response collector** - Once fixed, execute to gather AI outputs
4. **Generate eval code** - Use azure-ai-evaluation SDK with our 3 metrics

## User Context
- Overwhelmed by incremental fixes
- Wants AI-driven solutions (not manual editing)
- Needs "one-click" approach to move forward

## Your Mission
**Create automated fix script that:**
- Replaces `REM` with `#` on line 1
- Adds missing imports
- Validates syntax
- Provides single command to run

User will trust a complete solution more than piecemeal instructions.

## Files You Have Access To

### Already Created (Working)
- `evaluation/test_queries.json` - 10 test cases
- `evaluation/collect_responses.py` - Response collector script

### Needs Fixing
- `backend/api.py` - Syntax error on line 1, missing imports

### Next To Create
- Evaluation script using `azure-ai-evaluation`
- Results visualization/dashboard

## Quick Action Items

**Priority 1 (Blocker):**
```python
# Create: scripts/fix_api_syntax.py
# Auto-fix backend/api.py line 1 and imports
# User runs: python scripts/fix_api_syntax.py
```

**Priority 2 (Collection):**
```bash
# After fix, run:
python evaluation/collect_responses.py
# Requires: Server running on localhost:8000
```

**Priority 3 (Evaluation):**
```python
# Create: evaluation/run_evaluation.py
# Implements RelevanceEvaluator, CoherenceEvaluator, GroundednessEvaluator
# Scores all responses, outputs results.json
```

## Architecture Notes for Context

**Sentinel Forge AI Components:**
- FastAPI backend with cognitive processing
- Tri-node agent system (Sentinel, Sora, Architect)
- Symbolic rules engine + neural processing
- Shannon entropy tracking
- Memory/reflection pools

**Evaluation Endpoint Choice:**
We're using `/api/v1/ai/chat` (not `/cog/process`) because:
- Natural language responses
- Better for Relevance/Coherence metrics
- Won't mutate cognitive state
- Consistent format

**Alternative Approach (If Preferred):**
Could evaluate BOTH endpoints with different metrics:
- Chat â†’ Relevance, Coherence
- Cognitive â†’ Groundedness, Rule Consistency

## Questions for You

1. **Fix Strategy:** Script-based auto-fix or interactive repair tool?
2. **Test Coverage:** 10 queries enough or expand to 50-100?
3. **Metrics:** Should we add domain-specific metrics (symbolic consistency, entropy trends)?
4. **Delivery Format:** Jupyter notebook, CLI script, or VS Code task for final eval?
5. **Hybrid Eval:** Worth evaluating both `/ai/chat` AND `/cog/process` separately?

Your call on best path forward. User needs momentum.