diff --git a/fern/docs.yml b/fern/docs.yml
index 62bf5b256..3f6b946bc 100644
--- a/fern/docs.yml
+++ b/fern/docs.yml
@@ -222,16 +222,6 @@ navigation:
                   - page: Migrating from Trieve
                     path: knowledge-base/migrating-from-trieve.mdx
                     icon: fa-light fa-triangle-exclamation
-          - section: Structured outputs
-            icon: fa-light fa-database
-            path: assistants/structured-outputs.mdx
-            contents:
-              - page: Quickstart
-                path: assistants/structured-outputs-quickstart.mdx
-                icon: fa-light fa-rocket
-              - page: Examples
-                path: assistants/structured-outputs-examples.mdx
-                icon: fa-light fa-code
           - page: Custom keywords
             path: customization/custom-keywords.mdx
             icon: fa-light fa-bullseye
@@ -280,6 +270,31 @@ navigation:
                 path: assistants/examples/multilingual-agent.mdx
                 icon: fa-light fa-globe
 
+      - section: Observability
+        contents:
+          - section: Evals
+            icon: fa-light fa-clipboard-check
+            contents:
+              - page: Quickstart
+                path: observability/evals-quickstart.mdx
+                icon: fa-light fa-clipboard-check
+              - page: Advanced
+                path: observability/evals-advanced.mdx
+                icon: fa-light fa-clipboard-check
+          - page: Boards
+            path: observability/boards-quickstart.mdx
+            icon: fa-light fa-chart-line
+          - section: Structured outputs
+            icon: fa-light fa-database
+            path: assistants/structured-outputs.mdx
+            contents:
+              - page: Quickstart
+                path: assistants/structured-outputs-quickstart.mdx
+                icon: fa-light fa-rocket
+              - page: Examples
+                path: assistants/structured-outputs-examples.mdx
+                icon: fa-light fa-code
+
       - section: Squads
         contents:
           - page: Quickstart
@@ -331,12 +346,6 @@ navigation:
                 path: test/voice-testing.mdx
                 icon: fa-light fa-volume-high
 
-      - section: Observability
-        contents:
-          - page: Boards
-            path: observability/boards-quickstart.mdx
-            icon: fa-light fa-chart-line
-
       - section: Phone numbers
         contents:
           - page: Free Vapi number
diff --git a/fern/observability/evals-advanced.mdx b/fern/observability/evals-advanced.mdx
new file mode 100644
index 000000000..e3fbdcb09
--- /dev/null
+++ b/fern/observability/evals-advanced.mdx
@@ -0,0 +1,1127 @@
+---
+title: Advanced eval testing
+subtitle: Master testing strategies and best practices for production AI agents
+slug: observability/evals-advanced
+---
+
+## Overview
+
+This guide covers advanced evaluation strategies, testing patterns, and best practices for building robust test suites that ensure your AI agents work reliably in production.
+
+**You'll learn:**
+
+- Strategic testing approaches (smoke, regression, edge case)
+- Testing patterns for different scenarios
+- Performance optimization techniques
+- Maintenance and CI/CD integration strategies
+- Advanced troubleshooting methods
+
+## Testing strategies
+
+### Smoke tests
+
+Quick validation that core functionality works. Run these first to catch obvious issues.
+
+**Purpose:** Verify assistant responds and basic conversation flow works.
+
+```json
+{
+  "name": "Smoke Test - Basic Response",
+  "description": "Verify assistant responds to simple greeting",
+  "type": "chat.mockConversation",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hello"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "regex",
+        "content": ".+"
+      }
+    }
+  ]
+}
+```
+
+**Characteristics:**
+
+- Minimal validation (just check for any response)
+- Fast execution (1-2 turns)
+- Run before detailed tests
+- Exit early if smoke tests fail
+
+**When to use:**
+
+- Before running expensive test suites
+- After deploying configuration changes
+- As health checks in monitoring
+- Quick validation during development
+
+### Regression tests
+
+Ensure fixes and updates don't break existing functionality.
+
+**Purpose:** Validate that known issues stay fixed and features keep working.
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Create evaluation named with "Regression: " prefix
+    2. Include issue ticket number in description
+    3. Add exact scenario that previously failed
+    4. Validate the fix still works
+    
+    Example:
+    - Name: "Regression: Date Parsing Bug #1234"
+    - Description: "Verify dates like '3/15' parse correctly after bug fix"
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Regression: Date Parsing Bug #1234",
+    "description": "Verify dates like 3/15 are parsed correctly after fix",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Book me for 3/15"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "toolCalls": [{
+            "name": "bookAppointment",
+            "arguments": {
+              "date": "2025-03-15"
+            }
+          }]
+        }
+      }
+    ]
+  }'
+```
+  </Tab>
+</Tabs>
+
+**Best practices:**
+
+- Name tests after bugs they prevent
+- Include ticket/issue numbers in descriptions
+- Add regression tests when fixing bugs
+- Run full regression suite before major releases
+- Archive tests only when features are removed
+
+### Edge case testing
+
+Test boundary conditions and unusual inputs.
+
+**Common edge cases to test:**
+
+<AccordionGroup>
+  <Accordion title="Empty or minimal input">
+    ```json
+    {
+      "messages": [
+        {"role": "user", "content": ""},
+        {"role": "assistant", "judgePlan": {
+          "type": "ai",
+          "model": {
+            "provider": "openai",
+            "model": "gpt-4o",
+            "messages": [{
+              "role": "system",
+              "content": "PASS if response asks for clarification politely. Output: pass or fail"
+            }]
+          }
+        }}
+      ]
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Very long input">
+    ```json
+    {
+      "messages": [
+        {
+          "role": "user",
+          "content": "I need help with... (repeat 1000 times)"
+        },
+        {
+          "role": "assistant",
+          "judgePlan": {
+            "type": "regex",
+            "content": ".+"
+          },
+          "continuePlan": {
+            "exitOnFailureEnabled": true
+          }
+        }
+      ]
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Special characters and unicode">
+    ```json
+    {
+      "messages": [
+        {
+          "role": "user",
+          "content": "My name is François José 王明"
+        },
+        {
+          "role": "assistant",
+          "judgePlan": {
+            "type": "ai",
+            "model": {
+              "provider": "openai",
+              "model": "gpt-4o",
+              "messages": [{
+                "role": "system",
+                "content": "PASS if response correctly acknowledges the name with special characters. Output: pass or fail"
+              }]
+            }
+          }
+        }
+      ]
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Ambiguous or unclear requests">
+    ```json
+    {
+      "messages": [
+        {
+          "role": "user",
+          "content": "asdfghjkl"
+        },
+        {
+          "role": "assistant",
+          "judgePlan": {
+            "type": "ai",
+            "model": {
+              "provider": "openai",
+              "model": "gpt-4o",
+              "messages": [{
+                "role": "system",
+                "content": "PASS if response asks for clarification without being rude. Output: pass or fail"
+              }]
+            }
+          }
+        }
+      ]
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Rapid conversation changes">
+    ```json
+    {
+      "messages": [
+        {"role": "user", "content": "Book appointment"},
+        {"role": "assistant", "judgePlan": {"type": "regex", "content": ".*appointment.*"}},
+        {"role": "user", "content": "Actually, cancel that. I need tech support."},
+        {"role": "assistant", "judgePlan": {
+          "type": "ai",
+          "model": {
+            "provider": "openai",
+            "model": "gpt-4o",
+            "messages": [{
+              "role": "system",
+              "content": "PASS if response pivots to tech support without confusion. Output: pass or fail"
+            }]
+          }
+        }}
+      ]
+    }
+    ```
+  </Accordion>
+</AccordionGroup>
+
+**Edge case categories:**
+
+- **Input boundaries:** Empty, maximum length, special characters
+- **Data formats:** Invalid dates, malformed phone numbers, unusual names
+- **Conversation patterns:** Interruptions, topic changes, contradictions
+- **Timing:** Very fast responses, long pauses, timeout scenarios
+
+## Testing patterns
+
+### Happy path testing
+
+Validate ideal user journeys where everything works correctly.
+
+**Structure:**
+
+1. User provides clear, complete information
+2. Assistant responds appropriately
+3. Tools execute successfully
+4. Conversation completes with desired outcome
+
+**Example: Perfect booking flow**
+
+```json
+{
+  "name": "Happy Path - Complete Booking",
+  "description": "User provides all info clearly, booking succeeds",
+  "type": "chat.mockConversation",
+  "messages": [
+    {
+      "role": "user",
+      "content": "I'd like to book an appointment"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response asks for date/time preferences. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    },
+    {
+      "role": "user",
+      "content": "Next Monday at 2pm please"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "toolCalls": [
+          {
+            "name": "bookAppointment",
+            "arguments": {
+              "date": "2025-01-20",
+              "time": "14:00"
+            }
+          }
+        ]
+      }
+    },
+    {
+      "role": "tool",
+      "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "regex",
+        "content": ".*(confirmed|booked).*APT-12345.*"
+      }
+    }
+  ]
+}
+```
+
+**Happy path coverage:**
+
+- Test primary user goals
+- Verify expected tool executions
+- Validate success messages
+- Confirm data accuracy
+
+### Error handling testing
+
+Test how your assistant handles failures gracefully.
+
+**Tool failure scenarios:**
+
+```json
+{
+  "name": "Error Handling - Booking Unavailable",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Book Monday at 2pm"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "toolCalls": [{ "name": "bookAppointment" }]
+      }
+    },
+    {
+      "role": "tool",
+      "content": "{\"status\": \"error\", \"message\": \"Time slot unavailable\"}"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges the time is unavailable\n- Response offers alternatives or asks for different time\n- Tone remains helpful (not apologetic to excess)\n\nFAIL if:\n- Response ignores the error\n- Response doesn't offer next steps\n- Tone is frustrated or rude\n\nOutput: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+**Invalid input handling:**
+
+```json
+{
+  "name": "Error Handling - Invalid Date Format",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Book me for the 45th of Octember"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response politely asks for valid date without mocking user. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+**API timeout simulation:**
+
+```json
+{
+  "name": "Error Handling - Tool Timeout",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Check my order status"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "toolCalls": [{ "name": "checkOrderStatus" }]
+      }
+    },
+    {
+      "role": "tool",
+      "content": "{\"status\": \"error\", \"message\": \"Request timeout\"}"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response acknowledges technical issue and suggests retry or alternative. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+**Error categories to test:**
+
+- Tool/API failures
+- Invalid user input
+- Timeout scenarios
+- Rate limit errors
+- Partial data availability
+- Permission/authorization issues
+
+### Boundary testing
+
+Test limits and thresholds of your system.
+
+**Maximum conversation length:**
+
+```json
+{
+  "name": "Boundary - Max Turns",
+  "description": "Test assistant handles long conversations (20+ turns)",
+  "messages": [
+    { "role": "user", "content": "Question 1" },
+    { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
+    { "role": "user", "content": "Question 2" },
+    { "role": "assistant", "judgePlan": { "type": "regex", "content": ".+" } },
+    // ... repeat up to boundary ...
+    { "role": "user", "content": "Final question" },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response is coherent and maintains context from earlier conversation. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+**Rate limits:**
+
+Test behavior at or near rate limits:
+
+- Multiple tool calls in succession
+- Rapid user input
+- Large data processing requests
+
+**Data size boundaries:**
+
+```json
+{
+  "name": "Boundary - Large Data Response",
+  "messages": [
+    {
+      "role": "user",
+      "content": "Get all customer records"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "toolCalls": [{ "name": "getAllCustomers" }]
+      }
+    },
+    {
+      "role": "tool",
+      "content": "{\"customers\": [/* 1000 customer objects */]}"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response summarizes data rather than reading full list. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+## Best practices
+
+### Evaluation design principles
+
+<CardGroup cols={2}>
+  <Card title="Single responsibility" icon="bullseye">
+    Each evaluation should test one specific behavior or feature.
+    
+    ✅ **Good:** "Test greeting acknowledgment"
+    
+    ❌ **Bad:** "Test greeting, booking, and error handling"
+  </Card>
+
+{" "}
+
+<Card title="Clear naming" icon="tag">
+  Use descriptive names that explain what's being tested. ✅ **Good:** "Booking
+  - Validates Date Format" ❌ **Bad:** "Test 1" or "Eval ABC"
+</Card>
+
+{" "}
+
+<Card title="Comprehensive descriptions" icon="file-lines">
+  Document why the test exists and what it validates. Include context: business
+  requirement, bug ticket, or feature spec.
+</Card>
+
+  <Card title="Maintainable complexity" icon="layer-group">
+    Keep evaluations focused (5-10 turns max).
+    
+    Split complex scenarios into multiple targeted tests.
+  </Card>
+</CardGroup>
+
+### Validation approach selection
+
+Choose the right judge type for each scenario:
+
+<AccordionGroup>
+  <Accordion title="Use Exact Match when...">
+    **Ideal for:**
+    - Critical business data (confirmation IDs, totals, dates)
+    - Tool call validation with specific arguments
+    - Compliance-required exact wording
+    - Success/failure status messages
+    
+    **Example:** Booking confirmation ID must be exact
+    ```json
+    {
+      "judgePlan": {
+        "type": "exact",
+        "content": "Your confirmation ID is APT-12345"
+      }
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Use Regex when...">
+    **Ideal for:**
+    - Responses with variable data (names, dates, IDs)
+    - Pattern matching (email formats, phone numbers)
+    - Flexible phrasing with specific keywords
+    - Multiple acceptable phrasings
+    
+    **Example:** Confirmation with variable ID format
+    ```json
+    {
+      "judgePlan": {
+        "type": "regex",
+        "content": ".*confirmation (ID|number|code): [A-Z]{3}-[0-9]{5}.*"
+      }
+    }
+    ```
+  </Accordion>
+
+  <Accordion title="Use AI Judge when...">
+    **Ideal for:**
+    - Semantic meaning validation
+    - Tone and sentiment evaluation
+    - Contextual appropriateness
+    - Complex multi-factor criteria
+    - Helpfulness assessment
+    
+    **Example:** Validate polite rejection
+    ```json
+    {
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [{
+            "role": "system",
+            "content": "PASS if response politely declines without being rude and offers alternative. Output: pass or fail"
+          }]
+        }
+      }
+    }
+    ```
+  </Accordion>
+</AccordionGroup>
+
+**Decision tree:**
+
+```
+Is the exact wording critical?
+├─ Yes → Use Exact Match
+└─ No → Does it follow a pattern?
+    ├─ Yes → Use Regex
+    └─ No → Does it require understanding context/tone?
+        ├─ Yes → Use AI Judge
+        └─ No → Use Regex with flexible pattern
+```
+
+### Performance optimization
+
+**Minimize test execution time:**
+
+1. **Use exit-on-failure for early steps:**
+
+```json
+{
+  "continuePlan": {
+    "exitOnFailureEnabled": true
+  }
+}
+```
+
+Stops test immediately when critical validation fails.
+
+2. **Run critical tests first:**
+   Organize test suites so smoke tests and critical validations run before expensive tests.
+
+3. **Keep conversations focused:**
+   Aim for 5-10 turns maximum. Split longer scenarios into multiple tests.
+
+4. **Batch related tests:**
+   Group similar evaluations to run sequentially rather than one-off.
+
+5. **Optimize AI judge prompts:**
+
+- Use faster models (gpt-3.5-turbo) for simple validations
+- Use advanced models (gpt-4o) only for complex semantic evaluation
+- Keep prompts concise and specific
+
+**Performance comparison:**
+
+| Judge Type   | Speed       | Cost       | Use Case               |
+| ------------ | ----------- | ---------- | ---------------------- |
+| Exact        | ⚡⚡⚡ Fast | $ Low      | Critical exact matches |
+| Regex        | ⚡⚡ Fast   | $ Low      | Pattern matching       |
+| AI (GPT-3.5) | ⚡ Medium   | $$ Medium  | Simple semantic checks |
+| AI (GPT-4)   | ⏱ Slower    | $$$ Higher | Complex evaluation     |
+
+### Maintenance strategies
+
+**Version control your evaluations:**
+
+Store evaluation definitions alongside your codebase:
+
+```bash
+/tests
+  /evals
+    /greeting
+      - basic-greeting.json
+      - multilingual-greeting.json
+    /booking
+      - happy-path-booking.json
+      - error-handling-booking.json
+    /regression
+      - date-parsing-bug-1234.json
+```
+
+**Regular review cycle:**
+
+<Steps>
+  <Step title="Weekly: Review failed tests">
+    Investigate all failures. Update tests if expectations changed, or fix assistant if behavior regressed.
+  </Step>
+
+{" "}
+
+<Step title="Monthly: Audit test coverage">
+  Review test suite completeness: - All critical user flows covered? - New
+  features have tests? - Deprecated features removed?
+</Step>
+
+  <Step title="Quarterly: Refactor and optimize">
+    - Remove duplicate tests
+    - Update outdated validation criteria
+    - Optimize slow-running tests
+    - Document test rationale
+  </Step>
+</Steps>
+
+**Update tests when:**
+
+- Assistant prompts or behavior change intentionally
+- New features are added
+- Bugs are fixed (add regression tests)
+- User feedback reveals edge cases
+- Business requirements evolve
+
+**Deprecation strategy:**
+
+Don't delete tests immediately when features change:
+
+1. Mark test as "deprecated" in description
+2. Update expected behavior to match new requirements
+3. Run for one release cycle to verify
+4. Archive after confirmed stable
+
+### CI/CD integration
+
+Automate evaluation runs in your deployment pipeline.
+
+**Basic workflow:**
+
+```yaml
+# .github/workflows/test-assistant.yml
+name: Test Assistant Changes
+
+on:
+  pull_request:
+    paths:
+      - "assistants/**"
+      - "prompts/**"
+
+jobs:
+  run-evals:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Run critical evals
+        run: |
+          # Run smoke tests
+          curl -X POST "https://api.vapi.ai/eval/run" \
+            -H "Authorization: Bearer ${{ secrets.VAPI_API_KEY }}" \
+            -d '{"evalId": "$SMOKE_TEST_ID", "target": {...}}'
+
+          # Check results
+          # Fail build if tests fail
+```
+
+**Advanced patterns:**
+
+<AccordionGroup>
+  <Accordion title="Staging environment validation">
+    Run full eval suite against staging before production deploy:
+    
+    ```bash
+    # Run all evals against staging assistant
+    for eval_id in $EVAL_IDS; do
+      run_result=$(curl -X POST "https://api.vapi.ai/eval/run" \
+        -H "Authorization: Bearer $VAPI_API_KEY" \
+        -d "{\"evalId\": \"$eval_id\", \"target\": {\"type\": \"assistant\", \"assistantId\": \"$STAGING_ASSISTANT_ID\"}}")
+      
+      # Check if passed
+      status=$(echo $run_result | jq -r '.results[0].status')
+      if [ "$status" != "pass" ]; then
+        echo "Eval $eval_id failed!"
+        exit 1
+      fi
+    done
+    ```
+  </Accordion>
+
+  <Accordion title="Parallel test execution">
+    Run multiple evals concurrently to speed up CI:
+    
+    ```bash
+    # Run evals in parallel
+    for eval_id in $EVAL_IDS; do
+      (curl -X POST "https://api.vapi.ai/eval/run" \
+        -H "Authorization: Bearer $VAPI_API_KEY" \
+        -d "{\"evalId\": \"$eval_id\", ...}" > results_$eval_id.json) &
+    done
+    wait
+    
+    # Aggregate results
+    for result_file in results_*.json; do
+      # Check each result
+    done
+    ```
+  </Accordion>
+
+{" "}
+
+<Accordion title="Quality gates">
+  Block deployment if test pass rate falls below threshold: ```bash # Calculate
+  pass rate total_tests=10 passed_tests=$(grep -c '"status":"pass"'
+  all_results.json) pass_rate=$((passed_tests * 100 / total_tests)) if [
+  $pass_rate -lt 95 ]; then echo "Pass rate $pass_rate% below threshold 95%"
+  exit 1 fi ```
+</Accordion>
+
+  <Accordion title="Scheduled regression runs">
+    Run full regression suite nightly:
+    
+    ```yaml
+    # .github/workflows/nightly-regression.yml
+    on:
+      schedule:
+        - cron: '0 2 * * *'  # 2 AM daily
+    
+    jobs:
+      regression-suite:
+        runs-on: ubuntu-latest
+        steps:
+          - name: Run regression tests
+            run: ./scripts/run-regression-suite.sh
+          
+          - name: Notify on failures
+            if: failure()
+            run: |
+              # Send Slack notification
+              # Create GitHub issue
+    ```
+  </Accordion>
+</AccordionGroup>
+
+## Advanced troubleshooting
+
+### Debugging failed evaluations
+
+**Step-by-step investigation:**
+
+<Steps>
+  <Step title="Examine failure reason">
+    Check `judge.failureReason` for specific details:
+    
+    ```json
+    {
+      "judge": {
+        "status": "fail",
+        "failureReason": "Expected exact match: 'confirmed' but got: 'booked'"
+      }
+    }
+    ```
+    
+    This tells you exactly what differed.
+  </Step>
+
+{" "}
+
+<Step title="Review full conversation transcript">
+  Look at `results[0].messages` to see complete interaction: - What did the user
+  actually say? - How did the assistant respond? - Were tool calls made
+  correctly? - Did tool responses contain expected data?
+</Step>
+
+{" "}
+
+<Step title="Compare expected vs actual">
+  For exact match failures: - Check for extra spaces or newlines - Verify
+  punctuation matches exactly - Look for case sensitivity issues For tool call
+  failures: - Verify argument types (string vs number) - Check for extra/missing
+  arguments - Validate argument values
+</Step>
+
+{" "}
+
+<Step title="Test validation logic separately">
+  For regex: - Test pattern with online validators - Try pattern against actual
+  response - Check for escaped special characters For AI judge: - Test prompt
+  with known good/bad examples - Verify binary pass/fail criteria - Check for
+  ambiguous requirements
+</Step>
+
+  <Step title="Reproduce manually">
+    Test the assistant interactively:
+    - Use same input as eval
+    - Compare live behavior to eval results
+    - Check if issue is with assistant or eval validation
+  </Step>
+</Steps>
+
+### Common failure patterns
+
+<AccordionGroup>
+  <Accordion title="Exact match fails with similar text">
+    **Problem:** Expected "Hello, how can I help?" but got "Hello, how may I help?"
+    
+    **Solutions:**
+    - Switch to regex for flexibility: `Hello, how (can|may) I help\?`
+    - Use AI judge for semantic matching
+    - Update expected value if new phrasing is acceptable
+  </Accordion>
+
+{" "}
+
+<Accordion title="Tool calls don't match">
+  **Problem:** Arguments have different types or extra fields **Solutions:** -
+  Check argument types: `"14:00"` (string) vs `14` (number) - Use partial
+  matching: omit `arguments` to match only function name - Normalize data
+  formats in tool implementation
+</Accordion>
+
+{" "}
+
+<Accordion title="AI judge inconsistent results">
+  **Problem:** Same response sometimes passes, sometimes fails **Solutions:** -
+  Make criteria more specific and binary - Add explicit examples of pass/fail
+  cases in prompt - Use temperature=0 for deterministic evaluation - Switch to
+  regex if pattern-based validation works
+</Accordion>
+
+{" "}
+
+<Accordion title="Test times out or hangs">
+  **Problem:** Eval status stuck in "running" **Solutions:** - Check assistant
+  configuration for errors - Verify tool endpoints are accessible - Reduce
+  conversation complexity - Check for infinite loops in assistant logic
+</Accordion>
+
+  <Accordion title="Regex doesn't match expected">
+    **Problem:** Pattern seems correct but fails
+    
+    **Solutions:**
+    - Escape special regex characters: `.`, `?`, `*`, `+`, `(`, `)`
+    - Use `.*` for flexible matching around keywords
+    - Test pattern with online regex validators
+    - Check for hidden characters or unicode
+  </Accordion>
+</AccordionGroup>
+
+### Debugging tools and techniques
+
+**Use structured logging:**
+
+Track eval executions systematically:
+
+```javascript
+{
+  "timestamp": "2024-01-15T10:30:00Z",
+  "evalId": "eval-123",
+  "evalName": "Booking Flow Test",
+  "runId": "run-456",
+  "target": "assistant-789",
+  "result": "fail",
+  "failedStep": 3,
+  "failureReason": "Tool call mismatch",
+  "actualBehavior": "Called cancelAppointment instead of bookAppointment"
+}
+```
+
+**Isolate variables:**
+
+When tests fail inconsistently:
+
+1. Run same eval multiple times
+2. Test with different assistants (A/B comparison)
+3. Simplify conversation to minimum reproduction
+4. Check for race conditions or timing issues
+
+**Progressive validation:**
+
+Build up complexity gradually:
+
+```json
+// Step 1: Verify basic response
+{"judgePlan": {"type": "regex", "content": ".+"}}
+
+// Step 2: Verify contains keyword
+{"judgePlan": {"type": "regex", "content": ".*appointment.*"}}
+
+// Step 3: Verify exact format
+{"judgePlan": {"type": "exact", "content": "Appointment confirmed"}}
+```
+
+## Troubleshooting reference
+
+### Status and error codes
+
+| Status  | Ended Reason          | Meaning                          | Action                                |
+| ------- | --------------------- | -------------------------------- | ------------------------------------- |
+| ended   | mockConversation.done | ✅ Test completed normally       | Check results[0].status for pass/fail |
+| ended   | assistant-error       | ❌ Assistant configuration error | Fix assistant setup, re-run           |
+| ended   | pipeline-error-\*     | ❌ Provider API error            | Check provider status, API keys       |
+| running | -                     | ⏳ Test in progress              | Wait or check for timeout             |
+| queued  | -                     | ⏳ Test waiting to start         | Normal, should start soon             |
+
+### Quick diagnostic checklist
+
+<Note>
+  **When an eval fails, check:** - [ ] `endedReason` is "mockConversation.done"
+  - [ ] Assistant works correctly in manual testing - [ ] Tool endpoints are
+  accessible - [ ] Validation criteria match actual behavior - [ ] Regex
+  patterns are properly escaped - [ ] AI judge prompts are specific and binary -
+  [ ] Arguments match expected types (string vs number) - [ ] API keys and
+  permissions are valid - [ ] No rate limits or quota issues
+</Note>
+
+### Getting help
+
+**Include these details when reporting issues:**
+
+- Eval ID and run ID
+- Full `endedReason` value
+- Conversation transcript (`results[0].messages`)
+- Expected vs actual behavior
+- Assistant/squad configuration
+- Provider and model being used
+
+**Resources:**
+
+- [Eval API Reference](/api-reference/eval/create)
+- [Discord Community](https://discord.gg/pUFNcf2WmH) - #testing channel
+- [Support](mailto:support@vapi.ai) - Include eval run ID
+
+## Next steps
+
+<CardGroup cols={2}>
+  <Card
+    title="Evals quickstart"
+    icon="rocket"
+    href="/observability/evals-quickstart"
+  >
+    Return to quickstart guide for basic evaluation setup
+  </Card>
+
+{" "}
+
+<Card title="Assistants guide" icon="robot" href="/assistants/quickstart">
+  Learn to build and configure assistants for testing
+</Card>
+
+{" "}
+
+<Card title="Tools documentation" icon="wrench" href="/tools/custom-tools">
+  Build custom tools and test their behavior
+</Card>
+
+  <Card
+    title="Eval API reference"
+    icon="book"
+    href="/api-reference/eval/create"
+  >
+    Complete API documentation for evaluations
+  </Card>
+</CardGroup>
+
+## Summary
+
+<Tip>
+    **Key takeaways for advanced eval testing:**
+
+    **Testing strategy:**
+
+    - Use smoke tests before comprehensive suites
+    - Build regression tests when fixing bugs
+    - Cover edge cases systematically
+
+    **Validation selection:**
+
+    - Exact match for critical data
+    - Regex for pattern matching
+    - AI judge for semantic evaluation
+
+    **Performance:**
+
+    - Exit early on critical failures
+    - Keep conversations focused (5-10 turns)
+    - Batch related tests together
+
+    **Maintenance:**
+
+    - Version control evaluations
+    - Review failures promptly
+    - Update tests with features
+    - Document test purpose clearly
+
+    **CI/CD:**
+
+    - Automate critical tests in pipelines
+    - Use staging for full suite validation
+    - Set quality gate thresholds
+    - Run regression suites regularly
+
+</Tip>
diff --git a/fern/observability/evals-quickstart.mdx b/fern/observability/evals-quickstart.mdx
new file mode 100644
index 000000000..9b5e20de7
--- /dev/null
+++ b/fern/observability/evals-quickstart.mdx
@@ -0,0 +1,1412 @@
+---
+title: Evals quickstart
+subtitle: Get started with AI agent testing in 5 minutes
+slug: observability/evals-quickstart
+---
+
+## Overview
+
+This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you'll create mock conversations, define expected behaviors, and validate your agents work correctly before production.
+
+### What are Evals?
+
+Evals is Vapi's AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:
+
+1. **Creating mock conversations** - Define user messages and expected assistant responses
+2. **Validating behavior** - Use exact match, regex patterns, or AI-powered judging
+3. **Testing tool calls** - Verify function calls with specific arguments
+4. **Running automated tests** - Execute tests and receive detailed pass/fail results
+5. **Debugging failures** - Review full conversation transcripts with evaluation details
+
+### When are Evals useful?
+
+Evals help you maintain quality and catch issues early:
+
+- **Pre-deployment testing** - Validate new assistant configurations before going live
+- **Regression testing** - Ensure prompt or tool changes don't break existing behaviors
+- **Conversation flow validation** - Test multi-turn interactions and complex scenarios
+- **Tool calling verification** - Validate function calls with correct arguments
+- **Squad handoff testing** - Ensure smooth transitions between squad members
+- **CI/CD integration** - Automate quality gates in your deployment pipeline
+
+### What you'll build
+
+An evaluation suite for an appointment booking assistant that tests:
+
+- Greeting and initial response validation
+- Tool call execution with specific arguments
+- Response pattern matching with regex
+- Semantic validation using AI judges
+- Multi-turn conversation flows
+
+## Prerequisites
+
+<CardGroup cols={2}>
+  <Card title="Vapi account" icon="user">
+    Sign up at [dashboard.vapi.ai](https://dashboard.vapi.ai)
+  </Card>
+  <Card title="API key" icon="key">
+    Get your API key from **API Keys** in sidebar
+  </Card>
+</CardGroup>
+
+<Note>
+  You'll also need an existing assistant or squad to test. You can create one in
+  the Dashboard or use the API.
+</Note>
+
+## Step 1: Create your first evaluation
+
+Define a mock conversation to test your assistant's greeting behavior.
+
+<Tabs>
+  <Tab title="Dashboard">
+    <Steps>
+      <Step title="Navigate to Evals">
+        1. Log in to [dashboard.vapi.ai](https://dashboard.vapi.ai)
+        2. Click on **Evals** in the left sidebar (under Observability)
+        3. Click **Create Evaluation**
+      </Step>
+
+      <Step title="Configure basic settings">
+        1. **Name**: Enter "Greeting Test"
+        2. **Description**: Add "Verify assistant greets users appropriately"
+        3. **Type**: Automatically set to "chat.mockConversation"
+      </Step>
+
+      <Step title="Add conversation turns">
+        1. Click **Add Message**
+        2. Select **User** message type
+        3. Enter content: "Hello"
+        4. Click **Add Message** again
+        5. Select **Assistant** message type
+        6. Click **Enable Evaluation** toggle
+        7. Select **Exact Match** as judge type
+        8. Enter expected content: "Hello! How can I help you today?"
+        9. Click **Save Evaluation**
+      </Step>
+    </Steps>
+
+    <Tip>
+    Your evaluation is now saved. You can run it against any assistant or squad.
+    </Tip>
+
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Greeting Test",
+    "description": "Verify assistant greets users appropriately",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "content": "Hello! How can I help you today?"
+        }
+      }
+    ]
+  }'
+```
+
+**Response:**
+
+```json
+{
+  "id": "550e8400-e29b-41d4-a716-446655440000",
+  "orgId": "org-123",
+  "type": "chat.mockConversation",
+  "name": "Greeting Test",
+  "description": "Verify assistant greets users appropriately",
+  "messages": [...],
+  "createdAt": "2024-01-15T09:30:00Z",
+  "updatedAt": "2024-01-15T09:30:00Z"
+}
+```
+
+Save the returned `id` - you'll need it to run the evaluation.
+
+For complete API details, see [Create Eval](/api-reference/eval/create).
+
+  </Tab>
+</Tabs>
+
+<Note>
+  **Message structure:** Each conversation turn has a `role` (user, assistant,
+  system, or tool). Assistant messages with `judgePlan` define what to validate.
+</Note>
+
+## Step 2: Run your evaluation
+
+Execute the evaluation against your assistant or squad.
+
+<Tabs>
+  <Tab title="Dashboard">
+    <Steps>
+      <Step title="Open your evaluation">
+        1. Navigate to **Evals** in the sidebar
+        2. Click on "Greeting Test" from your evaluations list
+      </Step>
+
+      <Step title="Select target and run">
+        1. In the evaluation detail page, find the **Run Test** section
+        2. Select **Assistant** or **Squad** as the target type
+        3. Choose your assistant/squad from the dropdown
+        4. Click **Run Evaluation**
+        5. Watch real-time progress as the test executes
+      </Step>
+
+      <Step title="View results">
+        Results appear automatically when the test completes:
+        - ✅ **Green checkmark** indicates evaluation passed
+        - ❌ **Red X** indicates evaluation failed
+        - Click **View Details** to see full conversation transcript
+      </Step>
+    </Steps>
+
+  </Tab>
+
+  <Tab title="cURL">
+**Create an eval run:**
+
+```bash
+curl -X POST "https://api.vapi.ai/eval/run" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "evalId": "550e8400-e29b-41d4-a716-446655440000",
+    "target": {
+      "type": "assistant",
+      "assistantId": "your-assistant-id"
+    }
+  }'
+```
+
+**Response:**
+
+```json
+{
+  "id": "eval-run-123",
+  "evalId": "550e8400-e29b-41d4-a716-446655440000",
+  "orgId": "org-123",
+  "status": "queued",
+  "createdAt": "2024-01-15T09:35:00Z",
+  "updatedAt": "2024-01-15T09:35:00Z"
+}
+```
+
+**Check results:**
+
+```bash
+curl -X GET "https://api.vapi.ai/eval/run/eval-run-123" \
+  -H "Authorization: Bearer $VAPI_API_KEY"
+```
+
+For complete API details, see [Create Eval Run](/api-reference/eval/run) and [Get Eval Run](/api-reference/eval/get-run).
+
+  </Tab>
+</Tabs>
+
+<Tip>
+  You can also run evaluations with transient assistant or squad configurations
+  by providing `assistant` or `squad` objects instead of IDs in the target.
+</Tip>
+
+## Step 3: Understand test results
+
+Learn to interpret evaluation results and identify issues.
+
+### Successful evaluation
+
+When all checks pass, you'll see:
+
+```json
+{
+  "id": "eval-run-123",
+  "evalId": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "ended",
+  "endedReason": "mockConversation.done",
+  "results": [
+    {
+      "status": "pass",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Hello"
+        },
+        {
+          "role": "assistant",
+          "content": "Hello! How can I help you today?",
+          "judge": {
+            "status": "pass"
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Pass criteria:**
+
+- `status` is "ended"
+- `endedReason` is "mockConversation.done"
+- `results[0].status` is "pass"
+- All `judge.status` values are "pass"
+
+### Failed evaluation
+
+When validation fails, you'll see details:
+
+```json
+{
+  "status": "ended",
+  "endedReason": "mockConversation.done",
+  "results": [
+    {
+      "status": "fail",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Hello"
+        },
+        {
+          "role": "assistant",
+          "content": "Hi there! What can I do for you?",
+          "judge": {
+            "status": "fail",
+            "failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Failure indicators:**
+
+- `results[0].status` is "fail"
+- `judge.status` is "fail"
+- `judge.failureReason` explains why validation failed
+
+<Warning>
+  If `endedReason` is not "mockConversation.done", the test encountered an error
+  (like "assistant-error" or "pipeline-error-openai-llm-failed"). Check your
+  assistant configuration.
+</Warning>
+
+## Step 4: Test tool/function calls
+
+Validate that your assistant calls functions with correct arguments.
+
+### Basic tool call validation
+
+Test appointment booking with exact argument matching:
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Create new evaluation: "Appointment Booking Test"
+    2. Add user message: "Book me an appointment for next Monday at 2pm"
+    3. Add assistant message with evaluation enabled
+    4. Select **Exact Match** judge type
+    5. Click **Add Tool Call**
+    6. Enter function name: "bookAppointment"
+    7. Add arguments:
+       - `date`: "2025-01-20"
+       - `time`: "14:00"
+    8. Add tool response message:
+       - Type: **Tool**
+       - Content: `{"status": "success", "confirmationId": "APT-12345"}`
+    9. Add final assistant message to verify confirmation
+    10. Save evaluation
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Appointment Booking Test",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Book me an appointment for next Monday at 2pm"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "toolCalls": [{
+            "name": "bookAppointment",
+            "arguments": {
+              "date": "2025-01-20",
+              "time": "14:00"
+            }
+          }]
+        }
+      },
+      {
+        "role": "tool",
+        "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "regex",
+          "content": ".*confirmed.*APT-12345.*"
+        }
+      }
+    ]
+  }'
+```
+
+For API details, see [Create Eval](/api-reference/eval/create).
+
+  </Tab>
+</Tabs>
+
+### Tool call validation modes
+
+**Exact match - Full validation:**
+
+```json
+{
+  "judgePlan": {
+    "type": "exact",
+    "toolCalls": [
+      {
+        "name": "bookAppointment",
+        "arguments": {
+          "date": "2025-01-20",
+          "time": "14:00"
+        }
+      }
+    ]
+  }
+}
+```
+
+Validates both function name AND all argument values exactly.
+
+**Partial match - Name only:**
+
+```json
+{
+  "judgePlan": {
+    "type": "regex",
+    "toolCalls": [
+      {
+        "name": "bookAppointment"
+      }
+    ]
+  }
+}
+```
+
+Validates only that the function was called (arguments can vary).
+
+**Multiple tool calls:**
+
+```json
+{
+  "judgePlan": {
+    "type": "exact",
+    "toolCalls": [
+      {
+        "name": "checkAvailability",
+        "arguments": { "date": "2025-01-20" }
+      },
+      {
+        "name": "bookAppointment",
+        "arguments": { "date": "2025-01-20", "time": "14:00" }
+      }
+    ]
+  }
+}
+```
+
+Validates multiple function calls in sequence.
+
+<Note>
+  Tool calls are validated in the order they're defined. Use `type: "exact"` for
+  strict validation or `type: "regex"` for flexible validation.
+</Note>
+
+## Step 5: Use regex for flexible validation
+
+When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.
+
+### Common regex patterns
+
+**Greeting variations:**
+
+```json
+{
+  "judgePlan": {
+    "type": "regex",
+    "content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
+  }
+}
+```
+
+Matches: "Hello, I can help...", "Hi I'll help...", "Hey let me help..."
+
+**Responses with variables:**
+
+```json
+{
+  "judgePlan": {
+    "type": "regex",
+    "content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
+  }
+}
+```
+
+Matches any confirmation message with appointment ID format.
+
+**Date patterns:**
+
+```json
+{
+  "judgePlan": {
+    "type": "regex",
+    "content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
+  }
+}
+```
+
+Matches responses mentioning weekdays.
+
+**Case-insensitive matching:**
+
+```json
+{
+  "judgePlan": {
+    "type": "regex",
+    "content": "(?i)booking confirmed"
+  }
+}
+```
+
+The `(?i)` flag makes matching case-insensitive.
+
+### Example: Flexible booking confirmation
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Add assistant message with evaluation enabled
+    2. Select **Regex** as judge type
+    3. Enter pattern: `.*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*`
+    4. This matches various confirmation phrasings with time mentions
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Flexible Booking Test",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "I need to schedule an appointment"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "regex",
+          "content": ".*(schedule|book|set up).*appointment.*"
+        }
+      }
+    ]
+  }'
+```
+  </Tab>
+</Tabs>
+
+<Tip>
+  **Regex tips:** - Use `.*` to match any characters - Use `(option1|option2)`
+  for alternatives - Use `\d` for digits, `\s` for whitespace - Use `.*?` for
+  non-greedy matching - Test your patterns with sample responses first
+</Tip>
+
+## Step 6: Use AI judge for semantic validation
+
+For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.
+
+### AI judge structure
+
+```json
+{
+  "role": "assistant",
+  "judgePlan": {
+    "type": "ai",
+    "model": {
+      "provider": "openai",
+      "model": "gpt-4o",
+      "messages": [
+        {
+          "role": "system",
+          "content": "Your evaluation prompt here"
+        }
+      ]
+    }
+  }
+}
+```
+
+### Writing effective judge prompts
+
+**Template structure:**
+
+```
+You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.
+
+Include the full conversation history for context: {{messages}}
+
+Decision rule:
+- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
+- Otherwise FAIL.
+
+Pass criteria:
+- [Specific requirement 1]
+- [Specific requirement 2]
+
+Fail criteria (any one triggers FAIL):
+- [Specific failure condition 1]
+- [Specific failure condition 2]
+
+Output format: respond with exactly one word: pass or fail
+- No explanations
+- No punctuation
+- No additional text
+```
+
+<Note>
+**Template variables:**
+- `{{messages}}` - The entire conversation history (all messages exchanged)
+- `{{messages[-1]}}` - The last assistant message only
+</Note>
+
+### Example: Evaluate helpfulness and tone
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Add assistant message with evaluation enabled
+    2. Select **AI Judge** as judge type
+    3. Choose provider: **OpenAI**
+    4. Select model: **gpt-4o**
+    5. Enter evaluation prompt (see template above)
+    6. Customize pass/fail criteria for your use case
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Helpfulness Test",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "I need help with my account"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "ai",
+          "model": {
+            "provider": "openai",
+            "model": "gpt-4o",
+            "messages": [{
+              "role": "system",
+              "content": "You are an LLM-Judge. Evaluate ONLY the last assistant message: {{messages[-1]}}.\n\nInclude context: {{messages}}\n\nDecision rule:\n- PASS if ALL pass criteria are met AND NO fail criteria are triggered.\n- Otherwise FAIL.\n\nPass criteria:\n- Response acknowledges the user request\n- Response offers specific help or next steps\n- Tone is professional and friendly\n\nFail criteria (any triggers FAIL):\n- Response is rude or dismissive\n- Response ignores the user request\n- Response provides no actionable information\n\nOutput format: respond with exactly one word: pass or fail"
+            }]
+          }
+        }
+      }
+    ]
+  }'
+```
+  </Tab>
+</Tabs>
+
+### Supported AI judge providers
+
+<CardGroup cols={2}>
+  <Card title="OpenAI" icon="brain">
+    **Models:** gpt-4o, gpt-4-turbo, gpt-3.5-turbo
+    
+    Best for general-purpose evaluation
+  </Card>
+
+{" "}
+
+<Card title="Anthropic" icon="message-bot">
+  **Models:** claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for
+  nuanced evaluation
+</Card>
+
+{" "}
+
+<Card title="Google" icon="google">
+  **Models:** gemini-1.5-pro, gemini-1.5-flash Best for multilingual content
+</Card>
+
+  <Card title="Groq" icon="bolt">
+    **Models:** llama-3.1-70b-versatile, mixtral-8x7b-32768
+    
+    Best for fast evaluation
+  </Card>
+</CardGroup>
+
+**Custom LLM:**
+
+```json
+{
+  "model": {
+    "provider": "custom-llm",
+    "model": "your-model-name",
+    "url": "https://your-api-endpoint.com/chat/completions",
+    "messages": [...]
+  }
+}
+```
+
+### AI judge best practices
+
+<Note>
+  **Tips for reliable AI judging:** - Be specific with pass/fail criteria (avoid
+  ambiguous requirements) - Use "ALL pass criteria must be met" logic - Use "ANY
+  fail criteria triggers fail" logic - Include conversation context with `
+  {{ messages }}` syntax - Request exact "pass" or "fail" output (no
+  explanations) - Test criteria with known good/bad responses before production
+  - Use consistent evaluation standards across similar tests
+</Note>
+
+## Step 7: Control flow with Continue Plan
+
+Define what happens after an evaluation passes or fails using `continuePlan`.
+
+### Exit on failure
+
+Stop the test immediately if a critical check fails:
+
+```json
+{
+  "role": "assistant",
+  "judgePlan": {
+    "type": "exact",
+    "content": "I can help you with that."
+  },
+  "continuePlan": {
+    "exitOnFailureEnabled": true
+  }
+}
+```
+
+**Use case:** Skip expensive subsequent tests when initial validation fails.
+
+### Override responses on failure
+
+Provide fallback responses to continue testing even when validation fails:
+
+```json
+{
+  "role": "assistant",
+  "judgePlan": {
+    "type": "exact",
+    "content": "I've processed your request."
+  },
+  "continuePlan": {
+    "exitOnFailureEnabled": false,
+    "contentOverride": "Let me rephrase that...",
+    "toolCallsOverride": [
+      {
+        "name": "retryProcessing",
+        "arguments": { "retry": "true" }
+      }
+    ]
+  }
+}
+```
+
+**Use case:** Test error recovery paths or force specific tool calls for subsequent validation.
+
+### Example: Multi-step with exit control
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Create evaluation with multiple conversation turns
+    2. For each assistant message with critical validation:
+       - Enable evaluation
+       - Configure judge plan (exact, regex, or AI)
+       - Toggle **Exit on Failure** to stop test early
+    3. For non-critical checks, leave **Exit on Failure** disabled
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Multi-Step with Control",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "I want to book an appointment"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "content": "I can help you book an appointment."
+        },
+        "continuePlan": {
+          "exitOnFailureEnabled": true
+        }
+      },
+      {
+        "role": "user",
+        "content": "Monday at 2pm"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "toolCalls": [{"name": "bookAppointment"}]
+        },
+        "continuePlan": {
+          "exitOnFailureEnabled": false,
+          "contentOverride": "Booking confirmed for Monday at 2pm.",
+          "toolCallsOverride": [{
+            "name": "bookAppointment",
+            "arguments": {"date": "2025-01-20", "time": "14:00"}
+          }]
+        }
+      }
+    ]
+  }'
+```
+  </Tab>
+</Tabs>
+
+<Warning>
+  If `exitOnFailureEnabled` is `true` and validation fails, the test stops
+  immediately. Subsequent conversation turns are not executed. Use this for
+  critical checkpoints.
+</Warning>
+
+## Step 8: Test complete conversation flows
+
+Validate multi-turn interactions that simulate real user conversations.
+
+### Complete booking flow example
+
+<Tabs>
+  <Tab title="Dashboard">
+    Create a comprehensive test:
+    
+    1. **Turn 1 - Initial request:**
+       - User: "I need to schedule an appointment"
+       - Assistant evaluation: AI judge checking acknowledgment
+    
+    2. **Turn 2 - Provide details:**
+       - User: "Next Monday at 2pm"
+       - Assistant evaluation: Exact match on tool call `bookAppointment`
+    
+    3. **Turn 3 - Tool response:**
+       - Tool: `{"status": "success", "confirmationId": "APT-12345"}`
+    
+    4. **Turn 4 - Confirmation:**
+       - Assistant evaluation: Regex matching confirmation with ID
+    
+    5. **Turn 5 - Follow-up:**
+       - User: "Can I get that via email?"
+       - Assistant evaluation: Exact match on tool call `sendEmail`
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X POST "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Complete Booking Flow",
+    "description": "Test full appointment booking conversation",
+    "type": "chat.mockConversation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "I need to schedule an appointment"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "ai",
+          "model": {
+            "provider": "openai",
+            "model": "gpt-4o",
+            "messages": [{
+              "role": "system",
+              "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges appointment request\n- Response asks for details or preferences\n\nFAIL if:\n- Response is dismissive\n- Response ignores request\n\nOutput: pass or fail"
+            }]
+          }
+        }
+      },
+      {
+        "role": "user",
+        "content": "Next Monday at 2pm"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "toolCalls": [{
+            "name": "bookAppointment",
+            "arguments": {
+              "date": "2025-01-20",
+              "time": "14:00"
+            }
+          }]
+        }
+      },
+      {
+        "role": "tool",
+        "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "regex",
+          "content": ".*confirmed.*APT-12345.*"
+        }
+      },
+      {
+        "role": "user",
+        "content": "Can I get that via email?"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "exact",
+          "toolCalls": [{
+            "name": "sendEmail"
+          }]
+        }
+      }
+    ]
+  }'
+```
+
+For API details, see [Create Eval](/api-reference/eval/create).
+
+  </Tab>
+</Tabs>
+
+### System message injection
+
+Inject system prompts mid-conversation to test dynamic behavior changes:
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hello"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "regex",
+        "content": ".*help.*"
+      }
+    },
+    {
+      "role": "system",
+      "content": "You are now in urgent mode. Prioritize speed."
+    },
+    {
+      "role": "user",
+      "content": "I need immediate help"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+<Tip>
+  **Multi-turn testing tips:** - Keep conversations focused (5-10 turns for most
+  tests) - Use exit-on-failure for early turns to save time - Test one primary
+  flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive
+  validation - Include tool responses to simulate real interactions
+</Tip>
+
+## Step 9: Manage evaluations
+
+List, update, and organize your evaluation suite.
+
+### List all evaluations
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Navigate to **Evals** in the sidebar
+    2. View all evaluations in a table with:
+       - Name and description
+       - Created date
+       - Last run status
+       - Actions (Edit, Run, Delete)
+    3. Use search to filter by name
+    4. Sort by date or status
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X GET "https://api.vapi.ai/eval" \
+  -H "Authorization: Bearer $VAPI_API_KEY"
+```
+
+**Response:**
+
+```json
+{
+  "results": [
+    {
+      "id": "550e8400-e29b-41d4-a716-446655440000",
+      "name": "Greeting Test",
+      "description": "Verify assistant greets users appropriately",
+      "type": "chat.mockConversation",
+      "createdAt": "2024-01-15T09:30:00Z",
+      "updatedAt": "2024-01-15T09:30:00Z"
+    }
+  ],
+  "page": 1,
+  "total": 1
+}
+```
+
+For API details, see [List Evals](/api-reference/eval/list).
+
+  </Tab>
+</Tabs>
+
+### Update an evaluation
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Navigate to **Evals** and click on an evaluation
+    2. Click **Edit** button
+    3. Modify conversation turns, judge plans, or settings
+    4. Click **Save Changes**
+    5. Previous test runs remain unchanged
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X PATCH "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \
+  -H "Authorization: Bearer $VAPI_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Updated Greeting Test",
+    "description": "Enhanced greeting validation",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hi there"
+      },
+      {
+        "role": "assistant",
+        "judgePlan": {
+          "type": "regex",
+          "content": "^(Hello|Hi|Hey).*"
+        }
+      }
+    ]
+  }'
+```
+
+For API details, see [Update Eval](/api-reference/eval/update).
+
+  </Tab>
+</Tabs>
+
+### Delete an evaluation
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Navigate to **Evals**
+    2. Click on an evaluation
+    3. Click **Delete** button
+    4. Confirm deletion
+    
+    <Warning>
+    Deleting an evaluation does NOT delete its run history. Past run results remain accessible.
+    </Warning>
+  </Tab>
+
+  <Tab title="cURL">
+```bash
+curl -X DELETE "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \
+  -H "Authorization: Bearer $VAPI_API_KEY"
+```
+
+For API details, see [Delete Eval](/api-reference/eval/delete).
+
+  </Tab>
+</Tabs>
+
+### View run history
+
+<Tabs>
+  <Tab title="Dashboard">
+    1. Navigate to **Evals**
+    2. Click on an evaluation
+    3. View **Runs** tab showing:
+       - Run timestamp
+       - Target (assistant/squad)
+       - Status (pass/fail)
+       - Duration
+    4. Click any run to view detailed results
+  </Tab>
+
+  <Tab title="cURL">
+**List all runs:**
+```bash
+curl -X GET "https://api.vapi.ai/eval/run" \
+  -H "Authorization: Bearer $VAPI_API_KEY"
+```
+
+**Filter by eval ID:**
+
+```bash
+curl -X GET "https://api.vapi.ai/eval/run?evalId=550e8400-e29b-41d4-a716-446655440000" \
+  -H "Authorization: Bearer $VAPI_API_KEY"
+```
+
+For API details, see [List Eval Runs](/api-reference/eval/list-runs).
+
+  </Tab>
+</Tabs>
+
+## Expected output
+
+### Successful run
+
+```json
+{
+  "id": "eval-run-123",
+  "evalId": "550e8400-e29b-41d4-a716-446655440000",
+  "orgId": "org-123",
+  "status": "ended",
+  "endedReason": "mockConversation.done",
+  "createdAt": "2024-01-15T09:35:00Z",
+  "updatedAt": "2024-01-15T09:35:45Z",
+  "results": [
+    {
+      "status": "pass",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Hello"
+        },
+        {
+          "role": "assistant",
+          "content": "Hello! How can I help you today?",
+          "judge": {
+            "status": "pass"
+          }
+        }
+      ]
+    }
+  ],
+  "target": {
+    "type": "assistant",
+    "assistantId": "your-assistant-id"
+  }
+}
+```
+
+**Indicators of success:**
+
+- ✅ `status` is "ended"
+- ✅ `endedReason` is "mockConversation.done"
+- ✅ `results[0].status` is "pass"
+- ✅ All `judge.status` values are "pass"
+
+### Failed run
+
+```json
+{
+  "id": "eval-run-124",
+  "status": "ended",
+  "endedReason": "mockConversation.done",
+  "results": [
+    {
+      "status": "fail",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Book an appointment for Monday at 2pm"
+        },
+        {
+          "role": "assistant",
+          "content": "Sure, let me help you with that.",
+          "toolCalls": [
+            {
+              "name": "bookAppointment",
+              "arguments": {
+                "date": "2025-01-20",
+                "time": "2:00 PM"
+              }
+            }
+          ],
+          "judge": {
+            "status": "fail",
+            "failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+
+**Indicators of failure:**
+
+- ❌ `results[0].status` is "fail"
+- ❌ `judge.status` is "fail"
+- ❌ `judge.failureReason` provides specific details
+
+<Note>
+  Full conversation transcripts show both expected and actual values, making
+  debugging straightforward.
+</Note>
+
+## Common patterns
+
+### Multiple validation types in one eval
+
+Combine exact, regex, and AI judges for comprehensive testing:
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hello"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "content": "Hello! How can I help you?"
+      }
+    },
+    {
+      "role": "user",
+      "content": "Book appointment for Monday"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "regex",
+        "content": ".*(Monday|next week).*"
+      }
+    },
+    {
+      "role": "user",
+      "content": "Thanks for your help"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "ai",
+        "model": {
+          "provider": "openai",
+          "model": "gpt-4o",
+          "messages": [
+            {
+              "role": "system",
+              "content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
+            }
+          ]
+        }
+      }
+    }
+  ]
+}
+```
+
+### Test squad handoffs
+
+Validate smooth transitions between squad members:
+
+```json
+{
+  "name": "Squad Handoff Test",
+  "messages": [
+    {
+      "role": "user",
+      "content": "I need technical support"
+    },
+    {
+      "role": "assistant",
+      "judgePlan": {
+        "type": "exact",
+        "toolCalls": [
+          {
+            "name": "transferToSquadMember",
+            "arguments": {
+              "destination": "technical-support-agent"
+            }
+          }
+        ]
+      }
+    }
+  ],
+  "target": {
+    "type": "squad",
+    "squadId": "your-squad-id"
+  }
+}
+```
+
+### Regression test suite
+
+Organize related tests for systematic validation:
+
+```json
+{
+  "name": "Greeting Regression Suite",
+  "tests": [
+    "Greeting Test - Formal",
+    "Greeting Test - Casual",
+    "Greeting Test - Multilingual"
+  ]
+}
+```
+
+Run multiple evals sequentially to validate all greeting scenarios.
+
+## Troubleshooting
+
+| Issue                   | Solution                                                                                |
+| ----------------------- | --------------------------------------------------------------------------------------- |
+| Eval always fails       | Verify exact match strings character-by-character. Consider using regex for flexibility |
+| AI judge inconsistent   | Make pass/fail criteria more specific and binary. Test with known examples              |
+| Tool calls not matching | Check argument types (string vs number). Ensure exact spelling of function names        |
+| Run stuck in "running"  | Verify assistant configuration. Check for errors in assistant's tools or prompts        |
+| Timeout errors          | Reduce conversation length or simplify evaluations. Check assistant response times      |
+| Regex not matching      | Test regex patterns separately. Remember to escape special characters like `.` or `?`   |
+| Empty results array     | Check `endedReason` field. Assistant may have encountered an error before completion    |
+| Missing judge results   | Verify `judgePlan` is properly configured in assistant messages                         |
+
+### Common errors
+
+**"mockConversation.done" not reached:**
+
+- Check `endedReason` for actual error (e.g., "assistant-error", "pipeline-error-openai-llm-failed")
+- Verify assistant configuration (model, voice, tools)
+- Check API key validity and rate limits
+
+**Judge validation fails unexpectedly:**
+
+- Review actual vs expected output in `failureReason`
+- For exact match: Check for extra spaces, punctuation, or case differences
+- For regex: Test pattern with online regex validators
+- For AI judge: Verify prompt clarity and binary pass/fail logic
+
+**Tool calls not validated:**
+
+- Ensure tool is properly configured in assistant
+- Check argument types match exactly (string "14:00" vs number 14)
+- Verify tool function names are spelled correctly
+
+<Warning>
+  If you see `endedReason: "assistant-error"`, your assistant configuration has
+  issues. Test the assistant manually first before running evals.
+</Warning>
+
+## Next steps
+
+<CardGroup cols={2}>
+  <Card
+    title="Advanced testing strategies"
+    icon="flask"
+    href="/observability/evals-advanced"
+  >
+    Learn testing patterns, best practices, and CI/CD integration
+  </Card>
+
+{" "}
+
+<Card title="Assistants guide" icon="robot" href="/assistants/quickstart">
+  Create and configure assistants to test
+</Card>
+
+{" "}
+
+<Card title="Tools documentation" icon="wrench" href="/tools/custom-tools">
+  Build custom tools and validate their behavior
+</Card>
+
+  <Card
+    title="Eval API reference"
+    icon="book"
+    href="/api-reference/eval/create"
+  >
+    Complete API documentation for evals
+  </Card>
+</CardGroup>
+
+## Tips for success
+
+<Tip>
+  **Best practices for reliable testing:** - Start simple with exact matches,
+  then add complexity - One behavior per evaluation turn keeps tests focused -
+  Use descriptive names that explain what's being tested - Test both happy paths
+  and edge cases - Version control your evals alongside assistant configs - Run
+  critical tests first to fail fast - Review failure reasons promptly and
+  iterate - Document why each test exists (use descriptions)
+</Tip>
+
+## Get help
+
+Need assistance? We're here to help:
+
+- [Eval API Reference](/api-reference/eval/eval-controller-create)
+- [Discord Community](https://discord.gg/pUFNcf2WmH)
+- [Support](mailto:support@vapi.ai)