Skip to content

Conversation

@jirispilka
Copy link
Collaborator

No description provided.

@jirispilka jirispilka self-assigned this Jan 17, 2025
@jirispilka jirispilka merged commit ec6e9b0 into master Jan 17, 2025
2 checks passed
@jirispilka jirispilka deleted the fix/ci branch January 17, 2025 13:38
yfe404 added a commit to yfe404/apify-mcp-server that referenced this pull request Nov 10, 2025
…ove evaluation accuracy

This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4),
focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing
bidirectional tool equivalence, and fixing test case quality issues.

## Performance Improvements

Comparing baseline (v1.4 experiments apify#1-apify#4) vs current (v1.4 experiments apify#5-apify#8):

### Exact-Match Evaluator (Baseline → Current)
- GPT-4o Mini: 99% → **97%** (-2%) - Minor regression
- Claude Haiku 4.5: 95% → **99%** (+4%)
- Gemini 2.5 Flash: 91% → **96%** (+5%)
- GPT-5: 91% → **99%** (+8%)

### LLM-Judge Evaluator (Baseline → Current)
- GPT-4o Mini: 93% → **97%** (+4%)
- Claude Haiku 4.5: 91% → **95%** (+4%)
- Gemini 2.5 Flash: 89% → **97%** (+8%)
- GPT-5: 88% → **99%** (+11%) ← Largest improvement

**All models now significantly exceed the 70% threshold with more consistent performance.**

**Key Insight:** Adding complete tool descriptions to the judge prompt eliminated false negatives
and improved judge accuracy significantly, especially for GPT-5 (+11%) and Gemini (+8%).

## Technical Changes

### 1. Added Complete Tool Context to Judge Prompt (evals/config.ts:84-115)

**Before:** Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections
without understanding what each tool does, leading to arbitrary penalization.

**After:** Added comprehensive "Important Tool Context" section with descriptions for ALL tools:

**Tool descriptions added:**
- **search-actors:** Searches Apify Store to find scraping tools/Actors (NOT celebrity actors). Emphasizes informational intent.
- **apify-slash-rag-web-browser:** Browses web to get data immediately (one-time data retrieval). Emphasizes time indicators.
- **call-actor:** Mandatory two-step workflow (step="info" then step="call"). Explains info step is CORRECT and required.
- **fetch-actor-details:** Gets Actor documentation without running it. Notes overlap with call-actor step="info".
- **search-apify-docs:** Searches Apify documentation for platform/feature info.
- **get-actor-output:** Retrieves output data from completed Actor runs using datasetId.
- **fetch-apify-docs:** Fetches full content of specific Apify docs page by URL.

**Keyword Length Guidelines** section added to prevent judge from penalizing thoughtful keyword additions.

**Impact:** Judge now understands tool purposes and correctly evaluates tool selections instead of
arbitrary penalization. This was the PRIMARY cause of LLM-judge improvements (+4% to +11%).

### 2. Implemented Bidirectional Tool Equivalence (evals/run-evaluation.ts:102-144)

**Before:** No tool normalization existed - direct string comparison only.

**After:** Bidirectional normalization treats `call-actor(step="info")` and `fetch-actor-details` as equivalent.

**Why:** The `call-actor` tool has a mandatory two-step workflow:
- Step 1: `call-actor(step="info")` → Get Actor details
- Step 2: `call-actor(step="call")` → Execute Actor

Since step 1 is functionally identical to `fetch-actor-details`, both should be accepted as correct.

**Implementation:**
- Added `normalizeToolName()` - normalizes expected tools
- Added `normalizeToolCall()` - normalizes actual tool calls, checking step parameter
- Both functions map `call-actor` and `fetch-actor-details` → `fetch-actor-details` for comparison

**Impact:** Eliminates false negatives when models correctly use either equivalent tool.

### 3. Clarified Information vs Data Retrieval Intent (src/tools/store_collection.ts:90-126, src/const.ts:51-59)

**Problem:** Models confused when to use `search-actors` (finding tools) vs `apify-slash-rag-web-browser` (getting data).

**Root Cause:**
- `search-actors` incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data
- `RAG_WEB_BROWSER_ADDITIONAL_DESC` said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites

**Solution - search-actors (informational intent):**
- Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist"
- Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data"
- Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products"
- Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead"

**Solution - rag-web-browser (data retrieval intent):**
- Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)"
- Makes clear: "This tool directly fetches and returns data - it does NOT just find tools"
- Examples: "Get flight prices for tomorrow", "What's the weather today?"
- Time indicators: "today", "current", "latest", "recent", "now"

**Impact:** Models now clearly distinguish between informational intent vs data retrieval intent.

### 4. Fixed Test Case Quality Issues (evals/test-cases.json)

**Changes:**
- Fixed contradictory test cases (search-actors-1, search-actors-15)
- Removed misleading-query-2 (contradictory intent)
- Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions
- Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search)
- Updated fetch-actor-details-7 to accept both `fetch-actor-details` and `call-actor`
- Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1)

**Example fix - search-actors-1:**
```
Before: Query "How to scrape Instagram posts" with expectedTools=[]
        Reference: "Either explain OR call search-actors"  ← Contradictory
After:  Query "What Actors can scrape Instagram posts?"
        expectedTools=["search-actors"]  ← Clear intent
```

**Impact:** More consistent test expectations align with model behavior.

### 5. Updated Documentation (evals/README.md:67-78)

Added comprehensive v1.4 changelog documenting all improvements for future reference.

## Files Changed

- evals/config.ts - **Added complete tool context section to judge prompt (PRIMARY CHANGE)**
- evals/run-evaluation.ts - Implemented bidirectional tool equivalence normalization
- evals/test-cases.json - Dataset v1.4 with 74 test cases (fixed contradictions, disambiguated queries)
- evals/README.md - Documented v1.4 changes
- src/tools/store_collection.ts - Clarified search-actors as informational intent
- src/const.ts - Clarified rag-web-browser as data retrieval intent

## Validation

All evaluations significantly exceed the 70% threshold (Phoenix v1.4 experiments apify#5-apify#8):
- ✓ Claude Haiku 4.5: 99% exact-match, 95% judge
- ✓ Gemini 2.5 Flash: 96% exact-match, 97% judge
- ✓ GPT-4o Mini: 97% exact-match, 97% judge
- ✓ GPT-5: 99% exact-match, 99% judge
yfe404 added a commit to yfe404/apify-mcp-server that referenced this pull request Nov 26, 2025
…ove evaluation accuracy

This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4),
focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing
bidirectional tool equivalence, and fixing test case quality issues.

Comparing baseline (v1.4 experiments apify#1-apify#4) vs current (v1.4 experiments apify#5-apify#8):

- GPT-4o Mini: 99% → **97%** (-2%) - Minor regression
- Claude Haiku 4.5: 95% → **99%** (+4%)
- Gemini 2.5 Flash: 91% → **96%** (+5%)
- GPT-5: 91% → **99%** (+8%)

- GPT-4o Mini: 93% → **97%** (+4%)
- Claude Haiku 4.5: 91% → **95%** (+4%)
- Gemini 2.5 Flash: 89% → **97%** (+8%)
- GPT-5: 88% → **99%** (+11%) ← Largest improvement

**All models now significantly exceed the 70% threshold with more consistent performance.**

**Key Insight:** Adding complete tool descriptions to the judge prompt eliminated false negatives
and improved judge accuracy significantly, especially for GPT-5 (+11%) and Gemini (+8%).

**Before:** Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections
without understanding what each tool does, leading to arbitrary penalization.

**After:** Added comprehensive "Important Tool Context" section with descriptions for ALL tools:

**Tool descriptions added:**
- **search-actors:** Searches Apify Store to find scraping tools/Actors (NOT celebrity actors). Emphasizes informational intent.
- **apify-slash-rag-web-browser:** Browses web to get data immediately (one-time data retrieval). Emphasizes time indicators.
- **call-actor:** Mandatory two-step workflow (step="info" then step="call"). Explains info step is CORRECT and required.
- **fetch-actor-details:** Gets Actor documentation without running it. Notes overlap with call-actor step="info".
- **search-apify-docs:** Searches Apify documentation for platform/feature info.
- **get-actor-output:** Retrieves output data from completed Actor runs using datasetId.
- **fetch-apify-docs:** Fetches full content of specific Apify docs page by URL.

**Keyword Length Guidelines** section added to prevent judge from penalizing thoughtful keyword additions.

**Impact:** Judge now understands tool purposes and correctly evaluates tool selections instead of
arbitrary penalization. This was the PRIMARY cause of LLM-judge improvements (+4% to +11%).

**Before:** No tool normalization existed - direct string comparison only.

**After:** Bidirectional normalization treats `call-actor(step="info")` and `fetch-actor-details` as equivalent.

**Why:** The `call-actor` tool has a mandatory two-step workflow:
- Step 1: `call-actor(step="info")` → Get Actor details
- Step 2: `call-actor(step="call")` → Execute Actor

Since step 1 is functionally identical to `fetch-actor-details`, both should be accepted as correct.

**Implementation:**
- Added `normalizeToolName()` - normalizes expected tools
- Added `normalizeToolCall()` - normalizes actual tool calls, checking step parameter
- Both functions map `call-actor` and `fetch-actor-details` → `fetch-actor-details` for comparison

**Impact:** Eliminates false negatives when models correctly use either equivalent tool.

**Problem:** Models confused when to use `search-actors` (finding tools) vs `apify-slash-rag-web-browser` (getting data).

**Root Cause:**
- `search-actors` incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data
- `RAG_WEB_BROWSER_ADDITIONAL_DESC` said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites

**Solution - search-actors (informational intent):**
- Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist"
- Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data"
- Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products"
- Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead"

**Solution - rag-web-browser (data retrieval intent):**
- Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)"
- Makes clear: "This tool directly fetches and returns data - it does NOT just find tools"
- Examples: "Get flight prices for tomorrow", "What's the weather today?"
- Time indicators: "today", "current", "latest", "recent", "now"

**Impact:** Models now clearly distinguish between informational intent vs data retrieval intent.

**Changes:**
- Fixed contradictory test cases (search-actors-1, search-actors-15)
- Removed misleading-query-2 (contradictory intent)
- Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions
- Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search)
- Updated fetch-actor-details-7 to accept both `fetch-actor-details` and `call-actor`
- Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1)

**Example fix - search-actors-1:**
```
Before: Query "How to scrape Instagram posts" with expectedTools=[]
        Reference: "Either explain OR call search-actors"  ← Contradictory
After:  Query "What Actors can scrape Instagram posts?"
        expectedTools=["search-actors"]  ← Clear intent
```

**Impact:** More consistent test expectations align with model behavior.

Added comprehensive v1.4 changelog documenting all improvements for future reference.

- evals/config.ts - **Added complete tool context section to judge prompt (PRIMARY CHANGE)**
- evals/run-evaluation.ts - Implemented bidirectional tool equivalence normalization
- evals/test-cases.json - Dataset v1.4 with 74 test cases (fixed contradictions, disambiguated queries)
- evals/README.md - Documented v1.4 changes
- src/tools/store_collection.ts - Clarified search-actors as informational intent
- src/const.ts - Clarified rag-web-browser as data retrieval intent

All evaluations significantly exceed the 70% threshold (Phoenix v1.4 experiments apify#5-apify#8):
- ✓ Claude Haiku 4.5: 99% exact-match, 95% judge
- ✓ Gemini 2.5 Flash: 96% exact-match, 97% judge
- ✓ GPT-4o Mini: 97% exact-match, 97% judge
- ✓ GPT-5: 99% exact-match, 99% judge
jirispilka pushed a commit that referenced this pull request Nov 27, 2025
#334)

* feat(evals): Dataset v1.4 - Add tool context to judge prompt and improve evaluation accuracy

This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4),
focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing
bidirectional tool equivalence, and fixing test case quality issues.

Comparing baseline (v1.4 experiments #1-#4) vs current (v1.4 experiments #5-#8):

- GPT-4o Mini: 99% → **97%** (-2%) - Minor regression
- Claude Haiku 4.5: 95% → **99%** (+4%)
- Gemini 2.5 Flash: 91% → **96%** (+5%)
- GPT-5: 91% → **99%** (+8%)

- GPT-4o Mini: 93% → **97%** (+4%)
- Claude Haiku 4.5: 91% → **95%** (+4%)
- Gemini 2.5 Flash: 89% → **97%** (+8%)
- GPT-5: 88% → **99%** (+11%) ← Largest improvement

**All models now significantly exceed the 70% threshold with more consistent performance.**

**Key Insight:** Adding complete tool descriptions to the judge prompt eliminated false negatives
and improved judge accuracy significantly, especially for GPT-5 (+11%) and Gemini (+8%).

**Before:** Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections
without understanding what each tool does, leading to arbitrary penalization.

**After:** Added comprehensive "Important Tool Context" section with descriptions for ALL tools:

**Tool descriptions added:**
- **search-actors:** Searches Apify Store to find scraping tools/Actors (NOT celebrity actors). Emphasizes informational intent.
- **apify-slash-rag-web-browser:** Browses web to get data immediately (one-time data retrieval). Emphasizes time indicators.
- **call-actor:** Mandatory two-step workflow (step="info" then step="call"). Explains info step is CORRECT and required.
- **fetch-actor-details:** Gets Actor documentation without running it. Notes overlap with call-actor step="info".
- **search-apify-docs:** Searches Apify documentation for platform/feature info.
- **get-actor-output:** Retrieves output data from completed Actor runs using datasetId.
- **fetch-apify-docs:** Fetches full content of specific Apify docs page by URL.

**Keyword Length Guidelines** section added to prevent judge from penalizing thoughtful keyword additions.

**Impact:** Judge now understands tool purposes and correctly evaluates tool selections instead of
arbitrary penalization. This was the PRIMARY cause of LLM-judge improvements (+4% to +11%).

**Before:** No tool normalization existed - direct string comparison only.

**After:** Bidirectional normalization treats `call-actor(step="info")` and `fetch-actor-details` as equivalent.

**Why:** The `call-actor` tool has a mandatory two-step workflow:
- Step 1: `call-actor(step="info")` → Get Actor details
- Step 2: `call-actor(step="call")` → Execute Actor

Since step 1 is functionally identical to `fetch-actor-details`, both should be accepted as correct.

**Implementation:**
- Added `normalizeToolName()` - normalizes expected tools
- Added `normalizeToolCall()` - normalizes actual tool calls, checking step parameter
- Both functions map `call-actor` and `fetch-actor-details` → `fetch-actor-details` for comparison

**Impact:** Eliminates false negatives when models correctly use either equivalent tool.

**Problem:** Models confused when to use `search-actors` (finding tools) vs `apify-slash-rag-web-browser` (getting data).

**Root Cause:**
- `search-actors` incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data
- `RAG_WEB_BROWSER_ADDITIONAL_DESC` said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites

**Solution - search-actors (informational intent):**
- Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist"
- Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data"
- Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products"
- Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead"

**Solution - rag-web-browser (data retrieval intent):**
- Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)"
- Makes clear: "This tool directly fetches and returns data - it does NOT just find tools"
- Examples: "Get flight prices for tomorrow", "What's the weather today?"
- Time indicators: "today", "current", "latest", "recent", "now"

**Impact:** Models now clearly distinguish between informational intent vs data retrieval intent.

**Changes:**
- Fixed contradictory test cases (search-actors-1, search-actors-15)
- Removed misleading-query-2 (contradictory intent)
- Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions
- Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search)
- Updated fetch-actor-details-7 to accept both `fetch-actor-details` and `call-actor`
- Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1)

**Example fix - search-actors-1:**
```
Before: Query "How to scrape Instagram posts" with expectedTools=[]
        Reference: "Either explain OR call search-actors"  ← Contradictory
After:  Query "What Actors can scrape Instagram posts?"
        expectedTools=["search-actors"]  ← Clear intent
```

**Impact:** More consistent test expectations align with model behavior.

Added comprehensive v1.4 changelog documenting all improvements for future reference.

- evals/config.ts - **Added complete tool context section to judge prompt (PRIMARY CHANGE)**
- evals/run-evaluation.ts - Implemented bidirectional tool equivalence normalization
- evals/test-cases.json - Dataset v1.4 with 74 test cases (fixed contradictions, disambiguated queries)
- evals/README.md - Documented v1.4 changes
- src/tools/store_collection.ts - Clarified search-actors as informational intent
- src/const.ts - Clarified rag-web-browser as data retrieval intent

All evaluations significantly exceed the 70% threshold (Phoenix v1.4 experiments #5-#8):
- ✓ Claude Haiku 4.5: 99% exact-match, 95% judge
- ✓ Gemini 2.5 Flash: 96% exact-match, 97% judge
- ✓ GPT-4o Mini: 97% exact-match, 97% judge
- ✓ GPT-5: 99% exact-match, 99% judge

* Address PR review comments: clean up references and fix capitalization

- Fix capitalization: "Important Tool Context" -> "Important tool context"
- Remove change explanation notes from reference fields
- Remove references that only contained PR change notes without judge instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant