Add app_oracle API prediction mode for AppWorld benchmarks #18

chughtapan · 2025-10-23T20:12:12Z

Implements a new intermediate API prediction mode that uses oracle data to identify required services, then exposes all APIs from those services.

Changes:

Add app_oracle mode: Uses ground truth to identify apps (e.g., spotify, venmo), then loads all APIs from those apps. System apps (supervisor) only include ground truth APIs.
Refactor: Split appworld_helpers.py into api_predictor.py (API prediction) and prompts.py (prompt management) for better separation of concerns
Fix: Remove 20-API limit for "all" mode (now returns all 473 APIs)
Fix: Eliminate duplicate Task loading in predict_apis()

API count comparison for typical task:

ground_truth: 6 APIs (exact oracle)
app_oracle: 95 APIs (3 supervisor + 92 spotify)
all: 473 APIs (no limit)

Usage:
pytest tests/benchmarks/appworld/test_appworld.py --api-mode app_oracle \ --dataset train --limit 5 --model gpt-4o

🤖 Generated with Claude Code

Implements a new intermediate API prediction mode that uses oracle data to identify required services, then exposes all APIs from those services. Changes: - Add app_oracle mode: Uses ground truth to identify apps (e.g., spotify, venmo), then loads all APIs from those apps. System apps (supervisor) only include ground truth APIs. - Refactor: Split appworld_helpers.py into api_predictor.py (API prediction) and prompts.py (prompt management) for better separation of concerns - Fix: Remove 20-API limit for "all" mode (now returns all 473 APIs) - Fix: Eliminate duplicate Task loading in predict_apis() API count comparison for typical task: - ground_truth: 6 APIs (exact oracle) - app_oracle: 95 APIs (3 supervisor + 92 spotify) - all: 473 APIs (no limit) Usage: pytest tests/benchmarks/appworld/test_appworld.py --api-mode app_oracle \ --dataset train --limit 5 --model gpt-4o 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This PR implements an intermediate API prediction mode called app_oracle that uses ground truth data to identify required services, then exposes all APIs from those services. This provides a middle ground between exact oracle APIs (ground_truth) and all available APIs (all).

Key changes:

Added app_oracle mode that returns ~50-100 APIs by identifying required apps from ground truth, then loading all APIs from those apps
Refactored code by splitting appworld_helpers.py into separate api_predictor.py and prompts.py modules
Removed the 20-API limit for "all" mode to return all 473 available APIs

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/benchmarks/appworld/test_appworld.py	Updated imports to use new `api_predictor` and `prompts` modules
tests/benchmarks/appworld/prompts.py	Removed API prediction logic, keeping only prompt management functions
tests/benchmarks/appworld/conftest.py	Added `app_oracle` to API mode choices and updated documentation
tests/benchmarks/appworld/api_predictor.py	New module implementing all API prediction modes including the new `app_oracle` mode

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/benchmarks/appworld/api_predictor.py

chughtapan force-pushed the appworld-api-oracle branch from 5aa7d58 to fd1cf5e Compare October 23, 2025 20:24

chughtapan force-pushed the appworld-api-oracle branch from fd1cf5e to c1a4e9e Compare October 23, 2025 20:28

chughtapan requested a review from Copilot October 23, 2025 20:29

Copilot AI reviewed Oct 23, 2025

View reviewed changes

tests/benchmarks/appworld/api_predictor.py Show resolved Hide resolved

chughtapan merged commit 533f386 into main Oct 23, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add app_oracle API prediction mode for AppWorld benchmarks #18

Add app_oracle API prediction mode for AppWorld benchmarks #18

Uh oh!

chughtapan commented Oct 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add app_oracle API prediction mode for AppWorld benchmarks #18

Add app_oracle API prediction mode for AppWorld benchmarks #18

Uh oh!

Conversation

chughtapan commented Oct 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants