Add evals for each model by kamath · Pull Request #196 · browserbase/stagehand

kamath · 2024-11-21T01:13:09Z

why

We want evals for OpenAI and Anthropic. This does that and adds tags for each model if we want to break down performance.

Also, OpenAIClient wasn't working as expected, for some reason LogLine was being imported from the wrong place. This must've gotten lost in a previous PR, and due to the typing hell we're in, it got lost along the way. This fixes that.

what changed

Added evals for both Claude 3.5 Sonnet and GPT 4o for all evals with model name tags
Fixes example.ts to work with OpenAI and Anthropic

test plan

Braintrust run

Also ran example.ts with modelName of gpt-4o, claude-3-5-sonnet-20241022, and empty and they all work

pkiv

l g t m

jeremypress

looks good, might make sense in the future to have "stable" models that we can pin in the LLM Providers file, then any thing we run could pull stable Anthropic, stable Open AI etc, and not have manually set model names

kamath added 2 commits November 20, 2024 17:10

fix: openai example wasn't working

1dbbcfd

add braintrust evals for openai and claude

8f30905

kamath marked this pull request as ready for review November 21, 2024 01:52

kamath requested review from filip-michalsky, navidkpr and pkiv November 21, 2024 01:52

kamath changed the title ~~fix: openai example wasn't working~~ Add evals for each model Nov 21, 2024

rm console log anthropic

30b10a0

kamath requested a review from jeremypress November 21, 2024 01:59

pkiv approved these changes Nov 21, 2024

View reviewed changes

kamath merged commit 377d7d4 into main Nov 21, 2024

jeremypress reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evals for each model#196

Add evals for each model#196
kamath merged 3 commits intomainfrom
anirudh/fix/openai-example

kamath commented Nov 21, 2024 •

edited

Loading

Uh oh!

pkiv left a comment

Uh oh!

jeremypress left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kamath commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

pkiv left a comment

Choose a reason for hiding this comment

Uh oh!

jeremypress left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kamath commented Nov 21, 2024 •

edited

Loading