Skip to content

Conversation

@yuliia-fryshko
Copy link
Contributor

@yuliia-fryshko yuliia-fryshko commented Nov 5, 2025

Closes https://github.com/elastic/obs-ai-assistant-team/issues/373

This PR adds the evaluation ratings for Llama 4 Maverick based on evaluation results.

Screenshot 2025-11-06 at 16 34 42

I had to run the evaluation a couple of times, since the first run had some rate limit errors. It turns out that OpenRouter routes requests across multiple providers, each with different context limits - I added more details here.

In the latest run, I was lucky and didn’t get any errors.

@yuliia-fryshko yuliia-fryshko requested a review from a team as a code owner November 5, 2025 18:12
@yuliia-fryshko yuliia-fryshko self-assigned this Nov 5, 2025
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

🔍 Preview links for changed docs

| OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good |
| OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent |
| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent |
| Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great |
Copy link
Contributor

@viduni94 viduni94 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to the HugginFace model name for consistency.

Suggested change
| Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great |
| Meta | **Llama-4-Maverick-17B-128E-Instruct** | Good | Good | Good | Excellent | Excellent | Good | Good | Great |

| OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good |
| OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent |
| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent |
| Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great |
Copy link
Contributor

@viduni94 viduni94 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the results are very different to llama 3.3 - Did you run into the token limit issue a lot ?

| OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good |
| OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent |
| Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent |
| Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screenshot says Excellent for contextual insights, but noticed that it's specified as Good here

Copy link
Contributor

@viduni94 viduni94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉
Thanks @yuliia-fryshko

@viduni94 viduni94 changed the title [Obs AI Assistant] Add Llama 4 Maveric model ratings to the LLM performance matrix [Obs AI Assistant] Add Llama 4 Maverick model ratings to the LLM performance matrix Nov 6, 2025
Copy link
Member

@pmoust pmoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thanks both for the work here and the reviews

@yuliia-fryshko yuliia-fryshko enabled auto-merge (squash) November 6, 2025 16:11
@yuliia-fryshko yuliia-fryshko merged commit ca595c0 into elastic:main Nov 6, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants