-
Notifications
You must be signed in to change notification settings - Fork 179
[Obs AI Assistant] Add Llama 4 Maverick model ratings to the LLM performance matrix #3825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Obs AI Assistant] Add Llama 4 Maverick model ratings to the LLM performance matrix #3825
Conversation
🔍 Preview links for changed docs |
| | OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good | | ||
| | OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to the HugginFace model name for consistency.
| | Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great | | |
| | Meta | **Llama-4-Maverick-17B-128E-Instruct** | Good | Good | Good | Excellent | Excellent | Good | Good | Great | |
| | OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good | | ||
| | OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the results are very different to llama 3.3 - Did you run into the token limit issue a lot ?
| | OpenAI | **gpt-oss-20b** | Poor | Poor | Great | Poor | Good | Poor | Good | Good | | ||
| | OpenAI | **gpt-oss-120b** | Excellent | Poor | Great | Great | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-3.3-70B-Instruct** | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent | | ||
| | Meta | **Llama-4-Maverick-17B** | Good | Good | Good | Excellent | Excellent | Good | Good | Great | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screenshot says Excellent for contextual insights, but noticed that it's specified as Good here
viduni94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
Thanks @yuliia-fryshko
pmoust
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - thanks both for the work here and the reviews
Closes https://github.com/elastic/obs-ai-assistant-team/issues/373
This PR adds the evaluation ratings for Llama 4 Maverick based on evaluation results.
I had to run the evaluation a couple of times, since the first run had some rate limit errors. It turns out that OpenRouter routes requests across multiple providers, each with different context limits - I added more details here.
In the latest run, I was lucky and didn’t get any errors.