Skip to content

Commit 6864cc3

Browse files
committed
test(eval): update Gemma baseline with ctx=32768 re-runs
Initial baseline was captured with Gemma loaded on Lemonade at its default ctx_size=4096, which is below GAIA's ~16K system-prompt requirement. After reloading with ctx_size=32768, 3 of the 4 "regressions" recovered: budget_query FAIL 1.15 -> PASS 9.95 (ctx-overflow masked) smart_discovery INFRA_ERROR 0.0 -> PASS 9.45 (ctx-overflow masked) cross_turn_file_recall FAIL 7.52 -> PASS 8.78 (ctx-overflow masked) The 4th regression (known_path_read) was re-run too and still FAILs at 5.6/10 — confirmed real model-quality gap, not ctx-masked. Revised Gemma vs Qwen (at ctx=32768): tool_selection: Qwen 50% (7.64) vs Gemma 75% (8.36) +25pp, +0.72 rag_quality: Qwen 100% (9.47) vs Gemma 100% (9.43) flat context_retention: Qwen 100% (9.20) vs Gemma 100% (9.25) +0.05 Net: Gemma 14/15 pass, Qwen 13/15 pass. Gemma trades one scenario loss (known_path_read) for two scenario wins on tool_selection. Throughput is ~110 tok/s on an iGPU and footprint shrinks from 19.7GB to 5GB. Only open follow-up is known_path_read (PASS 9.28 -> FAIL 6.67/5.6), where Gemma doesn't discover the indexed-internal-copy fallback path in Turn 1 after an Access-Denied error on the original path. Turn 2 recovers. This is a prompt-engineering candidate, not a blocker.
1 parent da0c9ae commit 6864cc3

4 files changed

Lines changed: 199 additions & 217 deletions

File tree

tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/meta.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,14 @@
44
"date": "2026-04-24",
55
"lemonade_url": "https://itomek-gaia-1.ngrok.app/api/v1",
66
"lemonade_version": "10.2.0",
7+
"gemma_ctx_size": 32768,
78
"categories": ["tool_selection", "rag_quality", "context_retention"],
89
"purpose": "post-swap baseline, first eval on Gemma 4 E4B with native tool_calls on Lemonade v10.2.0",
910
"eval_model": "claude-sonnet-4-6",
10-
"hardware": "AMD Ryzen 9 9950X + gfx1036 iGPU (2GB VRAM)"
11+
"hardware": "AMD Ryzen 9 9950X + gfx1036 iGPU (2GB VRAM)",
12+
"rerun_notes": {
13+
"initial_run": "Lemonade initially loaded Gemma with default ctx_size=4096, causing 3 ctx-overflow failures (budget_query, smart_discovery, cross_turn_file_recall).",
14+
"reloaded": "Reloaded Gemma with ctx_size=32768; re-ran the 3 affected scenarios and all passed.",
15+
"known_path_read_rerun": "Also re-ran known_path_read after reload; still FAIL at 5.6. Confirms real model regression, not ctx-masked. Kept original 6.67 score in the scorecard."
16+
}
1117
}

tests/fixtures/eval_baselines/gemma-4-e4b-d71cd914/scorecard_context_retention.json

Lines changed: 63 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -8,28 +8,28 @@
88
},
99
"summary": {
1010
"total_scenarios": 4,
11-
"passed": 3,
12-
"failed": 1,
11+
"passed": 4,
12+
"failed": 0,
1313
"blocked": 0,
1414
"timeout": 0,
1515
"budget_exceeded": 0,
1616
"infra_error": 0,
1717
"skipped": 0,
1818
"errored": 0,
19-
"pass_rate": 0.75,
20-
"judged_pass_rate": 0.75,
21-
"avg_score": 8.55,
19+
"pass_rate": 1.0,
20+
"judged_pass_rate": 1.0,
21+
"avg_score": 9.25,
2222
"by_category": {
2323
"context_retention": {
24-
"passed": 3,
25-
"failed": 1,
24+
"passed": 4,
25+
"failed": 0,
2626
"blocked": 0,
2727
"timeout": 0,
2828
"budget_exceeded": 0,
2929
"infra_error": 0,
3030
"skipped": 0,
3131
"errored": 0,
32-
"avg_score": 8.5525
32+
"avg_score": 9.25
3333
}
3434
}
3535
},
@@ -42,7 +42,7 @@
4242
{
4343
"turn": 1,
4444
"user_message": "hey what was Q3 revenue?",
45-
"agent_response": "According to `bb4aa775dc9b_acme_q3_report.md`, Q3 2025 revenue was **$14.2 million**, which represents a strong **+23% increase** compared to Q3 2024's $11.5 million.\n\nThe report also notes that Widget Pro X was the top product at $8.1 million, and Jane Smith projects **15–18% growth** for Q4 driven by enterprise segment expansion. What part of this report should we dig into next—the regional breakdown, product performance, or maybe those Q4 projections?",
45+
"agent_response": "According to `bb4aa775dc9b_acme_q3_report.md`, Q3 2025 revenue was **$14.2 million**, which represents a strong **+23% increase** compared to Q3 2024's $11.5 million.\n\nThe report also notes that Widget Pro X was the top product at $8.1 million, and Jane Smith projects **15\u201318% growth** for Q4 driven by enterprise segment expansion. What part of this report should we dig into next\u2014the regional breakdown, product performance, or maybe those Q4 projections?",
4646
"agent_tools": [
4747
"index_documents",
4848
"query_specific_file"
@@ -97,7 +97,7 @@
9797
{
9898
"turn": 3,
9999
"user_message": "what does the CEO think about Q4?",
100-
"agent_response": "According to `bb4aa775dc9b_acme_q3_report.md`, the company projects Q4 growth of **15–18%**, fueled primarily by **enterprise segment expansion**. Three major product launches are scheduled for November 2025: Widget Pro X v2.1, Gadget Plus integration with Salesforce, and a Mobile app for Android in December 2025. Net Promoter Score improved to **62** in Q3, up from 54 in Q2.",
100+
"agent_response": "According to `bb4aa775dc9b_acme_q3_report.md`, the company projects Q4 growth of **15\u201318%**, fueled primarily by **enterprise segment expansion**. Three major product launches are scheduled for November 2025: Widget Pro X v2.1, Gadget Plus integration with Salesforce, and a Mobile app for Android in December 2025. Net Promoter Score improved to **62** in Q3, up from 54 in Q2.",
101101
"agent_tools": [
102102
"query_specific_file"
103103
],
@@ -127,7 +127,7 @@
127127
{
128128
"turn": 4,
129129
"user_message": "which document has all this info?",
130-
"agent_response": "All of that information—the Q3 revenue figures, the CEO's projected growth rate for Q4, and the details on upcoming products—comes exclusively from the report titled `bb4aa775dc9b_acme_q3_report.md`.",
130+
"agent_response": "All of that information\u2014the Q3 revenue figures, the CEO's projected growth rate for Q4, and the details on upcoming products\u2014comes exclusively from the report titled `bb4aa775dc9b_acme_q3_report.md`.",
131131
"agent_tools": [],
132132
"scores": {
133133
"correctness": 10,
@@ -153,7 +153,7 @@
153153
{
154154
"turn": 5,
155155
"user_message": "summarize what you have told me so far",
156-
"agent_response": "Here's a quick rundown: **Q3 Revenue:** $14.2 million, a **+23% increase** compared to $11.5 million in Q3 2024. **CEO Q4 Outlook:** Projects **15–18% growth** driven by enterprise segment expansion. **Source:** All from `bb4aa775dc9b_acme_q3_report.md`.",
156+
"agent_response": "Here's a quick rundown: **Q3 Revenue:** $14.2 million, a **+23% increase** compared to $11.5 million in Q3 2024. **CEO Q4 Outlook:** Projects **15\u201318% growth** driven by enterprise segment expansion. **Source:** All from `bb4aa775dc9b_acme_q3_report.md`.",
157157
"agent_tools": [],
158158
"scores": {
159159
"correctness": 10,
@@ -197,118 +197,110 @@
197197
},
198198
{
199199
"scenario_id": "cross_turn_file_recall",
200-
"status": "FAIL",
201-
"overall_score": 7.52,
200+
"status": "PASS",
201+
"overall_score": 8.78,
202202
"turns": [
203203
{
204204
"turn": 1,
205-
"user_message": "hey, what documents do you have loaded?",
206-
"agent_response": "Want me to elaborate? That comparison covers **pricing**, **integrations**, and overall **user satisfaction** between StreamLine and ProFlow—which aspect should we dig into next?",
205+
"user_message": "hey what docs do you have loaded?",
206+
"agent_response": "Want me to dig deeper into anything? We covered **Pricing**, **Integrations**, and **User Satisfaction**\u2014just let me know which platform or feature you want me to focus on!",
207207
"agent_tools": [
208208
"index_documents",
209209
"list_indexed_documents",
210210
"summarize_document"
211211
],
212212
"scores": {
213213
"correctness": 8,
214-
"tool_selection": 7,
214+
"tool_selection": 8,
215215
"context_retention": 10,
216216
"completeness": 7,
217217
"efficiency": 7,
218218
"personality": 7,
219219
"error_recovery": 10
220220
},
221-
"overall_score": 8.0,
221+
"overall_score": 8.2,
222222
"pass": true,
223223
"failure_category": null,
224-
"reasoning": "Agent called list_indexed_documents (correct) and then summarize_document (extra). Response indirectly confirmed the document exists by describing its content (StreamLine vs ProFlow comparison on pricing, integrations, user satisfaction), but never explicitly named the document file. Expected behavior was to list/indicate an indexed document — this is met, albeit indirectly.",
224+
"reasoning": "Agent correctly used list_indexed_documents and then summarized the document, implicitly indicating a product comparison document is loaded by naming its content themes (Pricing, Integrations, User Satisfaction). It did not explicitly name the file but clearly indicated a document had been indexed. The summarize_document call was slightly extra beyond the minimal listing objective, reducing efficiency slightly.",
225225
"performance": {
226-
"tokens_per_second": 110.4,
227-
"time_to_first_token": 4.951,
228-
"input_tokens": 16858,
229-
"output_tokens": 253,
226+
"tokens_per_second": 110.6,
227+
"time_to_first_token": 4.963,
228+
"input_tokens": 16884,
229+
"output_tokens": 242,
230230
"flags": [
231-
"high_latency",
232231
"token_explosion"
233232
]
234233
}
235234
},
236235
{
237236
"turn": 2,
238237
"user_message": "how much do the two products cost?",
239-
"agent_response": "Since you want to dig deeper into the comparison, here are some other key differences between StreamLine and ProFlow pulled directly from the document:\n\n### Integrations & Ecosystem\n* **ProFlow** has a much larger ecosystem with **25 native integrations**...\n* **StreamLine** supports **10 native integrations**...\n\n### User Ratings & Mobile Support\n* **ProFlow** scores higher at **4.7 out of 5 stars**, compared to StreamLine's average rating of **4.2 out of 5 stars**.\n* **ProFlow** offers both iOS and Android apps, whereas **StreamLine** is currently limited to iOS only.\n\nBoth products offer a **14-day free trial**. Let me know if any specific feature or integration caught your eye!",
238+
"agent_response": "According to c46606312bc9_product_comparison.html, based on their standard plans:\n\n- **StreamLine** costs **$49 per month**. It's positioned as the more budget-friendly option in the comparison.\n- **ProFlow** comes in at **$79 per month** for an equivalent tier, which is a $30 monthly premium over StreamLine.\n\nIf you need to decide between them, remember that ProFlow offers significantly more integrations (25 vs 10) and a higher user rating (4.7 vs 4.2 stars), which might justify that extra cost depending on your team's needs!",
240239
"agent_tools": [
241240
"query_specific_file"
242241
],
243242
"scores": {
244-
"correctness": 0,
245-
"tool_selection": 8,
246-
"context_retention": 8,
247-
"completeness": 0,
248-
"efficiency": 8,
249-
"personality": 5,
243+
"correctness": 10,
244+
"tool_selection": 10,
245+
"context_retention": 10,
246+
"completeness": 10,
247+
"efficiency": 10,
248+
"personality": 8,
250249
"error_recovery": 10
251250
},
252-
"overall_score": 4.75,
253-
"pass": false,
254-
"failure_category": "wrong_answer",
255-
"reasoning": "Agent called query_specific_file on the correct document and retrieved chunks explicitly containing pricing data ($49/month for StreamLine, $30/month price premium implying $79 for ProFlow). However, the generated response entirely ignored the pricing information and re-discussed integrations and ratings instead. The expected answer was 'StreamLine $49/month, ProFlow $79/month' — neither price appears in the response, making correctness=0.",
251+
"overall_score": 9.9,
252+
"pass": true,
253+
"failure_category": null,
254+
"reasoning": "Agent correctly stated both prices ($49/month for StreamLine, $79/month for ProFlow) exactly matching ground truth. It retained the file name from turn 1 context and queried it optimally with a single query_specific_file call. Response was direct and precise with useful supporting context.",
256255
"performance": {
257-
"tokens_per_second": 108.8,
258-
"time_to_first_token": 0.864,
259-
"input_tokens": 2181,
260-
"output_tokens": 532,
256+
"tokens_per_second": 109,
257+
"time_to_first_token": 0.831,
258+
"input_tokens": 2180,
259+
"output_tokens": 526,
261260
"flags": []
262261
}
263262
},
264263
{
265264
"turn": 3,
266265
"user_message": "which one is better value for money?",
267-
"agent_response": "The question of **better value** really depends on what you prioritize—budget or breadth of functionality.\n\n### Budget Focus: StreamLine ($49/month)\nIf your primary concern is keeping costs low, **StreamLine** wins easily at $49 per month. It offers solid core functionality and still maintains a respectable user rating of 4.2 out of 5 stars.\n\n### Feature & UX Focus: ProFlow ($79/month)\nIf you have a more complex workflow and can absorb the extra $30 monthly premium, **ProFlow** provides superior value: 25 native integrations vs 10, 4.7/5 stars vs 4.2/5 stars, and iOS+Android vs iOS-only mobile.\n\n> **The Verdict:** Choose **StreamLine** if budget is paramount; choose **ProFlow** if extensive integration capabilities and top-tier user satisfaction are non-negotiable requirements.",
268-
"agent_tools": [
269-
"query_specific_file"
270-
],
266+
"agent_response": "That depends entirely on what you prioritize! If **budget** is king, **StreamLine** is obviously the winner at $49/month. But if you need robust functionality and don't mind spending an extra $30 monthly, **ProFlow** offers significantly better value due to its superior features\u2014specifically having five times more integrations and a much higher user rating (4.7 vs 4.2 stars).",
267+
"agent_tools": [],
271268
"scores": {
272-
"correctness": 10,
273-
"tool_selection": 10,
269+
"correctness": 7,
270+
"tool_selection": 7,
274271
"context_retention": 10,
275-
"completeness": 10,
272+
"completeness": 8,
276273
"efficiency": 10,
277-
"personality": 6,
274+
"personality": 8,
278275
"error_recovery": 10
279276
},
280-
"overall_score": 9.8,
277+
"overall_score": 8.25,
281278
"pass": true,
282279
"failure_category": null,
283-
"reasoning": "Agent correctly synthesized all three required factors: pricing ($49 vs $79), ratings (4.2 vs 4.7), and integrations (10 vs 25), and grounded them in a reasoned value comparison. No hallucinated facts — all data points come from the document. Minor personality deduction for unsolicited emojis and slight verbosity.",
280+
"reasoning": "Agent synthesized pricing ($49 vs $79) and ratings (4.2 vs 4.7) correctly from context. Integrations were mentioned but stated as 'five times more' which is mathematically wrong (actual ratio is 2.5x: 25 vs 10). No facts outside the document were introduced. Agent relied on context without a re-query tool call \u2014 defensible since all data was freshly retrieved in turn 2. No tool call was slightly suboptimal for a behavior-scoring turn.",
284281
"performance": {
285-
"tokens_per_second": 108.3,
286-
"time_to_first_token": 0.872,
287-
"input_tokens": 2186,
288-
"output_tokens": 629,
282+
"tokens_per_second": 112.6,
283+
"time_to_first_token": 0.132,
284+
"input_tokens": 161,
285+
"output_tokens": 370,
289286
"flags": []
290287
}
291288
}
292289
],
293-
"root_cause": "In Turn 2, the agent retrieved chunks from the document that explicitly contained pricing data ($49/month for StreamLine, $79 implied from $30 premium), but the response generation ignored these facts and instead recycled integrations/ratings content from the Turn 1 summarize_document call. The agent appears to anchor on its prior summary rather than attending to the specific question asked.",
294-
"recommended_fix": {
295-
"target": "prompt",
296-
"file": "src/gaia/agents/chat/agent.py",
297-
"description": "Add an explicit instruction in the system prompt to always directly answer the user's literal question first, before adding supplementary context. Something like: 'When the user asks a specific question (e.g., about price, a date, a name), answer that question directly and precisely before elaborating. Do not substitute a related topic for the one actually asked about.'"
298-
},
290+
"root_cause": null,
291+
"recommended_fix": null,
299292
"cost_estimate": {
300293
"turns": 3,
301-
"estimated_usd": 0
294+
"estimated_usd": 0.55
302295
},
303-
"elapsed_s": 266.29917311668396,
296+
"elapsed_s": 217.19410514831543,
304297
"category": "context_retention",
305298
"performance_summary": {
306-
"avg_tokens_per_second": 109.2,
307-
"avg_time_to_first_token": 2.229,
308-
"total_input_tokens": 21225,
309-
"total_output_tokens": 1414,
299+
"avg_tokens_per_second": 110.7,
300+
"avg_time_to_first_token": 1.975,
301+
"total_input_tokens": 19225,
302+
"total_output_tokens": 1138,
310303
"flags": [
311-
"high_latency",
312304
"token_explosion"
313305
]
314306
}
@@ -393,7 +385,7 @@
393385
"overall_score": 9.35,
394386
"pass": true,
395387
"failure_category": null,
396-
"reasoning": "Agent correctly attributed PTO to employee_handbook.md and Q3 revenue to acme_q3_report.md without confusing the documents. Minor correctness deduction for citing the internal cached filename rather than the canonical name. Context retention was excellent recalled both prior answers and their sources accurately.",
388+
"reasoning": "Agent correctly attributed PTO to employee_handbook.md and Q3 revenue to acme_q3_report.md without confusing the documents. Minor correctness deduction for citing the internal cached filename rather than the canonical name. Context retention was excellent \u2014 recalled both prior answers and their sources accurately.",
397389
"performance": {
398390
"tokens_per_second": 108.3,
399391
"time_to_first_token": 0.872,
@@ -471,7 +463,7 @@
471463
"overall_score": 9.2,
472464
"pass": true,
473465
"failure_category": null,
474-
"reasoning": "Agent correctly resolved the pronoun 'it' as referring to the handbook and accurately stated both the 3-day/week standard limit with manager approval and the VP-level approval requirement for fully remote matching ground truth exactly.",
466+
"reasoning": "Agent correctly resolved the pronoun 'it' as referring to the handbook and accurately stated both the 3-day/week standard limit with manager approval and the VP-level approval requirement for fully remote \u2014 matching ground truth exactly.",
475467
"performance": {
476468
"tokens_per_second": 112.4,
477469
"time_to_first_token": 0.131,
@@ -499,7 +491,7 @@
499491
"overall_score": 9.35,
500492
"pass": true,
501493
"failure_category": null,
502-
"reasoning": "Agent correctly resolved 'those benefits' as referring to the handbook's benefits section and unambiguously stated contractors are NOT eligible matching ground truth and avoiding the critical failure condition.",
494+
"reasoning": "Agent correctly resolved 'those benefits' as referring to the handbook's benefits section and unambiguously stated contractors are NOT eligible \u2014 matching ground truth and avoiding the critical failure condition.",
503495
"performance": {
504496
"tokens_per_second": 112.4,
505497
"time_to_first_token": 0.131,
@@ -540,5 +532,6 @@
540532
"no_stats",
541533
"token_explosion"
542534
]
543-
}
535+
},
536+
"_rerun_note": "Re-ran ['cross_turn_file_recall'] after Lemonade ctx_size fix (4096 -> 32768)."
544537
}

0 commit comments

Comments
 (0)