Produce a session db for the evals#1583
Conversation
Signed-off-by: David Gageot <david.gageot@docker.com>
There was a problem hiding this comment.
Review Summary
Found one issue with timestamp handling in the new SessionFromEvents function. All reconstructed messages use time.Now() which assigns the database creation time rather than preserving when events actually occurred. This reduces the utility of the evaluation session database for timing analysis.
The core functionality looks good otherwise - the database creation, session reconstruction, and error handling are all working correctly.
| ToolDefinitions: currentToolDefinitions, | ||
| CreatedAt: time.Now().Format(time.RFC3339), | ||
| Model: currentModel, | ||
| Usage: currentUsage, |
There was a problem hiding this comment.
ISSUE: Message timestamps use time.Now() instead of preserving event timing
The SessionFromEvents function reconstructs sessions from historical evaluation events, but uses time.Now() for message timestamps. This means:
- All messages get the timestamp of when the database was created, not when events actually occurred
- Messages that happened seconds/minutes apart during execution will appear simultaneous
- Temporal ordering and latency analysis is lost
This affects three locations:
- Line 92 (this location): Assistant messages in
flushAssistantMessage() - Line 161: Tool call response messages
- Line 201: Error messages
Since runtime events don't contain timestamp information (verified in pkg/runtime/event.go), consider either:
- Adding timestamps to runtime events, or
- Using incrementing timestamps based on event sequence to preserve ordering, or
- Using eval run start time + offset based on position
While this doesn't break functionality, it does reduce the debugging value of the eval session database for investigating timing and latency issues.
There was a problem hiding this comment.
The issue is that our runtime events don't have a timing for now. I'll add this to my TODO
Now, when we run
cagent eval, it'll produce a full session store with all the sessions.That makes it super easy to go back to an evaluation session and investigate what went right or wrong.