feat: agents desktop recordings backend#23894
Conversation
0116573 to
db2414f
Compare
| // MaxRecordingSize is the largest desktop recording (in bytes) | ||
| // that will be accepted. Used by both the agent-side stop handler | ||
| // and the server-side storage pipeline. | ||
| const MaxRecordingSize = 100 << 20 // 100 MB |
There was a problem hiding this comment.
How large are these? Do they ever actually reach 100 MB?
There was a problem hiding this comment.
in my manual testing it's ~150kB/second, so you'd need a 10-12 minute recording to get to 100 MB. These videos are massively sped up though, so I think the agent would need to work for hours and hours to actually produce a 100MB video.
mafredri
left a comment
There was a problem hiding this comment.
Solid approach to the recording lifecycle. The three-state idempotency in StartRecording, stopOnce coordination, and the idle monitor's timer-reset-with-remaining-budget are well-engineered. The decision to leave recordings running on timeout for seamless continuation is a good design choice.
Severity count: 1 P1, 6 P2, 4 P3. No P0s.
The P1 is an authorization ordering issue: recordings start on a target agent before verifying the caller owns the target chat. The P2s cluster around two themes: (1) the stop/store pipeline sharing timeouts and buffering 100MB in memory, and (2) the API mutex held during blocking Desktop calls. All have straightforward fixes.
As Killua put it: "The decision to leave recordings running across wait_agent timeouts and resume seamlessly on the next call is well thought out. Huh, that's clean."
🤖 This review was automatically generated with Coder Agents.
agent/x/agentdesktop/api.go
Outdated
| return | ||
| } | ||
|
|
||
| a.mu.Lock() |
There was a problem hiding this comment.
P2 handleRecordingStart holds a.mu for the entire handler, including blocking Desktop calls. (Meruem P2, Killua P2, Mafuuu P2, Kurapika P3)
a.mu.Lock()at line 566 with defer. While held, callsa.desktop.Start(ctx)(line 576, process spawn on cold desktop) anda.desktop.StartRecording(ctx, recordingID)(line 584, process spawn). Meanwhile, everyhandleActioncall blocks at line 143 waiting to updatelastDesktopActionAt.
On a warm desktop the lock hold time is <10ms. But if the desktop process died and needs respawning, Start can block for hundreds of milliseconds to seconds. During that time, every mouse/keyboard/screenshot action from the computer-use agent stalls.
Fix: acquire a.mu only for the a.closed check and idle monitor map insert. Call desktop.Start and desktop.StartRecording outside the lock. The Desktop has its own lock.
a.mu.Lock()
if a.closed { a.mu.Unlock(); /* return error */ }
a.mu.Unlock()
a.desktop.Start(ctx)
a.desktop.StartRecording(ctx, recordingID)
a.mu.Lock()
// start idle monitor
a.mu.Unlock()🤖
There was a problem hiding this comment.
This is intentional. Holding the mutex for the entire duration of handleRecordingStart makes it easier to reason whether the code is concurrency-safe. In a previous iteration the agent was being clever with mutex usage and that led to subtle race conditions. Won't change.
There was a problem hiding this comment.
I would be more comfortable changing this behavior tbh. This can have repercussions all the way to shutting down the workspace. By keeping this server blocked from shutting down there's a chance we prevent proper workspace shutdown as well.
What are we actually guarding against here? Multiple Start's? An implementation that handles that could be fairly simple, in addition to what the agent suggested above something like:
a.mu.Lock()
if a.closed {
a.mu.Unlock()
return
}
if ch := a.starting; ch != nil {
a.mu.Unlock()
<-ch
// do something?, a.mu.Lock(), return, etc.
} else {
starting = make(chan struct{})
defer close(starting)
a.starting = starting
}
a.mu.Unlock()
// start...This is just off the top of my head without fully understanding the goal of this design.
We should always be very careful about locking interactions with Close, it should be possible to Close whenever without waiting for long-running operations.
There was a problem hiding this comment.
What are we actually guarding against here?
We're guarding against concurrent StartRecording and StopRecording calls. Holding a mutex for the entire duration of these functions makes it simple to reason about whether their interactions are safe.
It's true that this blocks all other desktop operations including Close, but I believe that the simplicity of the code is worth the tradeoff. Is it such a big deal that in the worst case - quire rare - scenario, where the workspace agent is being shut down at the exact same time as a desktop recording is being started, Close is delayed by a second?
If we implemented your suggestion, I'm also unsure how Close should work for processes in the starting state. Would it just leave them hanging?
db2414f to
5e16133
Compare
mafredri
left a comment
There was a problem hiding this comment.
Round 2. 9 of 11 R1 findings resolved. The P1 (auth ordering) fix is correct: isSubagentDescendant now runs before any agent connection or recording start. The semaphore (cap 25) and 90s timeout address the memory and timing concerns.
Two R1 findings were contested (mutex scope, temp file lifecycle). Both defenses are reasonable given the current usage pattern.
Three new findings this round, all minor. The most notable gap is the portableDesktop recording logic (200+ lines) having zero unit tests at the implementation layer. The rest is cosmetic.
Severity count: 0 P0, 0 P1, 1 P2, 2 P3.
As Bisky noted about the idle timeout test: "TestRecordingIdleTimeoutResetsOnAction is especially well-crafted: quartz traps, precise clock advancement, verified both the 'reset' and 'eventual fire' branches. These are real gems."
🤖 This review was automatically generated with Coder Agents.
5e16133 to
51d8d08
Compare
mafredri
left a comment
There was a problem hiding this comment.
Cool feature 👍🏻. I do have some concerns about the agentdesktop API mutex design though for two reasons:
- They may block workspace agent shutdown
- They're called form
chatdand may block there as well
The portabledesktop implementation has similar concerns, but I did not go as thoroughly through those.
I think the mutex design can easily be made more safe.
agent/x/agentdesktop/api.go
Outdated
| return | ||
| } | ||
|
|
||
| a.mu.Lock() |
There was a problem hiding this comment.
I would be more comfortable changing this behavior tbh. This can have repercussions all the way to shutting down the workspace. By keeping this server blocked from shutting down there's a chance we prevent proper workspace shutdown as well.
What are we actually guarding against here? Multiple Start's? An implementation that handles that could be fairly simple, in addition to what the agent suggested above something like:
a.mu.Lock()
if a.closed {
a.mu.Unlock()
return
}
if ch := a.starting; ch != nil {
a.mu.Unlock()
<-ch
// do something?, a.mu.Lock(), return, etc.
} else {
starting = make(chan struct{})
defer close(starting)
a.starting = starting
}
a.mu.Unlock()
// start...This is just off the top of my head without fully understanding the goal of this design.
We should always be very careful about locking interactions with Close, it should be possible to Close whenever without waiting for long-running operations.
| return | ||
| } | ||
|
|
||
| if err := a.desktop.StartRecording(ctx, recordingID); err != nil { |
There was a problem hiding this comment.
What if we clobber the recordingID here? E.g. network request intermittent failure, first one went through, then second one also goes through.
Shouldn't we have a check earlier before we retry Start + StartRecording?
There was a problem hiding this comment.
What if we clobber the recordingID here?
What do you mean? I'm not sure if this answers your question, but repeated calls in short succession to StartRecording will be idempotent.
agent/x/agentdesktop/api.go
Outdated
| return | ||
| } | ||
|
|
||
| if info.Size() > workspacesdk.MaxRecordingSize { |
There was a problem hiding this comment.
Do we actually have any protection that stops recording once we hit max size? Should we consider returning a truncated video?
There was a problem hiding this comment.
We don't. I'd consider this a possible future improvement. The portabledesktop cli would need support for truncation.
e172782 to
e6fae35
Compare
mafredri
left a comment
There was a problem hiding this comment.
Round 5, post-refactor. The restructuring is a significant improvement: recording management consolidated in portableDesktop under a single mutex, the API layer is thin, and the RecordingArtifact abstraction cleanly separates file ownership from the recording lifecycle. Most of the human review findings from R4 are resolved.
Severity count: 0 P0, 0 P1, 5 P2, 3 P3.
The most interesting new finding: on Windows, interruptRecordingProcess calls Kill() but the caller never marks rec.killed = true, so every graceful stop silently serves a corrupt MP4 as valid. Currently dormant (X11 required), but structurally wrong. As Hisoka put it: "The code presents one face (interruptRecordingProcess means 'graceful') and hides another (Kill() means 'destroyed'). What happens when someone notices the videos don't play?"
Gon identified a naming collision worth fixing: RecordActivity() in a recording package reads as "start recording the activity" rather than "note that activity happened."
A few Gon observations not posted inline: the 15s/30s/90s timeout chain across three files forms an undocumented dependency. If any timeout is changed independently, the chain breaks silently.
🤖 This review was automatically generated with Coder Agents.
| lastAction := time.Unix(0, lastNano) | ||
| elapsed := p.clock.Since(lastAction) | ||
| if elapsed >= idleTimeout { | ||
| p.mu.Lock() |
There was a problem hiding this comment.
P2 Idle monitor holds p.mu for up to 15 seconds during graceful stop, blocking Close() and delaying workspace shutdown. (Mafuuu P2)
monitorRecordingIdle(line 736) acquiresp.mu, then callslockedStopRecordingProcess(ctx, rec, false). Thefalseflag means SIGINT followed by a 15-second timer wait.p.muis held for the entire duration.Close()needsp.mu.Lock()to begin cleanup. If the idle monitor is mid-stop, Close is blocked for up to 15 seconds.
The prior defense (#comment-3028467808) argued "delayed by a second" was acceptable. That was about Start. This is a 15-second graceful SIGINT stop triggered by a background goroutine. The idle monitor knows nobody is actively waiting for a clean artifact, so force=true (or releasing mu before the wait) would avoid blocking Close.
🤖
There was a problem hiding this comment.
Waiting on these processes is essential to create valid recordings. If we used force=true, recordings would get corrupted. It's possible that an idle recording will be read by a future tool call.
4bf9386 to
31df469
Compare
mafredri
left a comment
There was a problem hiding this comment.
Thanks for the refactor, looks a lot better 👍🏻. I flagged a few more/new things but don't foresee that fixing them should cause any issues.
| } | ||
| } | ||
| return active | ||
| } |
There was a problem hiding this comment.
This works, so consider this minor bike-shedding (optional), but each active recording could instead increment and decrement an int (eg. atomic.Add*, etc).
| OrganizationID: ws.OrganizationID, | ||
| Name: fmt.Sprintf("recording-%s.mp4", p.clock.Now().UTC().Format("2006-01-02T15-04-05Z")), | ||
| Mimetype: "video/mp4", | ||
| Data: data, |
There was a problem hiding this comment.
Wish we would switch from lib/pq to pgx so we could stream the data rather than allocating it upfront. 🥲
bbe92eb to
96af52e
Compare
96af52e to
88fcec6
Compare
This PR introduces screen recording of the computer use agent using the virtual desktop.
wait_agenttool call. Recording is stopped by a successfulwait_agenttool call or when there hasn't been any desktop activity for 10 minutes.portabledesktopcli via therecordcommand. The videos are sped up in periods of inactivity.chat_filestable. There's a hard limit of 100MB per recording. Larger recordings are dropped.wait_agenton a computer use subagent tool call returns arecording_file_id, later allowing the frontend to display the corresponding video.