engine: isolate buildkit client+session to each client #6806

sipsma · 2024-03-02T01:25:23Z

This is a spin-off of work on #6747, which has proven itself to be a fountain side-quests all over the codebase. Splitting that effort up into separate PRs for everyone's sanity. There will be at least one more pre-req PR (for improving dagql ID digest performance to support walking IDs efficiently) before the "final act" of the actual user-facing features.

This change results in each client getting its own buildkit.Client and buildkit Session. As a side-effect, this results in each client also getting its own buildkit solver.Job since Job+Session are tied one-to-one in the buildkit implementation.

It turns out each Job in buildkit doesn't have much expensive work associated with initializing it or any state we want to be shared; all of that is in the solver.Solver which is still shared across the whole engine process.

The ultimate goal here is to scope session attachables (sockets, secrets, auth tokens, terminal endpoints, etc.) to each Function based on what it was passed as args (either directly or transitively as part of the DAG underlying an arg). That isn't fully implemented yet; even though each Function now call gets its own buildkit.Client, the session attachables are currently still shared amongst those clients (same as before). But this gets the buildkit-related groundwork+refactoring out of the way.

Implementation details:

Each schema.Query for a given module client is now stored in the ClientCallContext. Most of the state is still shared, except for the buildkit.Client which is now set per-Query object.
This all required some rearranging of the BuildkitController, DaggerServer, APIServer, and buildkit.Client objects and their interactions. I also did some extra cleanup+consolidation while I was in the area.
- schema.APIServer is no more; I consolidated it into server.DaggerServer. The split between the two didn't really make sense anymore after these changes (and not sure how much it ever did tbh, they were just two layers of HTTP servers). cc @vito @jedevc in case you are wondering where that code went 🙂
- I fixed up the dumb hack of (ab)using the BuildkitController's Solve method to implement upstream cache exports; that's now done in the /shutdown endpoint. So now clients only ever interact with the Session method.

There's a couple minor TODO comments left to double check a few things that might have subtleties, but good to review.

core/query.go

engine/buildkit/client.go

sipsma · 2024-03-02T01:34:16Z

engine/server/flusher.go

+			// TODO: double check rm of this defer and explain
+			// saw panic in engine logs (though no test error?)
+			// https://github.com/golang/go/issues/19959
+			// defer flushWriter.Flush()


cc @vito I saw panics in the engine logs from internal net/http code that seemed to be a result of this Flush running after next.ServeHTTP returned (which golang/go#19959 says is a no-no). I think/hope any flushing would be taken of by net/http internally, but let me know if you see any problems with removing this defer.

Hmm I just debugged the same panic myself and it turned out to be user error - an endpoint I added (/trace, for exporting otel traces) wasn't synchronizing to ensure its writes were flushed before the handler returned. That makes me think it's worth double-checking that commenting this out isn't hiding a real issue. Given how much trouble we're already having with getting the last bits of progress out of a client that's shutting down, I'm pretty afraid of removing anything that resembles flushing. 😅

edit: just noticing the code I added to "fix" it was one big no-op yet I haven't seen the panic since, so maybe there's something hairier lurking here, i.e. a race condition.

The thing is golang/go#19959 seems to pretty conclusively say that you can't call .Flush after ServeHTTP has returned or you'll get undefined behavior. So I don't think we can leave it in as is.

Looking at the implementation it looks like there is a "flush on close" logic: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1745, and it does look like it should be called provided the conn has not been hijacked: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1874-L1879

We do hijack the conn in the case of Terminal, but at that point we have size limits on each websocket message that I suspect inherently mean we already aren't ever going to write a massive amount of data (e.g. https://github.com/sipsma/dagger/blob/1fefa4a198ffe2d533825006a10f6846f2422b99/core/terminal.go#L106-L106).

The thing is golang/go#19959 seems to pretty conclusively say that you can't call .Flush after ServeHTTP has returned or you'll get undefined behavior. So I don't think we can leave it in as is.

That's absolutely true, but this is in a middleware chain; the inner ServeHTTP has completed, but not the outer one, right? The outer one is the one I would expect to matter.

Right but does net/http know about all that? It seems like from its perspective we are just calling ServeHTTP, so it couldn't know that it should defer some behavior until the outer-most ServeHTTP has returned, right? Unless there's some magic going on underneath the hood there

I don't think it needs to know all that - from net/http's perspective there's just the "real" ServeHTTP that it's calling, and all this middleware is just a series of function calls running as part of that outer call, which happen to align on the ServeHTTP name for composability. So from net/http's perspective I would expect that defer to already run while the outermost ServeHTTP is running, which is all that matters.

At least, that's how middleware normally works - we're doing all sorts of extra effort, maybe wires are getting crossed somewhere in there?

Are you able to consistently repro an issue without this commented out?

Are you able to consistently repro an issue without this commented out?

Yeah, actually you can see it in CI on main too if you look at the engine logs. Here's one extracted out from this run:

2024-03-04T12:36:30.9756009Z 393: [102.2s] time="2024-03-04T12:36:30Z" level=error msg="panic serving schema: runtime error: invalid memory address or nil pointer dereference goroutine 734584 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x5e github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.func2() /app/core/schema/schema.go:117 +0x77 panic({0x1bd4420?, 0x3134150?}) /usr/local/go/src/runtime/panic.go:914 +0x21f bufio.(*Writer).Flush(0xc006d0cc40?) /usr/local/go/src/bufio/bufio.go:636 +0x12 net/http.(*response).FlushError(0xc006d0cc40) /usr/local/go/src/net/http/server.go:1718 +0x36 net/http.(*response).Flush(0xc006d0cc01?) /usr/local/go/src/net/http/server.go:1711 +0x13 github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.(*APIServer).ServeHTTP.flushAfterNBytes.func4.func5({0x228ba60?, 0xc006d0cc40}, 0xc000bdd801?) /app/core/schema/flusher.go:42 +0x20f net/http.HandlerFunc.ServeHTTP(0xc01f96c5c0?, {0x228ba60?, 0xc006d0cc40?}, 0x2272e00?) /usr/local/go/src/net/http/server.go:2136 +0x29 github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP(0xc00f1459a8, {0x228ba60?, 0xc006d0cc40}, 0xc0003cff00) /app/core/schema/schema.go:169 +0x4fd github.com/dagger/dagger/engine/server.(*DaggerServer).ServeClientConn.(*DaggerServer).HTTPHandlerForClient.func3({0x228ba60, 0xc006d0cc40}, 0xc00d1a49a0?) /app/engine/server/server.go:194 +0x249 net/http.HandlerFunc.ServeHTTP(0x31976a0?, {0x228ba60?, 0xc006d0cc40?}, 0xc0028e9b50?) /usr/local/go/src/net/http/server.go:2136 +0x29 net/http.serverHandler.ServeHTTP({0xc0208fda40?}, {0x228ba60?, 0xc006d0cc40?}, 0x6?) /usr/local/go/src/net/http/server.go:2938 +0x8e net/http.(*conn).serve(0xc025494990, {0x2295af8, 0xc0208fd980}) /usr/local/go/src/net/http/server.go:2009 +0x5f4 created by net/http.(*Server).Serve in goroutine 734581 /usr/local/go/src/net/http/server.go:3086 +0x5cb"

My understanding is that at the "bottom" of the ServeHTTP chain of middleware, there's the "real" ServeHTTP from the net/http implementation, and it's that one which has the rule of not writing to ResponseWriter after ServeHTTP has returned. So since these other wrappers are above that ServeHTTP call, they must also follow that rule.

Hmm alright. Still having a hard time grokking where exactly Go's net/http code even comes into play here but don't have time for another rabbit hole. This behavior flies in the face of how I would expect middleware to work - it's weird that the request/response is effectively "dead" as soon as the innermost handler returns instead of respecting the wrapping scheme, where the middleware should all just be logical Go code running within the "real" ServeHTTP call that kicked off the whole request.

Either way it's broken as-is so I'm good with changing this for now. Hopefully I don't end up back in this spot when I'm troubleshooting progress draining.

engine/server/conn.go

jedevc

On reading through jobs.go in buildkit, I think this is fine 🤔 Looks like a nice refactor 🎉

Future note - we have to be really careful when we start splitting out the session. A super contrived example: if I create llb.Local with a session ID from one module, and consume in another module, we'd need to make sure to add the required Session ID to that local op as it moves between sessions - or add the requested local name to the second session.

This applies for everything that can be attached to sessions (the above is contrived since we don't use llb.Local in this way today). These cases actually might be missed by our tests because it's affected by cache as well (if module 1 Syncs the output, then module 2 gets the cached result, even though it could have a different session). So we should probably make sure we are handling these test cases when we do this refactor.

engine/server/server.go

engine/buildkit/client.go

sipsma · 2024-03-04T19:41:01Z

Future note - we have to be really careful when we start splitting out the session. A super contrived example: if I create llb.Local with a session ID from one module, and consume in another module, we'd need to make sure to add the required Session ID to that local op as it moves between sessions - or add the requested local name to the second session.

This applies for everything that can be attached to sessions (the above is contrived since we don't use llb.Local in this way today). These cases actually might be missed by our tests because it's affected by cache as well (if module 1 Syncs the output, then module 2 gets the cached result, even though it could have a different session). So we should probably make sure we are handling these test cases when we do this refactor.

Yeah today we do put the SessionID in the local LLB (code) and then also avoid all of this entirely due to Blob-ification. So until all that changes I think we have multiple levels of safety guarding against this. But yeah if that changes sometime down the line it will indeed require careful handling.

This is a pre-req for upcoming changes to support passing host sockets to modules and improvements to secret isolation. Signed-off-by: Erik Sipsma <erik@dagger.io>

Signed-off-by: Erik Sipsma <erik@dagger.io>

It seems like it's conclusively not safe to call this and it should be taken care of internally by net/http anyways. Signed-off-by: Erik Sipsma <erik@dagger.io>

Signed-off-by: Erik Sipsma <erik@dagger.io>

sipsma requested review from vito and jedevc March 2, 2024 01:25

sipsma commented Mar 2, 2024

View reviewed changes

core/query.go Outdated Show resolved Hide resolved

sipsma commented Mar 2, 2024

View reviewed changes

core/query.go Outdated Show resolved Hide resolved

sipsma commented Mar 2, 2024

View reviewed changes

engine/buildkit/client.go Show resolved Hide resolved

sipsma commented Mar 2, 2024

View reviewed changes

engine/server/conn.go Show resolved Hide resolved

jedevc reviewed Mar 4, 2024

View reviewed changes

engine/server/server.go Outdated Show resolved Hide resolved

engine/buildkit/client.go Outdated Show resolved Hide resolved

sipsma force-pushed the isolate-sessions branch from 1fefa4a to e7e7235 Compare March 4, 2024 20:04

jedevc approved these changes Mar 5, 2024

View reviewed changes

sipsma mentioned this pull request Mar 6, 2024

dagql ID format + digest improvements to support ID walking #6836

Merged

3 tasks

sipsma added 6 commits March 6, 2024 01:50

engine: isolate buildkit client+session to each client

05114b5

This is a pre-req for upcoming changes to support passing host sockets to modules and improvements to secret isolation. Signed-off-by: Erik Sipsma <erik@dagger.io>

stop services in DaggerServer.Close so it is always called

fc59a74

Signed-off-by: Erik Sipsma <erik@dagger.io>

remove note about remove of .Flush

649ecf9

It seems like it's conclusively not safe to call this and it should be taken care of internally by net/http anyways. Signed-off-by: Erik Sipsma <erik@dagger.io>

cleanup execMetadata and refs map usage in buildkit.Client

0f84283

Signed-off-by: Erik Sipsma <erik@dagger.io>

appease linter

3fa0dbb

Signed-off-by: Erik Sipsma <erik@dagger.io>

cleanup context propagation; remove TODO comments

6dcb5c2

Signed-off-by: Erik Sipsma <erik@dagger.io>

sipsma force-pushed the isolate-sessions branch from e7e7235 to 6dcb5c2 Compare March 6, 2024 09:52

sipsma merged commit e5f0bc7 into dagger:main Mar 6, 2024
43 checks passed

sipsma mentioned this pull request Mar 20, 2024

Support Socket args from the CLI #6747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: isolate buildkit client+session to each client #6806

engine: isolate buildkit client+session to each client #6806

sipsma commented Mar 2, 2024 •

edited

Loading

sipsma Mar 2, 2024

vito Mar 2, 2024 •

edited

Loading

sipsma Mar 4, 2024

vito Mar 4, 2024

sipsma Mar 4, 2024

vito Mar 4, 2024 •

edited

Loading

sipsma Mar 4, 2024

vito Mar 4, 2024

jedevc left a comment

sipsma commented Mar 4, 2024

engine: isolate buildkit client+session to each client #6806

engine: isolate buildkit client+session to each client #6806

Conversation

sipsma commented Mar 2, 2024 • edited Loading

sipsma Mar 2, 2024

Choose a reason for hiding this comment

vito Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

sipsma Mar 4, 2024

Choose a reason for hiding this comment

vito Mar 4, 2024

Choose a reason for hiding this comment

sipsma Mar 4, 2024

Choose a reason for hiding this comment

vito Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

sipsma Mar 4, 2024

Choose a reason for hiding this comment

vito Mar 4, 2024

Choose a reason for hiding this comment

jedevc left a comment

Choose a reason for hiding this comment

sipsma commented Mar 4, 2024

sipsma commented Mar 2, 2024 •

edited

Loading

vito Mar 2, 2024 •

edited

Loading

vito Mar 4, 2024 •

edited

Loading