-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: isolate buildkit client+session to each client #6806
Conversation
engine/server/flusher.go
Outdated
// TODO: double check rm of this defer and explain | ||
// saw panic in engine logs (though no test error?) | ||
// https://github.com/golang/go/issues/19959 | ||
// defer flushWriter.Flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @vito I saw panics in the engine logs from internal net/http
code that seemed to be a result of this Flush
running after next.ServeHTTP
returned (which golang/go#19959 says is a no-no). I think/hope any flushing would be taken of by net/http
internally, but let me know if you see any problems with removing this defer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I just debugged the same panic myself and it turned out to be user error - an endpoint I added (/trace
, for exporting otel traces) wasn't synchronizing to ensure its writes were flushed before the handler returned. That makes me think it's worth double-checking that commenting this out isn't hiding a real issue. Given how much trouble we're already having with getting the last bits of progress out of a client that's shutting down, I'm pretty afraid of removing anything that resembles flushing. 😅
edit: just noticing the code I added to "fix" it was one big no-op yet I haven't seen the panic since, so maybe there's something hairier lurking here, i.e. a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is golang/go#19959 seems to pretty conclusively say that you can't call .Flush
after ServeHTTP
has returned or you'll get undefined behavior. So I don't think we can leave it in as is.
Looking at the implementation it looks like there is a "flush on close" logic: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1745, and it does look like it should be called provided the conn has not been hijacked: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1874-L1879
We do hijack the conn in the case of Terminal
, but at that point we have size limits on each websocket message that I suspect inherently mean we already aren't ever going to write a massive amount of data (e.g. https://github.com/sipsma/dagger/blob/1fefa4a198ffe2d533825006a10f6846f2422b99/core/terminal.go#L106-L106).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is golang/go#19959 seems to pretty conclusively say that you can't call
.Flush
afterServeHTTP
has returned or you'll get undefined behavior. So I don't think we can leave it in as is.
That's absolutely true, but this is in a middleware chain; the inner ServeHTTP has completed, but not the outer one, right? The outer one is the one I would expect to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but does net/http
know about all that? It seems like from its perspective we are just calling ServeHTTP
, so it couldn't know that it should defer some behavior until the outer-most ServeHTTP
has returned, right? Unless there's some magic going on underneath the hood there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to know all that - from net/http
's perspective there's just the "real" ServeHTTP
that it's calling, and all this middleware is just a series of function calls running as part of that outer call, which happen to align on the ServeHTTP
name for composability. So from net/http
's perspective I would expect that defer
to already run while the outermost ServeHTTP
is running, which is all that matters.
At least, that's how middleware normally works - we're doing all sorts of extra effort, maybe wires are getting crossed somewhere in there?
Are you able to consistently repro an issue without this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to consistently repro an issue without this commented out?
Yeah, actually you can see it in CI on main too if you look at the engine logs. Here's one extracted out from this run:
2024-03-04T12:36:30.9756009Z 393: [102.2s] time="2024-03-04T12:36:30Z" level=error msg="panic serving schema: runtime error: invalid memory address or nil pointer dereference goroutine 734584 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.func2()
/app/core/schema/schema.go:117 +0x77
panic({0x1bd4420?, 0x3134150?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
bufio.(*Writer).Flush(0xc006d0cc40?)
/usr/local/go/src/bufio/bufio.go:636 +0x12
net/http.(*response).FlushError(0xc006d0cc40)
/usr/local/go/src/net/http/server.go:1718 +0x36
net/http.(*response).Flush(0xc006d0cc01?)
/usr/local/go/src/net/http/server.go:1711 +0x13
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.(*APIServer).ServeHTTP.flushAfterNBytes.func4.func5({0x228ba60?, 0xc006d0cc40}, 0xc000bdd801?)
/app/core/schema/flusher.go:42 +0x20f
net/http.HandlerFunc.ServeHTTP(0xc01f96c5c0?, {0x228ba60?, 0xc006d0cc40?}, 0x2272e00?)
/usr/local/go/src/net/http/server.go:2136 +0x29
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP(0xc00f1459a8, {0x228ba60?, 0xc006d0cc40}, 0xc0003cff00)
/app/core/schema/schema.go:169 +0x4fd
github.com/dagger/dagger/engine/server.(*DaggerServer).ServeClientConn.(*DaggerServer).HTTPHandlerForClient.func3({0x228ba60, 0xc006d0cc40}, 0xc00d1a49a0?)
/app/engine/server/server.go:194 +0x249
net/http.HandlerFunc.ServeHTTP(0x31976a0?, {0x228ba60?, 0xc006d0cc40?}, 0xc0028e9b50?)
/usr/local/go/src/net/http/server.go:2136 +0x29
net/http.serverHandler.ServeHTTP({0xc0208fda40?}, {0x228ba60?, 0xc006d0cc40?}, 0x6?)
/usr/local/go/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc025494990, {0x2295af8, 0xc0208fd980})
/usr/local/go/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 734581
/usr/local/go/src/net/http/server.go:3086 +0x5cb"
My understanding is that at the "bottom" of the ServeHTTP
chain of middleware, there's the "real" ServeHTTP
from the net/http
implementation, and it's that one which has the rule of not writing to ResponseWriter
after ServeHTTP
has returned. So since these other wrappers are above that ServeHTTP
call, they must also follow that rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm alright. Still having a hard time grokking where exactly Go's net/http
code even comes into play here but don't have time for another rabbit hole. This behavior flies in the face of how I would expect middleware to work - it's weird that the request/response is effectively "dead" as soon as the innermost handler returns instead of respecting the wrapping scheme, where the middleware should all just be logical Go code running within the "real" ServeHTTP call that kicked off the whole request.
Either way it's broken as-is so I'm good with changing this for now. Hopefully I don't end up back in this spot when I'm troubleshooting progress draining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reading through jobs.go
in buildkit, I think this is fine 🤔 Looks like a nice refactor 🎉
Future note - we have to be really careful when we start splitting out the session. A super contrived example: if I create llb.Local
with a session ID from one module, and consume in another module, we'd need to make sure to add the required Session ID to that local op as it moves between sessions - or add the requested local name to the second session.
This applies for everything that can be attached to sessions (the above is contrived since we don't use llb.Local in this way today). These cases actually might be missed by our tests because it's affected by cache as well (if module 1 Sync
s the output, then module 2 gets the cached result, even though it could have a different session). So we should probably make sure we are handling these test cases when we do this refactor.
Yeah today we do put the SessionID in the local LLB (code) and then also avoid all of this entirely due to |
This is a pre-req for upcoming changes to support passing host sockets to modules and improvements to secret isolation. Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
It seems like it's conclusively not safe to call this and it should be taken care of internally by net/http anyways. Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
This is a spin-off of work on #6747, which has proven itself to be a fountain side-quests all over the codebase. Splitting that effort up into separate PRs for everyone's sanity. There will be at least one more pre-req PR (for improving dagql ID digest performance to support walking IDs efficiently) before the "final act" of the actual user-facing features.
This change results in each client getting its own
buildkit.Client
and buildkitSession
. As a side-effect, this results in each client also getting its own buildkitsolver.Job
sinceJob
+Session
are tied one-to-one in the buildkit implementation.Job
in buildkit doesn't have much expensive work associated with initializing it or any state we want to be shared; all of that is in thesolver.Solver
which is still shared across the whole engine process.The ultimate goal here is to scope session attachables (sockets, secrets, auth tokens, terminal endpoints, etc.) to each Function based on what it was passed as args (either directly or transitively as part of the DAG underlying an arg). That isn't fully implemented yet; even though each Function now call gets its own
buildkit.Client
, the session attachables are currently still shared amongst those clients (same as before). But this gets the buildkit-related groundwork+refactoring out of the way.Implementation details:
schema.Query
for a given module client is now stored in theClientCallContext
. Most of the state is still shared, except for thebuildkit.Client
which is now set per-Query
object.BuildkitController
,DaggerServer
,APIServer
, andbuildkit.Client
objects and their interactions. I also did some extra cleanup+consolidation while I was in the area.schema.APIServer
is no more; I consolidated it intoserver.DaggerServer
. The split between the two didn't really make sense anymore after these changes (and not sure how much it ever did tbh, they were just two layers of HTTP servers). cc @vito @jedevc in case you are wondering where that code went 🙂BuildkitController
'sSolve
method to implement upstream cache exports; that's now done in the/shutdown
endpoint. So now clients only ever interact with theSession
method.There's a couple minor TODO comments left to double check a few things that might have subtleties, but good to review.