Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: isolate buildkit client+session to each client #6806

Merged
merged 6 commits into from
Mar 6, 2024

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented Mar 2, 2024

This is a spin-off of work on #6747, which has proven itself to be a fountain side-quests all over the codebase. Splitting that effort up into separate PRs for everyone's sanity. There will be at least one more pre-req PR (for improving dagql ID digest performance to support walking IDs efficiently) before the "final act" of the actual user-facing features.

This change results in each client getting its own buildkit.Client and buildkit Session. As a side-effect, this results in each client also getting its own buildkit solver.Job since Job+Session are tied one-to-one in the buildkit implementation.

  • It turns out each Job in buildkit doesn't have much expensive work associated with initializing it or any state we want to be shared; all of that is in the solver.Solver which is still shared across the whole engine process.

The ultimate goal here is to scope session attachables (sockets, secrets, auth tokens, terminal endpoints, etc.) to each Function based on what it was passed as args (either directly or transitively as part of the DAG underlying an arg). That isn't fully implemented yet; even though each Function now call gets its own buildkit.Client, the session attachables are currently still shared amongst those clients (same as before). But this gets the buildkit-related groundwork+refactoring out of the way.

Implementation details:

  • Each schema.Query for a given module client is now stored in the ClientCallContext. Most of the state is still shared, except for the buildkit.Client which is now set per-Query object.
  • This all required some rearranging of the BuildkitController, DaggerServer, APIServer, and buildkit.Client objects and their interactions. I also did some extra cleanup+consolidation while I was in the area.
    • schema.APIServer is no more; I consolidated it into server.DaggerServer. The split between the two didn't really make sense anymore after these changes (and not sure how much it ever did tbh, they were just two layers of HTTP servers). cc @vito @jedevc in case you are wondering where that code went 🙂
    • I fixed up the dumb hack of (ab)using the BuildkitController's Solve method to implement upstream cache exports; that's now done in the /shutdown endpoint. So now clients only ever interact with the Session method.

There's a couple minor TODO comments left to double check a few things that might have subtleties, but good to review.

@sipsma sipsma requested review from vito and jedevc March 2, 2024 01:25
core/query.go Outdated Show resolved Hide resolved
core/query.go Outdated Show resolved Hide resolved
// TODO: double check rm of this defer and explain
// saw panic in engine logs (though no test error?)
// https://github.com/golang/go/issues/19959
// defer flushWriter.Flush()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @vito I saw panics in the engine logs from internal net/http code that seemed to be a result of this Flush running after next.ServeHTTP returned (which golang/go#19959 says is a no-no). I think/hope any flushing would be taken of by net/http internally, but let me know if you see any problems with removing this defer.

Copy link
Contributor

@vito vito Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I just debugged the same panic myself and it turned out to be user error - an endpoint I added (/trace, for exporting otel traces) wasn't synchronizing to ensure its writes were flushed before the handler returned. That makes me think it's worth double-checking that commenting this out isn't hiding a real issue. Given how much trouble we're already having with getting the last bits of progress out of a client that's shutting down, I'm pretty afraid of removing anything that resembles flushing. 😅

edit: just noticing the code I added to "fix" it was one big no-op yet I haven't seen the panic since, so maybe there's something hairier lurking here, i.e. a race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is golang/go#19959 seems to pretty conclusively say that you can't call .Flush after ServeHTTP has returned or you'll get undefined behavior. So I don't think we can leave it in as is.

Looking at the implementation it looks like there is a "flush on close" logic: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1745, and it does look like it should be called provided the conn has not been hijacked: https://github.com/golang/go/blob/f29208030ab80769ce61dedb5a419821abf92113/src/net/http/server.go#L1874-L1879

We do hijack the conn in the case of Terminal, but at that point we have size limits on each websocket message that I suspect inherently mean we already aren't ever going to write a massive amount of data (e.g. https://github.com/sipsma/dagger/blob/1fefa4a198ffe2d533825006a10f6846f2422b99/core/terminal.go#L106-L106).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is golang/go#19959 seems to pretty conclusively say that you can't call .Flush after ServeHTTP has returned or you'll get undefined behavior. So I don't think we can leave it in as is.

That's absolutely true, but this is in a middleware chain; the inner ServeHTTP has completed, but not the outer one, right? The outer one is the one I would expect to matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but does net/http know about all that? It seems like from its perspective we are just calling ServeHTTP, so it couldn't know that it should defer some behavior until the outer-most ServeHTTP has returned, right? Unless there's some magic going on underneath the hood there

Copy link
Contributor

@vito vito Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to know all that - from net/http's perspective there's just the "real" ServeHTTP that it's calling, and all this middleware is just a series of function calls running as part of that outer call, which happen to align on the ServeHTTP name for composability. So from net/http's perspective I would expect that defer to already run while the outermost ServeHTTP is running, which is all that matters.

At least, that's how middleware normally works - we're doing all sorts of extra effort, maybe wires are getting crossed somewhere in there?

Are you able to consistently repro an issue without this commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to consistently repro an issue without this commented out?

Yeah, actually you can see it in CI on main too if you look at the engine logs. Here's one extracted out from this run:

2024-03-04T12:36:30.9756009Z 393: [102.2s] time="2024-03-04T12:36:30Z" level=error msg="panic serving schema: runtime error: invalid memory address or nil pointer dereference goroutine 734584 [running]:
runtime/debug.Stack()
  /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.func2()
  /app/core/schema/schema.go:117 +0x77
panic({0x1bd4420?, 0x3134150?})
  /usr/local/go/src/runtime/panic.go:914 +0x21f
bufio.(*Writer).Flush(0xc006d0cc40?)
  /usr/local/go/src/bufio/bufio.go:636 +0x12
net/http.(*response).FlushError(0xc006d0cc40)
  /usr/local/go/src/net/http/server.go:1718 +0x36
net/http.(*response).Flush(0xc006d0cc01?)
  /usr/local/go/src/net/http/server.go:1711 +0x13
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP.(*APIServer).ServeHTTP.flushAfterNBytes.func4.func5({0x228ba60?, 0xc006d0cc40}, 0xc000bdd801?)
  /app/core/schema/flusher.go:42 +0x20f
net/http.HandlerFunc.ServeHTTP(0xc01f96c5c0?, {0x228ba60?, 0xc006d0cc40?}, 0x2272e00?)
  /usr/local/go/src/net/http/server.go:2136 +0x29
github.com/dagger/dagger/core/schema.(*APIServer).ServeHTTP(0xc00f1459a8, {0x228ba60?, 0xc006d0cc40}, 0xc0003cff00)
  /app/core/schema/schema.go:169 +0x4fd
github.com/dagger/dagger/engine/server.(*DaggerServer).ServeClientConn.(*DaggerServer).HTTPHandlerForClient.func3({0x228ba60, 0xc006d0cc40}, 0xc00d1a49a0?)
  /app/engine/server/server.go:194 +0x249
net/http.HandlerFunc.ServeHTTP(0x31976a0?, {0x228ba60?, 0xc006d0cc40?}, 0xc0028e9b50?)
  /usr/local/go/src/net/http/server.go:2136 +0x29
net/http.serverHandler.ServeHTTP({0xc0208fda40?}, {0x228ba60?, 0xc006d0cc40?}, 0x6?)
  /usr/local/go/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc025494990, {0x2295af8, 0xc0208fd980})
  /usr/local/go/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 734581
  /usr/local/go/src/net/http/server.go:3086 +0x5cb"

My understanding is that at the "bottom" of the ServeHTTP chain of middleware, there's the "real" ServeHTTP from the net/http implementation, and it's that one which has the rule of not writing to ResponseWriter after ServeHTTP has returned. So since these other wrappers are above that ServeHTTP call, they must also follow that rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm alright. Still having a hard time grokking where exactly Go's net/http code even comes into play here but don't have time for another rabbit hole. This behavior flies in the face of how I would expect middleware to work - it's weird that the request/response is effectively "dead" as soon as the innermost handler returns instead of respecting the wrapping scheme, where the middleware should all just be logical Go code running within the "real" ServeHTTP call that kicked off the whole request.

Either way it's broken as-is so I'm good with changing this for now. Hopefully I don't end up back in this spot when I'm troubleshooting progress draining.

Copy link
Member

@jedevc jedevc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reading through jobs.go in buildkit, I think this is fine 🤔 Looks like a nice refactor 🎉

Future note - we have to be really careful when we start splitting out the session. A super contrived example: if I create llb.Local with a session ID from one module, and consume in another module, we'd need to make sure to add the required Session ID to that local op as it moves between sessions - or add the requested local name to the second session.

This applies for everything that can be attached to sessions (the above is contrived since we don't use llb.Local in this way today). These cases actually might be missed by our tests because it's affected by cache as well (if module 1 Syncs the output, then module 2 gets the cached result, even though it could have a different session). So we should probably make sure we are handling these test cases when we do this refactor.

engine/server/server.go Outdated Show resolved Hide resolved
engine/buildkit/client.go Outdated Show resolved Hide resolved
@sipsma
Copy link
Contributor Author

sipsma commented Mar 4, 2024

Future note - we have to be really careful when we start splitting out the session. A super contrived example: if I create llb.Local with a session ID from one module, and consume in another module, we'd need to make sure to add the required Session ID to that local op as it moves between sessions - or add the requested local name to the second session.

This applies for everything that can be attached to sessions (the above is contrived since we don't use llb.Local in this way today). These cases actually might be missed by our tests because it's affected by cache as well (if module 1 Syncs the output, then module 2 gets the cached result, even though it could have a different session). So we should probably make sure we are handling these test cases when we do this refactor.

Yeah today we do put the SessionID in the local LLB (code) and then also avoid all of this entirely due to Blob-ification. So until all that changes I think we have multiple levels of safety guarding against this. But yeah if that changes sometime down the line it will indeed require careful handling.

This is a pre-req for upcoming changes to support passing host sockets
to modules and improvements to secret isolation.

Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
It seems like it's conclusively not safe to call this and it should be
taken care of internally by net/http anyways.

Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
@sipsma sipsma merged commit e5f0bc7 into dagger:main Mar 6, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants