New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: enable buildkit scheduler debug logs #7128
Conversation
Well, I hit #6111 first try here, so will take a look at the logs. Unfortunately, the scheduler logs are proving to be extremely verbose, possibly to the point that we shouldn't include them on DEBUG level, but will figure out what to do there before merging. EDIT: yeah, I can see now that the raw logs file is 750 MB, so can't merge this as is 😆 😭 , marking as draft for now |
For now, I'm gonna see how far I can get just debugging these errors in this PR. Best case scenario that'll be enough to figure it out and we can just fix the issues and close this PR. Worst case scenario, we can do some logrus filtering to only include the really important scheduler debug logs and merge it with less than 750MB of engine logs 😄 |
12dd3f5
to
aa6faa4
Compare
Made some genuine progress in debugging, still working on root cause. Two vertexes are involved:
Chronologically the following happens. The dep vertex is deleted due to being unreferenced
That dep vertex shows up as an input to the child:
Finally,
|
I started seeing some incomprehensible behavior where vertex/edge digests would show up out of nowhere despite never being added. I added a goroutine that prints incrementing logs every 100ms and found that there are a bunch missing there, so I think we are dropping lots of logs from the dagger engine service, which unfortunately makes debugging this with logs next to impossible. cc @vito not sure if this is a known issue or not, also not sure if this is a telemetry issue, logrus issue, GHA job logs issue, etc. I am unable to repro the inconsistent graph state/compute cache errors locally for some reason, so I think I may need to do some hacks to have the engine write it logs to a file in a cache mount and then cat that at the end of the run... EDIT: I did just realize that the way I implemented the incrementing logs could plausibly have an issue if it's unable to call bklog.Debugf within 100ms (plausible if the engine is outrageously overloaded), so I'll see if I can use a buffered chan instead and still replicate missing logs. EDIT2: added the buffered chan and still am missing logs.. |
4ba90f8
to
ed4f9fe
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
…istently Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
efa70bc
to
5a916a2
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
This reverts commit 0fd2fda.
4d65d3e
to
a89f5c7
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
a89f5c7
to
a88120f
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
6136224
to
f7b4049
Compare
As the commit history here suggests, I've been deeply struggling with actually getting all logs (as opposed to the thing I'm trying to debug the error itself), but thankfully I finally managed to repro the compute cache error locally with non-nested dev engine and have a glorious 1.4GB engine log file to look through now 🎉 |
With the full logs I can see at least one of the root causes, which is a problem where edge merges happen to "stale" vertices that are no longer in the solver's actives map (outlined that here previously: moby/buildkit#4818 (comment)). However, I've been taking that understanding and trying to repro in a single isolated test case but still can't quite hit it (logs show almost the same thing happening but state is slightly different in a way that avoids it). I would feel much better if I could confirm the understanding with a repro, so I'm going to try that but timeboxed to the next day. If I can't get an exact isolated repro by then I'll just move forward with fixing the problem I'm seeing upstream, which no matter what is a bug (and decently likely to be the fundamental root cause here). |
Hit the point where looking through the logs as is was too convoluted, so I put together a lil parser that extracts relevant logs, filters out unrelated vertices and displays everything left in a (somewhat) more readable order/format: https://gist.github.com/sipsma/e67119d96cebe06e69e69f9ba9be2e78 Proved to be super useful:
I haven't confirmed that's the only way the error can get triggered, but that seems to be at least one way. The new super interesting/weird thing I found is that underneath the stale edge that hits this error is a long chain of deps stuck in
That edge seems to stay stuck in Not totally sure what's going on here; could be a separate scheduler bug that leaves it in that
|
Can replicate that weird behavior of getting stuck on At this point it's probably best to just fix the bug and see if all is well or if any other weirdness remains. Will do that tomorrow. I don't think the fix is exceptionally complicated; idea should just be to skip edge merge if the ultimate target is no longer in the actives map. |
@sipsma nice finds ❤️
I do think the behavior described above about getting stuck in slow and fast states definitely sounds suspicious - I wonder if maybe there's potentially some performance issues lurking there? Hopefully, this issue goes away with your proposed fix, but if not, would probably be worth earmarking it to come back to, I wonder if this could part of explaining some low CPU/network usage that some users have observed. |
Upstream fix here: moby/buildkit#4887 TODOs left:
@jedevc When I looked a bit more last night even though I was seeing these edges stuck in those states, getting a SIGQUIT stack trace showed there was no code actually stuck computing the cache, so I suspect this is just a scheduler problem. Going to see if that's still even happening after my fix now too. EDIT: looks like it's gone now after the fix, so I think we should be good for now |
Think we're good here, will track rest of work to pick up the upstream fix in this issue: #6111 |
Closes #7020
Just enabling everywhere (including published images) for now since users are hitting this too and may be helpful to grab these logs from them.
Issue for removing these once we're done debugging here #7129