Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: enable buildkit scheduler debug logs #7128

Closed
wants to merge 20 commits into from

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented Apr 18, 2024

Closes #7020

Just enabling everywhere (including published images) for now since users are hitting this too and may be helpful to grab these logs from them.

Issue for removing these once we're done debugging here #7129

@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

Well, I hit #6111 first try here, so will take a look at the logs. Unfortunately, the scheduler logs are proving to be extremely verbose, possibly to the point that we shouldn't include them on DEBUG level, but will figure out what to do there before merging.

EDIT: yeah, I can see now that the raw logs file is 750 MB, so can't merge this as is 😆 😭 , marking as draft for now

@sipsma sipsma marked this pull request as draft April 19, 2024 01:46
@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

For now, I'm gonna see how far I can get just debugging these errors in this PR. Best case scenario that'll be enough to figure it out and we can just fix the issues and close this PR. Worst case scenario, we can do some logrus filtering to only include the really important scheduler debug logs and merge it with less than 750MB of engine logs 😄

@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

Made some genuine progress in debugging, still working on root cause.

Two vertexes are involved:

  • A dep vertex:
    • name: mkfile /runtime/dagger.json
    • digest: sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9
  • A child vertex (depends on the above)
    • name: copy /runtime/template/runtime.py /runtime
    • digest: sha256:948bd1d0b0340f6abaa961af6373ee95f48f35dc5d94cbf35a85f66539ee0ab4

Chronologically the following happens.

The dep vertex is deleted due to being unreferenced

2024-04-19T20:16:19.2569252Z level=debug msg="deleting job from state" actives_key="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" job=rvq588ojkbcilh724kbpfrlv4 vertex_digest="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" vertex_name="mkfile /runtime/dagger.json"
2024-04-19T20:16:19.2571409Z level=debug msg="deleting unreferenced active vertex" actives_key="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" vertex_digest="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" vertex_name="mkfile /runtime/dagger.json
2024-04-19T20:16:19.2572818Z level=debug msg="edge in deleted unreferenced state" index=0 state=cache-slow vertex_digest="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" vertex_name="mkfile /runtime/dagger.json"

That dep vertex shows up as an input to the child:

2024-04-19T20:16:27.2965805Z level=debug msg=">> unpark copy /runtime/template/runtime.py /runtime req=1 upt=0 out=0 state=cache-fast sha256:948bd1d0b0340f6abaa961af6373ee95f48f35dc5d94cbf35a85f66539ee0ab4"                                                                                         
2024-04-19T20:16:27.2967033Z level=debug msg=":: dep0 mkfile /runtime/dagger.json state=complete des=cache-fast keys=1 hasslowcache=true preprocessfunc=true"                                                                                                                                          
2024-04-19T20:16:27.2967999Z level=debug msg="> incoming-0: 0xc01b4c7900 dstate=cache-slow canceled=false"                                                                                                                                                                                             
2024-04-19T20:16:27.2969186Z level=debug msg="status state=cache-fast cancomplete=false hasouts=false noPossibleCache=false depsCacheFast=true keys=1 cacheRecords=1"                                                                                                                                  
2024-04-19T20:16:27.2970708Z level=debug msg="updateIncoming copy /runtime/template/runtime.py /runtime solver.edgeState{state:1, result:(*solver.SharedCachedResult)(nil), cacheMap:(*solver.CacheMap)(0xc0177ad920), keys:[]solver.ExportableCacheKey{solver.ExportableCacheKey{CacheKey:(*solver.Cac (0xc01a3abcb0), Exporter(*solver.exporter)(0xc0140554c0)}}} desired=cache-slow"                                                                                                                                                                                                                      
2024-04-19T20:16:27.2973069Z level=debug msg="skip input request based on dep state" dep_desired_state=complete dep_index=0 dep_state=complete dep_vertex_digest="sha256:2e11b96cee7222164411bea7dd67d26891ac7746f35805e2ad92c735a88684c9" dep_vertex_name="mkfile /runtime/dagger.json" edge_vertex_di sha256:948bd1d0b0340f6abaa961af6373ee95f48f35dc5d94cbf35a85f66539ee0ab4" edge_vertex_name="copy /runtime/template/runtime.py /runtime"       

Finally, getState fails to find the dep vertex (since it was deleted):

024-04-19T20:16:27.3101679Z level=debug msg=">> unpark copy /runtime/template/runtime.py /runtime req=1 upt=1 out=1 state=cache-fast sha256:948bd1d0b0340f6abaa961af6373ee95f48f35dc5d94cbf35a85f66539ee0ab4"
2024-04-19T20:16:27.3102871Z level=debug msg=":: dep0 mkfile /runtime/dagger.json state=complete des=cache-fast keys=1 hasslowcache=true preprocessfunc=true"
2024-04-19T20:16:27.3103908Z level=debug msg="> incoming-0: 0xc01b4c7900 dstate=cache-slow canceled=false"
2024-04-19T20:16:27.3104742Z level=debug msg="> update-0: unknown"
2024-04-19T20:16:27.3105925Z level=debug msg="status state=cache-fast cancomplete=true hasouts=false noPossibleCache=false depsCacheFast=true keys=1 cacheRecords=1"
2024-04-19T20:16:27.3107444Z level=debug msg="finishIncoming copy /runtime/template/runtime.py /runtime failed to compute cache key: failed to get state for index 0 on copy /runtime/template/runtime.py /runtime solver.edgeState{state:1, result:(*solver.SharedCachedResult)(nil), cacheMap:(*solve

@sipsma
Copy link
Contributor Author

sipsma commented Apr 22, 2024

I started seeing some incomprehensible behavior where vertex/edge digests would show up out of nowhere despite never being added. I added a goroutine that prints incrementing logs every 100ms and found that there are a bunch missing there, so I think we are dropping lots of logs from the dagger engine service, which unfortunately makes debugging this with logs next to impossible. cc @vito not sure if this is a known issue or not, also not sure if this is a telemetry issue, logrus issue, GHA job logs issue, etc.

I am unable to repro the inconsistent graph state/compute cache errors locally for some reason, so I think I may need to do some hacks to have the engine write it logs to a file in a cache mount and then cat that at the end of the run...

EDIT: I did just realize that the way I implemented the incrementing logs could plausibly have an issue if it's unable to call bklog.Debugf within 100ms (plausible if the engine is outrageously overloaded), so I'll see if I can use a buffered chan instead and still replicate missing logs.

EDIT2: added the buffered chan and still am missing logs..

@sipsma sipsma force-pushed the enable-bk-sched-debug branch 3 times, most recently from 4ba90f8 to ed4f9fe Compare April 22, 2024 21:12
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
…istently

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Contributor Author

sipsma commented Apr 24, 2024

As the commit history here suggests, I've been deeply struggling with actually getting all logs (as opposed to the thing I'm trying to debug the error itself), but thankfully I finally managed to repro the compute cache error locally with non-nested dev engine and have a glorious 1.4GB engine log file to look through now 🎉

@sipsma
Copy link
Contributor Author

sipsma commented Apr 25, 2024

With the full logs I can see at least one of the root causes, which is a problem where edge merges happen to "stale" vertices that are no longer in the solver's actives map (outlined that here previously: moby/buildkit#4818 (comment)).

However, I've been taking that understanding and trying to repro in a single isolated test case but still can't quite hit it (logs show almost the same thing happening but state is slightly different in a way that avoids it). I would feel much better if I could confirm the understanding with a repro, so I'm going to try that but timeboxed to the next day.

If I can't get an exact isolated repro by then I'll just move forward with fixing the problem I'm seeing upstream, which no matter what is a bug (and decently likely to be the fundamental root cause here).

@sipsma
Copy link
Contributor Author

sipsma commented Apr 26, 2024

Hit the point where looking through the logs as is was too convoluted, so I put together a lil parser that extracts relevant logs, filters out unrelated vertices and displays everything left in a (somewhat) more readable order/format: https://gist.github.com/sipsma/e67119d96cebe06e69e69f9ba9be2e78

Proved to be super useful:

  1. The aforementioned stale edge merge is one part of the problem
  2. It seems that the other ingredient to trigger the error is that the edge hitting the error has a stale dep that:
    1. is in state complete
    2. transitions from desired-state cache-fast to desired-state cache-slow
    3. the stale dep itself has deps that are "stuck" in state cache-slow ("stuck" is explained more below)

I haven't confirmed that's the only way the error can get triggered, but that seems to be at least one way.

The new super interesting/weird thing I found is that underneath the stale edge that hits this error is a long chain of deps stuck in cache-slow, ending at this edge:

  • name: copy /runtime /runtime
  • digest: sha256:be638109a1a7ddf1d96d4acb544ef6f787df9eabc16ced03b0f4e199b513a0fe
  • output index: 1
  • dep on a diffOp that's in state complete

That edge seems to stay stuck in cache-slow for a very long time (possibly indefinitely, tests ended with it still in that state 10+ minutes after the compute cache error got hit).

Not totally sure what's going on here; could be a separate scheduler bug that leaves it in that cache-slow state, a bug in the slow cache func that computes the digest, or something else.

  • The op itself only has one input, which means it's copying from a state to itself, which is especially odd since it's copying one directory to itself. It's possible that's an inefficient/unneeded op in our own code, but it nonetheless should not be causing problems in the buildkit scheduler...

@sipsma
Copy link
Contributor Author

sipsma commented Apr 26, 2024

Can replicate that weird behavior of getting stuck on cache-slow plus some variations of getting stuck on cache-fast. I did notice that the whole chain of stuck edges all appear to also be stale edges still getting used, so it's possible it's just another bug caused by that problem.

At this point it's probably best to just fix the bug and see if all is well or if any other weirdness remains. Will do that tomorrow. I don't think the fix is exceptionally complicated; idea should just be to skip edge merge if the ultimate target is no longer in the actives map.

@jedevc
Copy link
Member

jedevc commented Apr 26, 2024

@sipsma nice finds ❤️

At this point it's probably best to just fix the bug and see if all is well or if any other weirdness remains.

I do think the behavior described above about getting stuck in slow and fast states definitely sounds suspicious - I wonder if maybe there's potentially some performance issues lurking there? Hopefully, this issue goes away with your proposed fix, but if not, would probably be worth earmarking it to come back to, I wonder if this could part of explaining some low CPU/network usage that some users have observed.

@sipsma
Copy link
Contributor Author

sipsma commented Apr 26, 2024

Upstream fix here: moby/buildkit#4887

TODOs left:

  • check to see if progress-related goroutine leaks were pre-existing or caused by the fix here
    • pre-existing issue, will look at separately
  • check to see if the issue with edges being stuck in cache-fast/slow is gone now or if it's a separate problem
    • seems to be gone now, problematic vertices now transition from cache-slow to complete

I do think the behavior described above about getting stuck in slow and fast states definitely sounds suspicious - I wonder if maybe there's potentially some performance issues lurking there? Hopefully, this issue goes away with your proposed fix, but if not, would probably be worth earmarking it to come back to, I wonder if this could part of explaining some low CPU/network usage that some users have observed.

@jedevc When I looked a bit more last night even though I was seeing these edges stuck in those states, getting a SIGQUIT stack trace showed there was no code actually stuck computing the cache, so I suspect this is just a scheduler problem. Going to see if that's still even happening after my fix now too.

EDIT: looks like it's gone now after the fix, so I think we should be good for now

@sipsma
Copy link
Contributor Author

sipsma commented Apr 26, 2024

Think we're good here, will track rest of work to pick up the upstream fix in this issue: #6111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable scheduler debug for tests
2 participants