Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor: add debug-shell and on-error #1640

Merged
merged 7 commits into from
Apr 18, 2023

Conversation

ktock
Copy link
Collaborator

@ktock ktock commented Feb 21, 2023

Related to #1104

PR7
docker buildx build --invoke=debug-shell should go directly to monitor mode where processes can be created with exec. We can also have docker buildx debug-shell to start monitor mode without a specific container context.

PR8
Add docker buildx build --invoke=on-error. In this mode, if build ends with an error debug shell will be opened from the error location. The error returned by buildkit is typed and contains the references to the state of the error and state in the beginning of the failed step. Monitor commands allow to switch in-between of these states. Error also includes source map that can be shown to the user.

debug-shell subcommand

This directly starts the monitor without performing the build.
Users can also attach to other sessions (if any) and perform reload, rollback, etc... in that session.

On terminal 1:

$ BUILDX_EXPERIMENTAL=1 /tmp/out/buildx build --invoke sh /tmp/ctx3

On terminal 2:

$ BUILDX_EXPERIMENTAL=1 /tmp/out/buildx debug-shell
INFO: connecting to buildx server
Launching interactive container. Press Ctrl-a-c to switch to monitor console
Interactive container was restarted with process "spc3i3vpqzezebdfun503tt13". Press Ctrl-a-c to switch to the new container
Switched IO
(buildx) list
ID				CURRENT_SESSION
b3ylh0tj7mzb3l8x35tdsko78	false
(buildx) attach b3ylh0tj7mzb3l8x35tdsko78
Attached to process "". Press Ctrl-a-c to switch to the new container
(buildx) ps
PID				CURRENT_SESSION	COMMAND
hf5ne0940cnlqdlo6khdn8t11	false		[sh]
(buildx) attach hf5ne0940cnlqdlo6khdn8t11
Attached to process "hf5ne0940cnlqdlo6khdn8t11". Press Ctrl-a-c to switch to the new container
(buildx) Switched IO

/ # ls /
a      dev    hi     lib    proc   sys    usr
bin    etc    home   lib64  root   tmp    var
/ # Switched IO
(buildx) rollback
invoke error: context canceled
Interactive container was restarted with process "whuvsxe4lizynxli7rhrga3jk". Press Ctrl-a-c to switch to the new container
(buildx) Switched IO
/ # ls
a      dev    hi     lib    proc   sys    usr
bin    etc    home   lib64  root   tmp    var
/ # Switched IO
(buildx) reload
invoke error: rpc error: code = Canceled desc = Canceled: exit code: 4294967295: context canceled
[+] Building 0.9s (6/6) FINISHED                                                             
 => [internal] load .dockerignore                                                       0.0s
 => => transferring context: 2B                                                         0.0s
 => [internal] load build definition from Dockerfile                                    0.0s
 => => transferring dockerfile: 87B                                                     0.0s
 => [internal] load metadata for docker.io/library/busybox:latest                       0.7s
 => [1/3] FROM docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => => resolve docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => CACHED [3/3] RUN echo aaaa > /a                                                     0.0s
 => CACHED [2/3] RUN echo hi > /hi                                                      0.0s
Interactive container was restarted with process "2utxvq9ij8yrwjiiqr3qtte71". Press Ctrl-a-c to switch to the new container
(buildx) Switched IO
/ # ls /
a      dev    hi     lib    proc   sys    usr
bin    etc    home   lib64  root   tmp    var

--invoke=debug-shell

This directly starts the monitor without performing the build.
Users can also attach to other sessions (if any) and perform reload, rollback, etc... in that session.

The difference between this and debug-shell subcommand is that --invoke=debug-shell allows performing reload with the specified arguments of the buildx build and starting a new session based on that result.

$ BUILDX_EXPERIMENTAL=1 /tmp/out/buildx build --invoke debug-shell /tmp/ctx3
INFO: connecting to buildx server
Launching interactive container. Press Ctrl-a-c to switch to monitor console
Interactive container was restarted with process "ltqny4fj99df1cgyjkvmbyyst". Press Ctrl-a-c to switch to the new container
Switched IO
(buildx) list
ID	CURRENT_SESSION
(buildx) reload
[+] Building 0.9s (6/6) FINISHED                                                             
 => [internal] load build definition from Dockerfile                                    0.0s
 => => transferring dockerfile: 87B                                                     0.0s
 => [internal] load .dockerignore                                                       0.1s
 => => transferring context: 2B                                                         0.0s
 => [internal] load metadata for docker.io/library/busybox:latest                       0.7s
 => [1/3] FROM docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => => resolve docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => CACHED [2/3] RUN echo hi > /hi                                                      0.0s
 => CACHED [3/3] RUN echo aaaa > /a                                                     0.0s
Interactive container was restarted with process "p3uhdcmimfhzmmpy0okrvurae". Press Ctrl-a-c to switch to the new container
(buildx) Switched IO
/ # Switched IO
(buildx) list
ID				CURRENT_SESSION
15pqy92xvlibmz66aac0doag9	true
(buildx) Switched IO

/ # ls /
a      dev    hi     lib    proc   sys    usr
bin    etc    home   lib64  root   tmp    var

--invoke=on-error

The container is invoked based on the state of the error returned by BuildKit.

Buildx controller saves that state and returns a typed error that contains the reference to that state and the client(monitor) uses it for starting containers/processes. For the typed error, we use the same logic as BuildKit does in errdefs.SolveError etc.

rollback --init command allows to start the container on the initial state of the failed step instead of the result state.

FROM busybox
RUN echo hi > /hi
RUN echo failed > /failed && echo fail > /
$ BUILDX_EXPERIMENTAL=1 /tmp/out/buildx build --invoke=on-error /tmp/ctx2
INFO: connecting to buildx server
[+] Building 1.7s (7/7) FINISHED                                                             
 => [internal] load .dockerignore                                                       0.0s
 => => transferring context: 2B                                                         0.0s
 => [internal] load build definition from Dockerfile                                    0.1s
 => => transferring dockerfile: 111B                                                    0.0s
 => [internal] load metadata for docker.io/library/busybox:latest                       1.3s
 => [auth] library/busybox:pull token for registry-1.docker.io                          0.0s
 => [1/3] FROM docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => => resolve docker.io/library/busybox@sha256:c118f538365369207c12e5794c3cbfb7b042d9  0.0s
 => CACHED [2/3] RUN echo hi > /hi                                                      0.0s
 => ERROR [3/3] RUN echo failed > /failed && echo fail > /                              0.2s
------
 > [3/3] RUN echo failed > /failed && echo fail > /:
#0 0.122 /bin/sh: can't create /: Is a directory
------
Launching interactive container. Press Ctrl-a-c to switch to monitor console
Interactive container was restarted with process "pk87itbolpib2frm851tdr6fq". Press Ctrl-a-c to switch to the new container
/ # ls /
bin     etc     hi      lib     proc    sys     usr
dev     failed  home    lib64   root    tmp     var
/ # cat failed
failed
/ # Switched IO
(buildx) rollback --init
invoke error: context canceled
Interactive container was restarted with process "bl6shbfd1hgnmj4syv5g8i16u". Press Ctrl-a-c to switch to the new container
(buildx) Switched IO
/ # ls
bin    etc    home   lib64  root   tmp    var
dev    hi     lib    proc   sys    usr
/ # cat failed
cat: can't open 'failed': No such file or directory

@ktock
Copy link
Collaborator Author

ktock commented Feb 21, 2023

Ready for review about codes but it seems that documents didn't pass CI but make docs doesn't fix this. How can we pass the docs CI on this PR? https://github.com/docker/buildx/actions/runs/4230634394/jobs/7348218338

commands/build.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the monitor-invoke-mode-restore branch from 3328ff8 to f2f69fb Compare February 24, 2023 12:15
@ktock
Copy link
Collaborator Author

ktock commented Mar 1, 2023

CI failure seems unrelated to the change. https://github.com/docker/buildx/actions/runs/4301437650/jobs/7498699046#step:10:63

ERROR: failed to solve: busybox: failed to do request: Head "https://registry-1.docker.io/v2/library/busybox/manifests/latest": dial tcp: lookup registry-1.docker.io on 10.43.0.10:53: read udp 10.42.0.2:36373->10.43.0.10:53: i/o timeout

@ktock ktock force-pushed the monitor-invoke-mode-restore branch from df16bae to 0ac64b8 Compare March 2, 2023 11:26
@ktock
Copy link
Collaborator Author

ktock commented Mar 2, 2023

All green 🎉

@ktock ktock force-pushed the monitor-invoke-mode-restore branch from 0ac64b8 to c7f16e3 Compare March 8, 2023 07:54
@ktock
Copy link
Collaborator Author

ktock commented Mar 8, 2023

@jedevc @tonistiigi @crazy-max Rebased, PTAL 🙏

@ktock
Copy link
Collaborator Author

ktock commented Apr 3, 2023

Could we move this PR and related ones forward?

Copy link
Collaborator

@jedevc jedevc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had some pending comments and forgot to post them - I'll try and take a more in-depth look asap.

controller/local/controller.go Show resolved Hide resolved
build/build.go Outdated Show resolved Hide resolved
case "on-error":
// NOTE: we overwrite the command to run because the original one should fail on the failed step.
// TODO: make this configurable.
cfg.Cmd = []string{"/bin/sh"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we don't have any idea what the shell at this point during execution is kinda sad 😢

For some reason, this feels familiar, like I've had this discussion before somewhere - but can't remember where 🤔 Regardless, I think we should follow-up with a buildkit patch to somehow store the current shell into the LLB? That way, when attaching a debugger, we can always find the right shell configuration to use. I guess this applies to trying to find the working directory and similar (if we try and invoke at a vertex that isn't a ExecOp).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

commands/build.go Outdated Show resolved Hide resolved
Comment on lines +365 to +355
switch op := solveErr.Solve.Op.GetOp().(type) {
case *pb.Op_Exec:
return op.Exec, nil
default:
return nil, errors.Errorf("invoke: unsupported error type")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support this on other OpTypes? e.g. if a copy fails, would we want to be able to break there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. It looks like that the error doesn't propagates the execution information (e.g. args,usr,cwd,envvar,...) on non-exec steps so we need to modify BuildKit maybe. We also need to think about the UI for how to make COPY's source and/or destination visible to the user.

commands/debug-shell.go Show resolved Hide resolved
controller/build/build.go Outdated Show resolved Hide resolved
@jedevc jedevc added this to the v0.11.0 milestone Apr 4, 2023
@ktock ktock force-pushed the monitor-invoke-mode-restore branch from 152318d to ea63f48 Compare April 9, 2023 23:52
@ktock
Copy link
Collaborator Author

ktock commented Apr 10, 2023

Thanks for the comment. Created separate commits for changes.

Copy link
Collaborator

@jedevc jedevc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting up commits a bit more, it's significantly easier to look through a review 🎉 Looks like the split has a few chunks in the wrong commits though, each commit should be self-contained and compile by itself.

Aside from that, just a couple minor comments, otherwise I'm happy 🎉

controller/build/build.go Outdated Show resolved Hide resolved
build/build.go Show resolved Hide resolved
build/result.go Show resolved Hide resolved
build/result.go Outdated Show resolved Hide resolved
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@ktock ktock force-pushed the monitor-invoke-mode-restore branch from ea63f48 to 0e60410 Compare April 15, 2023 06:31
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@ktock ktock force-pushed the monitor-invoke-mode-restore branch from 0e60410 to fd5d90c Compare April 15, 2023 06:38
@ktock ktock requested a review from jedevc April 15, 2023 06:47
@jedevc jedevc merged commit 0b432cc into docker:master Apr 18, 2023
56 checks passed
@ktock ktock deleted the monitor-invoke-mode-restore branch April 18, 2023 09:46
@jedevc
Copy link
Collaborator

jedevc commented Apr 18, 2023

I would say that before we merge #1656, I'd like to see a couple follow-ups to this:

  • We should support breaking on other failed commands, like COPY - atm, on-error stops in the right place, but I can't spawn any session. Even if we just drop into a default shell, with no extra args, environment variables, that's alright for now. This will be necessary to provide a good experience with monitor: breakpoint debugger on monitor and on IDEs (via DAP)  #1656 as well.

  • The commands in the monitor are getting long - we really need to split these up, out of a switch statement, and ideally into something a bit more dynamic. Something like implementations of a monitor.Command interface?

    As inspiration:

    type Command interface {
        // executes the command
        Exec(args []string /* anything else we might need*/)
    
        // returns command information
        Info() Info
    }
    
    type Info struct {
        // a usage string, e.g. "rollback [--init] [entrypoint] [cmd...]"
        Usage string
    
        // a short description, e.g. "re-runs the interactive container with initial rootfs contents"
        Description string
    
        // a longer description, describes in detail what the command does, the
        // behavior of each arg, etc.
        Help string
    }

@ktock
Copy link
Collaborator Author

ktock commented Apr 18, 2023

Thank you for the suggestion. SGTM.

@jedevc
Copy link
Collaborator

jedevc commented Apr 18, 2023

@ktock I quickly hit a couple of issues just now (though this seems to be an issue before this PR tbf):

  • The second .Build request we create always seems to exit with context cancelled - this means the buildkit logs are full of time="2023-04-18T12:02:37Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled". This now happens on every build, since we always spin up a second Build right away in getResultAt, instead of waiting until invoke is called.

    Reproducer: buildx build . --builder=dev --detach=false

    Seemingly related to this, we don't seem to get this for the remote controller when --detach=true and --sbom=true, instead, in this case we get a BuildKit panic 😢

    I'm not sure why we get different results between the local and remote controller? I'm guessing one gets cancelled, and the other doesn't?

    Reproducer: buildx build . --builder=dev --detach=true --sbom=true

    panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x125f688]
    
    goroutine 21695 [running]:
    go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
        /src/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:383 +0x2a
    go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000c3c300, {0x0, 0x0, 0x2652920?})
        /src/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:421 +0xa89
    panic({0x1697f20, 0x2624f80})
        /usr/local/go/src/runtime/panic.go:884 +0x213
    github.com/moby/buildkit/solver/llbsolver/proc.SBOMProcessor.func1({0x1b58a08, 0xc0034f5f50}, 0xc004f9f8d0, 0xc00036a960, 0xc002729ec0)
        /src/solver/llbsolver/proc/sbom.go:41 +0x448
    github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve(0xc00036a960, {0x1b58a08, 0xc0034f5f50}, {0xc00086a720, _}, {_, _}, {0x0, 0x0, {0x0, ...}, ...}, ...)
        /src/solver/llbsolver/solver.go:511 +0x12ad
    github.com/moby/buildkit/control.(*Controller).Solve(0xc00016cee0, {0x1b58a08, 0xc0034f5f50}, 0xc002611b80)
        /src/control/control.go:421 +0x1208
    github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1({0x1b58a08, 0xc0034f5f50}, {0x18b48e0?, 0xc002611b80})
        /src/api/services/control/control.pb.go:2438 +0x78
    github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor({0x1b58a08, 0xc0034f5f50}, {0x18b48e0?, 0xc002611b80?}, 0xc00055b508?, 0x4dcff7?)
        /src/util/grpcerrors/intercept.go:14 +0x3d
    github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1b58a08?, 0xc0034f5f50?}, {0x18b48e0?, 0xc002611b80?})
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
    go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x1b58960, 0xc0027a6d70}, {0x18b48e0, 0xc002611b80}, 0xc001c710c0, 0xc001c71100)
        /src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:342 +0x528
    main.unaryInterceptor.func1({0x1b58a08?, 0xc0034f5e00?}, {0x18b48e0, 0xc002611b80}, 0xc001c710c0, 0xc001c71100)
        /src/cmd/buildkitd/main.go:576 +0x1b5
    github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1b58a08?, 0xc0034f5e00?}, {0x18b48e0?, 0xc002611b80?})
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
    github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x1b58a08, 0xc0034f5e00}, {0x18b48e0, 0xc002611b80}, 0xc002baea50?, 0x1696c00?)
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbe
    github.com/moby/buildkit/api/services/control._Control_Solve_Handler({0x1868080?, 0xc00016cee0}, {0x1b58a08, 0xc0034f5e00}, 0xc0029f4b60, 0xc00031bb60)
        /src/api/services/control/control.pb.go:2440 +0x138
    google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005723c0, {0x1b617e0, 0xc003fc1040}, 0xc002c7fb00, 0xc00028d500, 0x2637458, 0x0)
        /src/vendor/google.golang.org/grpc/server.go:1336 +0xd33
    google.golang.org/grpc.(*Server).handleStream(0xc0005723c0, {0x1b617e0, 0xc003fc1040}, 0xc002c7fb00, 0x0)
        /src/vendor/google.golang.org/grpc/server.go:1704 +0xa36
    google.golang.org/grpc.(*Server).serveStreams.func1.2()
        /src/vendor/google.golang.org/grpc/server.go:965 +0x98
    created by google.golang.org/grpc.(*Server).serveStreams.func1
        /src/vendor/google.golang.org/grpc/server.go:963 +0x28a
    

    We shouldn't be always returning nil from this build function, while passing it the same solve opt - the solve opt contains instructions to generate an sbom from the result, which we shouldn't do for an empty result. We need to keep the SolveOpt empty (or ideally reuse the previous gateway client instead of creating a new one, though I guess that's not possible). I opened a buildkit PR to fix the panic upstream, but I think we still need to fix this here -- e.g. does this mean we're running the exporter twice?

  • This call now seems suspicious to me:

    buildx/build/result.go

    Lines 80 to 83 in ba6e5cd

    res2, err := c.Solve(ctx, gateway.SolveRequest{
    Evaluate: true,
    Definition: target,
    })
    . Because Evaluate: true is set, we don't actually get any logs until we return from the parent function. I think we might need to keep this lazy?

    @tonistiigi is this what we'd expect from Evaluate: true? Maybe I made a mistake while implementing it.

    Actually, looking at this again, this looks as simple as "currently the status channel is discarded". If we just forward it to the right place, we should get the status logs properly as desired.

@jedevc
Copy link
Collaborator

jedevc commented Apr 18, 2023

Another thing, we seem to keep getting tripped up here:

ref, err := res.SingleRef()

If the result is empty, because we did a buildx build --print, or if it has multiple elements, because it had multiple --platforms, then we'll panic/fail.

This doesn't just break for BUILDX_EXPERIMENTAL but even without 😱 I've tried isolating this effects of this PR to just when experimental is enabled by writing #1747.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants