Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot determine if a container was unable to start #13729

Closed
paralin opened this issue Mar 30, 2022 · 21 comments · Fixed by #16806
Closed

Cannot determine if a container was unable to start #13729

paralin opened this issue Mar 30, 2022 · 21 comments · Fixed by #16806
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@paralin
Copy link
Contributor

paralin commented Mar 30, 2022

/kind feature

Description

Say I execute a pod with the following pod spec:

        restartPolicy: OnFailure
        containers:
        - image: docker.io/library/alpine:edge
          name: hello
          args:
          - efcho
          - Hello world
          tty: true

The resulting container, of course, cannot start:

podman start container-id
Error: unable to start container "container-id": crun: executable file `efcho` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

... due to command not found.

However, this error does not appear anywhere in the "podman inspect" output:

[
     {
          "State": {
               "OciVersion": "1.0.2-dev",
               "Status": "created",
               "Running": false,
               "Paused": false,
               "Restarting": false,
               "OOMKilled": false,
               "Dead": false,
               "Pid": 0,
               "ExitCode": 0,
               "Error": "",
               "StartedAt": "0001-01-01T00:00:00Z",
               "FinishedAt": "0001-01-01T00:00:00Z",
               "Health": {
                    "Status": "",
                    "FailingStreak": 0,
                    "Log": null
               },
               "CheckpointedAt": "0001-01-01T00:00:00Z",
               "RestoredAt": "0001-01-01T00:00:00Z"
          }
     }
]

... the container just appears as "created" and there's no way to distinguish why it failed.

Describe the results you received:

What is the correct way to check if a container is in this kind of failed state via the API, and how to get the error message?

Is the best way to just check if the container is in "created" state, try to start it via api, and check for an error?

Describe the results you expected:

This seems like a bit of a hack/workaround, it'd be best if the error was in the container status somewhere.

@openshift-ci openshift-ci bot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 30, 2022
@paralin
Copy link
Contributor Author

paralin commented Mar 30, 2022

For reference, this is what I'm doing now as a workaround:

https://gist.github.com/paralin/ad4279d14a588eb3e897519b9f299907

... the error is only exposed when calling Start on the container, so if the Pod fails to start (created state), I have to loop over the containers & call Start on each until one returns an error.

@mheon
Copy link
Member

mheon commented Apr 4, 2022

So, to be perfectly clear, the request is to include the error in the Error field of container inspect's State struct?

@paralin
Copy link
Contributor Author

paralin commented Apr 4, 2022

@mheon That would work, that & also hopefully some indication of the exit code or the nature of the failure.

@github-actions
Copy link

github-actions bot commented May 5, 2022

A friendly reminder that this issue had no activity for 30 days.

@paralin
Copy link
Contributor Author

paralin commented May 5, 2022

This is still relevant, I think, there's no non hacky way to determine if a container failed to start

@github-actions
Copy link

github-actions bot commented Jun 5, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jun 6, 2022

@mheon any chance you can make this happen? Or should we hand this to an intern?

@mheon
Copy link
Member

mheon commented Jun 6, 2022

Sure. Just need to add a field to the DB to store the error message from the last run, and throw that in podman inspect.

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 7, 2022

@mheon Any chance you can implement this?

@mheon
Copy link
Member

mheon commented Jul 7, 2022

Sure, I'll put it on the list

@mheon mheon self-assigned this Jul 7, 2022
@github-actions
Copy link

github-actions bot commented Aug 7, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 7, 2022

@mheon Any progress on this?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 14, 2022

@mheon Reminder.

@mheon
Copy link
Member

mheon commented Sep 14, 2022

On it

@jakecorrenti
Copy link
Member

@mheon if you're not too far along with this I wouldn't mind taking a stab at it

@mheon
Copy link
Member

mheon commented Oct 27, 2022

@jakecorrenti Feel free to take a stab at it!

@mheon mheon assigned jakecorrenti and unassigned mheon Oct 27, 2022
@jakecorrenti
Copy link
Member

jakecorrenti commented Nov 17, 2022

I currently have a partial fix for this, but it isn't capturing the entire scope of the issue. In

func (c *Container) Start(ctx context.Context, recursive bool) error {
I currently have something along these lines

saveErrorState := func(e error) error {
    c.state.Error = e.Error()
    if err := c.save(); err != nil {
        return err
    }
    return nil
}

which is called when an error occurs in the function. The issue is this doesn't allow for the entire issue to be solved. If an error occurs in the container engine's ContainerRun function, I don't have the availability to modify the container's state and save it (to the best of my knowledge) without modifying the API in some way. Is there a way to do this without making any changes to the API?

For reference, this is the result of the above fix in regards to the issue above

         "State": {
              "OciVersion": "1.0.2-dev",
              "Status": "created",
              "Running": false,
              "Paused": false,
              "Restarting": false,
              "OOMKilled": false,
              "Dead": false,
              "Pid": 0,
              "ExitCode": 0,
              "Error": "crun: executable file `efcho` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found",
              "StartedAt": "0001-01-01T00:00:00Z",
              "FinishedAt": "0001-01-01T00:00:00Z",
              "Health": {
                   "Status": "",
                   "FailingStreak": 0,
                   "Log": null
              },
              "CheckpointedAt": "0001-01-01T00:00:00Z",
              "RestoredAt": "0001-01-01T00:00:00Z"
         },

@mheon
Copy link
Member

mheon commented Nov 17, 2022

Honestly, I think it's sufficient to just capture the Libpod bits... Probably would even be sufficient to capture just errors out of the OCI runtime, even.

@jakecorrenti
Copy link
Member

sounds good

jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 11, 2022
This change aims to store an error message to the ContainerState struct
when an error comes out of the OCI Runtime.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 11, 2022
This change aims to store an error message to the ContainerState struct
when an error comes out of the OCI Runtime.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 11, 2022
This change aims to store an error message to the ContainerState struct
when an error comes out of the OCI Runtime.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 11, 2022
This change aims to store an error message to the ContainerState struct
when an error comes out of the OCI Runtime.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 13, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 13, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 13, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 13, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 13, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Dec 31, 2022
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
jakecorrenti added a commit to jakecorrenti/podman that referenced this issue Jan 3, 2023
This change aims to store an error message to the ContainerState struct
with the last known error from the Start, StartAndAttach, and Stop OCI
Runtime functions.

The goal was to act in accordance with Docker's behavior.

Fixes: containers#13729

Signed-off-by: Jake Correnti <jakecorrenti+github@proton.me>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants