Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin Pod stuck at the Terminating state #207

Closed
rgl opened this issue Feb 5, 2024 · 8 comments
Closed

Spin Pod stuck at the Terminating state #207

rgl opened this issue Feb 5, 2024 · 8 comments

Comments

@rgl
Copy link

rgl commented Feb 5, 2024

I've created my own independent example tinygo spin application at https://github.com/rgl/spin-http-go-example and updated my k3s playground at https://github.com/rgl/k3s-vagrant to deploy the spin-http-go-example example.

Creating and using the Pod works fine, except deleting the Pod. Although the delete request is accepted by kubernetes, the Pod is stuck at the Terminating state.

It seems something is broken, but I'm not sure what.

I've added support for containerd-shim-spin-v2 0.10.0 (spin 2.0.1 webassembly runtime) in rgl/k3s-vagrant@e5df418

Can you please shed some light on how to troubleshoot this? kubectl describe on the pod, does not seem to show anything useful.

@Mossaka
Copy link
Member

Mossaka commented Feb 5, 2024

It might related to containerd/runwasi#418

Can you please shed some light on how to troubleshoot this?

The best way I am aware of is to examine the containerd logs for the spin shim.

If you are using the spin shim in k3s, you can refer to this guide on how to find the containerd logs

@rgl
Copy link
Author

rgl commented Feb 6, 2024

Thank you for the pointers!

I'm starting the pod as described at https://github.com/rgl/k3s-vagrant/blob/master/example-spin.yml

Although I can submit kubectl delete for the pod, the following command never seems to return:

$ kubectl delete pod/example-spin-654dccd659-8dskm 
pod "example-spin-654dccd659-8dskm" deleted

Here's what the tail -f of the containerd logs looks like:

# tail -f /var/lib/rancher/k3s/agent/containerd/containerd.log
time="2024-02-06T09:22:42.068337280Z" level=info msg="StopContainer for \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" with timeout 30 (s)"
time="2024-02-06T09:22:42.411348488Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:example-spin-654dccd659-8tltx,Uid:d479a9b2-e1b6-4509-b35d-8f89aea4f16e,Namespace:default,Attempt:0,}"

map[string]interface {}{"cniVersion":"1.0.0", "forceAddress":true, "hairpinMode":true, "ipMasq":false, "ipam":map[string]interface {}{"ranges":[][]map[string]interface {}{[]map[string]interface {}{map[string]interface {}{"subnet":"10.12.1.0/24"}}}, "routes":[]types.Route{types.Route{Dst:net.IPNet{IP:net.IP{0xa, 0xc, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0x0, 0x0}}, GW:net.IP(nil)}}, "type":"host-local"}, "isDefaultGateway":true, "isGateway":true, "mtu":(*uint)(0xc0000a8738), "name":"cbr0", "type":"bridge"}
delegateAdd: netconf sent to delegate plugin:
{"cniVersion":"1.0.0","forceAddress":true,"hairpinMode":true,"ipMasq":false,"ipam":{"ranges":[[{"subnet":"10.12.1.0/24"}]],"routes":[{"dst":"10.12.0.0/16"}],"type":"host-local"},"isDefaultGateway":true,"isGateway":true,"mtu":1500,"name":"cbr0","type":"bridge"}time="2024-02-06T09:22:42.488650624Z" level=info msg="server listen started"

time="2024-02-06T09:22:42.488707028Z" level=info msg="server started"

time="2024-02-06T09:22:42.4887162Z" level=info msg="Shim successfully started, waiting for exit signal..."

time="2024-02-06T09:22:42.494708048Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:example-spin-654dccd659-8tltx,Uid:d479a9b2-e1b6-4509-b35d-8f89aea4f16e,Namespace:default,Attempt:0,} returns sandbox id \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\""
time="2024-02-06T09:22:42.496768712Z" level=info msg="CreateContainer within sandbox \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\" for container &ContainerMetadata{Name:example,Attempt:0,}"
time="2024-02-06T09:22:42.507516899Z" level=info msg="CreateContainer within sandbox \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\" for &ContainerMetadata{Name:example,Attempt:0,} returns container id \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\""
time="2024-02-06T09:22:42.508273480Z" level=info msg="StartContainer for \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\""
time="2024-02-06T09:22:42.517526823Z" level=info msg="manifest is not in WASM OCI image format"

time="2024-02-06T09:22:42.518658869Z" level=info msg="cgroup manager V2 will be used"

time="2024-02-06T09:22:42.518766618Z" level=warn msg="Controller rdma is not yet implemented."

time="2024-02-06T09:22:42.518784762Z" level=warn msg="Controller misc is not yet implemented."

time="2024-02-06T09:22:42.539281631Z" level=warn msg="Controller rdma is not yet implemented."

time="2024-02-06T09:22:42.539340343Z" level=warn msg="Controller misc is not yet implemented."

time="2024-02-06T09:22:42.545786429Z" level=info msg="close_range; preserve_fds=0"

time="2024-02-06T09:22:42.972849735Z" level=warn msg="intermediate process already reaped"

time="2024-02-06T09:22:42.976602376Z" level=info msg="starting instance: dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141"

time="2024-02-06T09:22:42.977461079Z" level=info msg="calling start function"

time="2024-02-06T09:22:42.977593832Z" level=info msg="setting up wasi"

time="2024-02-06T09:22:42.983485892Z" level=info msg="StartContainer for \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\" returns successfully"
time="2024-02-06T09:22:44.069400677Z" level=error msg="get state for 0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919" error="context deadline exceeded: unknown"
time="2024-02-06T09:22:44.069484079Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:22:44.069543062Z" level=info msg="Stop container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" with signal terminated"
time="2024-02-06T09:22:46.200462576Z" level=info msg="stats: StatsRequest { id: "dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"

time="2024-02-06T09:22:58.77219967Z" level=info msg="instantiate_pre;"

time="2024-02-06T09:22:58.773132879Z" level=info msg=" >>> running spin trigger"

time="2024-02-06T09:22:58.773738621Z" level=info msg=" >>> notifying main thread we are about to start"

time="2024-02-06T09:22:58.774493624Z" level=info msg="Serving http://0.0.0.0:80"

time="2024-02-06T09:23:40.806373573Z" level=error msg="StopContainer for \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" failed" error="failed to kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": context canceled: unknown"
time="2024-02-06T09:23:40.808160649Z" level=info msg="StopPodSandbox for \"e23b2fe57bcd33134ac414a091e42b25c1aeefd0887a3753720ba640939f9fd9\""
time="2024-02-06T09:23:42.809070520Z" level=error msg="get state for ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f" error="context deadline exceeded: unknown"
time="2024-02-06T09:23:42.809150816Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:23:42.809185282Z" level=info msg="Kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\""
time="2024-02-06T09:25:12.068440580Z" level=error msg="StopContainer for \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" failed" error="failed to stop container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\": context deadline exceeded: unknown"
time="2024-02-06T09:25:12.069488406Z" level=info msg="StopPodSandbox for \"cc4cd0cb2be413f03b06ae3d52c18dbcfd4992672f6a10af8edf4ca91ff25474\""
time="2024-02-06T09:25:14.070067749Z" level=error msg="get state for 0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919" error="context deadline exceeded: unknown"
time="2024-02-06T09:25:14.070151733Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:25:14.070191783Z" level=info msg="Kill container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\""
time="2024-02-06T09:25:40.809165923Z" level=error msg="StopPodSandbox for \"e23b2fe57bcd33134ac414a091e42b25c1aeefd0887a3753720ba640939f9fd9\" failed" error="failed to stop container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": failed to kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": context deadline exceeded: unknown"
time="2024-02-06T09:25:40.809863923Z" level=info msg="StopPodSandbox for \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\""
time="2024-02-06T09:25:40.810007290Z" level=info msg="Container to stop \"cfeb2a380219c92f3d0fab2d57cad5542812736e185433888a2e5f1ad4adba2b\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-02-06T09:25:40.840434562Z" level=info msg="TearDown network for sandbox \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\" successfully"
time="2024-02-06T09:25:40.840480615Z" level=info msg="StopPodSandbox for \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\" returns successfully"
time="2024-02-06T09:25:41.758165380Z" level=info msg="StopContainer for \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" with timeout 30 (s)"
time="2024-02-06T09:25:43.759072133Z" level=error msg="get state for ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f" error="context deadline exceeded: unknown"
time="2024-02-06T09:25:43.759133691Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:25:43.759198951Z" level=info msg="Skipping the sending of signal terminated to container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" because a prior stop with timeout>0 request already sent the signal"
time="2024-02-06T09:26:13.759386968Z" level=info msg="Kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\""

After that the logs all look the same, it looping trying/failing to kill the container.

@Mossaka
Copy link
Member

Mossaka commented Feb 6, 2024

time="2024-02-06T09:25:12.068440580Z" level=error msg="StopContainer for "0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919" failed" error="failed to stop container "0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919": context deadline exceeded: unknown"

This log looks pretty interesting. What version of containerd are you using?

@rgl
Copy link
Author

rgl commented Feb 6, 2024

I'm using the version bundled with k3s from https://github.com/k3s-io/k3s/releases/tag/v1.28.5+k3s1. At that page, there's a Embedded Component Versions section, which points to Containerd v1.7.11-k3s2.

@alexis-langlet
Copy link

Hello,
I'm experiencing the same issue, is there a fix planned?

@rgl
Copy link
Author

rgl commented May 15, 2024

I think containerd-shim-spin was moved out of this repo into https://github.com/spinkube/containerd-shim-spin.

I also think this issue was fixed in spinkube/containerd-shim-spin#23 but I didn't yet have a chance to try it in k3s.

I did have a chance to try https://github.com/spinkube/containerd-shim-spin/releases/tag/v0.14.1 in talos at rgl/terraform-libvirt-talos@1c32bb7 and there, the issue reported here no longer happens! :-)

@alexis-langlet
Copy link

Ok nice, thank you

@rgl
Copy link
Author

rgl commented May 18, 2024

I've just upgraded to https://github.com/spinkube/containerd-shim-spin/releases/tag/v0.14.1 in k3s at rgl/k3s-vagrant@5b85081 and the issue reported here no longer happens! :-)

@rgl rgl closed this as completed May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants