-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing builder pod launch #199
Comments
So pretty sure that we need to handle all possible pod states in the initial In this case, we must also handle |
So I added a bunch more debugging information to a local builder. We've got a mess of possible state transitions for the builder pods. I'm seeing many instances where slugbuilder pod will I think the most sensible move going forward will be for builder to subscribe to all events in the deis namespace before launching the slugbuilder pod so that builder can see the "record of truth" from the stream. We should be able to see that the Pod came up, completed and exited. Any attempt to use the k8s API with respect to Pods directly will probably suffer from the This most certainly leaves builder log streaming in an unfortunate position. Logs will most likely "best effort" for beta even with event processing (and possibly v2.0.0). Ultimately, we may need to move away from Which should be more dependable and reliable than fetching logs from k8s API. |
I'm late to the party here (and on mobile) so apologies if I'm repeating anything. I have a prototype PR for watching the event stream to observe all state transitions with the pod. That PR also begins streaming logs after the pod has been launched, but with 'previous' set to true, which afaik solves the log streaming issue (even though I'd love to build a log streaming system for builder pod -> builder, I'd like to try as hard as possible to stick to the k8s API here. The state of that PR is purely prototypical and I'm not even sure if it builds right now, but I just wanted to throw out the info and current state of affairs. Given that we're talking about a race condition here (and have been for at least the past few weeks), we haven't solved the bug until we make a fundamental change such as this one. Finally, there are a few other strategies we can take also, which I'll throw out in a separate issue or comment when I have better access to the internets Sent from my iPhone
|
This should probably be a blocker and deserves the events/watch fix IMO. Trying to stick to k8s APIs make sense as long as streaming with |
Still debugging this with @smothiki, and we've found when things go wrong with slugbuilder that k8s is killing the pod in such a way that its status goes directly to $ kubectl --namespace=deis get po -w
NAME READY STATUS RESTARTS AGE
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Pending 0 0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Pending 0 0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Terminating 0 0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Terminating 0 0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Terminating 0 0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed 0/1 Terminating 0 0s |
In terminating state we tried to get pod status with kubectl. Kubectl errored out saying POD not available. while -w still showing POD status is terminating. Also this bug is related not only to builder. Trying to debug from kubernetes API server and kubelet end. Not much luck as of now. |
Namespace-wide events stream is what you want to look at. I think I've always seen
In the case a pod transitions to terminated and is GCd before we get logs we're hosed for logs. Thus the recommendation to move log streaming into our application layer, rather than k8s. |
Here are the relevant events after a failed Dockerfile $ kubectl get events -w --all-namespaces
...
NAMESPACE FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod Scheduled {scheduler } Successfully assigned dockerbuild-oldest-yodeling-2e2054e7-135a237c to 192.168.64.2
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod implicitly required container POD Pulled {kubelet 192.168.64.2} Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod implicitly required container POD Created {kubelet 192.168.64.2} Created with docker id 9656c4dc735c
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod implicitly required container POD Started {kubelet 192.168.64.2} Started with docker id 9656c4dc735c
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod spec.containers{deis-dockerbuilder} Pulling {kubelet 192.168.64.2} Pulling image "quay.io/deisci/dockerbuilder:v2-beta"
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod spec.containers{deis-dockerbuilder} Pulled {kubelet 192.168.64.2} Successfully pulled image "quay.io/deisci/dockerbuilder:v2-beta"
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod spec.containers{deis-dockerbuilder} Created {kubelet 192.168.64.2} Created with docker id 5b9e9f608634
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod spec.containers{deis-dockerbuilder} Started {kubelet 192.168.64.2} Started with docker id 5b9e9f608634
deis 0s 0s 1 dockerbuild-oldest-yodeling-2e2054e7-135a237c Pod implicitly required container POD Killing {kubelet 192.168.64.2} Killing with docker id 9656c4dc735c |
@mboersma what is 0s mean ? execution time |
"FIRST SEEN" and "LAST SEEN" |
Sorry--I pasted the header back in to hopefully make that output easier to decipher. |
This is fixed by #206. Yes, really. |
@smothiki I'd like to keep this open as a showstopper for RC. I believe there is still a race in the code between pod launch and observing the event |
this #185 should definitely solve the issue |
👍 I've noted that fact at the end of the description in #185 |
promoting to beta3 |
Image:
smothiki/builder:v2.1
Builder is hung on git push (and will wait for timeout):
It looks like the slugbuilder launched and completed though:
And the logs from the container (pulled off the host):
Rel #207
Rel #185
Rel #298
The text was updated successfully, but these errors were encountered: