Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing builder pod launch #199

Closed
slack opened this issue Feb 24, 2016 · 18 comments
Closed

Missing builder pod launch #199

slack opened this issue Feb 24, 2016 · 18 comments
Assignees
Milestone

Comments

@slack
Copy link
Member

slack commented Feb 24, 2016

Image: smothiki/builder:v2.1

Builder is hung on git push (and will wait for timeout):

k:beef example-go [master]$ g push deis master
Warning: the RSA host key for '[deis.beef.slack.io]:2222' differs from the key for the IP address '[54.200.165.125]:2222'
Offending key for IP in /Users/jhansen/.ssh/known_hosts:109
Matching host key in /Users/jhansen/.ssh/known_hosts:113
Counting objects: 529, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (524/524), done.
Writing objects: 100% (529/529), 54.51 KiB | 0 bytes/s, done.
Total 529 (delta 482), reused 0 (delta 0)
remote: --->

It looks like the slugbuilder launched and completed though:

deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod                                           Scheduled   {scheduler }                                        Successfully assigned slugbuild-earthy-instinct-d4c807b8-a94f52b6 to ip-10-0-0-98.us-west-2.compute.internal
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Pulled      {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Created     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id a626b5939b47
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Started     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id a626b5939b47
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Started     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id 710055360b5b
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Pulled      {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Successfully pulled image "quay.io/deisci/slugbuilder:v2-beta"
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Created     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id 710055360b5b
deis        14s         14s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Killing     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id a626b5939b47
deis        14s         14s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Killing     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id 710055360b5b

And the logs from the container (pulled off the host):

{"log":"\u001b[1G-----\u003e Go app detected\n","stream":"stdout","time":"2016-02-24T00:58:32.299975079Z"}
{"log":"\u001b[1G-----\u003e Checking Godeps/Godeps.json file.\n","stream":"stdout","time":"2016-02-24T00:58:32.310713312Z"}
{"log":"\u001b[1G-----\u003e Installing go1.4.2... done\n","stream":"stdout","time":"2016-02-24T00:58:36.717534582Z"}
{"log":"\u001b[1G-----\u003e Running: godep go install -tags heroku ./...\n","stream":"stdout","time":"2016-02-24T00:58:36.722983276Z"}
{"log":"\u001b[1G-----\u003e Discovering process types\n","stream":"stdout","time":"2016-02-24T00:58:37.321394269Z"}
{"log":"\u001b[1G       Procfile declares types -\u003e web\n","stream":"stdout","time":"2016-02-24T00:58:37.377768932Z"}
{"log":"\u001b[1G-----\u003e Compiled slug size is 1.7M\n","stream":"stdout","time":"2016-02-24T00:58:37.733511004Z"}

Rel #207
Rel #185
Rel #298

@smothiki smothiki self-assigned this Feb 24, 2016
@smothiki smothiki added the bug label Feb 24, 2016
@smothiki smothiki added this to the v2.0-beta1 milestone Feb 24, 2016
@slack
Copy link
Member Author

slack commented Feb 24, 2016

So pretty sure that we need to handle all possible pod states in the initial waitForPod.

In this case, we must also handle api.PodSucceeded.

@slack
Copy link
Member Author

slack commented Feb 25, 2016

So I added a bunch more debugging information to a local builder. We've got a mess of possible state transitions for the builder pods.

I'm seeing many instances where slugbuilder pod will start -> Pending -> Running -> Terminate before we are able to get through waitForPod and all the associated machinery. In the final state, the pod simply appears as a 404. So we are unable to determine if the pod never existed, or existed and has since gone away.

I think the most sensible move going forward will be for builder to subscribe to all events in the deis namespace before launching the slugbuilder pod so that builder can see the "record of truth" from the stream. We should be able to see that the Pod came up, completed and exited. Any attempt to use the k8s API with respect to Pods directly will probably suffer from the Terminated == 404 behavior.

This most certainly leaves builder log streaming in an unfortunate position. Logs will most likely "best effort" for beta even with event processing (and possibly v2.0.0). Ultimately, we may need to move away from pod logs and have a small utility/API for slug and docker builders to stream IO back to the builder component.

Which should be more dependable and reliable than fetching logs from k8s API.

@arschles
Copy link
Member

I'm late to the party here (and on mobile) so apologies if I'm repeating anything. I have a prototype PR for watching the event stream to observe all state transitions with the pod. That PR also begins streaming logs after the pod has been launched, but with 'previous' set to true, which afaik solves the log streaming issue (even though I'd love to build a log streaming system for builder pod -> builder, I'd like to try as hard as possible to stick to the k8s API here.

The state of that PR is purely prototypical and I'm not even sure if it builds right now, but I just wanted to throw out the info and current state of affairs.

Given that we're talking about a race condition here (and have been for at least the past few weeks), we haven't solved the bug until we make a fundamental change such as this one.

Finally, there are a few other strategies we can take also, which I'll throw out in a separate issue or comment when I have better access to the internets

Sent from my iPhone

On Feb 24, 2016, at 19:54, Jason Hansen notifications@github.com wrote:

So I added a bunch more debugging information to a local builder. We've got a mess of possible state transitions for the builder pods.

I'm seeing many instances where slugbuilder pod will start -> Pending -> Running -> Terminate before we are able to get through waitForPod and all the associated machinery. In the final state, the pod simply appears as a 404. So we are unable to determine if the pod never existed, or existed and has since gone away.

I think the most sensible move going forward will be for builder to subscribe to all events in the deis namespace before launching the slugbuilder pod so that builder can see the "record of truth" from the stream. We should be able to see that the Pod came up, completed and exited. Any attempt to use the k8s API with respect to Pods directly will probably suffer from the Terminated == 404 behavior.

This most certainly leaves builder log streaming in an unfortunate position. Logs will most likely "best effort" for beta even with event processing (and possibly v2.0.0). Ultimately, we may need to move away from pod logs and have a small utility/API for slug and docker builders to stream IO back to the builder component.

Which should be more dependable and reliable than fetching logs from k8s API.


Reply to this email directly or view it on GitHub.

@slack
Copy link
Member Author

slack commented Feb 25, 2016

This should probably be a blocker and deserves the events/watch fix IMO.

Trying to stick to k8s APIs make sense as long as streaming with previous works for deleted pods I'm 👍 for that approach.

@mboersma
Copy link
Member

Still debugging this with @smothiki, and we've found when things go wrong with slugbuilder that k8s is killing the pod in such a way that its status goes directly to Terminating from Pending. So even watching the event stream might not work, since by the time it's Terminating there's little we can do.

$ kubectl --namespace=deis get po -w
NAME                                          READY     STATUS    RESTARTS   AGE
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Pending   0          0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Pending   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s

@smothiki
Copy link
Contributor

In terminating state we tried to get pod status with kubectl. Kubectl errored out saying POD not available. while -w still showing POD status is terminating. Also this bug is related not only to builder. Trying to debug from kubernetes API server and kubelet end. Not much luck as of now.
Did check if the container of the terminating POD and it ran successfully and exited with 0 exit status

@slack
Copy link
Member Author

slack commented Feb 25, 2016

Namespace-wide events stream is what you want to look at. I think get po -w suffers from the same pod orientation:

I've always seen Started -> Pulled -> Created via kubectl get events -w --all-namespaces.

deis      59m       59m       1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Pulled    {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Successfully pulled image "quay.io/deisci/slugbuilder:v2-beta"
deis      59m       59m       1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Created   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id afd4fa8599ae
deis      1h        1h        1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Started   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id afd4fa8599ae
deis      1h        1h        1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Killing   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id afd4fa8599ae

In the case a pod transitions to terminated and is GCd before we get logs we're hosed for logs. Thus the recommendation to move log streaming into our application layer, rather than k8s.

@mboersma
Copy link
Member

Here are the relevant events after a failed Dockerfile git push deis master on current v2:

$ kubectl get events -w --all-namespaces
...
NAMESPACE   FIRSTSEEN   LASTSEEN   COUNT     NAME           KIND                    SUBOBJECT   REASON             SOURCE                      MESSAGE
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod                 Scheduled   {scheduler }   Successfully assigned dockerbuild-oldest-yodeling-2e2054e7-135a237c to 192.168.64.2
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Pulled    {kubelet 192.168.64.2}   Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Created   {kubelet 192.168.64.2}   Created with docker id 9656c4dc735c
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Started   {kubelet 192.168.64.2}   Started with docker id 9656c4dc735c
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Pulling   {kubelet 192.168.64.2}   Pulling image "quay.io/deisci/dockerbuilder:v2-beta"
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Pulled    {kubelet 192.168.64.2}   Successfully pulled image "quay.io/deisci/dockerbuilder:v2-beta"
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Created   {kubelet 192.168.64.2}   Created with docker id 5b9e9f608634
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Started   {kubelet 192.168.64.2}   Started with docker id 5b9e9f608634
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Killing   {kubelet 192.168.64.2}   Killing with docker id 9656c4dc735c

@smothiki
Copy link
Contributor

@mboersma what is 0s mean ? execution time

@slack
Copy link
Member Author

slack commented Feb 26, 2016

"FIRST SEEN" and "LAST SEEN"

@mboersma
Copy link
Member

Sorry--I pasted the header back in to hopefully make that output easier to decipher.

@mboersma
Copy link
Member

This is fixed by #206. Yes, really.

@arschles
Copy link
Member

arschles commented Mar 3, 2016

@gabrtv found a similar issue in #225, so I'm going to reopen this under RC1. It's unlikely that we'll be able to close this issue until we re-architect the way we stream logs. I've proposed a solution in #207.

@arschles arschles reopened this Mar 3, 2016
@arschles arschles modified the milestones: v2.0-rc1, v2.0-beta1 Mar 3, 2016
@smothiki
Copy link
Contributor

@gabrtv @arschles are we still observing this

@arschles
Copy link
Member

@smothiki I'd like to keep this open as a showstopper for RC. I believe there is still a race in the code between pod launch and observing the event

@smothiki
Copy link
Contributor

this #185 should definitely solve the issue

@arschles
Copy link
Member

👍

I've noted that fact at the end of the description in #185

@arschles arschles added this to the v2.0-beta3 milestone Apr 15, 2016
@arschles arschles removed this from the v2.0-rc1 milestone Apr 15, 2016
@arschles
Copy link
Member

promoting to beta3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants