Missing builder pod launch #199

slack · 2016-02-24T01:02:38Z

Image: smothiki/builder:v2.1

Builder is hung on git push (and will wait for timeout):

k:beef example-go [master]$ g push deis master
Warning: the RSA host key for '[deis.beef.slack.io]:2222' differs from the key for the IP address '[54.200.165.125]:2222'
Offending key for IP in /Users/jhansen/.ssh/known_hosts:109
Matching host key in /Users/jhansen/.ssh/known_hosts:113
Counting objects: 529, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (524/524), done.
Writing objects: 100% (529/529), 54.51 KiB | 0 bytes/s, done.
Total 529 (delta 482), reused 0 (delta 0)
remote: --->

It looks like the slugbuilder launched and completed though:

deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod                                           Scheduled   {scheduler }                                        Successfully assigned slugbuild-earthy-instinct-d4c807b8-a94f52b6 to ip-10-0-0-98.us-west-2.compute.internal
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Pulled      {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Created     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id a626b5939b47
deis        22s         22s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Started     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id a626b5939b47
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Started     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id 710055360b5b
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Pulled      {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Successfully pulled image "quay.io/deisci/slugbuilder:v2-beta"
deis        20s         20s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Created     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id 710055360b5b
deis        14s         14s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       implicitly required container POD   Killing     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id a626b5939b47
deis        14s         14s        1         slugbuild-earthy-instinct-d4c807b8-a94f52b6   Pod       spec.containers{deis-slugbuilder}   Killing     {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id 710055360b5b

And the logs from the container (pulled off the host):

{"log":"\u001b[1G-----\u003e Go app detected\n","stream":"stdout","time":"2016-02-24T00:58:32.299975079Z"}
{"log":"\u001b[1G-----\u003e Checking Godeps/Godeps.json file.\n","stream":"stdout","time":"2016-02-24T00:58:32.310713312Z"}
{"log":"\u001b[1G-----\u003e Installing go1.4.2... done\n","stream":"stdout","time":"2016-02-24T00:58:36.717534582Z"}
{"log":"\u001b[1G-----\u003e Running: godep go install -tags heroku ./...\n","stream":"stdout","time":"2016-02-24T00:58:36.722983276Z"}
{"log":"\u001b[1G-----\u003e Discovering process types\n","stream":"stdout","time":"2016-02-24T00:58:37.321394269Z"}
{"log":"\u001b[1G       Procfile declares types -\u003e web\n","stream":"stdout","time":"2016-02-24T00:58:37.377768932Z"}
{"log":"\u001b[1G-----\u003e Compiled slug size is 1.7M\n","stream":"stdout","time":"2016-02-24T00:58:37.733511004Z"}

Rel #207
Rel #185
Rel #298

The text was updated successfully, but these errors were encountered:

slack · 2016-02-24T17:16:08Z

So pretty sure that we need to handle all possible pod states in the initial waitForPod.

In this case, we must also handle api.PodSucceeded.

slack · 2016-02-25T01:54:58Z

So I added a bunch more debugging information to a local builder. We've got a mess of possible state transitions for the builder pods.

I'm seeing many instances where slugbuilder pod will start -> Pending -> Running -> Terminate before we are able to get through waitForPod and all the associated machinery. In the final state, the pod simply appears as a 404. So we are unable to determine if the pod never existed, or existed and has since gone away.

I think the most sensible move going forward will be for builder to subscribe to all events in the deis namespace before launching the slugbuilder pod so that builder can see the "record of truth" from the stream. We should be able to see that the Pod came up, completed and exited. Any attempt to use the k8s API with respect to Pods directly will probably suffer from the Terminated == 404 behavior.

This most certainly leaves builder log streaming in an unfortunate position. Logs will most likely "best effort" for beta even with event processing (and possibly v2.0.0). Ultimately, we may need to move away from pod logs and have a small utility/API for slug and docker builders to stream IO back to the builder component.

Which should be more dependable and reliable than fetching logs from k8s API.

arschles · 2016-02-25T03:07:11Z

I'm late to the party here (and on mobile) so apologies if I'm repeating anything. I have a prototype PR for watching the event stream to observe all state transitions with the pod. That PR also begins streaming logs after the pod has been launched, but with 'previous' set to true, which afaik solves the log streaming issue (even though I'd love to build a log streaming system for builder pod -> builder, I'd like to try as hard as possible to stick to the k8s API here.

The state of that PR is purely prototypical and I'm not even sure if it builds right now, but I just wanted to throw out the info and current state of affairs.

Given that we're talking about a race condition here (and have been for at least the past few weeks), we haven't solved the bug until we make a fundamental change such as this one.

Finally, there are a few other strategies we can take also, which I'll throw out in a separate issue or comment when I have better access to the internets

Sent from my iPhone

On Feb 24, 2016, at 19:54, Jason Hansen notifications@github.com wrote:

So I added a bunch more debugging information to a local builder. We've got a mess of possible state transitions for the builder pods.

I'm seeing many instances where slugbuilder pod will start -> Pending -> Running -> Terminate before we are able to get through waitForPod and all the associated machinery. In the final state, the pod simply appears as a 404. So we are unable to determine if the pod never existed, or existed and has since gone away.

I think the most sensible move going forward will be for builder to subscribe to all events in the deis namespace before launching the slugbuilder pod so that builder can see the "record of truth" from the stream. We should be able to see that the Pod came up, completed and exited. Any attempt to use the k8s API with respect to Pods directly will probably suffer from the Terminated == 404 behavior.

This most certainly leaves builder log streaming in an unfortunate position. Logs will most likely "best effort" for beta even with event processing (and possibly v2.0.0). Ultimately, we may need to move away from pod logs and have a small utility/API for slug and docker builders to stream IO back to the builder component.

Which should be more dependable and reliable than fetching logs from k8s API.

—
Reply to this email directly or view it on GitHub.

slack · 2016-02-25T04:43:07Z

This should probably be a blocker and deserves the events/watch fix IMO.

Trying to stick to k8s APIs make sense as long as streaming with previous works for deleted pods I'm 👍 for that approach.

mboersma · 2016-02-25T19:04:10Z

Still debugging this with @smothiki, and we've found when things go wrong with slugbuilder that k8s is killing the pod in such a way that its status goes directly to Terminating from Pending. So even watching the event stream might not work, since by the time it's Terminating there's little we can do.

$ kubectl --namespace=deis get po -w
NAME                                          READY     STATUS    RESTARTS   AGE
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Pending   0          0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Pending   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s
slugbuild-trendy-jetliner-7cfb303f-fddf5fed   0/1       Terminating   0         0s

smothiki · 2016-02-25T19:08:09Z

In terminating state we tried to get pod status with kubectl. Kubectl errored out saying POD not available. while -w still showing POD status is terminating. Also this bug is related not only to builder. Trying to debug from kubernetes API server and kubelet end. Not much luck as of now.
Did check if the container of the terminating POD and it ran successfully and exited with 0 exit status

slack · 2016-02-25T19:19:59Z

Namespace-wide events stream is what you want to look at. I think get po -w suffers from the same pod orientation:

I've always seen Started -> Pulled -> Created via kubectl get events -w --all-namespaces.

deis      59m       59m       1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Pulled    {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Successfully pulled image "quay.io/deisci/slugbuilder:v2-beta"
deis      59m       59m       1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Created   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Created with docker id afd4fa8599ae
deis      1h        1h        1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Started   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Started with docker id afd4fa8599ae
deis      1h        1h        1         slugbuild-convex-airfield-1b9813a5-449e9397   Pod       spec.containers{deis-slugbuilder}   Killing   {kubelet ip-10-0-0-98.us-west-2.compute.internal}   Killing with docker id afd4fa8599ae

In the case a pod transitions to terminated and is GCd before we get logs we're hosed for logs. Thus the recommendation to move log streaming into our application layer, rather than k8s.

mboersma · 2016-02-26T00:24:18Z

Here are the relevant events after a failed Dockerfile git push deis master on current v2:

$ kubectl get events -w --all-namespaces
...
NAMESPACE   FIRSTSEEN   LASTSEEN   COUNT     NAME           KIND                    SUBOBJECT   REASON             SOURCE                      MESSAGE
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod                 Scheduled   {scheduler }   Successfully assigned dockerbuild-oldest-yodeling-2e2054e7-135a237c to 192.168.64.2
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Pulled    {kubelet 192.168.64.2}   Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Created   {kubelet 192.168.64.2}   Created with docker id 9656c4dc735c
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Started   {kubelet 192.168.64.2}   Started with docker id 9656c4dc735c
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Pulling   {kubelet 192.168.64.2}   Pulling image "quay.io/deisci/dockerbuilder:v2-beta"
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Pulled    {kubelet 192.168.64.2}   Successfully pulled image "quay.io/deisci/dockerbuilder:v2-beta"
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Created   {kubelet 192.168.64.2}   Created with docker id 5b9e9f608634
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       spec.containers{deis-dockerbuilder}   Started   {kubelet 192.168.64.2}   Started with docker id 5b9e9f608634
deis      0s        0s        1         dockerbuild-oldest-yodeling-2e2054e7-135a237c   Pod       implicitly required container POD   Killing   {kubelet 192.168.64.2}   Killing with docker id 9656c4dc735c

smothiki · 2016-02-26T00:41:29Z

@mboersma what is 0s mean ? execution time

slack · 2016-02-26T00:43:58Z

"FIRST SEEN" and "LAST SEEN"

mboersma · 2016-02-26T00:46:41Z

Sorry--I pasted the header back in to hopefully make that output easier to decipher.

mboersma · 2016-02-29T17:30:02Z

This is fixed by #206. Yes, really.

arschles · 2016-03-03T04:31:27Z

@gabrtv found a similar issue in #225, so I'm going to reopen this under RC1. It's unlikely that we'll be able to close this issue until we re-architect the way we stream logs. I've proposed a solution in #207.

smothiki · 2016-03-10T19:45:07Z

@gabrtv @arschles are we still observing this

arschles · 2016-03-10T20:19:51Z

@smothiki I'd like to keep this open as a showstopper for RC. I believe there is still a race in the code between pod launch and observing the event

smothiki · 2016-03-10T20:21:20Z

this #185 should definitely solve the issue

arschles · 2016-03-10T20:22:17Z

👍

I've noted that fact at the end of the description in #185

arschles · 2016-04-15T17:38:00Z

promoting to beta3

smothiki self-assigned this Feb 24, 2016

smothiki added the bug label Feb 24, 2016

smothiki added this to the v2.0-beta1 milestone Feb 24, 2016

smothiki mentioned this issue Feb 24, 2016

feat(intervals): make pod tick intervals agressive to observe pod change #200

Closed

slack added the showstopper label Feb 25, 2016

smothiki mentioned this issue Feb 27, 2016

feat(race): change heritage label for every pod launch #206

Merged

mboersma closed this as completed Feb 29, 2016

arschles mentioned this issue Mar 3, 2016

Connection reset by peer when reading slugbuilder logs #225

Closed

arschles reopened this Mar 3, 2016

arschles modified the milestones: v2.0-rc1, v2.0-beta1 Mar 3, 2016

arschles mentioned this issue Mar 10, 2016

fix(pkg/gitreceive): use a watch for pod additions (Prototype) #185

Closed

arschles added this to the v2.0-beta3 milestone Apr 15, 2016

arschles removed this from the v2.0-rc1 milestone Apr 15, 2016

This was referenced Apr 15, 2016

Proposal: send logs directly from builder pods back to builder pod #207

Open

[Meta] Race condition for when builder sometimes misses build pod runs #298

Closed

kmala mentioned this issue Apr 20, 2016

fix(slugbuilder): check for the succeded status while waiting for slugbuilder pod #304

Merged

3 tasks

deis-admin assigned kmala and unassigned smothiki Apr 20, 2016

deis-admin added the in progress label Apr 20, 2016

kmala closed this as completed in #304 Apr 20, 2016

deis-admin removed the in progress label Apr 20, 2016

Cryptophobia mentioned this issue Mar 21, 2018

Proposal: send logs directly from builder pods back to builder pod teamhephy/builder#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing builder pod launch #199

Missing builder pod launch #199

slack commented Feb 24, 2016 •

edited by arschles

Loading

slack commented Feb 24, 2016

slack commented Feb 25, 2016

arschles commented Feb 25, 2016

slack commented Feb 25, 2016

mboersma commented Feb 25, 2016

smothiki commented Feb 25, 2016

slack commented Feb 25, 2016

mboersma commented Feb 26, 2016

smothiki commented Feb 26, 2016

slack commented Feb 26, 2016

mboersma commented Feb 26, 2016

mboersma commented Feb 29, 2016

arschles commented Mar 3, 2016

smothiki commented Mar 10, 2016

arschles commented Mar 10, 2016

smothiki commented Mar 10, 2016

arschles commented Mar 10, 2016

arschles commented Apr 15, 2016

Missing builder pod launch #199

Missing builder pod launch #199

Comments

slack commented Feb 24, 2016 • edited by arschles Loading

slack commented Feb 24, 2016

slack commented Feb 25, 2016

arschles commented Feb 25, 2016

slack commented Feb 25, 2016

mboersma commented Feb 25, 2016

smothiki commented Feb 25, 2016

slack commented Feb 25, 2016

mboersma commented Feb 26, 2016

smothiki commented Feb 26, 2016

slack commented Feb 26, 2016

mboersma commented Feb 26, 2016

mboersma commented Feb 29, 2016

arschles commented Mar 3, 2016

smothiki commented Mar 10, 2016

arschles commented Mar 10, 2016

smothiki commented Mar 10, 2016

arschles commented Mar 10, 2016

arschles commented Apr 15, 2016

slack commented Feb 24, 2016 •

edited by arschles

Loading