Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prebuild process gets stuck (when updating .gitpod.Dockerfile) #4856

Closed
shaal opened this issue Jul 17, 2021 · 14 comments · Fixed by #4875
Closed

Prebuild process gets stuck (when updating .gitpod.Dockerfile) #4856

shaal opened this issue Jul 17, 2021 · 14 comments · Fixed by #4875

Comments

@shaal
Copy link
Contributor

shaal commented Jul 17, 2021

Bug description

When updating .gitpod.Dockerfile and pushing it back to repo on a branch, the prebuild seems stuck.
It takes over half an hour (but I am not sure how long, perhaps 1 hour?) where all I see is this screen with jumping G.
(while console doesn't display any error)
image

Coming back to the computer after a while, I see this screen -
image
And here's the output I found in console -
https://gist.github.com/shaal/b34b1e2ffba2d50deb4e42a3b4ed961c

I checked gitpod.io/workspaces, and I see 5(?) workspaces running, even that I only opened 2 windows of the following URL, expecting to see the prebuild output - https://gitpod.io/#https://github.com/shaal/DrupalPod/tree/debug-with-tmate

image

Steps to reproduce

In my experience, it is difficult to replicate or anticipate how prebuild behave (especially with custom docker image build)
This happened to me also the other day, when I worked on https://github.com/phase2/outline/tree/gitpod, when I opened an issue in https://community.gitpod.io/t/prebuild-is-stuck/4171

I think I usually use the manual prebuild using the #prebuild/ in the URL, but in repos that has the Gitpod bot, I would get a message that a prebuild is already running.

These workspaces supposed to run prebuild, but they are stuck, and once 4 workspaces are "running", I cannot open another one, but at the same time I cannot stop a prebuild that is stuck like that.
So, I usually wait a few hours, then the workspaces are shutdown, and I can run prebuild using the #prebuild/ in the URL, and then it finally works.

Expected behavior

No response

Example repository

No response

Anything else?

No response

@shaal
Copy link
Contributor Author

shaal commented Jul 18, 2021

Update 1:

One of the workspaces (https://gitpod.io/start/#black-ferret-dd1kh8y8) that seemed to be stuck on "Build an image" screen and empty output, finally has some (error output).
Console says: /start: started workspace instance: d8a47bd4-07f0-4cfe-8ac9-3a43dac47463
Screenshot:
image
The actual output in 'terminal preview'

Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                 Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                  Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                                   Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                                                    Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
               Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                                 Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                                                                                      Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                 Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".
                                                                                      Error: 13 INTERNAL: cannot resolve base image ref: Error response from daemon: manifest unknown: Failed to fetch "e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455" from request "/v2/gitpod-dev/base-images/manifests/e737d8012fc85bce44573b7673879707ba3aa94a2dc313806e6a3e3edea8f455".

@shaal
Copy link
Contributor Author

shaal commented Jul 18, 2021

Update 2:

Gitpod seems to keep restarting these prebuilds, you can see below that these are 4 new workpsaces.
In the meantime, I cannot do anything. I cannot stop these workspaces from rebuilding, and I cannot work on other things in the meantime (because it's maxed out on 4 active workspaces)
image

@shaal
Copy link
Contributor Author

shaal commented Jul 18, 2021

Update 3:

After a few hours, it was no longer stuck.
I opened a workspace with the updated repo, and I saw the "build image" page, now displaying terminal messages as I expected. (I was able to SSH into the process of image build, since I use tmate in there)

@shaal
Copy link
Contributor Author

shaal commented Jul 18, 2021

Update 4:

I added ENV TRIGGER_REBUILD 1 to .gitpod.Dockerfile, and now I see this screen, which seems to be stuck again. No console errors, and nothing displayed in the "terminal" output.
image

After 1-2 hours:
image

@ghuntley
Copy link
Contributor

https://gitpod.io/#https://github.com/ghuntley/learn-graphql/tree/gh/gitpodify

Confirming that I've also witnessed the following as of moments ago

now I see this screen, which seems to be stuck again. No console errors, and nothing displayed in the "terminal" output.

2021-07-19_13-36-48

@ghuntley ghuntley added feature: prebuilds type: bug Something isn't working labels Jul 19, 2021
@ghuntley ghuntley added this to Inbox in [DEPRECATED] Product Engineering Groundwork via automation Jul 19, 2021
@ghuntley ghuntley added the priority: highest (user impact) Directly user impacting label Jul 19, 2021
@geropl
Copy link
Member

geropl commented Jul 19, 2021

Outch... this feels like a restart-cycle because we miss some signal. Might have been introduced with the recent headless-log changes. 😕 Will dig into this shortly.

@csweichel csweichel moved this from Inbox to Scheduled (limit: 25 WIP) in [DEPRECATED] Product Engineering Groundwork Jul 19, 2021
@geropl geropl self-assigned this Jul 19, 2021
@geropl geropl moved this from Scheduled (limit: 25 WIP) to In Progress in [DEPRECATED] Product Engineering Groundwork Jul 19, 2021
@geropl
Copy link
Member

geropl commented Jul 19, 2021

Thx so much @shaal for this treasure trove of data.

It's a bit entangled, bit it seems like here are multiple strange things (bugs?) happening here:

  • the root cause for the cannot resolve base image ref: not sure where this comes from, yet
  • we seem to not terminate prebuilds properly which do not reach the kubernetes control plane (to be verified!)
  • we start new workspaces in a loop on a (failed) prebuild

Will try to verify and fix one-by-one.

@shaal
Copy link
Contributor Author

shaal commented Jul 19, 2021

Thank you for the quick response, let me know if I can help testing.

@geropl
Copy link
Member

geropl commented Jul 20, 2021

Ok, pinned it down:

  • the frontend does not properly end it's re-try loop to load headless logs but assumes we will navigate away once we have the first clear signal on "prebuild done". If there is an error during workspace start before we generate an ideUrl (e.g., during imagebuilds) we can never navigate away from the prebuild logs view.
  • we do not handle the image builder error properly: instead of failing and stopping the workspace it runs into a timeout (1h) it seems the image build takes very long (longer than the default timeout of 1h), which, in combination with a race condition in image-builder and the frontend trying to start a lot of workspaces at once, results in these error messages.

Besides the first error which we're going to fix in #4875 there are two issues left:

  • UX: Dead end when image-build fails during prebuild #4879: one is a UI/UX gap for exactly this situation (no ideUrl but workspace start fails) which this PR will not attempt to fix because it needs a bit more discussion. I will open a new issue for that.
  • the other is the race condition in image-builder and (maybe) potential to improve the reporting of build errors. But as we're planning to migrate to a new service very soon I bet we won't do anything about that for now.

@shaal
Copy link
Contributor Author

shaal commented Jul 20, 2021

@geropl Do you know why image build takes so long? At least in my case, .gitpod.Dockerfile takes less than a minute to install

it seems the image build takes very long (longer than the default timeout of 1h)

@geropl
Copy link
Member

geropl commented Jul 21, 2021

@geropl Do you know why image build takes so long? At least in my case, .gitpod.Dockerfile takes less than a minute to install

Hm, not sure. Maybe my hypothesis (long build) is wrong, but it was the only way I could reproduce the error case.

It could be that there is another error within image-builder. Thinking about it I always tested EU, could be that this error manifested in US only. 🤔 Will try that instance with the contexts you shared. But if that's the case, it's likely to go away with the new iamge-builder as well, as the core process simplifies a lot.

@shaal
Copy link
Contributor Author

shaal commented Jul 22, 2021

I updated the .gitpod.Dockerfile, and then checked this URL
https://gitpod.io/#https://github.com/shaal/DrupalPod/blob/debug-with-tmate/.gitpod/.gitpod.Dockerfile

It still looks like the prebuild is stuck. I am assuming that's expected, because the fix will be done through the follow-up issues?

@geropl
Copy link
Member

geropl commented Jul 27, 2021

@shaal I tried the context you linked above (https://github.com/shaal/DrupalPod/blob/debug-with-tmate/.gitpod/.gitpod.Dockerfile) and it hung for me as well, but during image build, not prebuild. And it did so in this RUN: https://github.com/shaal/DrupalPod/blob/debug-with-tmate/.gitpod/.gitpod.Dockerfile#L33-L36

top says:
image

It feels this is unrelated to the original problem in this issue (which started multiple workspaces on broken image builds) which got fixed in the meantime, so I opt for closing this. Especially as I see you opened #4936 which looks like a follow up.

D'accord?

@shaal
Copy link
Contributor Author

shaal commented Jul 27, 2021

@geropl Yeah, I think the original problem of this issue was resolved, and #4936 explains better the problem I'm seeing now.
Closing this issue, Thank you all!

@shaal shaal closed this as completed Jul 27, 2021
@geropl geropl moved this from In Progress to Done in [DEPRECATED] Product Engineering Groundwork Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment