Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update images #311

Closed
wants to merge 1 commit into from
Closed

update images #311

wants to merge 1 commit into from

Conversation

edsantiago
Copy link
Collaborator

Signed-off-by: Ed Santiago santiago@redhat.com

Copy link

github-actions bot commented Nov 6, 2023

Cirrus CI build successful. Found built image names and IDs:

Stage Image Name IMAGE_SUFFIX
base debian do-not-use
base fedora do-not-use
base fedora-aws do-not-use
base fedora-aws-arm64 do-not-use
base image-builder do-not-use
base prior-fedora do-not-use
cache build-push c20231106t160529z-f39f38d13
cache debian c20231106t160529z-f39f38d13
cache fedora c20231106t160529z-f39f38d13
cache fedora-aws c20231106t160529z-f39f38d13
cache fedora-netavark c20231106t160529z-f39f38d13
cache fedora-netavark-aws-arm64 c20231106t160529z-f39f38d13
cache fedora-podman-aws-arm64 c20231106t160529z-f39f38d13
cache fedora-podman-py c20231106t160529z-f39f38d13
cache prior-fedora c20231106t160529z-f39f38d13
cache rawhide c20231106t160529z-f39f38d13
cache win-server-wsl c20231106t160529z-f39f38d13

Requires emergency override of containers.conf SNAFU with zstd:chunked

containers/common#1730

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago
Copy link
Collaborator Author

@cevich if you have a spare moment could you look at the fedora-aws Base Image failure please?

==> fedora-aws: Stopping the source instance...
    fedora-aws: Stopping instance
==> fedora-aws: Waiting for the instance to stop...
==> fedora-aws: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
==> fedora-aws: Provisioning step had errors: Running the cleanup provisioner, if present...
==> fedora-aws: Terminating the source AWS instance...
==> fedora-aws: ResourceNotReady: failed waiting for successful resource state

I can't find the string "Waiting for the instance to stop" anywhere in the likely source trees, so I have no idea what is running or what the bug is.

FWIW, the python-3.12 bug is a red herring, my last build threw the same error, but worked anyway.

@cevich
Copy link
Member

cevich commented Nov 10, 2023

The podman-py stuff I believe Urvashi sorted out. There's an actual bug in pylint and she found a workaround.

The error you got is coming from Packer. I've seen similar things before, it looks like a flake to me. It probably orphaned a VM (we can worry about that later). I restarted the task and will keep an eye on it as I'm able today...

@cevich
Copy link
Member

cevich commented Nov 10, 2023

...uggg. Amazon is having a bad day, re-running again...

@edsantiago
Copy link
Collaborator Author

It doesn't seem to be a flake. I restarted it four times yesterday.

@cevich
Copy link
Member

cevich commented Nov 10, 2023

I don't think we've changed the packer version recently, so it must be something on the Amazon side. Perhaps triggering a bug in packer.

In my last attempt, I found the line:
fedora-aws: Instance ID: i-0ac23dd69f36c7d41

I looked that instance up on the AWS EC2 console, and it shows the status as "terminated" - which is correct.

I found a few other instances in a "stopped" state, that shouldn't happen.

If you'd like to try figuring and bumping up the packer timeout, that may get you past the hump. Otherwise we may need a newer version (which may not accept our current cloud.yml).

@cevich
Copy link
Member

cevich commented Nov 10, 2023

Looking again:

==> fedora-aws: Waiting for the instance to stop...
==> fedora-aws: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts

I bet amazon changed some timings on their end. Such that (for example) it tries an ACPI shutdown, waits, tries again, waits, then "yanks the plug". If the timings of any of that collide with what packer is expecting, we'd get this problem.

It's highly-likely there's a timeout setting for this, probably needs to be added to the cloud.yml. Sometimes they (HashiCorp) do it as a CLI option or by env. vars. But I doubt it in this case.

@cevich
Copy link
Member

cevich commented Nov 10, 2023

If we need to dig deeper, there are options here as well. AWS keeps a log of basically every API request per user. So it's pretty easy to see if and when the request came in. In this case, it does look like a StopInstance is being received, runs for ~10 minutes, then there's a TerminateInstance call.

@edsantiago
Copy link
Collaborator Author

edsantiago commented Nov 14, 2023

Closing in favor of #312. Hoping all these timeouts and errors go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants