Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New buildah/timeout/external container flakes #18631

Closed
edsantiago opened this issue May 19, 2023 · 20 comments
Closed

New buildah/timeout/external container flakes #18631

edsantiago opened this issue May 19, 2023 · 20 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Collaborator

New weird flakes seen in buildah treadmill. I'm filing all under one issue, but they might be unrelated:

not ok 119 podman ps --external
...
# podman build -t thiswillneverexist -
STEP 1/3: FROM quay.io/libpod/testimage:20221018
STEP 2/3: RUN touch /intermediate.image.to.be.pruned
--> 0c116d0cdcf9
STEP 3/3: RUN sleep 30
timeout: sending signal TERM to command '/var/tmp/go/src/github.com/containers/podman/bin/podman'
[ rc=124 (expected) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: podman build did not get killed within 10 seconds
#| expected: -le 10
#|   actual:     32

(ISTR something changing in buildah signal handling recently)

And:

not ok 248 podman diff with buildah container 
...
$ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
quay.io/libpod/testimage:20221018 f5a99120db64
buildahctr
$ podman diff --format json buildahctr
Error: buildahctr not found: layer not known

And:

not ok 137 podman rm container from storage
...
$ podman container exists --external testimage-working-container
[ rc=1 (** EXPECTED 0 **) ]

All of these are buildah-related. I haven't seen them in any other PRs, but this has been the lightest PR week in memory. If we don't see these in regular podman CI, it might be something in the new buildah vendoring.

  • debian-12 : sys podman debian-12 root host boltdb
  • debian-12 : sys podman debian-12 rootless host boltdb
    • 05-19 09:05 in [sys] podman rm container from storage
    • 05-19 09:05 in [sys] podman diff with buildah container
  • rawhide : sys podman rawhide rootless host sqlite
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label May 19, 2023
@edsantiago
Copy link
Collaborator Author

Flakes being seen in the wild, outside of the treadmill. Next guess: something to do with the buildah version in new VM images?

@edsantiago
Copy link
Collaborator Author

Everything points to something broken in containers/storage.

@edsantiago
Copy link
Collaborator Author

All the 13808 failures below are the buildah treadmill PR.

All the 17831 failures below are from my hammer-CI PR, during a period this weekend when I had cherry-picked #18634, a test PR that included a new containers/storage. That is why I believe the problem is in containers/storage. Today I have reverted that PR from my hammer-CI PR, and am no longer seeing those buildah failures.

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

All the 17831 failures below are from my hammer-CI PR, during a period this weekend when I had cherry-picked #18634, a test PR that included a new containers/storage.

Assuming that was github.com/mtrmac/storage v0.0.0-20230519165505-36c85d0b2b46 , that means upstream c/storage between v1.46.1 and ceaf6ce2c54a0a2b184b831a732c722758979e81 .

a short-list of relevant PRs (skipping dependency updates, non-overlay changes, and entirely new unused features):

To me, it seems containers/storage#1549 is the highest-risk change in this area, but there are certainly other candidates.

@edsantiago
Copy link
Collaborator Author

Uh-oh. this might be the same bug, *without( the new containers/storage. It's really hard to tell, because nobody is submitting PRs the last few weeks.

# podman [options] rmi -f rmtest:1
Error: committing container for step {Env:[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin] Command:run Args:[touch rmtest:8] Flags:[] Attrs:map[] Message:RUN touch rmtest:8 Original:RUN touch rmtest:8}: copying layers and metadata for container "802e6990add8b793b1cb8bbef78ae6cff2d027aa57397fa913d3fdca49170be1": committing the finished image: creating image "cde9f59aba1cf8c2625eceef9009afd9225fa12b2cfacf7d8bc6a850d41eafad": layer not known

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

That (podman image rm - concurrent with shared layers) is, I think at a first glance, entirely unrelated.

It seems to me that the test is buggy — it claims to test concurrent removal, but it tests concurrent (build + removal); and if an image is removed while another is being built from the same layers, I can imagine that failing (a build doesn’t hold a race-free “reference” to a layer that would protect it against concurrent removals).

Actually, well, the lack of a race-free “reference” is arguably a bug, but it’s not the bug reportedly being tested by the test; #9266 only intended to make removes not fail against concurrent removals.

@edsantiago
Copy link
Collaborator Author

Thank you @mtrmac. I just panicked without looking closely.

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

@edsantiago I do think that build&rm one is worth tracking and fixing.

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

Filed #18659 for my concurrent rm theory.

@nalind
Copy link
Member

nalind commented May 22, 2023

Aside from containers/buildah#4813, I think containers/storage#1571 is causing podman (with a containers/storage that incorporates it) and the system default version of buildah (built with a containers/storage that didn't incorporate it) to default to different storage drivers, so they see different sets of layers and images.

There are multiple options for getting them to agree. Set STORAGE_DRIVER in the environment to $($PODMAN $_PODMAN_TEST_OPTS info -f '{{.Store.GraphDriverName}}'). Create ~/.config/containers/storage.conf. Incorporate the default that podman selects into the configuration file for CI.

@flouthoc
Copy link
Collaborator

To me, it seems containers/storage#1549 is the highest-risk change in this area, but there are certainly other candidates.

@mtrmac I was testing it with podman with #18224 and I never saw any race or issues with CI until I rebased recently and saw some debian failures ( debian failures looked unrelated to the PR ). ( But let me take a look again and see if I missed anything )

@flouthoc
Copy link
Collaborator

Regarding following tests

[sys] podman rm container from storage
[sys] podman diff with buildah container

I think buildah version in VM might be the very much issue.

There are multiple options for getting them to agree. Set STORAGE_DRIVER in the environment to $($PODMAN $_PODMAN_TEST_OPTS info -f '{{.Store.GraphDriverName}}'). Create ~/.config/containers/storage.conf. Incorporate the default that podman selects into the configuration file for CI.

podman diff with buildah container fails for me when storage driver is not set in storage.conf but error is different from what i see in CI but both tests pass for me with latest buildah binary and when storage is configured correctly.

@edsantiago
Copy link
Collaborator Author

I'm having trouble following the thread, and I don't have a way to play with debian. But I just think it's important to point out that system tests do not (and must not) muck around with setting STORAGE_DRIVER or adding --this --that options. They are intended to test our tools as configured on a real system. If there is an incompatibility between podman and buildah, on a system that could conceivable look like a real-world one, my inclination would be to fix our tools so they handle this situation better.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jun 23, 2023

@edsantiago update on this issue?

@flouthoc
Copy link
Collaborator

I think this is similar what we saw in treadmill and was recently fixed by doing a workaround for debian. But I think @edsantiago can confirm this.

@edsantiago
Copy link
Collaborator Author

I do not understand the status of this bug. As best I can tell, #18822 changed the CI setup something something overlay/vfs/debian. Since then I have only run the buildah treadmill 2-3 times, because there have been very few buildah PRs. I have not seen these errors.

To me that means the bug has been swept under the carpet, not fixed. Opinions differ on this: some may believe the issue is fixed; some may believe that it's a difficult situation with no real good fix. I will step out of this discussion for now. If someone truly believes this is fixed, feel free to close this issue. If at all possible, please include a comment defending your actions.

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 26, 2023

My very imprecise impression, just skimming the conversations and not checking the details myself, is that Nalin diagnosed this in #18631 (comment) , and that @flouthoc fixed the c/storage behavior in containers/storage#1618 / containers/storage#1637 .

Those c/storage changes are included in Podman since e5399aa .

@flouthoc PTAL and correct me if I’m wrong.

@flouthoc
Copy link
Collaborator

@mtrmac Yes, regarding #18822 we had a discussion in a watercooler that it is hard to ensure this compat for older version between (new) podman <-> (old) buildah hence we decided to create a workaround for debian #18822 . For newer version this should be fixed via containers/storage#1618 / containers/storage#1637 .

@edsantiago
Copy link
Collaborator Author

#13808 is passing again. Closing.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

5 participants