Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standard_init_linux.go:203: exec user process caused \"no such file or directory\" #1844

Closed
cjeanner opened this issue Nov 21, 2018 · 63 comments

Comments

Projects
None yet
9 participants
@cjeanner
Copy link

commented Nov 21, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

[//]: # Uncomment only one, leave it on its own line:

kind bug

[//]: # kind feature

Description

From time to time, we hit this uncommon issue while running a podman run command.

Steps to reproduce the issue:

  1. It's a random failure, no reproducer. yay.

Describe the results you received:
From time to time, it fails running a container

Describe the results you expected:
Well. It should run the container ;).

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:       0.11.1.1
Go Version:    go1.10.2
OS/Arch:       linux/amd64

Output of podman info:

host:
  BuildahVersion: 1.5-dev
  Conmon:
    package: podman-0.11.1.1-3.git594495d.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.12.0-dev, commit: ccde1bf8e093607ecf92b404368033f645cad755-dirty'
  Distribution:
    distribution: '"centos"'
    version: "7"
  MemFree: 874045440
  MemTotal: 8365137920
  OCIRuntime:
    package: runc-1.0.0-55.dev.git2abd837.el7.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 8557686784
  SwapTotal: 8588881920
  arch: amd64
  cpus: 8
  hostname: undercloud.localdomain
  kernel: 3.10.0-862.14.4.el7.x86_64
  os: linux
  rootless: false
  uptime: 59m 56.33s
insecure registries:
  registries:
  - 192.168.24.1:8787
  - 192.168.24.3:8787
registries:
  registries:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  ContainerStore:
    number: 1
  GraphDriverName: overlay
  GraphOptions:
  - overlay.override_kernel_check=true
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 19
  RunRoot: /var/run/containers/storage

Additional environment details (AWS, VirtualBox, physical, etc.):

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Nov 21, 2018

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2018

A bit more details:

The container that causes problem:
2018-11-19 21:42:01 | "Error running ['podman', 'run', '--name', 'nova_wait_for_db_sync', '--label', 'config_id=tripleo_step3', '--label', 'container_name=nova_wait_for_db_sync', '--label', 'managed_by=paunch', '--label', 'config_data={"command": "/docker-config-scripts/nova_wait_for_db_sync.py", "detach": false, "image": "docker.io/tripleomaster/centos-binary-nova-placement-api:3ed8ac0e93367a02ad53d9fa93467057724b6621_fd8eb74b", "net": "host", "privileged": false, "start_order": 1, "user": "root", "volumes": ["/var/lib/nova:/var/lib/nova:shared", "/var/lib/docker-config-scripts/:/docker-config-scripts/", "/var/lib/config-data/puppet-generated/nova_placement/etc/nova:/etc/nova:ro"]}', '--net=host', '--privileged=false', '--user=root', '--volume=/var/lib/nova:/var/lib/nova:shared', '--volume=/var/lib/docker-config-scripts/:/docker-config-scripts/', '--volume=/var/lib/config-data/puppet-generated/nova_placement/etc/nova:/etc/nova:ro', 'docker.io/tripleomaster/centos-binary-nova-placement-api:3ed8ac0e93367a02ad53d9fa93467057724b6621_fd8eb74b', '/docker-config-scripts/nova_wait_for_db_sync.py']. [1]",
2018-11-19 21:42:01 | "stderr: standard_init_linux.go:203: exec user process caused "no such file or directory"",

logstash query for this error: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22exec%20user%20process%20caused%5C%22%20AND%20message%3A%5C%22no%20such%20file%20or%20directory%5C%22 which means we hit this error since November 19th.

@giuseppe

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

who is creating the file `/docker-config-scripts/nova_wait_for_db_sync.py'? Is it static or can it change while the container is creating/running?

@cjeanner

This comment has been minimized.

@stuggi

This comment has been minimized.

Copy link

commented Nov 21, 2018

who is creating the file `/docker-config-scripts/nova_wait_for_db_sync.py'? Is it static or can it change while the container is creating/running?

nova_wait_for_db_sync.py is a static script which is part of the used volume '--volume=/var/lib/docker-config-scripts/:/docker-config-scripts/'

The config scripts get written to the node before they are used:

2018-11-19 21:31:07 | TASK [Write docker config scripts] *********************************************
2018-11-19 21:31:08 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=create_swift_secret.sh)
2018-11-19 21:31:08 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=docker_puppet_apply.sh)
2018-11-19 21:31:09 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=neutron_ovs_agent_launcher.sh)
2018-11-19 21:31:09 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=nova_api_discover_hosts.sh)
2018-11-19 21:31:10 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=nova_api_ensure_default_cell.sh)
2018-11-19 21:31:10 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=nova_statedir_ownership.py)
2018-11-19 21:31:10 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=nova_wait_for_db_sync.py)
2018-11-19 21:31:11 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=pyshim.sh)
2018-11-19 21:31:11 | changed: [fedora-28-vexxhost-sjc1-0000573726] => (item=set_swift_keymaster_key_id.sh)

2018-11-19 21:42:01 | "Running container: nova_wait_for_db_sync",
2018-11-19 21:42:01 | "$ podman run --name nova_wait_for_db_sync --label config_id=tripleo_step3 --label container_name=nova_wait_for_db_sync --label managed_by=paunch --label config_data={"command": "/docker-config-scripts/nova_wait_for_db_sync.py", "detach": false, "image": "docker.io/tripleomaster/centos-binary-nova-placement-api:3ed8ac0e93367a02ad53d9fa93467057724b6621_fd8eb74b", "net": "host", "privileged": false, "start_order": 1, "user": "root", "volumes": ["/var/lib/nova:/var/lib/nova:shared", "/var/lib/docker-config-scripts/:/docker-config-scripts/", "/var/lib/config-data/puppet-generated/nova_placement/etc/nova:/etc/nova:ro"]} --net=host --privileged=false --user=root --volume=/var/lib/nova:/var/lib/nova:shared --volume=/var/lib/docker-config-scripts/:/docker-config-scripts/ --volume=/var/lib/config-data/puppet-generated/nova_placement/etc/nova:/etc/nova:ro docker.io/tripleomaster/centos-binary-nova-placement-api:3ed8ac0e93367a02ad53d9fa93467057724b6621_fd8eb74b /docker-config-scripts/nova_wait_for_db_sync.py",

@rhatdan

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

Could things be umounted before runc runs? if you had a race between podman run and podman umount this could happen?

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Nov 21, 2018

@rhatdan The code to prevent unmount while c.state.Mounted is set should prevent that from happening - let me verify

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

@giuseppe pointed me out that it also fails the same way with Docker:
http://logs.openstack.org/98/619598/3/check/tripleo-ci-fedora-28-standalone-docker/b3b6940/logs/undercloud/var/log/extra/errors.txt

So at this point I wouldn't spend much time on it for now, it's probably something on our side and we'll keep looking.

@giuseppe

This comment has been minimized.

Copy link
Member

commented Nov 23, 2018

I am going to close this issue for now, since it happens also with Docker. Most likely the problem is somewhere else.

Let's reopen it if it needs more investigation in podman.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2018

@giuseppe please re-open, we can still see this one with podman 0.12:

standard_init_linux.go:203: exec user process caused \"no such file or directory\"",

http://logs.openstack.org/03/625903/2/gate/tripleo-ci-centos-7-containers-multinode/23f5730/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-12-19_20_05_52

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2018

@rhatdan rhatdan reopened this Dec 19, 2018

@giuseppe

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

There were no fixes that went into Podman to address this potential problem AFAIK. But since it happens also with Docker, why do you think it is in Podman? From what I could see, that error happens if the interpreter for the script you use as entrypoint doesn't exist

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2018

openstack-gerrit pushed a commit to openstack/tripleo-quickstart that referenced this issue Dec 22, 2018

temporarily turn off podman
We're hitting several issues in the upstream
ci related to podman :(

https://bugs.launchpad.net/tripleo/+bug/1804434
containers/libpod#1844

Related-Bug: #1804434
Related-Bug: #1809218
Change-Id: I19aa04382ba159768a1748d44412bbc670facaf3
@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

We still have the issue at this time, looking at our kibana:
logstash-podman

  • podman version
    Version: 0.12.1.2
    Go Version: go1.10.2
    OS/Arch: linux/amd64

  • podman info
    host:
    BuildahVersion: 1.6-dev
    Conmon:
    package: podman-0.12.1.2-2.git9551f6b.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.12.0-dev, commit: 97d5f1232f62307696c8b4b2d430e8b771a45873-dirty'
    Distribution:
    distribution: '"centos"'
    version: "7"
    MemFree: 1249722368
    MemTotal: 8369278976
    OCIRuntime:
    package: runc-1.0.0-57.dev.git2abd837.el7.centos.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
    SwapFree: 8589930496
    SwapTotal: 8589930496
    arch: amd64
    cpus: 8
    hostname: centos-7-inap-mtl01-0001426023
    kernel: 3.10.0-957.1.3.el7.x86_64
    os: linux
    rootless: false
    uptime: 1h 29m 13.69s (Approximately 0.04 days)
    insecure registries:
    registries:

    • 192.168.24.1:8787
      registries:
      registries:
    • registry.centos.org
    • registry.access.redhat.com
    • docker.io
    • quay.io
      store:
      ContainerStore:
      number: 1
      GraphDriverName: overlay
      GraphOptions: null
      GraphRoot: /var/lib/containers/storage
      GraphStatus:
      Backing Filesystem: extfs
      Native Overlay Diff: "true"
      Supports d_type: "true"
      ImageStore:
      number: 14
      RunRoot: /var/run/containers/storage

We couldn't reproduce the error with Docker, but only Podman. I'll comment if I have further findings.

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2019

This is definitely a runc error, telling us that the entrypoint binary for the container doesn't exist.

If it works on Docker, it's most likely part of how we assemble container entrypoint - perhaps we're doing something differently that results in us generating a different path than Docker?

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2019

It's always docker-puppet.py

I think we need to figure out where that lives in the image, and figure out why Podman isn't finding it properly.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

@mheon
The container options are:
--volume /var/lib/docker-puppet/docker-puppet.sh:/var/lib/docker-puppet/docker-puppet.sh:rw,z --security-opt label=disable --entrypoint /var/lib/docker-puppet/docker-puppet.sh

I suspect it's /var/lib/docker-puppet/docker-puppet.sh which contains this code:
https://github.com/openstack/tripleo-heat-templates/blob/beb8ae7dd53c7b36313956c4ea847f487199468e/docker/docker-puppet.py#L260-L371

There is something in that snippet that doesn't run fine, I think. I'm digging.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

Note for myself, but since docker-puppet.py generates docker-puppet.sh, I wonder if we have a race here where docker-puppet.sh is executed before we create it in random fashion.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

@mheon while I keep debugging myself, how could we know which directory isn't found by runc or podman? I wish I could strace it but it's happening randomly in CI and I can't reproduce it for now.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

My first feeling is that there is a race when your entrypoint is bind-mounted from the host.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

FTR #2068 can help with logging.

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 3, 2019

@EmilienM hmmm I don't think so. According to the runc code involved, if it was due to a missing entrypoint, it would fail earlier, something around line 174 instead of the current line 203.

L203 seems to point to some Seccomp issue instead... Deeper in runc/podman?

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 3, 2019

Small steps:
when we have a missing --entrypoint foo.sh, we have another error:

podman run --volume $(pwd)/entry.sh:/entry.s --entrypoint /entry.sh alpine
container create failed: container_linux.go:344: starting container process caused "exec: \"/entry.sh\": stat /entry.sh: no such file or directory"
: internal libpod error

If we don't have the file on disk, podman itself returns an error:

podman run --volume $(pwd)/entry.s:/entry.sh --entrypoint /entry.sh alpine
error checking path "/root/entry.s": stat /root/entry.s: no such file or directory

So it's really, really strange. And the line 203 involves "seccomp init" message, NOT "exec user process" :(.

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 3, 2019

This is definitely the syscall.Exec() at the very end of the Linux init code.

Per the manpage for execve - The file filename or a script or ELF interpreter does not exist, or a shared library needed for file or interpreter cannot be found.

Could we be missing a library necessary to run the script?

@cevich

This comment has been minimized.

Copy link
Collaborator

commented Jan 3, 2019

This couldn't be at all related to the occasional fail we see from python tests where the cgroup mount isn't there (for some/whatever reason)?

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jan 4, 2019

@EmilienM could you add a debugging message for common_dcmd and how it looks like? At least I cannot find it in the log you've mentioned.

Can it be that some additional volume hides /var/lib/docker-puppet/docker-puppet.sh (e.g. a new volume for /var/lib). I'd suggest in any case to put the entrypoint bind as the last one so to be sure it is accessible.

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2019

Hm. I don't think we do any checks/locking on volumes.

Possible solution: mount the container, copy the bash script in, and then run it?

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 11, 2019

@mheon uhu, I was thinking about such a solution. As a last resort though.

IIRC, there isn't any "podman copy", replaced in favour of

ct_mount=$(podman mount <container>)
cp /foo ${ct_mount}/foo
podman umount <container>

right? Or is there now some internal wrapper for that? Care to list the steps just to be sure? :)

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

@mheon We'll try your solution, @cjeanner is working on a patch. In the meantime, I'm proposing to disable concurrency on that script, which has proven to help after 3 days of heavy testing in our CI: https://review.openstack.org/629546

@bogdando

This comment has been minimized.

Copy link

commented Jan 11, 2019

@mheon why would we want to have any checks/locking on volumes mounted in :ro? I mean, what disabling concurrency buys for us if volumes are mounted in ro?

@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 11, 2019

@bogdando I don't think we would want such a thing - so I'm strongly suspecting that any issues with volume contents probably aren't Podman itself (we just make the bind mounts).

@cjeanner Yeah, no podman copy right now (I'd like to eventually get such a thing written, it wouldn't be hard to mimic those steps under the covers)

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

@cjeanner another option is to retry the exec and keep concurrency. I'll try that out as well.

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jan 11, 2019

Could you use temporary files that are different for each run? Then it should not matter how many containers run in parallel

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2019

Before I go write the code, here are 2 options (at least):

  • Create docker-puppet-.sh in /var/lib/docker (and we would end up with ~30 files which have the same content). Not a HUGE deal but still ugly from an engineering point of view.
  • Create a tempfile, which should probably be fine since the entrypoint is only used when the containers are configured (aka puppet runs inside the container during the initial deployments or upgrades) so when the host reboots the files aren't there anymore but since entrypoint isn't used, it's ok.

I'll try option 2.

openstack-gerrit pushed a commit to openstack/tripleo-heat-templates that referenced this issue Jan 16, 2019

docker-puppet: retry container run command
Context: containers/libpod#1844
We have concurrency issue when podman is enabled, where
the bind-mounted entrypoint can't be found.

This patch will retry the podman run commands 3 times before declaring
a failure.
Also, everytime it fails we'll log the number of attempts to configure
the container. So we can track these numbers in CI.

I'll allow us to keep doing concurrent calls, but with less chance
to fail with the issue #1844.

Note: we hate this patch and we hope to revert it soon. But now it's how
we'll reduce issues in CI.

Change-Id: I6af89bf54e562e7c6bbcdb82041a7274789dcf28
Related-Bug: #1811383

openstack-gerrit added a commit to openstack/openstack that referenced this issue Jan 16, 2019

Update git submodules
* Update tripleo-heat-templates from branch 'master'
  - Merge "docker-puppet: retry container run command"
  - docker-puppet: retry container run command
    
    Context: containers/libpod#1844
    We have concurrency issue when podman is enabled, where
    the bind-mounted entrypoint can't be found.
    
    This patch will retry the podman run commands 3 times before declaring
    a failure.
    Also, everytime it fails we'll log the number of attempts to configure
    the container. So we can track these numbers in CI.
    
    I'll allow us to keep doing concurrent calls, but with less chance
    to fail with the issue #1844.
    
    Note: we hate this patch and we hope to revert it soon. But now it's how
    we'll reduce issues in CI.
    
    Change-Id: I6af89bf54e562e7c6bbcdb82041a7274789dcf28
    Related-Bug: #1811383
@EmilienM

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2019

@rhatdan @mheon @giuseppe

until we find out more about this bug and what the root cause would be in libpod, we decided to go with an ugly workaround in our deployment, where we simply retry 3 times the podman run command during the configuration steps.

By adding the retry will:

  • highly reduce the chances to hit this bug in our CI
  • increase deployment time when the bug is hit during the retries
  • log the retries and make them visible in logstash so we'll have more metrics on how much we still hit the bug.

@cjeanner is also working on a python implementation of "podman cp" in https://review.openstack.org/#/c/630287/ - it would be great to have this native feature in podman directly, as we are dealing with multi-threads etc.

Let's keep this issue open, as there is a bug somewhere and we seem to be alone hitting it so far.

@ssbarnea

This comment has been minimized.

Copy link

commented Jan 16, 2019

I retry approach is acceptable and needed but we should also create an elastic-recheck query for measuring how often this happens and assure it will be fixed.

Look at https://review.openstack.org/#/c/619059/ and try to raise a similar one for this bug, add me as a reviewer. This will create a nice graph on http://status.openstack.org/elastic-recheck/

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jan 16, 2019

@cjeanner is also working on a python implementation of "podman cp" in https://review.openstack.org/#/c/630287/ - it would be great to have this native feature in podman directly, as we are dealing with multi-threads etc.

there is already a PR to add cp to podman: https://github.com/containers/libpod/pull/2151

openstack-gerrit added a commit to openstack/openstack that referenced this issue Jan 16, 2019

Update git submodules
* Update tripleo-heat-templates from branch 'master'
  - Merge "Reuse the container in case we have a temporary podman failure"
  - Reuse the container in case we have a temporary podman failure
    
    The "retry" patch[1] didn't take care of the existing container. This patch
    intends to allow to reuse the container in case it has failed, in order to
    avoid an error when the container is already existing.
    
    [1] https://review.openstack.org/#/c/614639/
    
    Change-Id: I5c7258c8687582f56b59ed410c0cc8f6ba4c2d4f
    Context: containers/libpod#1844
    Related-Bug: #1811383

openstack-gerrit pushed a commit to openstack/tripleo-heat-templates that referenced this issue Jan 16, 2019

Reuse the container in case we have a temporary podman failure
The "retry" patch[1] didn't take care of the existing container. This patch
intends to allow to reuse the container in case it has failed, in order to
avoid an error when the container is already existing.

[1] https://review.openstack.org/#/c/614639/

Change-Id: I5c7258c8687582f56b59ed410c0cc8f6ba4c2d4f
Context: containers/libpod#1844
Related-Bug: #1811383
@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

Hello there,

Sooooo. I now have a patch that push the entrypoint script in the container before starting it.
Basically, it does the following:

  • creates the container with all the mounts, options and so on
  • mounts that container, copies content, umounts it
  • starts the container

And, well, it still fails randomly because it "doesn't find the script".

I've added a lot of debug, and here's what I get:

2019-01-17 08:58:30 |         "2019-01-17 08:56:48,287 INFO: 24656 -- podman: Creating container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "1995372dae0aba414626cd6d79309fc80d189c69a5cc7276b6012b6d6ded66f4",
2019-01-17 08:58:30 |         "2019-01-17 08:56:48,619 INFO: 24657 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-mistral",
2019-01-17 08:58:30 |         "2019-01-17 08:56:49,149 INFO: 24657 -- docker-puppet-mistral: cp /var/lib/docker-puppet/docker-puppet.sh /var/lib/containers/storage/overlay/ee7b8a4c83e204b02e34db795e1ed7f531c879df528af486056c87c55aa86eb8/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:49,162 INFO: 24657 -- Content of /var/lib/containers/storage/overlay/ee7b8a4c83e204b02e34db795e1ed7f531c879df528af486056c87c55aa86eb8/merged/var/lib/docker-puppet for docker-puppet-mistral: docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:49,167 INFO: 24657 -- Entrypoint checksum for docker-puppet-mistral: 41171bd3e13723c3586b3957418c0bb8  /var/lib/containers/storage/overlay/ee7b8a4c83e204b02e34db795e1ed7f531c879df528af486056c87c55aa86eb8/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "c1a8039f20b7e915c4f7da9ac77dd1884fb8bc78c4e4ee35f65a15d2d713fc79",
2019-01-17 08:58:30 |         "2019-01-17 08:56:49,220 INFO: 24656 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,096 INFO: 24656 -- docker-puppet-zaqar: cp /var/lib/docker-puppet/docker-puppet.sh /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,114 INFO: 24656 -- Content of /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet for docker-puppet-zaqar: docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,123 INFO: 24656 -- Entrypoint checksum for docker-puppet-zaqar: 41171bd3e13723c3586b3957418c0bb8  /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,320 INFO: 24657 -- Container diff for docker-puppet-mistral: C /var/lib",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,320 INFO: 24657 -- podman: Start container docker-puppet-mistral",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,619 INFO: 24656 -- Container diff for docker-puppet-zaqar: ",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,619 INFO: 24656 -- podman: Start container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:56:58,019 INFO: 24658 -- podman: Creating container docker-puppet-iscsid",
2019-01-17 08:58:30 |         "2019-01-17 08:56:58,134 INFO: 24655 -- podman: Creating container docker-puppet-ironic_inspector",
2019-01-17 08:58:30 |         "2019-01-17 08:57:05,049 INFO: 24654 -- podman: Creating container docker-puppet-nova",
2019-01-17 08:58:30 |         "2019-01-17 08:57:06,662 INFO: 24653 -- podman: Creating container docker-puppet-glance_api",
2019-01-17 08:58:30 |         "15e9c9f7f07ae50979d90ae8bc99a747157278dbba5d1fcf87983acced3896b8",
2019-01-17 08:58:30 |         "2019-01-17 08:57:06,678 INFO: 24658 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-iscsid",
2019-01-17 08:58:30 |         "bf38fa325b1a7e343623f0d445ea8e1a4d2763f63545cfd2074f6b8540aa68dd",
2019-01-17 08:58:30 |         "2019-01-17 08:57:06,871 INFO: 24655 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-ironic_inspector",
2019-01-17 08:58:30 |         "ae2c2a1968fca547c26ddbbea8746f53e0fb573b2e90204a9355918519a481ad",
2019-01-17 08:58:30 |         "2019-01-17 08:57:06,971 INFO: 24654 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-nova",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,553 INFO: 24655 -- docker-puppet-ironic_inspector: cp /var/lib/docker-puppet/docker-puppet.sh /var/lib/containers/storage/overlay/3845785a6e67fdc80603256a99e7526cac86d7f04f41ffa577e988662a11d931/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,553 INFO: 24658 -- docker-puppet-iscsid: cp /var/lib/docker-puppet/docker-puppet.sh /var/lib/containers/storage/overlay/dfe732ae6fed26d8fe6ac1b994093ffee688df80ffaad51dc2102e44125b80e0/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,555 ERROR: 24656 -- Failed running docker-puppet.py for zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,555 ERROR: 24656 -- unable to start container c1a8039f20b7e915c4f7da9ac77dd1884fb8bc78c4e4ee35f65a15d2d713fc79: container create failed: container_linux.go:336: starting container process caused \"exec: \\\"/var/lib/docker-puppet/docker-puppet.sh\\\": stat /var/lib/docker-puppet/docker-puppet.sh: no such file or directory\"",
2019-01-17 08:58:30 |         ": internal libpod error",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,555 INFO: 24656 -- Finished processing puppet configs for zaqar",

If we take only the info for the failed container:

2019-01-17 08:58:30 |         "2019-01-17 08:56:48,287 INFO: 24656 -- podman: Creating container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:56:49,220 INFO: 24656 -- podman: Pushing /var/lib/docker-puppet/docker-puppet.sh in container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,096 INFO: 24656 -- docker-puppet-zaqar: cp /var/lib/docker-puppet/docker-puppet.sh /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,114 INFO: 24656 -- Content of /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet for docker-puppet-zaqar: docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,123 INFO: 24656 -- Entrypoint checksum for docker-puppet-zaqar: 41171bd3e13723c3586b3957418c0bb8  /var/lib/containers/storage/overlay/18b1f91747dcf5d44856a8521ec1bf1bca3f450c6ebfb526a8503ecf5aeeaba3/merged/var/lib/docker-puppet/docker-puppet.sh",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,619 INFO: 24656 -- Container diff for docker-puppet-zaqar: ",
2019-01-17 08:58:30 |         "2019-01-17 08:56:50,619 INFO: 24656 -- podman: Start container docker-puppet-zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,555 ERROR: 24656 -- Failed running docker-puppet.py for zaqar",
2019-01-17 08:58:30 |         "2019-01-17 08:57:07,555 INFO: 24656 -- Finished processing puppet configs for zaqar",

And what's interesting is this:
"2019-01-17 08:56:50,619 INFO: 24656 -- Container diff for docker-puppet-zaqar: "

Here, we should see a C /var/lib after the semicolon (yeah, apparently I have a small issue with the logging, it doesn't show the full diff).

So, apparently, the file isn't properly detected in the container, while we can actually get its md5sum on the mounted container....

Any idea what's going on?

I'll try to work a bit the output of the "podman diff" in order to get its full content on our CI. But that's an interesting thing.

Also, we're still running 6 instances of those commands in parallel. Maybe there's a small issue in podman layer managment?

Cheers,

C.

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

Taking also the opportunity to point the podman version - yes, it also happens in latest 1.0 release:

+ podman version
Version:       1.0.0
Go Version:    go1.10.2
OS/Arch:       linux/amd64


+ podman info
host:
  BuildahVersion: 1.6-dev
  Conmon:
    package: podman-1.0.0-1.git82e8011.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.14.0-dev, commit: 635328ab9b2058e40335533f35d595604666f1e4-dirty'
  Distribution:
    distribution: '"centos"'
    version: "7"
  MemFree: 867885056
  MemTotal: 8369270784
  OCIRuntime:
    package: runc-1.0.0-57.dev.git2abd837.el7.centos.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 8579125248
  SwapTotal: 8589930496
  arch: amd64
  cpus: 8
  hostname: undercloud.localdomain
  kernel: 3.10.0-957.1.3.el7.x86_64
  os: linux
  rootless: false
  uptime: 42m 41.68s
insecure registries:
  registries:
  - 192.168.24.1:8787
  - 192.168.24.3:8787
registries:
  registries:
  - registry.centos.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 1
  GraphDriverName: overlay
  GraphOptions: null
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 19
  RunRoot: /var/run/containers/storage
@mheon

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2019

Alright. So we can confirm that it's being successfully copied into the container, that its MD5 sum is good, and that podman diff shows that we made changes to the directory it exists in, yet stat is telling us it doesn't exist.

@cjeanner To confirm, this is SELinux non-enforcing, right?

@rhatdan @giuseppe Any ideas?

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

@mheon it's on the CI, so it's set to "permissive", indeed. I would point to some "commit" part in the container management.

@cjeanner

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

Some more info about the system we're running this, and encountering the issue:
System:
CentOS Linux release 7.6.1810 (Core)
CPU:
8x Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)
Installed kernel:
kernel-3.10.0-957.1.3.el7.x86_64

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jan 17, 2019

can you please show the last version of the script?

@cjeanner

This comment has been minimized.

openstack-gerrit added a commit to openstack/openstack that referenced this issue Jan 31, 2019

Update git submodules
* Update ansible-role-tripleo-modify-image from branch 'master'
  - Merge "yum_update_buildah: add retry x3"
  - yum_update_buildah: add retry x3
    
    The "buildah run" randomly fails on centos7 kernel, with:
    standard_init_linux.go:203: exec user process caused "no such file or directory"
    
    We think it's related to:
    containers/libpod#1844
    
    To workaround this issue, we'll retry 3 times with a delay of 3 seconds
    between each "buildah run" command which would fail to produce an exit
    code of 0.
    
    Change-Id: Ic50fd359c9bf50a6e0247d7743b26191d2f3dcb5

openstack-gerrit pushed a commit to openstack/ansible-role-tripleo-modify-image that referenced this issue Jan 31, 2019

yum_update_buildah: add retry x3
The "buildah run" randomly fails on centos7 kernel, with:
standard_init_linux.go:203: exec user process caused "no such file or directory"

We think it's related to:
containers/libpod#1844

To workaround this issue, we'll retry 3 times with a delay of 3 seconds
between each "buildah run" command which would fail to produce an exit
code of 0.

Change-Id: Ic50fd359c9bf50a6e0247d7743b26191d2f3dcb5
@rhatdan

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@giuseppe Any progress on this issue?

@giuseppe

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@rhatdan I still believe it is not in Podman but how the script is created/deleted outside of the container.

Haven't heard anything back for a while and I've never seen this issue anywhere else, so let's close this issue for now, if there is any update we can re-open it.

@giuseppe giuseppe closed this Mar 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.