Run docker-compose down before build? #228

ianwremmel · 2019-07-09T20:37:10Z

I'm getting errors like the following but I only run one agent per host. I'm pretty sure a prior build got cancelled and didn't properly clean up the running containers. Can an option be added to kill any containers from previous jobs? Or, is there a more robust way to force cleanup to happen after the build completes?

ERROR: for buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1  Cannot start service redis: driver failed programming external connectivity on endpoint buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1 (d35a813f99837fb20d793bcfe8ba884c04c9e24eb8725f1791ecf914afdba253): Bind for 0.0.0.0:6379 failed: port is already allocated

The text was updated successfully, but these errors were encountered:

ianwremmel · 2019-07-10T21:59:02Z

Looking at the log from one of the cancelled builds, it looks like it's terminating without running most of the lifecycle hooks (maybe this is really a buildkite bug?). It seems odd that the agent is lost in these cases since the agent gets reused on the next build.

toolmantim · 2019-07-17T04:00:43Z

Hi @ianwremmel. Sorry you've been hitting that. Are you seeing this with the latest version of the agent? We had some signal handling bugs in previous versions, which might have prevented the plugin from having a chance to cleanup properly.

ianwremmel · 2019-07-17T17:01:58Z

I'm seeing it on v3.0.3

toolmantim · 2019-07-17T21:48:47Z

Thanks @ianwremmel! I can’t find 3.0.3 in https://github.com/buildkite/agent/releases 🤔

ianwremmel · 2019-07-17T21:54:41Z

oh, sorry, I misread, that's the version of the docker-compose plugin.

I'm using version v4.3.3 of the elastic ci stack, so it's whatever agent is in image ami-057de5cbfd86cfe88. If I'm reading the dashboard right, it's agent version v3.12.0.

toolmantim · 2019-07-17T23:09:02Z

Ah cool, thanks for finding out the exact agent version @ianwremmel!

Between the hook and this function, we should be cleaning everything up:

https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/40424b7bfd584e023bde924b8015bfc6ca07f516/hooks/pre-exit

docker-compose-buildkite-plugin/lib/run.bash

Lines 3 to 20 in 40424b7

    
           compose_cleanup() { 
        
             # Send them a friendly kill 
        
             run_docker_compose kill || true 
        
             # `compose down` doesn't support force removing images 
        
             if [[ "$(plugin_read_config LEAVE_VOLUMES 'false')" == "false" ]]; then 
        
               run_docker_compose rm --force -v || true 
        
             else 
        
               run_docker_compose rm --force || true 
        
             fi 
        
             # Stop and remove all the linked services and network 
        
             if [[ "$(plugin_read_config LEAVE_VOLUMES 'false')" == "false" ]]; then 
        
               run_docker_compose down --volumes || true 
        
             else 
        
               run_docker_compose down || true 
        
             fi 
        
           }

The -1 agent lost log output, that's usually if the instance gets forcefully terminated? (or a spot instance doesn't gracefully terminate). If you head to the timeline tab on the job with the -1 exit status, and click through to the agent, did it run any jobs after that one, or was that the last job it was running before it disappeared?

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

Sorry for all the questions! Hopefully something will lead us to a clue, because we should already be doing a docker-compose down 🤔

ianwremmel · 2019-07-17T23:22:36Z

I've seen the -1 agent lost when i've run out of memory (turns out eslint can't be run on a t2.micro)., but in n this case, i'm pushing a new commit and buildkite is aggressively killing the job (let's call it Job A). I think the -1 agent lost is a bit misleading here; the agent is still available and takes on additional jobs (let's call the next one Job B).

Job B fails with the Bind for 0.0.0.0:6379 failed: port is already allocated error because redis is still running since Job A never had a chance to run docker-compose down.

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

Yes. It's still taking jobs. I've had to go into AWS to manually kill it so a new one with a normal environment would boot.

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

We're approximating Heroku deployment. The relevant compose files are (we use the array form of the buildkite docker-compose config option) are posted below.

Hopefully something will lead us to a clue, because we should already be doing a docker-compose down

Yea, I looked into the plugin code and I see that it's supposed to be calling down. It seems like one of these

might be causing the job to be killed in a way that doesn't run the cleanup lifecycle hooks.

docker-compose configs:

version: "3.6"

services:
  app:
    build: .
    command: "rails server"
    depends_on:
      - chrome
      - firefox
      - postgres
      - redis
      - selenium
    environment:
      CAPYBARA_SERVER_HOST: "0.0.0.0"
      DATABASE_URL: postgresql://postgres@postgres:5432/postgres
      DISABLE_SSL: "I AM SURE"
      MEMCACHEDCLOUD_SERVERS: "memcached:11211"
      PORT: 3000
      RAILS_ENV: test
      RAILS_LOG_TO_STDOUT: "true"
      REDIS_URL: "redis://redis:6379"
      SELENIUM_DRIVER: ${SELENIUM_DRIVER-docker_chrome}
    networks:
      redacted:
        aliases:
          - provider.udlocal.com
          - admin.udlocal.com
          - api.udlocal.com
          - www.udlocal.com
    volumes:
      - source: ./db
        type: bind
        target: /app/db

version: "3.6"

services:
  chrome: &selenium_node_config
    image: selenium/node-chrome
    depends_on:
      - selenium
    environment:
      - HUB_HOST=selenium
      - HUB_PORT=4444
    volumes:
      - /dev/shm:/dev/shm
    networks:
      - redacted

  firefox:
    <<: *selenium_node_config
    image: selenium/node-firefox

  memcached:
    image: memcached
    networks:
      - redacted
    ports:
      - "11211:11211"

  postgres:
    image: ${DATABASE_IMAGE_URI:-postgres:9.6}
    networks:
      - redacted
    ports:
      - "5432:5432"

  redis:
    image: redis:4.0.11
    networks:
      - redacted
    ports:
      - "6379:6379"

  selenium:
    image: selenium/hub
    ports:
      - "4444:4444"
    networks:
      - redacted

networks:
  redacted: {}

volumes:
  postgres: {}

alexkohler · 2020-03-05T14:50:47Z

Has any headway been made on this issue? I'm also running into something similar.

glittershark · 2021-08-16T23:56:03Z

Just commenting to say I just started running into this on my pipeline, after a relatively long time not seeing it, and now I'm seeing it on a relative majority of builds

pzeballos · 2021-10-04T21:47:13Z

Hi @glittershark! which agent version are you running? we made several changes on exit status and cleaning dir on the latest releases. Could you confirm if this still happening on the last version? (v3.33.3)

toote · 2022-09-21T03:59:41Z

Based on the fact that it has not been reported or upvoted in almost a year I will proceed to close this but feel free to re-open or create a new issue if this is still hapenning

glittershark · 2022-09-21T17:05:48Z

yeah, this seems to have cleared itself up for us 🤔

toote closed this as completed Sep 21, 2022

xxchan mentioned this issue Dec 22, 2022

ci: bump docker-compose-buildkite-plugin risingwavelabs/risingwave#7019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run docker-compose down before build? #228

Run docker-compose down before build? #228

ianwremmel commented Jul 9, 2019

ianwremmel commented Jul 10, 2019

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019 •

edited

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019

alexkohler commented Mar 5, 2020

glittershark commented Aug 16, 2021

pzeballos commented Oct 4, 2021

toote commented Sep 21, 2022

glittershark commented Sep 21, 2022

Run docker-compose down before build? #228

Run docker-compose down before build? #228

Comments

ianwremmel commented Jul 9, 2019

ianwremmel commented Jul 10, 2019

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019 • edited

toolmantim commented Jul 17, 2019

ianwremmel commented Jul 17, 2019

alexkohler commented Mar 5, 2020

glittershark commented Aug 16, 2021

pzeballos commented Oct 4, 2021

toote commented Sep 21, 2022

glittershark commented Sep 21, 2022

ianwremmel commented Jul 17, 2019 •

edited