Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run docker-compose down before build? #228

Closed
ianwremmel opened this issue Jul 9, 2019 · 12 comments
Closed

Run docker-compose down before build? #228

ianwremmel opened this issue Jul 9, 2019 · 12 comments

Comments

@ianwremmel
Copy link

I'm getting errors like the following but I only run one agent per host. I'm pretty sure a prior build got cancelled and didn't properly clean up the running containers. Can an option be added to kill any containers from previous jobs? Or, is there a more robust way to force cleanup to happen after the build completes?

ERROR: for buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1  Cannot start service redis: driver failed programming external connectivity on endpoint buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1 (d35a813f99837fb20d793bcfe8ba884c04c9e24eb8725f1791ecf914afdba253): Bind for 0.0.0.0:6379 failed: port is already allocated
@ianwremmel
Copy link
Author

Looking at the log from one of the cancelled builds, it looks like it's terminating without running most of the lifecycle hooks (maybe this is really a buildkite bug?). It seems odd that the agent is lost in these cases since the agent gets reused on the next build.

Screen Shot 2019-07-10 at 2 57 40 PM

@toolmantim
Copy link
Contributor

Hi @ianwremmel. Sorry you've been hitting that. Are you seeing this with the latest version of the agent? We had some signal handling bugs in previous versions, which might have prevented the plugin from having a chance to cleanup properly.

@ianwremmel
Copy link
Author

I'm seeing it on v3.0.3

@toolmantim
Copy link
Contributor

Thanks @ianwremmel! I can’t find 3.0.3 in https://github.com/buildkite/agent/releases 🤔

@ianwremmel
Copy link
Author

ianwremmel commented Jul 17, 2019

oh, sorry, I misread, that's the version of the docker-compose plugin.

I'm using version v4.3.3 of the elastic ci stack, so it's whatever agent is in image ami-057de5cbfd86cfe88. If I'm reading the dashboard right, it's agent version v3.12.0.

@toolmantim
Copy link
Contributor

Ah cool, thanks for finding out the exact agent version @ianwremmel!

Between the hook and this function, we should be cleaning everything up:

The -1 agent lost log output, that's usually if the instance gets forcefully terminated? (or a spot instance doesn't gracefully terminate). If you head to the timeline tab on the job with the -1 exit status, and click through to the agent, did it run any jobs after that one, or was that the last job it was running before it disappeared?

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

Sorry for all the questions! Hopefully something will lead us to a clue, because we should already be doing a docker-compose down 🤔

@ianwremmel
Copy link
Author

I've seen the -1 agent lost when i've run out of memory (turns out eslint can't be run on a t2.micro)., but in n this case, i'm pushing a new commit and buildkite is aggressively killing the job (let's call it Job A). I think the -1 agent lost is a bit misleading here; the agent is still available and takes on additional jobs (let's call the next one Job B).

Job B fails with the Bind for 0.0.0.0:6379 failed: port is already allocated error because redis is still running since Job A never had a chance to run docker-compose down.

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

Yes. It's still taking jobs. I've had to go into AWS to manually kill it so a new one with a normal environment would boot.

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

We're approximating Heroku deployment. The relevant compose files are (we use the array form of the buildkite docker-compose config option) are posted below.

Hopefully something will lead us to a clue, because we should already be doing a docker-compose down

Yea, I looked into the plugin code and I see that it's supposed to be calling down. It seems like one of these
Screen Shot 2019-07-17 at 4 19 48 PM
might be causing the job to be killed in a way that doesn't run the cleanup lifecycle hooks.

docker-compose configs:

version: "3.6"

services:
  app:
    build: .
    command: "rails server"
    depends_on:
      - chrome
      - firefox
      - postgres
      - redis
      - selenium
    environment:
      CAPYBARA_SERVER_HOST: "0.0.0.0"
      DATABASE_URL: postgresql://postgres@postgres:5432/postgres
      DISABLE_SSL: "I AM SURE"
      MEMCACHEDCLOUD_SERVERS: "memcached:11211"
      PORT: 3000
      RAILS_ENV: test
      RAILS_LOG_TO_STDOUT: "true"
      REDIS_URL: "redis://redis:6379"
      SELENIUM_DRIVER: ${SELENIUM_DRIVER-docker_chrome}
    networks:
      redacted:
        aliases:
          - provider.udlocal.com
          - admin.udlocal.com
          - api.udlocal.com
          - www.udlocal.com
    volumes:
      - source: ./db
        type: bind
        target: /app/db
version: "3.6"

services:
  chrome: &selenium_node_config
    image: selenium/node-chrome
    depends_on:
      - selenium
    environment:
      - HUB_HOST=selenium
      - HUB_PORT=4444
    volumes:
      - /dev/shm:/dev/shm
    networks:
      - redacted

  firefox:
    <<: *selenium_node_config
    image: selenium/node-firefox

  memcached:
    image: memcached
    networks:
      - redacted
    ports:
      - "11211:11211"

  postgres:
    image: ${DATABASE_IMAGE_URI:-postgres:9.6}
    networks:
      - redacted
    ports:
      - "5432:5432"

  redis:
    image: redis:4.0.11
    networks:
      - redacted
    ports:
      - "6379:6379"

  selenium:
    image: selenium/hub
    ports:
      - "4444:4444"
    networks:
      - redacted

networks:
  redacted: {}

volumes:
  postgres: {}

@alexkohler
Copy link

Has any headway been made on this issue? I'm also running into something similar.

@glittershark
Copy link

Just commenting to say I just started running into this on my pipeline, after a relatively long time not seeing it, and now I'm seeing it on a relative majority of builds

@pzeballos
Copy link
Contributor

Hi @glittershark! which agent version are you running? we made several changes on exit status and cleaning dir on the latest releases. Could you confirm if this still happening on the last version? (v3.33.3)

@toote
Copy link
Contributor

toote commented Sep 21, 2022

Based on the fact that it has not been reported or upvoted in almost a year I will proceed to close this but feel free to re-open or create a new issue if this is still hapenning

@toote toote closed this as completed Sep 21, 2022
@glittershark
Copy link

yeah, this seems to have cleared itself up for us 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants