Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mrsk deployments filling up disk without cleaning overlay2 directory #403

Closed
danthegoodman1 opened this issue Jul 26, 2023 · 9 comments · Fixed by #456
Closed

mrsk deployments filling up disk without cleaning overlay2 directory #403

danthegoodman1 opened this issue Jul 26, 2023 · 9 comments · Fixed by #456

Comments

@danthegoodman1
Copy link
Contributor

It seems that the filesystem of old images are retained, and thus disks fill indefinitely. I've got workers that I've deployed updated images to <100 times so it's mostly a small single layer change.

--- /var/lib/docker -----------------------------------------------------------------------------------------------------
                         /..
   36.7 GiB [##########] /overlay2
   71.0 MiB [          ] /containers
   68.6 MiB [          ] /image
  104.0 KiB [          ] /buildkit
   68.0 KiB [          ] /network
   28.0 KiB [          ] /volumes
   24.0 KiB [          ] /plugins
e   4.0 KiB [          ] /tmp
e   4.0 KiB [          ] /swarm
e   4.0 KiB [          ] /runtimes
    4.0 KiB [          ]  engine-id

overlay2 dir:

2.1 GiB [##########] /5e4ace9160a0553be7b4e0c4a9370aa04072eba077f775ecd92d23d407e415e3
    1.0 GiB [####      ] /aa99cf5f7dee0c04cea13e086ee6725ab9a4e79a37934458db0ae2db558ad9a3
    1.0 GiB [####      ] /9aab8a6a708dcfae79b6340a44db30e4bbed9cdd36702494d334605bbe77b2df
    1.0 GiB [####      ] /0870087b132848602db4b1b1a5938dd29e24133a0bbdf0c65d04fcc74cf49194
    1.0 GiB [####      ] /764478f2a457581fd576181c7e92ffd68a18e9712a2e025861edc8767f86c0ca
  974.3 MiB [####      ] /3874976f79577d4f440ee2110762099991797e7ed2b7befc6fb8a54610ab53bf
  974.3 MiB [####      ] /08374222ee4577ac0e71b7a33f9d882345c8ebc582700c11da00f8115c74863e
  501.9 MiB [##        ] /9d1ca2306f374de9ab5c55ba18340dc3cd96b5e2f9ca121cc0909596ee6a34a5
  501.9 MiB [##        ] /7f177923f8b8ff57168765da9e3b88c57c5737a7da5bdafa01f679c8ced81e9d
  501.9 MiB [##        ] /794069a96278dba8549597d4f0858a6cd68fbe31f735510a41927fc4b5f45458
  369.0 MiB [#         ] /5167db57e964d6771cae16f0dc1abb55524cfc8645a5fde550d87bf32368171f
  369.0 MiB [#         ] /94d01efa25b4ce2361ecc3746be5dbfa3a5fdac3085d898ebadb1d953536167a
  368.8 MiB [#         ] /59c6c4ce1d86507a775b6a5b2937f9c726e057f360790daa42a538c444f00464
  368.8 MiB [#         ] /d70a0204f17f7713e42162b71169a9e979ef01b8956af1f75880521836dbc811
  368.8 MiB [#         ] /ddc840b4f2be97fd1fd5c65b326d0ff96a151cf28518fa368439c2de2ea6aa60
  368.8 MiB [#         ] /87a2fb4119b45752dd13e54c97e7186fc34a2ef3fb412821e0a32fd543e5bf34
  368.8 MiB [#         ] /7a916c94d1b6c7406f3e66a920ea0e1c3b1bfdbae15e45e7b966535b9ab57c20
  368.7 MiB [#         ] /9b148ec250472f78cecbdc97e019ae5b5ee36a3f9173f0b7efda22aeedbdcdba
  354.8 MiB [#         ] /31ea5de4a74ad537543de61f31c89bbb77c906b45203977c51d85eb0a92ac206
  354.7 MiB [#         ] /11a77b5de6cec4e893270fa5ee8ccc0c984b17aa36f51398f771f528e0c1fdbb
  354.7 MiB [#         ] /e9f9e821dfe32dc82dbbdf2612ba5138101fc8fea764440f22cdcf2e7fbf71c8
  354.7 MiB [#         ] /4385f60fb8435b7a605f7b84e6933924899d550c953ef2a624f8f41f1e467b6e
  354.7 MiB [#         ] /a7db5c22a6ffa2a573ed7a46ce969b2e4de3441431558bcd354aa90c5b2204fc
  354.6 MiB [#         ] /ca7739dabc9209a63592407d73b226b4e868384a29a23a46039a0003cc6f0e4f
  354.6 MiB [#         ] /a8eee5e12531862db1aafb1539b036fbce780650ed8847ff7c0f6493dbc5de3c
  354.5 MiB [#         ] /ffba26d30ea201e63b746ca059c807807d1304f57ff5e7b2f5fc4f3f90fe7dca
  354.4 MiB [#         ] /1a698f71e43504f369d5361ba5fd96918f292099b9e03d37ed2ad2f932b48963
  354.4 MiB [#         ] /e17a0c825c7738eaaadef895a5b136696e7f2849a9917f37e8b2b0fbd197347b
  354.3 MiB [#         ] /72b291b3b23bdafb901b6540e0002aad70a4583f683b8396aa15b90aa3126af9
  354.3 MiB [#         ] /08b62b11dcaa31c808809407b25d670d4cbe01a494fdf1ffedab2187b7c09d76
  354.2 MiB [#         ] /7ad3c329f06537b049d1934281ddcbb594655a62c36a88019fb73434c53ad1cc
  354.1 MiB [#         ] /2452d7b103d1ac568031c7010d91bc0a3ee345990c81c4878c65cffa2e57d933
  354.1 MiB [#         ] /6c6e7715173d3e2fce24c66dcd21add55fc54020b4f4fee11d9ede688047aebd
  354.0 MiB [#         ] /d1b85b15fe7fa87df99bf306f4287bdfa7cd4c89e874d662b1a851c2f9d28926
  354.0 MiB [#         ] /3c2cc9d3be3995cbd9a6d263c6a1dfcdc866d170425d9b0218f128577f454628
  354.0 MiB [#         ] /7578e87485fff52f78d940c1ca7b64b5cb1b6ad762fd434579a4eaa54753ddfd
  354.0 MiB [#         ] /2fd6771778386c034537e5e6bcbbad1409c3463f61ac288622675abe22e4ca7b
  353.9 MiB [#         ] /726596cdcc3ca9e9c8ede245c299187803a66b53d0f3b5f0e1de6d12a69bdacb
  353.7 MiB [#         ] /5fd1f498f0709ea6d5d379899d823bfb8c2ee508a90d38a21e1ab8f49f9f1e70
  340.6 MiB [#         ] /c21623702dfba4cf8c1bf1267c701e19e6adfa5368372e8c0687b0e18fc12fe1
  340.6 MiB [#         ] /dad63ec22276ae4178f5dd1f078912ba9ddc01522b4bf9c294fe8f017281e4be
  340.6 MiB [#         ] /4a71b4380bd3f6913956fd480fc45bbb7b392d5365a0c74e481c81d1c36e7c26
  340.6 MiB [#         ] /477276ca72bd812009610c973f20b26854c2bde28e3dbd107ec5d6d0d37573c3
  340.6 MiB [#         ] /ee4227bcd76e37172756504bf3755c5f6fbc4fa0444d291f7f537bae6aa59c38
  340.6 MiB [#         ] /95fcdb3df1c31fc6b8d4a7031c22f4a6bfb0758b9b001eb3ff64ba994e269955
  340.5 MiB [#         ] /09c96a5b44a517f04bdb224aa1732741bbf0fcb79c22e68ba7148ec873aedc47
  340.5 MiB [#         ] /3daf9e3031bf00af47c46a865988d620bb6d37fc7df5cd9fe42069386ff49beb
  340.4 MiB [#         ] /be9f12d17d2516b0617aa55da6fd508a472ee81f9d293db8f3402d4596a2ce66
  340.4 MiB [#         ] /9b91a45ef3013da7a714d3b569e6c502baf321d789e9ed66130ab1a23507d725
  340.4 MiB [#         ] /371eda75ffcca76365118753e04644edb200723daf1fff58bdf7968c32f985c2
  340.3 MiB [#         ] /dd200bec784ed709f3c4dba097b74142c6b457e9688e5fc892b302e95e667f73
  340.3 MiB [#         ] /1d57d36d12aa314b772b80fd37d83e0b411fc3a82d82200de161c7582b5d36ab
  340.3 MiB [#         ] /d85cc16624ecc80a599d0d9460ef5d6c3326222b6d40161ae6283609e6c7f69c
  340.3 MiB [#         ] /d54718f5b6e9a46c98a0d1b2e7a9d189ae99e7e46a002bbb0176175029ddb697
  340.3 MiB [#         ] /f72223f4a2ac997480809c1a6c3cd5624eca2f1e2d3a81dbd5900a9960c112ad
  340.1 MiB [#         ] /17124a51574dbc7b3f06dc7770405f7890bf8a5d9a848a44e9f02de5a1b6f615
  340.0 MiB [#         ] /fa0bea066f14fbd67d811a7d2c93012be9388bfd12a6a51f6d254810f18fb986
  340.0 MiB [#         ] /4cb2f4975e1cd6d8ea1d82748e25850522ea04fe77b268f8e56abaa6a83c3851
  339.9 MiB [#         ] /c48d13843f55e0899f8a88b6501c67f6cc78b90a3e91e4b6e72ad78cf266c712
  339.8 MiB [#         ] /b324cdf2b81222f38446bb6bc4c533a58460b626d7373d9fb71fd27fad6469e6
  339.8 MiB [#         ] /922776493ee05cedb32dbf20aa3dc0b5d381bff18b1c29127f67c94d94055c0c
  339.8 MiB [#         ] /2dc56fcb46b802599bb19aadc42f8bd1c0c28ee21857664cfd0d259efaddbeda
  337.6 MiB [#         ] /fcc5965ddb6ac63b9f02e98ce1fb572327cdb588735076d19088ba0816cd7e22
  337.5 MiB [#         ] /65d89172c4a4b80886de0488ea31631c23a3d1536c3cab0f3dd9db1bd416bd03
  337.5 MiB [#         ] /9f154901902a2e976cb5343c9ebb1b5f6265e17149ca13d7991956de2a7317dd
  218.7 MiB [#         ] /979c2e1af539509ebe943b7afc7a59081050e4c29080127461f4273f6410a7a1
  168.9 MiB [          ] /762d48a821e7f4bb35db17ae9edd482fcaf5cee2d73336723e2a60cc1cb5aba0
  168.9 MiB [          ] /0db52438bef910bb8d7c1c65ea4eb551c7be42bc62cfccc2ce1946ca3abf7892
  168.5 MiB [          ] /dba128fe875422ab744e07b6a5250dc75e4edef9e91066f5c16c4bdf402142c1
  143.2 MiB [          ] /cc82546d84001dd05359fd85daabc69bf48fc4ff24e463ddd6f4da713ce0ff4e
  143.2 MiB [          ] /72d5ed5817adc9715a33748d7f04740d5731c45c00b63d6ce717cc65afee6bb7
  143.2 MiB [          ] /478a55836964f47a90d1f379e06928cb1c531e647770d11f7cb50fd43ecdc4b5
  143.2 MiB [          ] /8ec468b6cbd2989dc8bbfa893dfb853a6a06ab640337a0db1a154bb91c8e1ea7
 Total disk usage:  36.7 GiB  Apparent size:  34.2 GiB  Items: 1272075
@danthegoodman1
Copy link
Contributor Author

docker system prune seems to solve the issue, but according to readme this should happen automatically for 3-day old images on next deploy?

@danthegoodman1
Copy link
Contributor Author

danthegoodman1 commented Jul 26, 2023

running mrsk prune all did not seem to remove them on one of the servers, and neither did docker system prune --force. I had to manually clear out the overlay2 directory

@djmb
Copy link
Collaborator

djmb commented Aug 8, 2023

@danthegoodman1 - MRSK doesn't call docker system prune as that doesn't give enough control over which containers and images are deleted.

It will run commands like these:

  1. To all containers except for the last 5:
docker ps -q -a --filter label=service=app --filter status=created --filter status=exited --filter status=dead | tail -n +6 | while read container_id; do docker rm $container_id; done
  1. Delete dangling images:
docker image prune --force --filter label=service=app --filter dangling=true
  1. Delete all image tags that don't match running containers
docker image ls --filter label=service=app --format '{{.ID}} {{.Repository}}:{{.Tag}}' | grep -v -w "$(docker container ls -a --format '{{.Image}}\|' --filter label=service=app | tr -d '\n')registry:4443/app:latest\|registry:4443/app:<none>" | while read image tag; do docker rmi $tag; done

This ensures we have just the last 5 containers and their images.

These commands all are filtered by --filter label=service=<your app>. Are you building your images externally to MRSK? If so and if you are not adding a service label to the images then they'll be missed by the pruning.

Please let me know if that's the case. Either way we should add a check for that label and reject images that are missing it.

@danthegoodman1
Copy link
Contributor Author

I am building them externally, perhaps they were ignored by the prune? I deploy like:

docker build -t xxxx .

docker push xxxx

mrsk redeploy -P --version latest

deploy.yaml:

# Name of your application. Used to uniquely configure containers.
service: xxx

# Name of the container image.
image: xxx

# Deploy to these servers.
servers:
...

# Credentials for your image host.
registry:
  # Specify the registry server, if you're not using Docker Hub
  server: xxx
  username: AWS

  # Always use an access token rather than real password when possible.
  password: <%= %x(aws ecr get-login-password --region us-east-1 --profile xxx) %>

# Inject ENV variables into containers (secrets come from .env).
...

# Use a different ssh user than root
ssh:
  user: ubuntu

# Configure a custom healthcheck (default is /up on port 3000)
healthcheck:
  path: /up
  port: 8080
  max_attempts: 12
  interval: 5s

@danthegoodman1
Copy link
Contributor Author

I'm also noticing that docker system prune -af and docker image prune -af is not getting everything in the overlay2 directory. One machine has about 40GB of junk in there, others ~17GB

@danthegoodman1
Copy link
Contributor Author

Seems to mostly be from one layer as well

@yoelcabo
Copy link
Contributor

yoelcabo commented Sep 6, 2023

I am observing the same, our hosts are being filled with images:

root@media-worker:~# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          233       5         67.52GB   65.88GB (97%)
Containers      6         1         377.8MB   148.2MB (39%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

docker system prune -a or docker image prune --force -a fixed it for us, but I'd like to get a permanent fix.

Debugging a bit, I found kamal is using the following command:

docker image prune --force --filter label=service=happyscribe-media-worker --filter dangling=true

I found two issues with this:

  1. The unused images are still tagged. So either there is a bug with not untagging them or we should add -a to the command.

  2. Even with -a it still doesn't work because the images don't have the label (but the containers do)

root@media-worker:~# docker image prune --force --filter label=service=happyscribe-media-worker --filter dangling=true
Total reclaimed space: 0B
root@media-worker:~# docker image prune -a --force --filter label=service=happyscribe-media-worker --filter dangling=true
Error response from daemon: invalid filter 'dangling=[true false]'
root@media-worker:~# docker image prune -a --force --filter label=service=happyscribe-media-worker
Total reclaimed space: 0B
root@media-worker:~# docker ps --filter label=service=happyscribe-media-worker
CONTAINER ID   IMAGE                                                           COMMAND                  CREATED             STATUS             PORTS     NAMES
528d16bb28e4   happyscribe/main-app:90026513e5cff6888e7055cb183aa14107908b19   "bundle exec sidekiq…"   About an hour ago   Up About an hour             happyscribe-media-worker-workers-production-90026513e5cff6888e7055cb183aa14107908b19
b19c24f4d02b   happyscribe/main-app:44940d1065eb7cd1ec2db5e9ed16f1fe31053246   "bundle exec sidekiq…"   3 hours ago         Up 3 hours                   happyscribe-media-worker-workers-production-44940d1065eb7cd1ec2db5e9ed16f1fe31053246_replaced_004bae419dfa7c8a
root@media-worker:~# docker images --filter label=service=happyscribe-media-worker
REPOSITORY   TAG       IMAGE ID   CREATED   SIZE
root@media-worker:~# 

Important note: we are building the docker images outside of Kamal and pushing them to DockerHub. Then deploying with --skip-push. So the label thing is on us.

For anyone using GitHub Actions, I fixed it like this:

    - name: Build and push Docker image
      uses: docker/build-push-action@v2
      with:
        context: .
        push: true
        tags: happyscribe/main-app:${{ github.sha }},happyscribe/main-app:latest
        cache-from: type=gha
        cache-to: type=gha,mode=max
        labels: service=happyscribe-media-worker ## Added this line

The image tagging is also "on us" because we are tagging each image with the GitHub commit. That said, I think Kamal should just prune all images that don't have a running container:

docker image prune -a --force --filter label=service=happyscribe-media-worker

If only one app is supposed to be running on a server at once, it makes sense that we only keep that container's image, right?

@djmb would you be open to a PR changing this?

For now I added a post-deploy step in our CD that runs:

kamal app remove_images 

(I found the command digging on the code)
This achieves what I wanted to achieve: removing all images that are not being used, while keeping the running ones still there.

@djmb
Copy link
Collaborator

djmb commented Sep 8, 2023

Hi @yoelcabo!

Thanks for the investigation! I think we should introduce a check for external images to ensure that they have the correct labels, which should prevent that in future. I'll get to this soon.

Regarding docker image prune -a - we used to use it but had to stop.

When you deploy the image is tagged with the version and also latest. Running docker image prune -a removes the version tag (I assume because it sees the container as only using the latest tag). Then when you deploy a new version the latest tag moves to it and the old image is untagged. This breaks rollbacks. See #270.

The integration tests do a rollback as part of them so you could test this out there if you want to look into this in more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@djmb @yoelcabo @danthegoodman1 and others