Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docker images #1359

Merged
merged 5 commits into from
Apr 10, 2016
Merged

Improve docker images #1359

merged 5 commits into from
Apr 10, 2016

Conversation

dsander
Copy link
Collaborator

@dsander dsander commented Mar 19, 2016

In my opinion our docker images still have a few issues:

  1. We do not leverage layer caching, every new image build is done from scratch, packages are installed and gems are installed even if we just changed one line.
  2. We can not have tagged images on Docker Hub, tagging for automated builds only works with git tags (which need to be configured explicitly), rollbacks to a previous version is hard and matching a image to a commit on the master branch is nearly impossible.
  3. One can not build custom Huginn images without first pushing the code to a git repository and change the docker scripts.

This PR tries to address all three problems with one downside, we can not use automated Docker Hub builds anymore.

Problem 3

The image build is now based on the current working directory, thus every change made locally (or checked out on the build server) are included in the image. Building custom images is now easy: docker build -t myusername/huginn -f docker/multi-process/Dockerfile .

Problem 2

By using CircleCI to build the images we can configure when we want to tag images and push the tags to the Docker Hub after finishing a new build. What I really would like to do is build images for every PR to be able to test the changes locally, but since we need to have the Docker Hub credentials stored in ENV this is not possible right now.
The build status is viewable here: https://circleci.com/gh/dsander/huginn

We could set up a private Jenkins instance to do the builds (and I would gladly pay the hosting bill), but I was not sure how the community would feel about a build process that is not public. It could be an option to just do the PR builds on a private server.

Problem 1

Image layer caching is leveraged by using the current working directory as the base for the image build. This could also be achieved on Docker Hub, but there the Dockerfiles would have to be placed in the Huginn root directory which would then use the root README.md for the image description. When we move away from automated builds the image description needs to be updated manually, but only after we changed it not for every build.

For a comparison of image sizes I used e5fcb24 as baseline for the first image, then build images after the merge of #1330 and another one after #1301

Before

I will only show the layer diff for the first change (after the merge of #1330) in the second build the same layers changed.

cantino/huginn-single-process:

> 6184db4c5108        25 hours ago        /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> 1237355c58ab        25 hours ago        /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> f6a392bdce47        25 hours ago        /bin/sh -c #(nop) ADD file:c02e2b155230fdf907   1.946 kB
> 0661a9bac381        25 hours ago        /bin/sh -c #(nop) WORKDIR /app                  0 B
> 20d55078086d        25 hours ago        /bin/sh -c /scripts/setup                       196.3 MB
> 39a5d340265a        25 hours ago        /bin/sh -c #(nop) ADD file:feb8aee3f31639be37   1.384 kB
> a03a924590ab        25 hours ago        /bin/sh -c /scripts/prepare                     270.3 MB
> bee631232603        25 hours ago        /bin/sh -c #(nop) ADD file:05c9559d7956058565   1.148 kB
> a0faca291444        25 hours ago        /bin/sh -c #(nop) MAINTAINER Dominik Sander     0 B

This is everything, for every update the user needs to download the whole image, even if the change was just one line. For the cantino/huginn image it is even worse:

> 69635f1d0837        26 hours ago        /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> 850776335c60        26 hours ago        /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> 9ba5a5d0e5c2        26 hours ago        /bin/sh -c #(nop) VOLUME [/var/lib/mysql]       0 B
> 875f14571cbd        26 hours ago        /bin/sh -c #(nop) ADD file:a67575f9e580cbbbdd   6.356 kB
> 60d0d2da6c5f        26 hours ago        /bin/sh -c /scripts/standalone-packages         193.3 MB
> 7f36705ffbe5        26 hours ago        /bin/sh -c #(nop) ADD file:3a61c7a67f98478b9d   282 B
> 7387e9d7d25e        26 hours ago        /bin/sh -c #(nop) WORKDIR /app                  0 B
> 673ef7acb3f1        26 hours ago        /bin/sh -c #(nop) MAINTAINER Andrew Cantino     0 B
> a4f93035eb3f        26 hours ago        /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> 1237355c58ab        26 hours ago        /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> f6a392bdce47        26 hours ago        /bin/sh -c #(nop) ADD file:c02e2b155230fdf907   1.946 kB
> 0661a9bac381        26 hours ago        /bin/sh -c #(nop) WORKDIR /app                  0 B
> 20d55078086d        26 hours ago        /bin/sh -c /scripts/setup                       196.3 MB
> 39a5d340265a        26 hours ago        /bin/sh -c #(nop) ADD file:feb8aee3f31639be37   1.384 kB
> a03a924590ab        26 hours ago        /bin/sh -c /scripts/prepare                     270.3 MB
> bee631232603        26 hours ago        /bin/sh -c #(nop) ADD file:05c9559d7956058565   1.148 kB
> a0faca291444        26 hours ago        /bin/sh -c #(nop) MAINTAINER Dominik Sander     0 B

After

After merging #1330 (bf7c2fe), in which only some code was changed. (~13MB uncompressed)

huginn-single-process:

> c2bbbe3a9a09        About an hour ago   /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> 2c77dedbfc4c        About an hour ago   /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> b9e2b108b584        About an hour ago   /bin/sh -c /scripts/setup                       3.771 MB
> c0f5ff6d133e        About an hour ago   /bin/sh -c #(nop) ADD multi:03226194c6e1c964e   3.203 kB
> 619dcca9e29e        About an hour ago   /bin/sh -c #(nop) COPY dir:1c4e202eb2e5bed2b6   1.929 MB

And the same small changeset for the multi-process images 🎉 :

> 7bed47eeb156        About an hour ago   /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> 6d2f33ac50d1        About an hour ago   /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> 387a6a0e9bf4        About an hour ago   /bin/sh -c #(nop) VOLUME [/var/lib/mysql]       0 B
> 89298a606b1e        About an hour ago   /bin/sh -c /scripts/setup                       3.771 MB
> 7af26e5158f6        About an hour ago   /bin/sh -c #(nop) ADD multi:224d2bac862998f34   7.633 kB
> a4f01acc69e4        About an hour ago   /bin/sh -c #(nop) COPY dir:1c4e202eb2e5bed2b6   1.929 MB

After merging #1301 (9a588e0), we added new gems, thus the gems needed to be reinstalled. (~200MB uncompressed)

huginn-single-process:

> 2f36b68220d6        About an hour ago   /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> c8fa70731d08        About an hour ago   /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> 9044f4f1d25e        About an hour ago   /bin/sh -c /scripts/setup                       3.771 MB
> 303424e95c1c        About an hour ago   /bin/sh -c #(nop) ADD multi:03226194c6e1c964e   3.203 kB
> e8b7e9add23b        About an hour ago   /bin/sh -c #(nop) COPY dir:87808e94c248cf9181   1.954 MB
> 218475f5fb45        About an hour ago   /bin/sh -c chown -R huginn:huginn /app &&       193.5 MB
> 62a8cd05532e        About an hour ago   /bin/sh -c #(nop) ADD dir:d2a851ae8c2e0715263   16.34 kB
> b922d489af33        About an hour ago   /bin/sh -c #(nop) ADD file:7f48e629b343e9cf65   1.136 kB
> 3c850d8620af        About an hour ago   /bin/sh -c #(nop) ADD multi:4d61c5fca0c04e181   22.22 kB

The diff stays small for the multi-process image:

> 97107a410fb7        About an hour ago   /bin/sh -c #(nop) CMD ["/scripts/init"]         0 B
> db498063263c        About an hour ago   /bin/sh -c #(nop) EXPOSE 3000/tcp               0 B
> 7d33a4f76a4a        About an hour ago   /bin/sh -c #(nop) VOLUME [/var/lib/mysql]       0 B
> 84aa3f6b7a75        About an hour ago   /bin/sh -c /scripts/setup                       3.771 MB
> 25d0f1f5b04a        About an hour ago   /bin/sh -c #(nop) ADD multi:224d2bac862998f34   7.633 kB
> 16c4181eab24        About an hour ago   /bin/sh -c #(nop) COPY dir:87808e94c248cf9181   1.954 MB
> a7a70cefd4c4        About an hour ago   /bin/sh -c chown -R huginn:huginn /app &&       193.5 MB
> 2b251c93171b        About an hour ago   /bin/sh -c #(nop) ADD dir:d2a851ae8c2e0715263   16.34 kB
> becb9172f0eb        About an hour ago   /bin/sh -c #(nop) ADD file:7f48e629b343e9cf65   1.136 kB
> da89499502fc        About an hour ago   /bin/sh -c #(nop) ADD multi:4d61c5fca0c04e181   22.22 kB

Summary

Disadvantages:

  • No more automated builds and with that manual updating of the image descriptions on the Docker Hub

Advantages:

  • Layer caching which leads to faster updates and less bytes to download
  • Faster image builds (even without an optimized CircleCI config)
  • Tags for images which match the commit they are based on
  • Ease of building custom images
  • Potential to build images for every PR

Since the init script, which does most of the heavy lifting, has been nearly untouched I do not think it should not cause any issues, but we still should get some verification the multi-process image still works (I updated to the updated single-process image myself)

@elvetemedve @gdomod @csu @peterseverin @uri @sky-chen @skray @mhow2 @dannysu @xu-cheng @daturkel @kennethkalmer @apopiak If you are still using one of the Huginn Docker images would you mind testing if everything still works when you switch to dsander/huginn or dsander/huginn-single-process (depending on which image you used before)?

TODO

  • Move shared docker scripts to /docker/scripts/ and update the Dockerfiles (omitted for now to make the diff more readable)
  • Squash commits
  • Change namespace in .travis.yml
  • Possibly reencrypt the environment variables in .travis.yml

Needs to be done by @cantino

  • Update Docker Hub configuration
    • Add huginnbuilder as contributor on Docker Hub
    • Disable automated build for cantino/huginn-single-process and build triggers for cantino/huginn

@coveralls
Copy link

Coverage Status

Coverage remained the same at 90.715% when pulling 961d96e on dsander:docker into 9a588e0 on cantino:master.

@elvetemedve
Copy link
Contributor

Hi @dsander,
I've switched to the dsander/huginn image and everything seems to be working so far.

I found these errors below in the log, but these are look like warnings rather than faults.

Mar 20 12:00:50 ZBox docker[54209]: mysqld stderr | 160320 11:00:48 mysqld_safe Can't log to error log and syslog at the same time.  Remove all --log-error configuration options for --syslog to take effect.
Mar 20 12:01:45 ZBox docker[54209]: foreman stderr | huginn_production already exists foreman stderr |
Mar 20 12:01:45 ZBox docker[54209]: foreman stderr | + exec bundle exec foreman start

I keep this container running and let you know if something goes wrong.

@dsander
Copy link
Collaborator Author

dsander commented Mar 20, 2016

@elvetemedve Thanks for testing! I don't really know what the mysqld stderr message means, but I think it has been there before, the foreman log line are no problem.

@cantino
Copy link
Member

cantino commented Mar 22, 2016

Awesome! I'll admit that I still use Capistrano and don't really know Docker, though. :) Let me know how I can help.

@dsander dsander force-pushed the docker branch 4 times, most recently from 00fc4ed to 6e9b690 Compare April 2, 2016 10:18
@dsander
Copy link
Collaborator Author

dsander commented Apr 2, 2016

@cantino I think two week are enough time for everyone who wanted to test it. If you are OK with having to update the descriptions on Docker Hub manually when we change them, I think we are ready to merge this.

The TODO list is updated with a few tasks only you can do to due to the permissions of GitHub and Docker Hub. I will send you the environment variables for the CircleCI configuration via email. Except for disabling the automated builds on docker hub everything can be done whenever you have time (the CircleCI builds will just fail when you enable them before we merge this).

Disabling the automated docker hub builds, finalizing the circle.yml configuration and merging this needs to be coordinated, so we have to find a time on which we are both available. As always, no rush 😄

@elvetemedve
Copy link
Contributor

@dsander The container you created has been running for 1 weeks and 1 days without any issues, but now it's stopped, because of running out of memory. Looks like a memory leak to me.
Here is the end of the log:

foreman stdout | 20:46:51 jobs.1 | Queuing event propagation
foreman stdout | 20:46:51 jobs.1 | Queuing schedule for every_1m
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.receive! (id=74072) RUNNING
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.receive! (id=74072) COMPLETED after 0.0092
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.run_schedule (id=74073) RUNNING
foreman stdout | 20:46:51 jobs.1 | [ActiveJob] Enqueued AgentCheckJob (Job ID: 7e36b6d3-82bd-4a45-9928-5dd20d8f6257) to DelayedJob(default) with arguments: 9
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.run_schedule (id=74073) COMPLETED after 0.0164
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=74074) RUNNING
20:46:51 jobs.1 | [ActiveJob] [AgentCheckJob] [7e36b6d3-82bd-4a45-9928-5dd20d8f6257] Performing AgentCheckJob from DelayedJob(default) with arguments: 9
foreman stdout | 20:46:51 jobs.1 | [ActiveJob] [AgentCheckJob] [7e36b6d3-82bd-4a45-9928-5dd20d8f6257] Performed AgentCheckJob from DelayedJob(default) in 25.69ms
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=74074) COMPLETED after 0.0329
foreman stdout | 20:46:51 jobs.1 | 2016-04-05T22:46:51+0200: [Worker(host:ca7c0ce05afe pid:480)] 3 jobs processed at 32.7096 j/s, 0 failed
foreman stdout | 20:47:51 jobs.1 | Queuing event propagation
foreman stdout | 20:47:51 jobs.1 | Queuing schedule for every_1m
foreman stdout | 20:47:51 jobs.1 | 2016-04-05T22:47:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.receive! (id=74075) RUNNING
foreman stdout | 20:47:51 jobs.1 | 2016-04-05T22:47:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.receive! (id=74075) COMPLETED after 0.0107
foreman stdout | 20:47:51 jobs.1 | 2016-04-05T22:47:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.run_schedule (id=74076) RUNNING
foreman stdout | 20:47:51 jobs.1 | [ActiveJob] Enqueued AgentCheckJob (Job ID: 28d470d6-a7c8-4581-be3a-26df8b02914e) to DelayedJob(default) with arguments: 9
foreman stdout | 20:47:51 jobs.1 | 2016-04-05T22:47:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job Agent.run_schedule (id=74076) COMPLETED after 0.0182
foreman stdout | 20:47:51 jobs.1 | 2016-04-05T22:47:51+0200: [Worker(host:ca7c0ce05afe pid:480)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=74077) RUNNING
foreman stdout | 20:47:52 jobs.1 | [ActiveJob] [AgentCheckJob] [28d470d6-a7c8-4581-be3a-26df8b02914e] Performing AgentCheckJob from DelayedJob(default) with arguments: 9
foreman stdout | 20:48:39 jobs.1 | 
foreman stdout | 20:48:39 jobs.1 | #
20:48:39 jobs.1 | # Fatal error in Heap::ReserveSpace
20:48:39 jobs.1 | # Allocation failed - process out of memory
20:48:39 jobs.1 | #
foreman stdout | 20:48:39 jobs.1 | 
foreman stdout | 20:48:59 jobs.1 | exited with code 133
foreman stdout | 20:48:59 system | sending SIGTERM to all processes
foreman stdout | 20:49:00        | Trace/breakpoint trap (core dumped)
foreman stdout | 20:49:00 web.1  | terminated by SIGTERM

Coredump does not tell too much about where did the failure happen:

           PID: 5295 (ruby2.2)
           UID: 1000 (rtorrent)
           GID: 1000 (rtorrent)
        Signal: 5 (TRAP)
     Timestamp: Tue 2016-04-05 22:48:39 CEST (22h ago)
  Command Line: bin/threaded.rb                                  
    Executable: /usr/bin/ruby2.2
 Control Group: /
         Slice: -.slice
       Boot ID: 9891fcffec134677a1065d5f5330396e
    Machine ID: 452c9feff2fe4c738e39b5ba9ced7aa6
      Hostname: ZBox
      Coredump: /var/lib/systemd/coredump/core.ruby2\x2e2.1000.9891fcffec134677a1065d5f5330396e.5295.1459889319000000000000.lz4
       Message: Process 5295 (ruby2.2) of user 1000 dumped core.

                Stack trace of thread 546:
                #0  0x00007f944307b771 n/a (n/a)

The machine I'm using to run Huginn has 4 GB of RAM and 4GB of swap memory, while only 1GB is used from it at the moment.
Do you have any idea how to track it down which part of the application is eating all the memory?

@dsander
Copy link
Collaborator Author

dsander commented Apr 7, 2016

@elvetemedve I do not think that is related to the Dockerfile changes, we tried to locate the memory leak a few times, but I was never able to find where it happens #1114 #978.

@elvetemedve
Copy link
Contributor

@dsander In my opinion the long term solution would be tracking the memory usage of the application for each Agent and write it into a log file. That way sooner or later the faulty code block could be revealed.

Regarding Docker there is a short term solution you could implement. I noticed that the container was still running after the crash of foreman. It would be nice if you could tell supervisord to restart the process when it dies. There is an autorestart setting which is good for this. See http://supervisord.org/configuration.html#program-x-section-values

@dsander
Copy link
Collaborator Author

dsander commented Apr 7, 2016

@elvetemedve Thanks for the tip, it tried changing the autorestart setting, but that causes issue with the webserver foreman starts: A server is already running. Check /app/tmp/pids/server.pid.. Not sure if I will find the motivation to dig deep into the multi-process image again anytime soon. You can use the single-process image which allow you to use the docker restart=always option.

@dsander dsander force-pushed the docker branch 5 times, most recently from c67ec48 to efc9d19 Compare April 10, 2016 14:45
@dsander dsander force-pushed the docker branch 2 times, most recently from 0f4706b to 8e894eb Compare April 10, 2016 16:39
@dsander
Copy link
Collaborator Author

dsander commented Apr 10, 2016

After trying a bunch of CI services I went back to Travis and finally got it working. Travis allows us to encrypt environment variables, those will only be exposed on branch builds of the Huginn repository, not for pull requests. We only need to make sure not to merge code that would expose the Docker Hub credentials.

A build on Travis that pushes the images can be viewed here: https://travis-ci.org/dsander/huginn/builds/122077312
After rebasing the branch: https://travis-ci.org/dsander/huginn/builds/122079443

@cantino To merge this we only need to disable the automated builds on Docker Hub and add huginnbuilder as a collaborator. I think I can encrypt the variables myself.

@cantino
Copy link
Member

cantino commented Apr 10, 2016

Any idea what the "Coveralls encountered an exception:" warning at the bottom is?

Want me to disable the automatic builds now?

@dsander
Copy link
Collaborator Author

dsander commented Apr 10, 2016

Coveralls is not accepting the data because the builds I linked were done on my fork directly and coveralls only accepts PR builds of this repository.

Want me to disable the automatic builds now?

Sure, I have time to update the credentials in the travis config.

@cantino
Copy link
Member

cantino commented Apr 10, 2016

Should I disable auto builds for both cantino/huginn and cantino/huginn-single-process?

@dsander
Copy link
Collaborator Author

dsander commented Apr 10, 2016

Yes, if I remember correctly the cantino/huginn image is build via a webhook which is send after the cantino/huginn-single-process build is done.

@dsander
Copy link
Collaborator Author

dsander commented Apr 10, 2016

Everything looks good, the master build succeeded and the images are tagged 🎉

@dsander dsander deleted the docker branch April 10, 2016 20:00
@cantino
Copy link
Member

cantino commented Apr 10, 2016

Nicely done! This is a great step forward. 😄 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants