New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Dockerfile for CKAN 2.2 release #1724
Conversation
This commit adds a Dockerfile and support files (including nginx and runit configuration) for a binary CKAN 2.2 docker image. Specifically, this allows you to build a docker image including CKAN 2.2, running behind nginx and gunicorn, by running docker build . For example, to build an image called ckan/ckan tagged at version 2.2, you might run docker build -t ckan/ckan:2.2 . The resulting image contains only CKAN, with a nearly vanilla configuration. In order to use it, you must do one of two things. You can either use the vanilla configuration as-is, and this requires that you specify the location of a Postgres database and a Solr core on startup: docker run -i -t -p 80:80 \ -e DATABASE_URL=postgres://user:pass@hostname/db \ -e SOLR_URL=http://hostname:8983/solr/ckan_default \ ckan/ckan:2.2 This will run CKAN, connect to the database, and initialise it if need be. Configuring Solr will have to be done separately. Alternatively, you can use this image as a base for extension. If a configuration file is injected to /etc/ckan/default.ini, the image will use that and ignore the DATABASE_URL and SOLR_URL environment variables. Lastly, by default the CKAN file store is at /var/lib/ckan, and you may well wish to mount this data volume outside the running container: docker run ... -v /mnt/storage:/var/lib/ckan ...
This ensures that we can configure error emails from the CKAN instance inside the container. An optional environment variable, ERROR_EMAIL, can be set for the container. If set, it will configure CKAN to send error emails to $ERROR_EMAIL. If unset, no emails will be sent.
Any comments: @nigelbabu, @wardi, @seanh? |
@nickstenning love it. seems like the right design for sure. does putting this in the ckan repo make something easier? I would like to use something like this to manage multiple different versions of ckan, so wouldn't it be better to package this separately? |
Yes. It makes including files from the source distribution easier. More importantly, the tools to build CKAN and its system dependencies may change over time, so it makes most sense for it to be versioned alongside CKAN itself, just like documentation. I'm concerned that we have at least four different mechanisms to build CKAN (1, 2, 3, and this one) and most of these are versioned in their own repositories, which makes achieving reproducibility difficult ("Which version of ckan-packaging do I need to check out to build a .deb for CKAN 2.1?"). I'd like all mechanisms to converge, and for development and test environments to use the same process (as far as is reasonable), in order to achieve some degree of dev-prod parity. For example:
Anyway, to reiterate, I strongly think this stuff should be in the main CKAN repository, and we should aim to deduplicate and simplify the various different installation and deployment mechanisms we currently maintain. |
+1 let's discuss on Thursday |
Great job, @nickstenning. This is very useful, and I agree 100% that it should live inside this repository. |
+1, I'd like to see this for dev, production and Travis installs |
I like it inside the repo (and in the core docs) as well. Configuring Solr is one of the more awkward steps of the CKAN install (and one of the main sources of problems), so it would be good to have some kind of (optional) automatic configuration of it. Suggest any automatic Solr config should use the multi-core Solr setup so that it doesn't conflict with any other Solr cores on the same machine, present or future. With CKAN's current default single-core Solr setup, you can only have two CKAN sites using the same Solr server if those two sites always use the same version of CKAN (or the same version of CKAN's Solr schema file, actually). The single-core Solr setup should go away, imho. Having CKAN support environment variables is probably the best solution, but failing that we could get rid of sed and have the docker script create the config file by rendering a Jinja2 template, substituting in the same variable values that it currently puts in using sed? CKAN already has a Some CKAN config values can in fact be changed in the database, and if there's a value in the db it overrides the one from the config file. See |
@amercader Would be good to get your take on this as well |
P.S. So this also upgrades CKAN to Ubuntu 14.04? |
Actually, this could let us avoid needing to document multi-core solr. How making the DB and solr URLs optional and when not specified create them inside the instance.
The last one might not be really useful, but I'd love to have the second one for working with a bunch of copies of existing ckan databases, and the third would be perfect for testing. |
@wardi So the way that I was doing this on the development branch was having another Dockerfile (that lives in The hookup of a separate Solr container can be automated using docker links. In development, the automation can be taken one step further with fig: again, see this branch. Installing Solr and/or Postgres on spinup is something I'd like to avoid, primarily because as far as possible, we want to avoid runtime setup. Especially runtime setup that relies on the network (e.g. to download external packages). A docker image is intended to be a complete binary distribution, so that once it's downloaded, the cycle time of create, destroy, create can be measured in seconds, not minutes or hours. |
It Just Works ™️, so yes. |
I think we're still gonna want to document how to setup Postgres and Solr manually somewhere, even if the recommended approach is just to use our docker images. If other people would rather remove those docs though, I'm always happy to see the docs get smaller! I guess if Solr is going to be running inside another docker container (which seems right) then we wouldn't need the multi-core setup, the single would do. But for anyone doing a traditional source or package install, the multicore Solr setup would still be advisable, so just for consistency (which helps with documentation and support) I'd say maybe the Solr Dockerfile should do a multi-core even though it doesn't really need to. Then we could delete any mention of the single-core Solr setup from the docs. |
@nickstenning sounds perfect. I absolutely agree we shouldn't be downloading and installing the way we are now with travis. |
I think it might be desirable to put This will probably result in a lot of incorrect and inconsistent docs, confusing error messages, and user support emails. There used to be a time (before we made these paths consistent across source, package, single-instance-on-a-server and multiple) when I was tasked with making sure that every email to ckan-dev and -discuss got an answer, and for a lot of emails my first reply would be "Did you do a package install or a source install?" This also applies to things like database names, database user names, also for the datastore, Solr core names ... they are all ckan_default and not just ckan. Unless of course we drop I think just leaving them in there is probably best. |
Thanks @nickstenning for your work and the really thorough explanation, this seems like a good direction to explore. I totally agree that deploying CKAN can be complicated and that we should consolidate deployment options. Having said that, I'm going to dampen the enthusiasm a bit with some points for discussion. Just to clarify a couple of things first:
Some things that bother me:
Anyway, I don't want to appear as completely opposed to this, I'm super excited about using it myself and on our servers (as soon as I learn some docker!), I only think that we should think of the implications before jumping on it. It would be really good to discuss it on tomorrow's dev meeting on even a dedicated call. |
Okay, in which case that's a documentation bug. The current installation docs say:
I think so. Indeed, that's my entire motivation for this work. The cognitive overhead involved in running a Docker image is very small. Far smaller than setting up and running Ansible, in my opinion:
Compared to Ansible (install ansible, git clone ansible playbooks, create ansible inventory file, work out how to invoke ansible-playbook correctly for your environment) this is much simpler. People who already use another configuration management tool (Puppet, Chef) are also going to be reluctant to use an Ansible-based deployment mechanism. As for Vagrant -- it's just not a solution for production deployments. It's aimed (pretty much exclusively) at development environments. People aren't going to install Vagrant + VirtualBox on a production server. Vagrant + vagrant-lxc is more likely, but at that point why aren't you using Docker (which is also LXC-based)? I'd also point out that both Ansible and Vagrant are fundamentally "blueprint-based". They provide a recipe for building an environment, but all the actual building has to be done when you run
No, not in the simple case. They'd need to know how to run a command:
I think there might be some misunderstanding about how docker works. The Dockerfile should track the state of CKAN. We can make improvements to it alongside the improvements we make to CKAN. We don't release the Dockerfile, we release built images. That's why this is a PR against the release-2.2 branch -- it builds a v2.2 image which would be built and pushed to the public Docker index as
IMO this is a feature, not a bug. I'd argue that part of the reason the other build scripts have clearly atrophied is that this was not true.
It's just WSGI. Running both Apache and Nginx to serve a single application is just plain wasteful, both in terms of compute resource and developer brain power. That's two different big applications with two different config syntaxes, and two sets of HTTP requests to debug. I've been running production WSGI applications under gunicorn for several years -- I don't think this will cause us any problems.
Let's discuss tomorrow, but I don't think we can provide support in general to users on how to administer Solr. We can certainly help, and we can provide a preconfigured CKAN Solr docker image as well so that users can do something like
and have the two hooked up and talking to one another automatically. |
That is a docs bug, yes |
Ok, let me reformulate the question: would a user need to know Docker to run and maintain a production CKAN site? The answer seems to be yes. After running the Perhaps the answers to all these questions are trivially easy, but at the very least users will need to flick through a slideshow and perhaps more likely learn a bit of docker to understand how their deployment will work and how will they need to manage it. It does reduce the install to a single command, yes, but I would argue that at the expense of more overhead later on.
It's a matter of what most users will be familiar with. In any case that's a separate discussion from the docker one
We've never given support for administering Solr other than getting it working as part of the CKAN install. Again I think the single/multiple core issue is a red herring, independent of whether using Docker or not |
I just want to testify that easier and simpler deployment is something I've heard quite a bit from talking with users (esp people who are less technical but who oversee a project) and that some kind of standardized deployment like docker would really help here (i've even explicitly mentioned docker to folks in this connection over the last ~6m and had very positive reactions). So to answer @amercader's point: yes I think docker would really help here ... cf also this roadmap issue: ckan/ideas#25 (5m deployment for geeks) |
Yep, we all agree on this
I guess the title of the issue you link to proves a bit my point: "5m (30s?) deployment of CKAN (by geeks)". I'm sure that for these geeks (CKAN dev team included) docker is an ideal solution. But having answered many install support requests, I think that for many users, Docker would be an extra layer of complexity. |
Deploying a CKAN package install is about as simple as it can get (could be improved some, but it's close) - docker will not making deploying simpler for production. It may bring other advantages for production though, like easily running many CKANs on a server, backups, rolling back after an upgrade. For development installs the package can't be used, and docker potentially makes installing CKAN for dev much quicker and easier than doing a source install manually. The other big potential advantage of docker that I see is that we could use the same Dockerfile (or multiple dockr files, but based off each other and without duplication), for production installs on any OS, single or multiple CKAN instances on a server, developer installs, and Travis. I like the sounds of that unification. |
@amercader i hear you but +1 to @seanh's points - i.e. it would make dev installs easier and would provide unification. I agree docker adds some complexity but also much simplification and in some ways you move the support step to docker (i.e. getting docker on your machine) which has widespread support :-) |
Apologies for some of my reticent comments, just caused by some misunderstandings cleared after today's chat at the dev meeting. Just to clarify, this will not replace any existing supported install mechanism. There is big support to get this merged and used, just some changes needed:
Agreed to leave tests as a separate piece of work |
Thanks for this, @amercader. I'll try and get to this in the next few days. |
I think we want ckan-postgres and ckan-solr Dockerfiles to go with this, both for production and development, setting up Postgres (including the DataStore) and Solr for CKAN is a PITA, time-consuming and error-prone. For a development environment, I think it should be:
@nickstenning Is that what you were thinking? I wonder if it's possible to make the CKAN Dockerfile for development environments automatically setup the links to the Postgres and Solr containers, rather than having to pass In production this seems a little more complicated. Presumably we can't use docker's inter-container communication, because CKAN, Postgres and Solr may not be on the same host? So they'll have to communicate over the network, and we'll have to pass the Postgres and Solr URLs into the CKAN container at If we can't use docker links in production, then do we want to use them in development? It may be simpler jut to use URLs for both prod and dev. Should the Postgres and Solr containers both be configured to support connecting multiple CKAN containers to (different databases and Solr cores on) the same Postgres and Solr containers? Or are you imagining it's one Postgres container and one Solr container for each CKAN container, even in production? (I guess you can still run lots of Postgres containers on a single host, so why not?) Update: Using one postgres and one solr container for each ckan container would simplify things as well, database and solr core names can be the same for every CKAN so they can just be "hardcoded" in the default config file |
For configuration, would it be better to simply share the CKAN config file with the host machine using a container volume? Then the user can just edit the file themselves to set things like the postgres and solr URLs, site_title, etc. And probably have to restart the container to have the changes take effect. The config file could come with hard-coded defaults for everything including the postgres and solr URLs. These defaults would work if you ran postgres and solr containers on the same host, using our default Dockerfiles and following the commands in our docs. So for a dev install or "just trying it out" you wouldn't need to edit them. For a production install with postgres and solr on different hosts, the user would need to edit the CKAN config file manually and we may need to come up with some way of managing this. |
libxslt1-dev \ | ||
nginx-light \ | ||
postfix \ | ||
build-essential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are slightly different from the packages we install for our package and source installs. What's are the reasons for the changes? Do we want to make the same changes to our package and source installs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aiming for installing the smallest number of packages needed to make CKAN work. You're right though, the source install omits libxml2-dev
and libxstl1-dev
: perhaps they aren't actually required.
Unless I missed it I think we need to add the datastore and datapusher to this: http://docs.ckan.org/en/latest/maintaining/datastore.html http://docs.ckan.org/projects/datapusher/ Datastore is just a few extra commands to setup some more databases, db users and db permissions, and then a few more settings to set in the CKAN config file. It can just be run in the same container that the first postgres db runs in. Datapusher is actually a separate Python web app that needs to be running, so maybe that's a fourth container we need for each CKAN site? Or maybe datapusher can just be run in the same docker container where CKAN is running? I think running multiple processes in one container at the same time is possible (using a script as the container's CMD, or using a process management tool) and that may simplify things, but somehow one process per container seems to be the Right Way with docker. Anyway once datapusher is running one way or another, the datapusher URL needs setting in the CKAN config file. DataStore and DataPusher will be needed as they enable the Data API and also many of the data preview/visualisation features. |
Another thing missing is extensions, if the user wants to deploy a custom CKAN extension for either a dev or production install, how do they do that? Something like this should be possible I think:
But I think we probably want to support building an image with a set of extensions already installed and configured. I guess that can be done simply enough with a Dockerfile that's based on one of the released CKAN Docker images. |
Some CKAN features and extensions require cron jobs, how would those work with docker? |
More-or-less. But let's discuss that when we get there. In the meantime, have a look at this branch.
From the point of view of the connecting container, Docker links just expose some environment variables which can be used to discover the network address of a given service. Docker links are just a special case of using environment variables such as
No. In a high density production environment I probably wouldn't even run Postgres in a container. I'd just run Postgres. But multiple CKANs talking to one Postgres, yes, absolutely.
I think the answer to that depends on questions of philosophy. In my opinion, configuration and data are not the same thing, and while data can be exported out of the container, configuration should not be. Part of the appeal of Docker is the clarity around how the container state is defined (in the arguments to
Correct.
Yes.
It is and this container already does it. See
This is the point where I think you need to build your own image, à la:
(I know that the
This Docker image already runs crond. Just drop a file in I'll respond to your commit comments inline. |
@nickstenning any update on this? Will this go in soon? |
FYI, Docker 1.0 was released today http://blog.docker.com/2014/06/its-here-docker-1-0/ |
Excuse me poking in here, but just an FYI. I don't see David Raznick (kindly) on this thread. I'm sure you're aware he had a docker solution in place for some time available on https://github.com/kindly/ckan_dockered ... I've been using it for the past 6 months and it works well. Works as follows: Rgds, Colum |
Closing in favour of #1755. (Sorry, but I can't change the base of a pull request.) |
What is this?
This PR adds a Dockerfile and support files (including nginx and postfix configuration) for a binary CKAN 2.2 docker image. Specifically, this allows you to build a docker image including CKAN 2.2, running behind nginx and gunicorn, by running
For example, to build an image called
ckan/ckan
tagged at version 2.2, you might runThe resulting image contains only CKAN, with a nearly vanilla configuration. In order to use it, you must do one of two things. You can either use the vanilla configuration as-is, and this requires that you specify the location of a Postgres database and a Solr core on startup:
This will run CKAN, connect to the database, and initialise it if need be. Configuring Solr will have to be done separately. There are a couple of other environment variables you can use to customise the deployment, including
GUNICORN_NUM_WORKERS
andERROR_EMAIL
, which do what you might expect.Alternatively, and perhaps more realistically you can use this image as a base for extension. If a
configuration file is injected to /etc/ckan/default.ini, the image will use that and ignore the
DATABASE_URL
,SOLR_URL
, andERROR_EMAIL
environment variables. (GUNICORN_NUM_WORKERS
is still used.)A minimal Dockerfile that uses this as a base might look something like this:
Lastly, by default the CKAN file store is at
/var/lib/ckan
, and in a production environment you would almost certainly mount this data volume outside the running container:Why should I care?
I'm of the opinion that deploying CKAN at the moment is too complicated. The package installation makes certain assumptions (Postgres and Solr on the same server; only one CKAN per machine) which seem unrealistically restrictive (not to say unwise) for production environments.
This setup allows you to trivially run multiple CKAN instances on a single machine, all pointing to Postgres and Solr by URL (either local or remote, it doesn't matter), without having to resort to the 10-step source install. It also makes it trivial to use LXC to memory constrain individual instances (
docker -m 1g ...
) which will be important in high-density deployments.Perhaps more excitingly, this can be used as a binary build base for more complicated deployments, including those which need additional extensions and configuration. You can simply start from this image and add extensions/additional services as necessary. The image builds on baseimage-docker which makes running additional supervised services very easy.
What's not right yet?
site_title
,site_logo
,site_description
...) should move into the database?What else should I know?
ckan/ckan:2.2
we can.nickstenning/ckan:2.2
image.release-v2.2
branch.