Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Dockerfile for CKAN 2.2 release #1724

Closed

Conversation

nickstenning
Copy link
Contributor

What is this?

This PR adds a Dockerfile and support files (including nginx and postfix configuration) for a binary CKAN 2.2 docker image. Specifically, this allows you to build a docker image including CKAN 2.2, running behind nginx and gunicorn, by running

docker build .

For example, to build an image called ckan/ckan tagged at version 2.2, you might run

docker build -t ckan/ckan:2.2 .

The resulting image contains only CKAN, with a nearly vanilla configuration. In order to use it, you must do one of two things. You can either use the vanilla configuration as-is, and this requires that you specify the location of a Postgres database and a Solr core on startup:

docker run -i -t -p 80:80 \
  -e DATABASE_URL=postgres://user:pass@hostname/db \
  -e SOLR_URL=http://hostname:8983/solr/ckan_default \
  ckan/ckan:2.2

This will run CKAN, connect to the database, and initialise it if need be. Configuring Solr will have to be done separately. There are a couple of other environment variables you can use to customise the deployment, including GUNICORN_NUM_WORKERS and ERROR_EMAIL, which do what you might expect.

Alternatively, and perhaps more realistically you can use this image as a base for extension. If a
configuration file is injected to /etc/ckan/default.ini, the image will use that and ignore the DATABASE_URL, SOLR_URL, and ERROR_EMAIL environment variables. (GUNICORN_NUM_WORKERS is still used.)

A minimal Dockerfile that uses this as a base might look something like this:

FROM ckan/ckan:2.2

ADD ./mycustomconfig.ini /etc/ckan/default.ini

Lastly, by default the CKAN file store is at /var/lib/ckan, and in a production environment you would almost certainly mount this data volume outside the running container:

docker run ... -v /mnt/storage:/var/lib/ckan ...

Why should I care?

I'm of the opinion that deploying CKAN at the moment is too complicated. The package installation makes certain assumptions (Postgres and Solr on the same server; only one CKAN per machine) which seem unrealistically restrictive (not to say unwise) for production environments.

This setup allows you to trivially run multiple CKAN instances on a single machine, all pointing to Postgres and Solr by URL (either local or remote, it doesn't matter), without having to resort to the 10-step source install. It also makes it trivial to use LXC to memory constrain individual instances (docker -m 1g ...) which will be important in high-density deployments.

Perhaps more excitingly, this can be used as a binary build base for more complicated deployments, including those which need additional extensions and configuration. You can simply start from this image and add extensions/additional services as necessary. The image builds on baseimage-docker which makes running additional supervised services very easy.

What's not right yet?

  1. Perhaps the most important omission from this PR is documentation updates. I'd like to have the discussion with people about whether this is something they want to see in the main CKAN repository before I commit to writing up the docs.
  2. CKAN is currently autoconfigured using some questionable techniques. Perhaps we should instead permit configuring certain key config properties using environment variables? Perhaps some of what is currently configuration (site_title, site_logo, site_description...) should move into the database?
  3. I don't know. You guys know far more about what a sensible CKAN deployment does than I do. Tell me what's missing.

What else should I know?

  • I've taken the liberty of reserving the "ckan" username at the Docker index, so if we want to push this as suggested, to ckan/ckan:2.2 we can.
  • For the time being, you can try out the nickstenning/ckan:2.2 image.
  • NB: This PR merges to the release-v2.2 branch.

This commit adds a Dockerfile and support files (including nginx and
runit configuration) for a binary CKAN 2.2 docker image.

Specifically, this allows you to build a docker image including CKAN
2.2, running behind nginx and gunicorn, by running

    docker build .

For example, to build an image called ckan/ckan tagged at version 2.2,
you might run

    docker build -t ckan/ckan:2.2 .

The resulting image contains only CKAN, with a nearly vanilla
configuration. In order to use it, you must do one of two things. You
can either use the vanilla configuration as-is, and this requires that
you specify the location of a Postgres database and a Solr core on
startup:

    docker run -i -t -p 80:80 \
      -e DATABASE_URL=postgres://user:pass@hostname/db \
      -e SOLR_URL=http://hostname:8983/solr/ckan_default \
      ckan/ckan:2.2

This will run CKAN, connect to the database, and initialise it if need
be. Configuring Solr will have to be done separately.

Alternatively, you can use this image as a base for extension. If a
configuration file is injected to /etc/ckan/default.ini, the image will
use that and ignore the DATABASE_URL and SOLR_URL environment variables.

Lastly, by default the CKAN file store is at /var/lib/ckan, and you may
well wish to mount this data volume outside the running container:

    docker run ... -v /mnt/storage:/var/lib/ckan ...
This ensures that we can configure error emails from the CKAN instance
inside the container.

An optional environment variable, ERROR_EMAIL, can be set for the
container. If set, it will configure CKAN to send error emails to
$ERROR_EMAIL. If unset, no emails will be sent.
@nickstenning
Copy link
Contributor Author

Any comments: @nigelbabu, @wardi, @seanh?

@wardi
Copy link
Contributor

wardi commented May 19, 2014

@nickstenning love it. seems like the right design for sure.

does putting this in the ckan repo make something easier? I would like to use something like this to manage multiple different versions of ckan, so wouldn't it be better to package this separately?

@nickstenning
Copy link
Contributor Author

Yes. It makes including files from the source distribution easier. More importantly, the tools to build CKAN and its system dependencies may change over time, so it makes most sense for it to be versioned alongside CKAN itself, just like documentation.

I'm concerned that we have at least four different mechanisms to build CKAN (1, 2, 3, and this one) and most of these are versioned in their own repositories, which makes achieving reproducibility difficult ("Which version of ckan-packaging do I need to check out to build a .deb for CKAN 2.1?"). I'd like all mechanisms to converge, and for development and test environments to use the same process (as far as is reasonable), in order to achieve some degree of dev-prod parity.

For example:

  1. The .deb builder could easily be a Dockerfile that extends this one by doing something like:

    FROM ckan/ckan
    RUN apt-get install -q -y <builddepends>
    VOLUME ["/data"]
    RUN fpm ... > /data/ckan.deb
  2. We could tweak this Dockerfile so that it was suitable for development, allowing you to mount your working copy live inside the container, switching to paster serve --reload on an environment variable, and automating spinup of Postgres and Solr so that creating a dev env from scratch was as simple as fig up. See this branch for an example.

  3. Travis could also use this, by bootstrapping Docker and then running the tests in the container. This would have the secondary effect of verifying the container build process.

Anyway, to reiterate, I strongly think this stuff should be in the main CKAN repository, and we should aim to deduplicate and simplify the various different installation and deployment mechanisms we currently maintain.

@wardi
Copy link
Contributor

wardi commented May 19, 2014

+1

let's discuss on Thursday

@vitorbaptista
Copy link
Contributor

Great job, @nickstenning. This is very useful, and I agree 100% that it should live inside this repository.

@seanh
Copy link
Contributor

seanh commented May 21, 2014

+1, I'd like to see this for dev, production and Travis installs

@seanh
Copy link
Contributor

seanh commented May 21, 2014

I like it inside the repo (and in the core docs) as well.

Configuring Solr is one of the more awkward steps of the CKAN install (and one of the main sources of problems), so it would be good to have some kind of (optional) automatic configuration of it. Suggest any automatic Solr config should use the multi-core Solr setup so that it doesn't conflict with any other Solr cores on the same machine, present or future. With CKAN's current default single-core Solr setup, you can only have two CKAN sites using the same Solr server if those two sites always use the same version of CKAN (or the same version of CKAN's Solr schema file, actually). The single-core Solr setup should go away, imho.

Having CKAN support environment variables is probably the best solution, but failing that we could get rid of sed and have the docker script create the config file by rendering a Jinja2 template, substituting in the same variable values that it currently puts in using sed? CKAN already has a paster make-config command that could accept variables as command-line args (or read env vars) and do this templating.

Some CKAN config values can in fact be changed in the database, and if there's a value in the db it overrides the one from the config file. See /ckan-admin/config on your CKAN instance (must be logged-in to a sysadmin account). The way config settings are handled in CKAN is a mess, needs overhauled. But even without overhauling it you can probably get it to take the settings that you need from the db, if it doesn't already.

@seanh
Copy link
Contributor

seanh commented May 21, 2014

@amercader Would be good to get your take on this as well

@seanh
Copy link
Contributor

seanh commented May 21, 2014

P.S. So this also upgrades CKAN to Ubuntu 14.04?

@wardi
Copy link
Contributor

wardi commented May 21, 2014

Actually, this could let us avoid needing to document multi-core solr. How making the DB and solr URLs optional and when not specified create them inside the instance.

DATABASE_URL given SOLR_URL given docker action
no change
install solr; paster search-index rebuild
install postgres + solr; paster db init
install postgres; paster db init; paster search-index clear

The last one might not be really useful, but I'd love to have the second one for working with a bunch of copies of existing ckan databases, and the third would be perfect for testing.

@nickstenning
Copy link
Contributor Author

@wardi So the way that I was doing this on the development branch was having another Dockerfile (that lives in ckan/config/solr/, alongside our schema) to build a CKAN-compatible Solr image (which we could publish as ckan/ckan-solr). See here.

The hookup of a separate Solr container can be automated using docker links. In development, the automation can be taken one step further with fig: again, see this branch.

Installing Solr and/or Postgres on spinup is something I'd like to avoid, primarily because as far as possible, we want to avoid runtime setup. Especially runtime setup that relies on the network (e.g. to download external packages). A docker image is intended to be a complete binary distribution, so that once it's downloaded, the cycle time of create, destroy, create can be measured in seconds, not minutes or hours.

@nickstenning
Copy link
Contributor Author

P.S. So this also upgrades CKAN to Ubuntu 14.04?

It Just Works ™️, so yes.

@seanh
Copy link
Contributor

seanh commented May 21, 2014

I think we're still gonna want to document how to setup Postgres and Solr manually somewhere, even if the recommended approach is just to use our docker images. If other people would rather remove those docs though, I'm always happy to see the docs get smaller!

I guess if Solr is going to be running inside another docker container (which seems right) then we wouldn't need the multi-core setup, the single would do. But for anyone doing a traditional source or package install, the multicore Solr setup would still be advisable, so just for consistency (which helps with documentation and support) I'd say maybe the Solr Dockerfile should do a multi-core even though it doesn't really need to. Then we could delete any mention of the single-core Solr setup from the docs.

@wardi
Copy link
Contributor

wardi commented May 21, 2014

@nickstenning sounds perfect. I absolutely agree we shouldn't be downloading and installing the way we are now with travis.

@seanh
Copy link
Contributor

seanh commented May 21, 2014

I think it might be desirable to put /usr/lib/ckan/default and /etc/ckan/default in the docker install, even though I understand that docker doesn't need these directories. If anyone doing a traditional source or package install wants to install a second CKAN on the same machine, they will still need the default dirs. And if the docker setup differs from the traditional ones, then every time you give a virtualenv or config file path in the docs (or on the mailing list, stack overflow, wiki, in IRC..) you'll have to say "if docker type this ... if not type this". (In the case of old mailing list and stack overflow answers, we won't be able to update them.)

This will probably result in a lot of incorrect and inconsistent docs, confusing error messages, and user support emails.

There used to be a time (before we made these paths consistent across source, package, single-instance-on-a-server and multiple) when I was tasked with making sure that every email to ckan-dev and -discuss got an answer, and for a lot of emails my first reply would be "Did you do a package install or a source install?"

This also applies to things like database names, database user names, also for the datastore, Solr core names ... they are all ckan_default and not just ckan.

Unless of course we drop default from the traditional installs as well, and say that multiple CKANs on a filesystem is not supported (at least not in our docs), and the way to do that is by using docker instead. But then you have to think about upgrades/migration (or that the new docs won't work for people who first installed from an older version), and there will still be a difference between people with "old-style" and "new-style" installs...

I think just leaving them in there is probably best.

@amercader
Copy link
Member

Thanks @nickstenning for your work and the really thorough explanation, this seems like a good direction to explore. I totally agree that deploying CKAN can be complicated and that we should consolidate deployment options.

Having said that, I'm going to dampen the enthusiasm a bit with some points for discussion.

Just to clarify a couple of things first:

  • The package installation makes certain assumptions (Postgres and Solr on the same server; only one CKAN per machine) which seem unrealistically restrictive (not to say unwise) for production environments.

    That's not true, we explicitly don't include Solr and Postgres on the CKAN package so users can install it and customize it however they prefer (this causing inconvenience to users that just want the whole lot installed in the same machine and start playing with CKAN). One CKAN per machine is true, and I agree that's a limitation, but not for most users.

  • we have at least four different mechanisms to build CKAN (1, 2, 3, and this one)

    1 is super deprecated (was used to build 1.x) and it should say so in the repo, 2 only takes CKAN itself into account (+ DataPusher) and 3 tries to replicate a whole environment for the tests to run (

Some things that bother me:

  • The main one: as I said I totally agree that deploying CKAN at the moment is too complicated, but is docker a tool that will make things easier for most users deploying CKAN? I can see how this approach would be perfect for developers working on local dev instances or even for providers like Open Knowledge managing several sites and servers (ie "us"), but I'd say that asking the average user to understand how docker works and the implications of the different commands to just install a web app is a big ask. My gut feeling is that something like vagrant or ansible is more popular and accessible.

    It may well be that I misunderstood and that docker would only be the base to build other more user-friendly stuff, but your notes seem to suggest otherwise. I guess my question is, would a user need to know docker to install CKAN?

  • Regarding having this script in the main CKAN repo I have only two concerns:

    • The frequency of updates and fixes on it will depend on actual CKAN releases, which right now take a while. A separate repo with branches and tags matching the ones on CKAN core is perhaps less convenient but allows us to tweak it constantly (as long as we maintain backwards compatibility) without waiting on a proper release. I guess that once the scripts are well tested nd mature enough that's less an issue
    • Moving it into the main repo implies that the Technical team is responsible for maintaining it, making sure it works across releases and supporting people's issues on the repo, lists, etc. even if Nick is not around :) This means learning and understanding docker.
  • I personally would stick with Apache + mod_wsgi rather than gunicorn. It is perhaps less sexy, but far more used, and all our docs and scripts rely on it.

  • Regarding Solr multi-core, I have said this in the past: on Solr 1.x setting up multi-cores is a pain, and something that the vast majority of users don't need. The result of getting rid of the single-core instructions would be lots more extra installation steps. In any case, on Solr 3.x (the one that comes with Ubuntu 14.04) and Solr 4.x multi-core is the default setup, so that would greatly simplify things (as will use tomcat instead of jetty).

Anyway, I don't want to appear as completely opposed to this, I'm super excited about using it myself and on our servers (as soon as I learn some docker!), I only think that we should think of the implications before jumping on it.

It would be really good to discuss it on tomorrow's dev meeting on even a dedicated call.

@nickstenning
Copy link
Contributor Author

The package installation makes certain assumptions (Postgres and Solr on the same server; only one CKAN per machine) which seem unrealistically restrictive (not to say unwise) for production environments.

That's not true, we explicitly don't include Solr and Postgres on the CKAN package so users can install it and customize it however they prefer

Okay, in which case that's a documentation bug. The current installation docs say:

You should install CKAN from package if [...] You want to run CKAN, Solr and PostgreSQL on the same server
[...]
You should install CKAN from source if [...] You want to run CKAN, Solr and PostgreSQL on different servers


is docker a tool that will make things easier for most users deploying CKAN?

I think so. Indeed, that's my entire motivation for this work. The cognitive overhead involved in running a Docker image is very small. Far smaller than setting up and running Ansible, in my opinion:

  1. Install docker (one command on Ubuntu 14.04, RHEL, Arch linux, and only going to get easier over time) or deploy a machine with Docker preinstalled
  2. Run docker run ckan/ckan:2.2

Compared to Ansible (install ansible, git clone ansible playbooks, create ansible inventory file, work out how to invoke ansible-playbook correctly for your environment) this is much simpler. People who already use another configuration management tool (Puppet, Chef) are also going to be reluctant to use an Ansible-based deployment mechanism.

As for Vagrant -- it's just not a solution for production deployments. It's aimed (pretty much exclusively) at development environments. People aren't going to install Vagrant + VirtualBox on a production server. Vagrant + vagrant-lxc is more likely, but at that point why aren't you using Docker (which is also LXC-based)?

I'd also point out that both Ansible and Vagrant are fundamentally "blueprint-based". They provide a recipe for building an environment, but all the actual building has to be done when you run ansible or vagrant. Docker images are "binaries" -- you download one and then you run it. This allows us to know that if an image works for one user, it will work for the next. It also allows us to fully decouple the implementation of the docker image from the environment on the user's production server. We can run an Ubuntu 14.04-based docker image on a RHEL server. The user doesn't even need to know we're doing so. It might be worth flicking through this slideshow if you haven't seen it -- it has a pretty good explanation of the difference between docker containerisation and full VM virtualisation.


would a user need to know docker to install CKAN

No, not in the simple case. They'd need to know how to run a command: docker run ....


The frequency of updates and fixes on it will depend on actual CKAN releases

I think there might be some misunderstanding about how docker works. The Dockerfile should track the state of CKAN. We can make improvements to it alongside the improvements we make to CKAN. We don't release the Dockerfile, we release built images. That's why this is a PR against the release-2.2 branch -- it builds a v2.2 image which would be built and pushed to the public Docker index as ckan/ckan:2.2. We could build a Docker image on every git push which is published as ckan/ckan:develop. The key point is that (most) users don't build their own Docker images from the Dockerfile, they use the built binary images.


Moving it into the main repo implies that the Technical team is responsible for maintaining it

IMO this is a feature, not a bug. I'd argue that part of the reason the other build scripts have clearly atrophied is that this was not true.


I personally would stick with Apache + mod_wsgi rather than gunicorn

It's just WSGI. Running both Apache and Nginx to serve a single application is just plain wasteful, both in terms of compute resource and developer brain power. That's two different big applications with two different config syntaxes, and two sets of HTTP requests to debug. I've been running production WSGI applications under gunicorn for several years -- I don't think this will cause us any problems.


Regarding Solr multi-core

Let's discuss tomorrow, but I don't think we can provide support in general to users on how to administer Solr. We can certainly help, and we can provide a preconfigured CKAN Solr docker image as well so that users can do something like

docker run --name solr ckan/solr
docker run --link solr:solr ckan/ckan

and have the two hooked up and talking to one another automatically.

@seanh
Copy link
Contributor

seanh commented May 21, 2014

That is a docs bug, yes

@amercader
Copy link
Member

would a user need to know docker to install CKAN

No, not in the simple case. They'd need to know how to run a command: docker run ....

Ok, let me reformulate the question: would a user need to know Docker to run and maintain a production CKAN site?

The answer seems to be yes. After running the docker run ... command what happens? Do I trust everything is working fine? How do I manage the network for connecting to other servers, memory, processes... Do I need to scale the container at some point? How do I do it?

Perhaps the answers to all these questions are trivially easy, but at the very least users will need to flick through a slideshow and perhaps more likely learn a bit of docker to understand how their deployment will work and how will they need to manage it.

It does reduce the install to a single command, yes, but I would argue that at the expense of more overhead later on.

I personally would stick with Apache + mod_wsgi rather than gunicorn

It's just WSGI

It's a matter of what most users will be familiar with. In any case that's a separate discussion from the docker one

Regarding Solr multi-core

Let's discuss tomorrow, but in general I don't think we can provide support in general to users on how to administer Solr

We've never given support for administering Solr other than getting it working as part of the CKAN install. Again I think the single/multiple core issue is a red herring, independent of whether using Docker or not

@rufuspollock
Copy link
Member

I just want to testify that easier and simpler deployment is something I've heard quite a bit from talking with users (esp people who are less technical but who oversee a project) and that some kind of standardized deployment like docker would really help here (i've even explicitly mentioned docker to folks in this connection over the last ~6m and had very positive reactions).

So to answer @amercader's point: yes I think docker would really help here ...

cf also this roadmap issue: ckan/ideas#25 (5m deployment for geeks)

@amercader
Copy link
Member

@rgrp

I just want to testify that easier and simpler deployment is something I've heard quite a bit from talking with users (esp people who are less technical but who oversee a project)

Yep, we all agree on this

and that some kind of standardized deployment like docker would really help here (i've even explicitly mentioned docker to folks in this connection over the last ~6m and had very positive reactions).

I guess the title of the issue you link to proves a bit my point: "5m (30s?) deployment of CKAN (by geeks)". I'm sure that for these geeks (CKAN dev team included) docker is an ideal solution. But having answered many install support requests, I think that for many users, Docker would be an extra layer of complexity.

@seanh
Copy link
Contributor

seanh commented May 22, 2014

Deploying a CKAN package install is about as simple as it can get (could be improved some, but it's close) - docker will not making deploying simpler for production. It may bring other advantages for production though, like easily running many CKANs on a server, backups, rolling back after an upgrade.

For development installs the package can't be used, and docker potentially makes installing CKAN for dev much quicker and easier than doing a source install manually.

The other big potential advantage of docker that I see is that we could use the same Dockerfile (or multiple dockr files, but based off each other and without duplication), for production installs on any OS, single or multiple CKAN instances on a server, developer installs, and Travis. I like the sounds of that unification.

@rufuspollock
Copy link
Member

@amercader i hear you but +1 to @seanh's points - i.e. it would make dev installs easier and would provide unification.

I agree docker adds some complexity but also much simplification and in some ways you move the support step to docker (i.e. getting docker on your machine) which has widespread support :-)

@amercader
Copy link
Member

Apologies for some of my reticent comments, just caused by some misunderstandings cleared after today's chat at the dev meeting.

Just to clarify, this will not replace any existing supported install mechanism.

There is big support to get this merged and used, just some changes needed:

  • Work against master, and later backport to 2.2 if necessary
  • Keep Apache + mod_wsgi for the time being and move the discussion around gunicorn to a separate issue
  • Keep the default paths /usr/lib/ckan/default and /etc/ckan/default for now
  • Needs some docs

Agreed to leave tests as a separate piece of work

@nickstenning
Copy link
Contributor Author

Thanks for this, @amercader. I'll try and get to this in the next few days.

@seanh
Copy link
Contributor

seanh commented May 26, 2014

I think we want ckan-postgres and ckan-solr Dockerfiles to go with this, both for production and development, setting up Postgres (including the DataStore) and Solr for CKAN is a PITA, time-consuming and error-prone.

For a development environment, I think it should be:

  • Git clone CKAN
  • Docker build the CKAN, Postgres and Solr images from the three Dockerfiles in the CKAN core git repo
  • Docker run the Postgres and Solr images, then docker run the CKAN image passing it the names of the Postgres and Solr ones with --link arguments, so we connect them with Docker's inter-container communication

@nickstenning Is that what you were thinking?

I wonder if it's possible to make the CKAN Dockerfile for development environments automatically setup the links to the Postgres and Solr containers, rather than having to pass --link all the time? It can assume that the two containers are on the same host, and can assume they have some default names like ckan-postgres and ckan-solr.

In production this seems a little more complicated. Presumably we can't use docker's inter-container communication, because CKAN, Postgres and Solr may not be on the same host? So they'll have to communicate over the network, and we'll have to pass the Postgres and Solr URLs into the CKAN container at docker run time.

If we can't use docker links in production, then do we want to use them in development? It may be simpler jut to use URLs for both prod and dev.

Should the Postgres and Solr containers both be configured to support connecting multiple CKAN containers to (different databases and Solr cores on) the same Postgres and Solr containers? Or are you imagining it's one Postgres container and one Solr container for each CKAN container, even in production? (I guess you can still run lots of Postgres containers on a single host, so why not?)

Update: Using one postgres and one solr container for each ckan container would simplify things as well, database and solr core names can be the same for every CKAN so they can just be "hardcoded" in the default config file

@seanh
Copy link
Contributor

seanh commented May 26, 2014

For configuration, would it be better to simply share the CKAN config file with the host machine using a container volume? Then the user can just edit the file themselves to set things like the postgres and solr URLs, site_title, etc. And probably have to restart the container to have the changes take effect.

The config file could come with hard-coded defaults for everything including the postgres and solr URLs. These defaults would work if you ran postgres and solr containers on the same host, using our default Dockerfiles and following the commands in our docs. So for a dev install or "just trying it out" you wouldn't need to edit them.

For a production install with postgres and solr on different hosts, the user would need to edit the CKAN config file manually and we may need to come up with some way of managing this.

libxslt1-dev \
nginx-light \
postfix \
build-essential
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are slightly different from the packages we install for our package and source installs. What's are the reasons for the changes? Do we want to make the same changes to our package and source installs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aiming for installing the smallest number of packages needed to make CKAN work. You're right though, the source install omits libxml2-dev and libxstl1-dev: perhaps they aren't actually required.

@seanh
Copy link
Contributor

seanh commented May 26, 2014

I don't know. You guys know far more about what a sensible CKAN deployment does than I do. Tell me what's missing.

Unless I missed it I think we need to add the datastore and datapusher to this:

http://docs.ckan.org/en/latest/maintaining/datastore.html

http://docs.ckan.org/projects/datapusher/

Datastore is just a few extra commands to setup some more databases, db users and db permissions, and then a few more settings to set in the CKAN config file. It can just be run in the same container that the first postgres db runs in.

Datapusher is actually a separate Python web app that needs to be running, so maybe that's a fourth container we need for each CKAN site? Or maybe datapusher can just be run in the same docker container where CKAN is running? I think running multiple processes in one container at the same time is possible (using a script as the container's CMD, or using a process management tool) and that may simplify things, but somehow one process per container seems to be the Right Way with docker.

Anyway once datapusher is running one way or another, the datapusher URL needs setting in the CKAN config file.

DataStore and DataPusher will be needed as they enable the Data API and also many of the data preview/visualisation features.

@seanh
Copy link
Contributor

seanh commented May 26, 2014

Another thing missing is extensions, if the user wants to deploy a custom CKAN extension for either a dev or production install, how do they do that? Something like this should be possible I think:

  1. Git clone the extension on your host machine
  2. Share the extension's directory with the container
  3. Run the pip install and any other commands needed to install the extension in the container using docker run, also do any config file editing or other setup the extension requires
  4. Restart the apache or paster process

But I think we probably want to support building an image with a set of extensions already installed and configured. I guess that can be done simply enough with a Dockerfile that's based on one of the released CKAN Docker images.

@seanh
Copy link
Contributor

seanh commented May 26, 2014

Some CKAN features and extensions require cron jobs, how would those work with docker?

@nickstenning
Copy link
Contributor Author

[On Solr and Postgres images:] @nickstenning Is that what you were thinking?

More-or-less. But let's discuss that when we get there. In the meantime, have a look at this branch.

If we can't use docker links in production, then do we want to use them in development? It may be simpler jut to use URLs for both prod and dev.

From the point of view of the connecting container, Docker links just expose some environment variables which can be used to discover the network address of a given service. Docker links are just a special case of using environment variables such as DATABASE_URL for configuration.

Or are you imagining it's one Postgres container and one Solr container for each CKAN container

No. In a high density production environment I probably wouldn't even run Postgres in a container. I'd just run Postgres. But multiple CKANs talking to one Postgres, yes, absolutely.

For configuration, would it be better to simply share the CKAN config file with the host machine using a container volume?

I think the answer to that depends on questions of philosophy. In my opinion, configuration and data are not the same thing, and while data can be exported out of the container, configuration should not be. Part of the appeal of Docker is the clarity around how the container state is defined (in the arguments to docker run, basically). Putting configuration in a file that's shared outside the container is creating shared state, and I don't think it's a good idea in this context.

Unless I missed it I think we need to add the datastore and datapusher to this:

Correct.

Or maybe datapusher can just be run in the same docker container where CKAN is running?

Yes.

I think running multiple processes in one container at the same time is possible

It is and this container already does it. See baseimage-docker for the base image used for this Dockerfile, and see the contrib/docker/svc directory.

if the user wants to deploy a custom CKAN extension for either a dev or production install, how do they do that?

This is the point where I think you need to build your own image, à la:

FROM ckan/ckan:2.2
RUN ckan ext install https://github.com/ckan/ckanext-foo.git

(I know that the ckan ext install command doesn't (yet) exist).

Some CKAN features and extensions require cron jobs, how would those work with docker?

This Docker image already runs crond. Just drop a file in /etc/cron.d.


I'll respond to your commit comments inline.

@rufuspollock
Copy link
Member

@nickstenning any update on this? Will this go in soon?

@vitorbaptista
Copy link
Contributor

FYI, Docker 1.0 was released today http://blog.docker.com/2014/06/its-here-docker-1-0/

@Analect
Copy link

Analect commented Jun 9, 2014

Excuse me poking in here, but just an FYI. I don't see David Raznick (kindly) on this thread. I'm sure you're aware he had a docker solution in place for some time available on https://github.com/kindly/ckan_dockered ... I've been using it for the past 6 months and it works well.

Works as follows:
sudo docker run -name=newstuff -i -t -p 5000:80 kindly/ckan_base:2.2 bash /usr/lib/ckan/startup.sh http://packaging.ckan.org/build/python-ckan_2.2-6_amd64.deb

Rgds, Colum

@nickstenning
Copy link
Contributor Author

Closing in favour of #1755.

(Sorry, but I can't change the base of a pull request.)

@nickstenning nickstenning deleted the release-v2.2-docker branch June 10, 2014 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants