Skip to content

CKAN High Availability

ThrawnCA edited this page Aug 26, 2021 · 8 revisions

This document describes a basic way to setup a CKAN high availability cluster. It is written primarily for CKAN 2.0 running on Ubuntu 12.04, but similar steps should work with all recent versions of CKAN. We first describe how the frontend components can be duplicated to provide redundancy, followed by suggestions for possible configuration of the main backend dependencies for CKAN 2.0 - PostgreSQL and Solr.

Frontend

Redundancy on the frontend can be provided by having multiple web servers, each with their own copy of the CKAN code. As CKAN is a WSGI app any suitable web server setup can be used, but we generally recommend Apache with mod_wsgi.

A load balancer is then placed in front of the web servers (we recommend nginx). To configure nginx to load-balance the servers, create a file in the sites-available directory containing:

upstream backend  {
    ip_hash;  # send requests from given IP to same server each time
    server <server 1 IP>:<server 1 port> max_fails=3 fail_timeout=10s;
    server <server 2 IP>:<server 2 port> max_fails=3 fail_timeout=10s;
}

server {
    location / {
        proxy_pass http://backend;
    }
}

Notes:

  • Each instance must have the same settings for beaker.session.key and beaker.session.secret in the CKAN config file.
  • If Open ID is used, something like Memcache will additionally be required in order to share session information between servers.
  • Without Memcache it is also possible that some flash messages will not be displayed (as they are currently stored to disk), but the use of ip_hash should minimise this.

Backend: PostgreSQL

There are various ways that a PostgreSQL cluster can be configured, see [the PostgreSQL wiki] 1 for a brief overview. Here we are going to describe how to set up two PostgreSQL servers so that one acts as a master and the other is available as a [warm standby] 2 machine. Also see [these] 3 [documents] 4 for reference. The basic idea is that the CKAN instance(s) will all use the same master server. If a failure is detected, the standby server will be brought online and the instance(s) updated to use the new standby server. This process is not automatic, if automatic failover is required then a more complex setup (possibly using a tool like [slony] 5) will be required. The steps to setup the warm standby configuration are as follows:

On the master server:

  • As the postgres user, create a new ssh key (with no passphrase). This will be used to connect to the standby server.

    sudo -u postgres mkdir /var/lib/postgresql/.ssh sudo -u postgres chmod 700 /var/lib/postgresql/.ssh sudo -u postgres ssh-keygen -t rsa -b 2048 -f /var/lib/postgresql/.ssh/rsync-key

On the standby server:

  • Copy the newly created public key to the authorized_keys file of the postgres user on the standby server. In Ubuntu 12.04 this defaults to /var/lib/postgresql/.ssh/authorized_users.

On the master server:

  • Verify that you can connect to the standby server as the postgres user. sudo -u postgres ssh postgres@<standby server ip> -i /var/lib/postgresql/.ssh/rsync-key

  • Edit the file /etc/postgresql/9.1/main/postgresql.conf: On line 153 set wal_level = archive. On line 181 set archive_mode = on. On line 183 set archive_command = 'rsync -avz -e "ssh -i /var/lib/postgresql/.ssh/rsync-key" %p postgres@<standby server IP>:/var/lib/postgresql/9.1/archive/%f' (where /var/lib/postgresql/9.1/archive is where the WAL files will be stored on the standby server).

  • Restart postgres.

On the standby server:

  • Stop PostgreSQL if it is currently running.
  • Remove the data directory: sudo mv /var/lib/postgresql/9.1/main /var/lib/postgresql/9.1/main.backup

On the master server:

  • Save a backup of the current database to the standby server. sudo -u postgres psql -c "select pg_start_backup('ckan-initial-backup', true);" sudo -u postgres rsync -avz -e "ssh -i /var/lib/postgresql/.ssh/rsync-key" --exclude 'pg_log/*' --exclude 'pg_xlog/*' --exclude postmaster.pid /var/lib/postgresql/9.1/main/ postgres@<standby server IP>:/var/lib/postgresql/9.1/main sudo -u postgres psql -c "select pg_stop_backup();"

On the standby server:

  • Install the posgresql-contrib package to get the pg_standby program. sudo apt-get install postgresql-contrib-9.1

  • Create a file in the postgres data directory called recovery.conf, containing: restore_command = '/usr/lib/postgresql/9.1/bin/pg_standby -t /var/lib/postgresql/9.1/recovery.trigger /var/lib/postgresql/9.1/archive/ %f %p %r' /var/lib/postgresql/9.1/recovery.trigger is the path to the trigger file, creating this file will cause the standby server to come online.

  • Start postgres.

On the master server:

  • Add some data (enough to create several WAL files).

On the standby server:

  • Verify that the WAL files are being stored in the /var/lib/postgresql/9.1/archive directory (or equivalent).
  • Read the postgres log file to verify that the WAL files are being read.

Backend: Solr

Solr replication is described on the [Solr wiki] 6. The steps necessary to set up replication using a single-core master server and a single slave server are as follows:

On the master server:

  • Edit /etc/solr/conf/solrconfig.xml, adding a new replication request handler on around line 505.

      <requestHandler name="/replication" class="solr.ReplicationHandler">
          <lst name="master">
              <str name="replicateAfter">commit</str>
              <str name="replicateAfter">startup</str>
    
              <str name="backupAfter">optimize</str>
              <str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>
              <str name="commitReserveDuration">00:00:10</str>
          </lst>
    
          </lst>
             <int name="maxNumberOfBackups">2</int>
             <lst name="invariants">
                <str name="maxWriteMBPerSec">16</str>
             </lst>
          </lst>
      </requestHandler>  
    
  • Restart Jetty (Solr 6 or newer: restart Solr).

On the slave server:

  • Edit /etc/solr/conf/solrconfig.xml, adding a new replication request handler on around line 505.

      <requestHandler name="/replication" class="solr.ReplicationHandler" >
          <lst name="slave">
              <!--fully qualified url for the replication handler of master-->
              <str name="masterUrl">http://master_host:port/solr/corename/</str>
                                                                                                     
              <!--Interval in which the slave should poll master. Format is HH:mm:ss-->
              <str name="pollInterval">00:00:20</str>
           </lst>
      </requestHandler>
    
  • Restart Jetty (Solr 6 or newer: restart Solr).

Backend: Redis

CKAN 2.7+ requires a Redis cluster.

Backend: Job workers

The asynchronous job workers can be hosted separately from the CKAN instance. In a HA setup, this may improve stability and responsiveness (since a heavily loaded job queue will not directly affect the web application), and allows the application and jobs to be scaled up or out independently of each other.

Job workers require a copy of the CKAN config file; access to the database, Redis cluster, and Solr; and usually a management process such as Supervisord. They do not require a load balancer.

On the job server:

  • Create a virtualenv and install CKAN source and its dependencies.

  • Copy the config file from your CKAN instance to the same location, eg /etc/ckan/default/production.ini

  • Install Supervisor, eg:

      sudo apt-get install supervisor
    
  • Set up a config file for each type of worker you want to run, eg:

      [unix_http_server]
      file=/var/tmp/supervisor.sock
    
      [rpcinterface:supervisor]
      supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
    
      ; =======================================================
      ; Supervisor configuration for CKAN background job worker
      ; =======================================================
    
      [program:ckan-worker]
    
      ; Use the full paths to the virtualenv and your configuration file here.
      command=/usr/lib/ckan/default/bin/ckan -c /etc/ckan/default/production.ini jobs worker
    
      ; User the worker runs as.
      user=ckan
    
      ; Start just a single worker. Increase this number if you have many or
      ; particularly long running background jobs.
      numprocs=1
      process_name=%(program_name)s-%(process_num)02d
    
      ; Log files.
      stdout_logfile=/var/log/ckan/ckan-worker.log
      stderr_logfile=/var/log/ckan/ckan-worker.log
    
      ; Make sure that the worker is started on system start and automatically
      ; restarted if it crashes unexpectedly.
      autostart=true
      autorestart=true
    
      ; Number of seconds the process has to run before it is considered to have
      ; started successfully.
      startsecs=10
    
      ; Need to wait for currently executing tasks to finish at shutdown.
      ; Increase this if you have very long running tasks.
      stopwaitsecs = 600
    
Clone this wiki locally