CKAN High Availability

Rick Leir edited this page May 27, 2016 · 6 revisions
Clone this wiki locally

This document describes a basic way to setup a CKAN high availability cluster. It is written primarily for CKAN 2.0 running on Ubuntu 12.04, but similar steps should work with all recent versions of CKAN. We first describe how the frontend components can be duplicated to provide redundancy, followed by suggestions for possible configuration of the main backend dependencies for CKAN 2.0 - PostgreSQL and Solr.

Frontend

Redundancy on the frontend can be provided by having multiple web servers, each with their own copy of the CKAN code. As CKAN is a WSGI app any suitable web server setup can be used, but we generally recommend Apache with mod_wsgi.

A load balancer is then placed in front of the web servers (we recommend nginx). To configure nginx to load-balance the servers, create a file in the sites-available directory containing:

upstream backend  {
    ip_hash;  # send requests from given IP to same server each time
    server <server 1 IP>:<server 1 port> max_fails=3 fail_timeout=10s;
    server <server 2 IP>:<server 2 port> max_fails=3 fail_timeout=10s;
}

server {
    location / {
        proxy_pass http://backend;
    }
}

Notes:

  • Each instance must have the same settings for beaker.session.key and beaker.session.secret in the CKAN config file.
  • If Open ID is used, something like Memcache will additionally be required in order to share session information between servers.
  • Without Memcache it is also possible that some flash messages will not be displayed (as they are currently stored to disk), but the use of ip_hash should minimise this.

Backend: PostgreSQL

There are various ways that a PostgreSQL cluster can be configured, see the PostgreSQL wiki for a brief overview. Here we are going to describe how to set up two PostgreSQL servers so that one acts as a master and the other is available as a warm standby machine. Also see these documents for reference. The basic idea is that the CKAN instance(s) will all use the same master server. If a failure is detected, the standby server will be brought online and the instance(s) updated to use the new standby server. This process is not automatic, if automatic failover is required then a more complex setup (possibly using a tool like slony) will be required. The steps to setup the warm standby configuration are as follows:

On the master server:

  • As the postgres user, create a new ssh key (with no passphrase). This will be used to connect to the standby server.

    sudo -u postgres mkdir /var/lib/postgresql/.ssh sudo -u postgres chmod 700 /var/lib/postgresql/.ssh sudo -u postgres ssh-keygen -t rsa -b 2048 -f /var/lib/postgresql/.ssh/rsync-key

On the standby server:

  • Copy the newly created public key to the authorized_keys file of the postgres user on the standby server. In Ubuntu 12.04 this defaults to /var/lib/postgresql/.ssh/authorized_users.

On the master server:

  • Verify that you can connect to the standby server as the postgres user. sudo -u postgres ssh postgres@<standby server ip> -i /var/lib/postgresql/.ssh/rsync-key

  • Edit the file /etc/postgresql/9.1/main/postgresql.conf: On line 153 set wal_level = archive. On line 181 set archive_mode = on. On line 183 set archive_command = 'rsync -avz -e "ssh -i /var/lib/postgresql/.ssh/rsync-key" %p postgres@<standby server IP>:/var/lib/postgresql/9.1/archive/%f' (where /var/lib/postgresql/9.1/archive is where the WAL files will be stored on the standby server).

  • Restart postgres.

On the standby server:

  • Stop PostgreSQL if it is currently running.
  • Remove the data directory: sudo mv /var/lib/postgresql/9.1/main /var/lib/postgresql/9.1/main.backup

On the master server:

  • Save a backup of the current database to the standby server. sudo -u postgres psql -c "select pg_start_backup('ckan-initial-backup', true);" sudo -u postgres rsync -avz -e "ssh -i /var/lib/postgresql/.ssh/rsync-key" --exclude 'pg_log/*' --exclude 'pg_xlog/*' --exclude postmaster.pid /var/lib/postgresql/9.1/main/ postgres@<standby server IP>:/var/lib/postgresql/9.1/main sudo -u postgres psql -c "select pg_stop_backup();"

On the standby server:

  • Install the posgresql-contrib package to get the pg_standby program. sudo apt-get install postgresql-contrib-9.1

  • Create a file in the postgres data directory called recovery.conf, containing: restore_command = '/usr/lib/postgresql/9.1/bin/pg_standby -t /var/lib/postgresql/9.1/recovery.trigger /var/lib/postgresql/9.1/archive/ %f %p %r' /var/lib/postgresql/9.1/recovery.trigger is the path to the trigger file, creating this file will cause the standby server to come online.

  • Start postgres.

On the master server:

  • Add some data (enough to create several WAL files).

On the standby server:

  • Verify that the WAL files are being stored in the /var/lib/postgresql/9.1/archive directory (or equivalent).
  • Read the postgres log file to verify that the WAL files are being read.

Backend: Solr

Solr replication is described on the Solr wiki. The steps necessary to set up replication using a single-core master server and a single slave server are as follows:

On the master server:

  • Edit /etc/solr/conf/solrconfig.xml, adding a new replication request handler on around line 505.

    <requestHandler name="/replication" class="solr.ReplicationHandler">
        <lst name="master">
            <str name="replicateAfter">commit</str>
            <str name="replicateAfter">startup</str>
    
            <str name="backupAfter">optimize</str>
            <str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>
            <str name="commitReserveDuration">00:00:10</str>
        </lst>
    
        </lst>
           <int name="maxNumberOfBackups">2</int>
           <lst name="invariants">
              <str name="maxWriteMBPerSec">16</str>
           </lst>
        </lst>
    </requestHandler>  
    
  • Restart Jetty (Solr 6 or newer: restart Solr).

On the slave server:

  • Edit /etc/solr/conf/solrconfig.xml, adding a new replication request handler on around line 505.

    <requestHandler name="/replication" class="solr.ReplicationHandler" >
        <lst name="slave">
            <!--fully qualified url for the replication handler of master-->
            <str name="masterUrl">http://master_host:port/solr/corename/</str>
    
            <!--Interval in which the slave should poll master. Format is HH:mm:ss-->
            <str name="pollInterval">00:00:20</str>
         </lst>
    </requestHandler>
    
  • Restart Jetty (Solr 6 or newer: restart Solr).