Fetching contributors…
Cannot retrieve contributors at this time
495 lines (348 sloc) 17.5 KB
This document gives an overview of the installation process for
BioSQL on 3 different platforms: MySQL, Postgres, and Oracle.
It's written from a Bioperl and Unix perspective but most of the
instructions will work for Unix generally, independent of language.
In each case the installation has multiple steps, and there may
be peculiarities for a given server and a given operating system.
If you've read this document and the documentation for your server
and are still having a problem with the install please contact
BioSQL is a database for storing bioinformatics data, particularly
sequences and associated data. It has been written jointly by the
members of the BioJava, BioPerl, BioRuby and BioPython projects,
and you can use applications from any of these projects to load and
fetch data. You don't have to use these projects - you can query
the database directly with SQL (Structured Query Language) or using
the direct bindings of the different languages (e.g. JDBC, Perl DBI).
BioSQL needs a SQL database server system, currently one of MySQL,
Postgres or Oracle (but any SQL database will work with a little
tweaking - we test regularly against these engines). BioSQL will
typically be used with a client program written in one of Perl,
Java or Python (for example, the Perl client is the bioperl-db
package). Sensible uses of BioSQL include:
(a) Storing the entire GenBank or EMBL or Swissprot public
repository in the BioSQL schema, providing efficient access to
individual records there.
(b) Storing your own sequences which you are working on in the lab.
(c) Storing a particular genome sequence.
Yes, there are many other SQL-compatible databases which can serve
these purposes. The great thing about BioSQL is the interoperability
between the various Bio* projects (see www.open-bio.org for more on
all these efforts).
Server Setup
To run BioSQL yourself you need to have a database server. Depending
on your circumstances there will be different setups to use. Here are
a couple of examples:
- Running a BioSQL database on a laptop providing Swissprot in a
flexible and useable way. Here one might use MySQL or Postgres
running locally on the laptop.
- Running a BioSQL database for a small lab group in order to share
sequence annotations. Here one might use Postgres, or MySQL with
transaction support. Postgres comes with transaction support built-in.
- Running a BioSQL database in a company to manage a mid-scale EST
project. Here one might use an Oracle or Postgres database.
The database server you use should really be chosen by ease of use
for people; for example, in a group with a lot of Oracle experience,
use Oracle, similarly for a MySQL group, use MySQL. If you are
starting out, probably MySQL is the easiest database to work with, but
Postgres is a very sensible option which comes as a more "standard"
database. Below gives you the basic set up of the three databases.
For the MySQL version to instantiate and run properly you need to
have MySQL version 3.23.50 or later, because only from that version on
are nullable columns handled in a bug-free manner, according to the
MySQL documentation. If you are lucky your Linux or Mac distribution,
or whatever, came with MySQL. Look at your service tools and switch
on MySQL. If not, don't worry, installing MySQL is easy.
Most of us on Unix typically install from source. Download the tar
file from one of the mirrors linked off www.mysql.com. Uncompress and
untar the tar file.
>tar -zxvf mysql-4.0.10-gamma.tar.gz
>make install
You'll probably need root privileges to run the 'make install'.
Once the MySQL files are installed you are ready to create setup
accounts and start the server, and the steps are well-documented
in the MySQL manual. Another good document to look at is in the
bioperl-db package: bioperl-db/docs/HOWTO-MySQL.html. This one
discusses both MySQL installation and bioperl-db.
If you are on Windows you're better off using a binary. Please
see the installation instructions at www.mysql.com.
If you want InnoDB compiled in MySQL you should probably download the
stable MySQL 4* version, but if you have already installed MySQL 3.23
MAX and you want to activate InnoDB in your server you need to
uncomment InnoDB-related lines in the my.cnf MySQL configuration file.
By default there is no configuration file, so if you don't have one
you need to copy one of the sample files in the supporting_files/
directory to /etc/. There are four my.cnf files called my-small.cnf,
my-medium.cnf, my-large.cnf, and my-huge.cnf corresponding to
predefined defaults for small, medium and large databases
respectively. Pick any of them and copy it to /etc/.
For example:
>cp /usr/local/mysql/support-files/my-small.cf /etc/my.cnf
Then edit the /etc/my.cnf to uncomment lines referring to InnoDB,
specifically the following lines:
innodb_data_home_dir = /usr/local/mysql/var/
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /usr/local/mysql/var/
innodb_log_arch_dir = /usr/local/mysql/var/
set-variable = innodb_buffer_pool_size=16M
set-variable = innodb_additional_mem_pool_size=2M
set-variable = innodb_log_file_size=5M
set-variable = innodb_log_buffer_size=8M
set-variable = innodb_lock_wait_timeout=50
Finally you need to make sure the innodb_data_home_dir exists and is
owned by the mysql user, e.g.
>mkdir /usr/local/mysql/var/
>chown mysql /usr/local/mysql/var/
Finally start the mysql server as per normal:
>safe_mysqld &
Like Mysql, the best thing is to install from source. Go to
www.postgresql.org and choose a mirror to download the postgres tar
Unzip and Untar, in one step:
>tar -zxvf postgresql-8.3.6.tar.gz
Or in two steps:
>gunzip postgresql-8.3.6.tar.gz
>tar -xvf postgresql-8.3.6.tar
NOTE: In version 8.3 and higher, PostgreSQL introduced a significant
change to the behavior of the database server. Specifically,
PostgreSQL v8.3+ removes the automatic typecasts of numeric values to
strings, and hence comparing a character-type column to an unquoted
numeric value results in an error.
This may break client code that includes unquoted number literals in
query constraints for text-type columns, for example for accession
numbers (which ironically often are not numbers).
None of the Bio* language-bindings is currently known to suffer from
this, and hence PostgreSQL v8.3 is supported. Custom code, however,
may be affected, and we recommend to test such code first before
upgrading from an older version of PostgreSQL.
Now make the package by going:
>make install
You may have to be root to do this.
In some systems, readline is not installed. In this case either:
(a) install the readline package from ftp.gnu.org, or
(b) do ./configure --without-readline
Now make a new postgres user, for example 'postgres', which will
actually run the postgres server on your machine. On linux and other
unix's this is done by something like:
>adduser postgres
On Mac OS X, go to SystemPreferences --> Users --> New User
You now need to make a place for your data directory for postgres.
For example, as root:
>cd /usr/local
# pgsql/ may already exist from installing PostgreSQL
>mkdir pgsql
>mkdir pgsql/data
>chown -R postgres pgsql/
The chown command makes the postgres user the owner of the pgsql/
Since this is where the postgres data will be stored, you may well want
to have this directory on a separate disk partion. If so, simply build
a directory tree (e.g. /somewhere/disk12/pgsql/data). Remember that
the postgres server will be accessing this directory all the time, so
wherever possible make sure this disk is local storage, *not* network
Now log in as the postgres user to install the postgres server data files:
>su postgres
The postgres server needs to know where to find the data and where
the postgres binaries are. For the postgres user, it is best to store
this once in a init shell file, e.g. ~/.tcshrc for tcsh:
>setenv PGDATA /usr/local/pgsql/data
>setenv PATH ${PATH}:/usr/local/pgsql/bin
The PGDATA environment variable tells postgres where the data files
are stored; the path is to put all the postgres binaries on the path
The initdb command builds all the metadata files and basic starting
files for the postgres server. you only need to run this once on your
>/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/
*WARNING* In some situations, you need to init the database with the
--no-locale command (this disables some of the sorting cases in the
postgres server). If when you start the postmaster command you see a
message like:
FATAL: invalid value for option 'LC_MONETARY': 'en_US'
Then remove the files in /usr/local/postgres/data/ and rerun initdb
with --no-locale command.
>rm -rf /usr/local/postgres/data/*
>/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/ --no-locale
Now start the postgres server. This actually pulls up a instance of
the postgres server for running things:
>/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ -l logfile start
To test that your postgres server works, you need to make a test
database and then connect to the server using the psql command
>createdb test
>psql test
*WARNING* If the above commands don't work, then perhaps your locale
settings have gotten messed up. Check out the logfile for the psqldb user
and if it has message like:
FATAL: invalid value for option 'LC_MONETARY': 'en_US'
Then remove the files in /usr/local/postgres/data/ and rerun initdb
with --no-locale command
>rm -rf /usr/local/pgsql/data/*
>/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/ --no-locale
The postgres server is now up and running.
ORACLE Installation
Oracle installations are certainly more complicated than MySQL
installs. Depending on the situation one could choose from the free
Personal Oracle or Oracle Enterprise. The latter package is installed
using the Oracle Universal Installer which uses a graphic interface.
Both available through the Oracle Technology Network.
Installing Oracle, with its relative complexity, is somewhat beyond
the scope of this document. However, if this is your choice then bear
in mind that there are people in bioperl-l@bioperl.org who can assist
should you run into trouble. Just as with MySQL and Postgres you will
probably need to create special users to install the database and
you will need root privileges. Unlike MySQL and Postgres Oracle uses
configuration files, such as tnsnames.ora and listener.ora, that keep
track of essential information and many problems are solved by making
changes in these files. Other common problems concern the
environmental variables $ORACLE_HOME and $ORACLE_SID, make sure that
these are set correctly.
Starting BioSQL and loading the first database
You have to first choose the language or project you want to use in
to load your data. You currently have three options: Perl and BioPerl,
Java and BioJava and Python and BioPython. You will need to download
the schema, the same for all three languages, and the supporting modules
for each language. In each case the package is a mixture of generic
modules for database access and then the BioSQL binding code.
Schema Loading
The BioSQL schema is distributed separately from the language
bindings. Use git clone to get a local copy:
>git clone https://github.com/biosql/biosql.git
Create a database, which we'll call 'biosql', in the data instance:
For MySQL do:
>mysqladmin -u root create biosql
For postgres do, as the postgres user:
>createdb biosql
To load the schema, use the appropiate SQL dialect in
For mysql do:
>mysql -u root biosql < biosqldb-mysql.sql
For postgres do:
>psql biosql < biosqldb-pg.sql
For Oracle there's more than a single script, and the database itself
is more complex, complete with triggers, indices, and different
configurations. Please take a look at the INSTALL file in the
biosql/biosql-schema/sql/biosql-ora directory for more information.
At this point we now have a database instance with the server set up
correctly, we now need to load it with data.
Perl Loading
To load with Perl, you need to have the DBI module installed as well
as the driver for your SQL database of choice. The Oracle, MySQL,
and Postgres modules are available at CPAN:
>sudo perl -MCPAN -e shell
'sudo' is a command that allows users to issue commands as root.
Alternatively become root.
At the CPAN shell prompt go:
cpan>install DBI
Then for the particular SQL flavour you need to install the DBD.
For MySQL:
cpan>install DBD::mysql
For Postgres:
cpan>install DBD::Pg
For Oracle:
cpan>install DBD::Oracle
The DBD modules will need the database client libraries for
the relevant database on the local system to compile and link
correctly. Occasionally this install fails in an indecipherable way.
If the install in the CPAN shell fails it is usually best to try the
install by hand, looking at the compile error message carefully.
If you interrupt CPAN the files are downloaded to ~/.cpan/build.
If you are root, this will be either off /.cpan/ or /root/.cpan,
depending on the operating system.
For example, in my case, I got the following error on installing DBD::Pg:
/usr/bin/ld: table of contents for archive:
/usr/local/pgsql/lib/libpq.a is out of date; rerun ranlib(1) (can't load from it)
make: *** [blib/arch/auto/DBD/Pg/Pg.bundle] Error 1
Looking at this error I then did, as root:
>ranlib /usr/local/pgsql/lib/libpq
and then re-ran make and make install.
Another common source of error comes from Perl assuming a library file
is in a certain place when it's elsewhere. For example, this partial
command, after the "make" command:
... gcc -c -L/usr/lib/mysql -lmysqlclient -lm -lz ...
means that gcc assumes the file "libmysqlclient.a" might be found in
the directory /usr/lib/mysql, or in one of the standard library
directories like /lib or /usr/lib. If it's actually in
/usr/local/lib/mysql, for example, then you'll need to tell Perl to
construct the Makefile appropriately. First clean up:
>make clean
Then make the Makefile:
>perl Makefile.PL --libs="-L/usr/local/lib/mysql -lmysqlclient -lm -lz"
Then proceed to "make test" and "make install".
Downloading bioperl-db and installing
With DBI and the driver of choice set up you now should download
the bioperl and bioperl-db package.
bioperl-1.5.2 is available from CPAN or www.bioperl.org, and
bioperl-db is available from CPAN, or from git:
>git clone https://github.com/bioperl/bioperl-db.git
Look at the INSTALL files in the bioperl and bioperl-db directories,
they will give specific instructions. Bioperl uses other CPAN modules
but don't make the mistake of trying to install each and every
accessory module, most of these are related to specific modules in
Bioperl that you may not be using.
Also, you don't need root permission to use bioperl and bioperl-db and
you don't need to run the final 'make install'. You could use these packages
with the packages just untar'd in your local directory and PERL5LIB
set appropriately, e.g.:
>setenv PERL5LIB /sw/lib/perl5:/Users/birney/src/bioperl-live:/Users/birney/src/bioperl-db
Many of the bioperl installation details are discussed in the INSTALL
file found in the bioperl package, look there for more information.
Loading taxonomy data
With bioperl and bioperl-db installed you are ready to load some data.
It is advisable to pre-load the NCBI taxonomy database (use
scripts/load_ncbi_taxonomy.pl in the biosql-schema package, the details are
in its documentation). Otherwise you'll see errors from misparsed
Loading sequence data
To load sequence data you can use the load_seqdatabase.pl script in
bioperl-db/scripts/biosql/, for example:
>perl load_seqdatabase.pl -dbuser postgres -dbname biosql -namespace swissprot -format swiss sprot40.dat
Note that in this example for a PostgreSQL database it is the database
superuser who loads the sequences. Just as one would not use the
system root user on a Unix system as a normal user account, one would
create separate users with more limited privileges as the ones
"owning" a schema, and loading or manipulating data. Each RDBMS has
its own way of creating users; look up the documentation for yours for
the 'CREATE USER' command.
The -dbname parameter is the name of the database you have set up for
your data.
The -dbuser parameter is the user who will load the data, who has
write-privileges on the database.
The -namespace 'swissprot' provides the string under which this dataset
is stored. A single bioSQL database can store many different "sequence
databases" (e.g., it can store both Swissprot and EMBL and
The -format parameter indicates the format of the files provided. All the
SeqIO formats are supported (see www.bioperl.org/HOWTOs for more on
SeqIO), though of course not all formats have the same amount of
information in them.
The last argument to the command above is the sequence file, "sprot40.dat".
Because you have your choice of formats you have your choice of input
files. The complete Swissprot file can be obtained at ftp.expasy.ch,
along many sub-databases. All of Genbank is available at
ftp://ftp.ncbi.nih.gov/genbank, in genbank format. Many microbial
genomes are available at ftp://ftp.tigr.org/pub/data/. Enjoy!