Skip to content

ahstram/al3c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

         _______     
  ____ _/ /__  /_____  a software framework for numerical approximations
 / __ `/ / /_ >/ ___/  of posterior distributions by Approximate
/ /_/ / /___/ / /__    Bayesian Computation Sequential Monte Carlo
\__,_/_//____/\___/    in parallel 

al3c is written for two types of users:

  1. Those who have written their own software to address niche scientific questions, and would like to speed things up by using ABC in parallel without too much parallel programming or effort.

  2. Those seeking use ABC with coalescent theory models.

To accomodate the latter users, we provide a number of examples using al3c with MaCS, a Markovian coalescent theory simulator, which can easily be modified for the user's specific case.

Publication

al3c: High-performance software for parameter inference using Approximate Bayesian Computation in Bioinformatics

Support

For help with using al3c, please email Alexander Stram at astram@usc.edu

al3c+MaCS Example

We provide an al3c example for parallelized ABC-SMC using the coalescent simulator MaCS, available at https://github.com/gchen98/macs

Here are some quick examples that can get you running al3c without much thought:

Warning! If you have MPI libraries already installed, please skip to the "From source" example. Existing MPI libraries will likely interfere with the Linux and OS-X binaries we provide below.

Linux 64-bit

git clone https://github.com/ahstram/al3c.git  
cd al3c/  
bin/al3c_linux_x86-64 config/macs_linux-x86-64.xml  

Mac OS-X

git clone https://github.com/ahstram/al3c.git  
cd al3c/  
bin/al3c_darwin_x86-64 config/macs_darwin-x86-64.xml  

From source

To compile the example from source, you will need to ensure that an MPI library such as Intel MPI Library, OpenMPI or MPICH2 is installed, then follow:

git clone https://github.com/ahstram/al3c.git  
cd al3c/  
## The following command makes the al3c binary
make  
## The following command compiles the MaCS binary, which requires C++ Boost libraries 
git clone https://github.com/gchen98/macs.git macs-binary && cd macs-binary && make && ln -s ../macs-binary/macs ../macs/macs && cd .. 
## This makes the MaCS plugin, which uses the above MaCS binary
cd macs/ && make && cd ../  
## This initiates al3c, using the "config/macs.xml" configuration, which loads the MaCS plugin, which in turn calls the MaCS binary
bin/al3c config/macs.xml  

Modifications to the make.inc files in the al3c/ and al3c/macs/ directory may be necessary for your environment. ##Tutorial

al3c consists of three components: the al3c binary, a user-defined shared library (.so file), and an XML file, which specifies parameters to be used with the user-defined shared library (.so file).

 ___________
|           |
| XML file  |                      ________________        
|___________|                     |                |
             \  _________         |                |
              \|         |        |                |
               |   al3c  | -----> | desired output |
              /|_________|        |                |
 ___________ /                    |                |
|           |                     |________________|
| .so file  |
|___________|

No modification is necessary to the al3c binary-- it is readily available in the git repository.

We must create a shared library which will specify the:

  • parameter space
  • prior distribution
  • parameter perturbation kernel
  • simulation model
  • distance function to compare observed & simulated data

and an XML configuration file, which will specify

  • the location of the shared library (.so file)
  • the observed data
  • number of processors to use
  • basic ABC-SMC parameters

A rough diagram of the relation between functions in our shared library follows:

void user_t::prior();
        |               +-----------------------+
        |               |                       |
        |               V                       |
        |        user_t::perturb();<-----+      |
        |               |                |      |
        |               |                |      |
        |               V                |      |
        +-----> user_t::simulate();      |      |
                        |                |      |
                        |                |      |
                        V                |      |
                user_t::distance();------+      |
                        |                       |
                        |   prior_density();    |        
                     (weight)                   |
                        |   perturb_density();  |
                        |                       |
                        V                       |
                user_summary_t::summarize();    |
                        |                       |
                        |                       |
                        V                       |
                user_t::print(); ---------------+
                        |
                        |
                        V
                  (output file)

###Compiling the al3c binary

Note that compiling al3c requires MPI libraries be installed. To avoid installing MPI libraries on a Linux 64-bit or OS X system, you may instead use the static binary "bin/al3c_linux_x86-64" (Linux) or "bin/al3c_darwin_x86-64" (OS X) and skip this step.

The following commands will obtain the al3c source code and and compile the al3c binary

git clone http://github.com/ahstram/al3c.git
cd al3c
make

###Coding the shared library Shared libraries for use with al3c should be written in C++, and include the "al3c.hpp" file which is found in the "include" directory.

al3c.hpp defines two functions:

u01, which must be used as the shared library's random number generator, or used to seed any external random number generator, in order to ensure that different processors do not use the same random number seed.

exec_cmd, an optional function call external programs to simulate data

al3c.hpp declares two templates:

user_t, the users' parameters of interest

user_summary_t, a summary statistic for user_t

The user must then define all functions listed in C/C++ Requirements, in order to complete the definition of these two templates.

For a sample shared library's source code, please see this sample. ###Compiling the shared library

We may compile our source file as a shared library using the "-shared" and "-fPIC" compilers flags, like such:

        gcc -shared -o macs.so -fPIC macs.so.cpp

###XML configuration

The XML configuration must define all nodes listed in XML Requirements.

A sample XML file is available here.

###Running al3c al3c is run with the XML files to load as its sole argument, for example:

bin/al3c config/macs.xml

##Reference Manual

1 C/C++ Provisions
        1.1 Variables
                1.1.1 param 
                1.1.2 param_summary
                1.1.3 S
                1.1.4 O
                1.1.5 N
                1.1.6 D
                1.1.7 d
                1.1.8 w
        1.2 Functions
                1.2.1 u01
                1.2.2 exec_cmd
2 C/C++ Requirements (sample)
        2.1 Parameter & Parameter Summary Statistic Data Types
                2.1.1 param_t
                2.1.2 param_summary_t
        2.2 Printing Parameter Data Type & Summarizing Acceptances
                2.2.1 user_t::print
                2.2.2 user_summary_t::summarize
        2.3 Generating the Prior Distribution
                2.3.1 user_t::prior 
                2.3.2 user_t::prior_density
        2.4 Perturbing Parameters
                2.4.1 user_t::perturb
                2.4.2 user_t::perturb_density
        2.5 Simulating Data & Measuring Goodness of Fit
                2.5.1 user_t::simulate
                2.5.2 user_t::distance
3 XML Requirements (sample)
        3.1 lib - Shared library
        3.2 MPI
                3.2.1 NP - Number of Processors
        3.3 ABC
                3.3.1 G - Generations
                3.3.5 R - Rank
                3.3.4 E - Epsilon
                3.3.3 A - Acceptances
                3.3.2 T - Trials
        3.4 O - Observed data

###C/C++ Provisions

####u01

float u01();

Uniform random number generator

Example

seed=u01();
seed_rng(seed);

while (1) {
        x=rng();
        ...
}

####exec_cmd

uint exec_cmd(const char *cmd);

Run command via system call

Example

See simulate()

###C/C++: Parameter & Parameter Summary Statistics

####param_t

struct param_t;

A user defined struct with the parameters we are investigating

Example

struct param_t {

        float MigrationRate_EurToAfr,
              MigrationRate_AsnToAfr,
              MigrationRate_AsnToEur,
              EffectivePopulationSize_Afr,
              GrowthRate_Eur,
              GrowthRate_Asn,
              PastEvent_EurToAfrMigration;
};

####print

const char *framework_t::print();

Format parameter for printing

Relevant Variables

param_t *param // parameter to format 
float *d //  the distance of param's simulation
float *w // // the weight of param
bool header // printing header or not

Return Value

const char *d <return value>;

Example

user_t::print(bool header) {

        ofstringstream output;

        if (header)
                output<<"#distance        weight        MigrationRate_EurToAfr        MigrationRate_AsnToAfr        MigrationRate_AsnToEur        EffectivePopulationSize_Afr        GrowthRate_Eur        GrowthRate_Asn        PastEvent_EurToAfrMigration"<<endl;
        else
                output<<*d<<*w<<param->MigrationRate_EurToAfr<<param->MigrationRate_AsnToAfr<<param->MigrationRate_AsnToEur<<param->EffectivePopulationSize_Afr<<param->GrowthRate_Eur<<param->GrowthRate_Asn<<param->PastEvent_EurToAfrMigration<<endl;
                
        return output.str().c_str();
}

####param_summary_t

struct param_summary_t;

A user defined struct giving summary statistics necessary for a dynamic perturbation kernel

Example

struct param_summary_t {

        float MigrationRate_EurToAfr_Variance,
              MigrationRate_AsnToAfr_Variance,
              MigrationRate_AsnToEur_Variance,
              EffectivePopulationSize_Afr_Variance,
              GrowthRate_Eur_Variance,
              GrowthRate_Asn_Variance,
              PastEvent_EurToAfrMigration_Variance;

};

####summarize

framework_t::summarize();

Write to param_summary_t based on an array of param_t's

Relevant Variables

param_t **params; // parameter to format 
uint A; // number of parameters 

Return Value

param_summary_t *param_summary; // Desired statistics of accepted parameters

Example

user_summary_t::summarize() {

        float m1=0,m2=0;

        for (uint a=0;a<A;a++) {
                m1+=params[a]->MigrationRate_EurToAfr;
                m2+=params[a]->MigrationRate_EurToAfr*params[a]->MigrationRate_EurToAfr;
        }
        m1/=(float)A;
        m2/=(float)A;

        param_summary->MigrationRate_EurToAfr_Variance=m2-m1*m1;

/*
        ...
        
        (repeat for other parameters)

        ...
*/

        m1=0, m2=0;

        for (uint a=0;a<A;a++) {
                m1+=params[a]->PastEvent_EurToAfrMigration;
                m2+=params[a]->PastEvent_EurToAfrMigration*params[a]->MigrationRate_PastEvent_EurToAfrMigration;
        } 
        m1/=(float)A;
        m2/=(float)A;

        param_summary->PastEvent_EurToAfrMigration_Variance=m2-m1*m1;


}

###C/C++: Generating the Prior Distribution

####prior

void framework_t::prior();

Generate random parameters from the prior distribution

Return Variables

param_t *param; // Parameter to write to according to the desired prior distribution

Example

void user_t::prior() {

        param->MigrationRate_EurToAfr=u01()*0.4+0.84; // Unif[0.84,1.24]
        param->MigrationRate_AsnToAfr=u01()*0.32+0.16; // Unif[0.16,0.48]
        param->MigrationRate_AsnToEur=u01()*0.84+0.72; // Unif[0.72,1.56]
        param->EffectivePopulationSize_Afr=u01()*0.2994+1.319; // Unif[1.319, 1.6184]
        param->GrowthRate_Eur=u01()*0.31+0.28; // Unif[0.28, 0.59]
        param->GrowthRate_Asn=u01()*0.45+0.30; // Unif[0.30, 0.75]
        param->PastEvent_EurToAfrMigration=u01()*2.8+4.8; // Unif[4.8,7.6]
}

####prior_density

framework_t::prior_density();

Give the probability density of a parameter generated by prior()

Relevant Variables

param_t *param; // A parameter for which we want the pdf of according to the prior distribution

Return Variables

float <return value>; // The density of param according to the prior distribution

Example

void user_t::prior_density() {

        if (0.84<=param->MigrationRate_EurToAfr && param->MigrationRate_EurToAfr<=1.24
                 && 0.16<=param->MigrationRate_AsnToAfr && param->MigrationRate_AsnToAfr<=0.48
                 && 0.72<=param->MigrationRate_AsnToEur && param->MigrationRate_AsnToEur<=1.56
                 && 1.319<=param->EffectivePopulationSize_Afr && param->EffectivePopulationSize_Afr<=1.6184
                && 0.28<=param->GrowthRate_Eur && param->GrowthRate_Eur<=0.59
                 && 0.3<=param->GrowthRate_Asn && param->GrowthRate_Asn<=0.75
                && 4.8<=param->PastEvent_EurToAfrMigration && param->PastEvent_EurToAfrMigration<=7.6)
                return 1;
        else
                return 0;
}

###C/C++: Perturbing Variables

####perturb

void framework_t::perturb();

Perturb parameters according to a desired perturbation kernel

Relevant Variables

param_summary_t *param_summary; // A user defined struct giving summary statistics of the last generation's accepted parameters
param_t *param; // The parameter to be perturbed

Return Variables

param_t *param; // The perturbed parameter

Example

void user_t::perturb() {

        param->MigrationRate_EurToAfr+=(u01()-0.5f)*sqrt(2*param_summary->MigrationRate_EurToAfr_Variance*12);

        /* ... repeat for other parameters ... */

        param->PastEvent_EurToAfrMigration+=(u01()-0.5f)*sqrt(2*param_summary->PastEvent_EurToAfrMigration*12);
}

####perturb_density

float framework_t::perturb_density(param_t *old_param);

Give the probability density of a parameter perturbation

Relevant Variables

param_t *old_param; // A parameter from last generation's acceptances
param_t *param; // The perturbed parameter
param_summary_t *summary; // A user defined struct giving summary statistics of the last generation's accepted parameters

Return Variables

float <return value> // The probability density of param being perturbed from old_param, according to the perturbation kernel specified in perturb()

Example

float user_t::perturb_density() {

        if ( fabs(param->MigrationRate_EurToAfr - old_param->MigrationRate_EurToAfr) > sqrt(2*param_summary->MigrationRate_EurToAfr_Variance*12)/2.f ) 
                return 0.f;

        /* ... repeat for other parameters ... */

        if ( fabs(param->PastEvent_EurToAfrMigration - old_param->PastEvent_EurToAfrMigration) > sqrt(2*param_summary->PastEvent_EurToAfrMigration_Variance*12)/2.f ) 
                return 0.f;

        return 1.f;        
}

###C/C++: Data simulation

####simulate

framework_t::simulate();

Simulate data according to given parameters

Revelant Variables

param_t *param; // The parameter to simulate with
uint N; // The number of rows of simulated/observed data
uint D; // The number of columns (<b>D</b>imensions) of simulated/observed data

Return Variables

float **S; // An N*D array of floating points containing simulated data

Example

user_t::simulate() {

        ostringstream cmd;

        uint seed=(uint)u01()*UINT_MAX;

        cmd<<"macs.sh 718 100000 -s "<<seed<<" -t .001 -I 3 176 170 372 0 -m 2 1 "<<param->MigrationRate_EurToAfr<<" -m 3 1 "<<param->MigrationRate_AsnToAfr<<" -m 3 2 "<<param->MigrationRate_AsnToEur<<" -n 1 "<<param->EffectivePopulationSize_Afr<<" -g 2 "<<param->GrowthRate_Eur<<" -g 3 "<<param->GrowthRate_Asn<<" -eg .0230000 2 0 -eg .0230001 3 0 -ej .0230002 3 2 -em .0230003 2 1 "<<param->PastEvent_EurToAfrMigration<<" -en .0230004 2 0.1861 -ej .051 2 1 -en .148 1 0.731 -r 0.0006"<<endl;

        //this is a helper function that will write to **S
        exec_cmd(cmd.str().c_str());
}

####distance

float framework_t::distance();

Give the distance between simulated & observed data

Relevant Variables

uint N; // The number of rows of simulated/observed data
uint D; // The number of columns ("D"imensions) of simulated/observed data
float **S; // An N*D matrix of floating points containing simulated data
float **O; // An N*D matrix of floating points containing observed data

Return Variables

float <return value>; // The distance between observed & simulated data

Example

user_t::distance() {

        float r=0;

        for (uint n=0;n<N;n++)
                for (uint d=0;d<D;d++)
                        r+=pow(S[n][d]-O[n][d],2);

        return sqrt(r);
}

###XML: lib

####lib

C++ library to load

Notes

Example

<lib>lib/libmacs.so</lib>

###XML: MPI

####NP Number of processors to use

Notes

This should be the total number of processors al3c will be run with. If there are 8 nodes, each with 8 processors, we specify NP as "64" and let the MPI system do the rest.

Example

<MPI>
        <NP>64</NP>  
</MPI>

###XML: ABC

####G Minimum number of generations of ABC-SMC to run before al3c quits

Acceptable values

integers in {0,1,...}

Notes

This is the minimum number of generations of ABC-SMC to run.

al3c will not quit until both "G" and "E" are satisfied. If you'd like to quit strictly based on "G", set "E" to be "0".

Example

<ABC>
        <G>15</G>  
</ABC>

####R Rank of accepted parameter to set next generation's epsilon to

Acceptable values

integers in {1,2,...,A}

integers in {1,2,...,A}G

Notes

After each generation, the distances for each accepted parameter are sorted, least to greatest. If we would like to set the next generation's epsilon ("acceptance threshold") according to the present generation's accepted parameter distances, we do so here. This allows us to adaptively set our epsilon schedule according to quantiles, rather than having to set it explicitly beforehand.

If "A" is set to "1000", and we want to set our epsilon to the simulation with the 250th best rank, we set "250" here. All accepted simulations in the next generation will have a distance less than or equal to this generation's 250th best simulation.

If we would like to change the quantile used for each generation, we can specify R as a tab/space delimited vector.

0 means "ignore". al3c will fall back on an "E" vector to set its epsilon schedule. If "E" is a scalar, acceptances will be made according to "A" and "T" only.

Example

<ABC>
        <R>250</R>  
</ABC>

<!-- Or, if we'd like to vary the rank over generations, do it like this: -->

<ABC>
        <R>750 500 400 300 250 250 250 250 250 250 250 250 250 300 400</R>
</ABC>

####E Maximum distance to observed data all accepted simulations in a generation must have before al3c quits

Acceptable values

reals in [0,∞) strictly decreasing reals in [0,∞)G

Notes

When all accepted parameters in the most recent generation have simulated datasets with a distance from observed data less than or equal to "E", we quit.

If you desire to explicitly set a rejection threshold schedule ("epsilon schedule"), you can do so by letting this value be a vector of G strictly decreasing values, delimited by spaces or tabs. In that case, "R" must be set to "0". Otherwise, use "R" for dynamic rank-based epsilon schedules.

al3c will not quit until both "G" and "E" are satisfied. If you'd like to quit strictly based on "E", set "G" to be "0".

Example

<ABC>
        <E>1.4</E>  
</ABC>

<!-- Or, if we'd like to set an explicit epsilon schedule, do it like this: -->

<ABC>
        <E>100 90 80 70 60 50 40 30 20 10 5 4 3 2 1</E>  
</ABC>

####A Number of accepted trials ("acceptances") per generation

Acceptable values

integers in {1,2,...,T}

Notes

This is the number of accepted parameters we desire per generation.

The actual value used may be slightly higher than specified here, so that "NP" divides "A", allowing for even allocation of computation across processors.

Example

<ABC>
        <A>1000</A>  
</ABC>

####T Minimum number of trials per generation

Acceptable values

integers in {1,2,...}

Notes

This is the minimum number of trials we must have per generation. Each trial is either accepted or rejected. There may be more trials than specified here depending on the values of "E" and "R" are set.

The actual value used may be slightly higher than specified here, so that "NP" divides "T", allowing for even allocation of computation across processors.

Example

<ABC>
        <T>10000</T>  
</ABC>

###XML: O

####O Observed data to simulate

Notes

This can be a 2-dimensional matrix, rows delimited by newlines, columns delimited by tabs/spaces/columns. The number of rows and columns will be counted, and the values of "N" and "D" will be inferred accordingly.

Example

<O>259        108        103        147
119        53        66        132
100        46        61        128</O>

Releases

No releases published

Packages

No packages published