## (&#x1F4D7;) ipyrad Cookbook: `abba-baba` admixture tests

The `ipyrad.analysis` Python module includes functions to calculate abba-baba admixture statistics (including several variants of these measures), to perform signifance tests, and to produce plots of results. All code in this notebook is written in Python, which you can copy/paste into an IPython terminal to execute, or, preferably, run in a Jupyter notebook like this one. See the other analysis cookbooks for [instructions](http://ipyrad.readthedocs.io/analysis.html) on using Jupyter notebooks. All of the software required for this tutorial is included with `ipyrad` (v.6.12+). Finally, we've written functions to generate plots for summarizing and interpreting results. 

### Load packages

In [80]:
import ipyrad.analysis as ipa
import ipyparallel as ipp

### Connect to cluster
The code can be easily parallelized across cores on your machine, or many nodes of an HPC cluster using the `ipyparallel` library (see our [ipyparallel tutorial]()). An `ipcluster` instance must be started for you to connect to, which can be started by running `'ipcluster start'` in a terminal. 

In [81]:
ipyclient = ipp.Client()

### Load in your .loci data file and a tree hypothesis

In [69]:
locifile = "./analysis-ipyrad/ped-min4_outfiles/ped-min4.loci"
newick = "./analysis-tetrad/pedtest1.full.tre"

In [38]:
## parse the newick tree, re-root it, and plot it.
tre = ipa.tree(newick=newick)
tre.root(wildcard="prz")
tre.draw(vsize=8, ewidth=2);

## store rooted tree back into a newick string.
newick = tre.tree.write()

### Short tutorial: calculating abba-baba statistics
To give a gist of what this code can do, here is a quick tutorial version, each step of which we explain in greater detail below. We first create a `'baba'` analysis object that is linked to our data file, then we tell it which tests to perform (here we auto-generate a number of tests using the `generate_tests_from_tree()` function), and then we calculate the results and plot them. 

In [31]:
## create a baba object linked to data file and newick tree
bb = ipa.baba(data=locifile, newick=newick)

## generate all possible abba-baba tests meeting a set of constraints
bb.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno"],
    })

## run all tests linked to bb 
bb.run(ipyclient)

## print the results table
print bb.results_table

## save the results table to a csv file
bb.results_table.to_csv("bb.abba-baba.csv", sep="\t")

  44 tests generated from tree
  [####################] 100%  calculating D-stats  | 0:02:39 |  
       dstat  bootmean   bootstd         Z        ABBA        BABA    nloci
0  -0.080823 -0.078616  0.034492  2.279212  369.078125  433.984375   8922.0
1  -0.108041 -0.107661  0.043234  2.490187  263.281250  327.062500   6692.0
2  -0.126447 -0.125456  0.038744  3.238067  305.375000  393.781250   8192.0
3  -0.081898 -0.082782  0.035068  2.360593  358.031250  421.906250   8876.0
4  -0.110147 -0.112951  0.040461  2.791580  256.500000  320.000000   6670.0
5  -0.122080 -0.122642  0.040290  3.043995  300.687500  384.312500   8150.0
6  -0.084039 -0.085001  0.034793  2.443077  337.875000  399.875000   8366.0
7  -0.111342 -0.111351  0.043323  2.570227  240.187500  300.375000   6328.0
8  -0.135614 -0.134856  0.038726  3.482285  280.250000  368.187500   7697.0
9   0.028870  0.024802  0.051120  0.485173  184.875000  174.500000   6216.0
10 -0.165644 -0.166281  0.062270  2.670303   29.750000   41.562500 

### plotting and interpreting results
Interpreting the results of D-statistic tests is actually *very* complicated. You cannot treat every test as if it were independent because introgression between one pair of species may cause one or both of those species to *appear* as if they have also introgressed with other taxa in your data set. This problem is described in great detail in [this paper (Eaton et al. 2015)](http://onlinelibrary.wiley.com/doi/10.1111/evo.12758/abstract). A good place to start, then, is to perform many tests and focus on those which have the strongest signal of admixture. Then, perform additional tests, such as `partitioned D-statistics` (described further below) to tease apart whether a single or multiple introgression events are likely to have occurred. 

In the example plot below we find evidence of admixture between the sample **33413_thamno** (black) with several other samples, but the signal is strongest with respect to **30556_thamno** (test 33). It also appears that admixture is consistently detected with samples of (**40578_rex** & **35855_rex**) when contrasted against **35236_rex** (tests 14, 20, 29, 32).

In [40]:
## plot the results, showing here some plotting options.
bb.plot(height=800, 
        pct_tree_y=0.15,  
        tree_style='c',
        ewidth=2, 
        alpha=4.,
        style_test_labels={"font-size":"10px"});

## Full Tutorial

### Creating a `baba` object

The fundamental object for running abba-baba tests is the `ipa.baba()` object. This stores all of the information about the data, tests, and results of your analysis, and is used to generate plots. If you only have one data file that you want to run many tests on then you will only need to enter the path to your data once. The data file must be a `'.loci'` file from an ipyrad analysis. In general, you will probably want to use the largest data file possible for these tests (`min_samples_locus`=4), to maximize the amount of data available for any test. Once a `baba` object is created you create different copies of that object to perform different tests on, like below. 

In [82]:
## create an initial object linked to your data in 'locifile'
aa = ipa.baba(data=locifile)

## create two other copies
bb = aa.copy()
cc = aa.copy()

## print these objects
print aa
print bb
print cc

<ipyrad.analysis.baba.Baba object at 0x7f96be806f10>
<ipyrad.analysis.baba.Baba object at 0x7f96bd4aa2d0>
<ipyrad.analysis.baba.Baba object at 0x7f96bf0bad90>


### Linking tests to the baba object

The next thing we need to do is to link a `'test'` to each of these objects, or a list of tests. In the [Short tutorial](#Short-tutorial:-calculating-abba-baba-statistics) above we auto-generated a list of tests from an input tree, but to be more explicit about how things work we will write out each test by hand here. A test is described by a Python dictionary that tells it which samples (individuals) should represent the 'p1', 'p2', 'p3', and 'p4' taxa in the ABBA-BABA test. You can see in the example below that we set two samples to represent the outgroup taxon (p4). This means that the SNP frequency for those two samples combined will represent the p4 taxon. For the `baba` object named `'cc'` below we enter two tests using a list to show how multiple tests can be linked to a single `baba` object. 

In [83]:
aa.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["29154_superba"], 
    "p2": ["33413_thamno"], 
    "p1": ["40578_rex"],
}

bb.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["30686_cyathophylla"], 
    "p2": ["33413_thamno"], 
    "p1": ["40578_rex"],
}

cc.tests = [
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41954_cyathophylloides"], 
     "p2": ["33413_thamno"], 
     "p1": ["40578_rex"],
    },
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41478_cyathophylloides"], 
     "p2": ["33413_thamno"], 
     "p1": ["40578_rex"],
    },
]

### Other parameters
Each `baba` object has a set of parameters associated with it that are used to filter the loci that will be used in the test and to set some other optional settings. If the `'mincov'` parameter is set to 1 (the default) then loci in the data set will only be used in a test if there is at least one sample from every tip of the tree that has data for that locus. For example, in the tests above where we entered two samples to represent "p4" only one of those two samples *needs* to be present for the locus to be included in our analysis. If you want to require that both samples have data at the locus in order for it to be included in the analysis then you could set `mincov=2`. However, for the test above setting `mincov=2` would filter out *all* of the data, since it is impossible to have a coverage of 2 for 'p3', 'p2', and 'p1', since they each have only one sample. Therefore, you can also enter the `mincov` parameter as a dictionary setting a different minimum for each tip taxon, which we demonstrate below for the `baba` object `'bb'`. 

In [84]:
## print params for object aa
aa.params

database   None                
mincov     1                   
nboots     1000                
quiet      False               

In [85]:
## set the mincov value as a dictionary for object bb
bb.mincov = {"p4":2, "p3":1, "p2":1, "p1":1}


### Running the tests
When you execute the `'run()'` command all of the tests for the object will be distributed to run in parallel on your cluster (or the cores available on your machine) as connected to your `ipyclient` object. The results of the tests will be stored in your `baba` object under the attributes `'results_table'` and `'results_boots'`. 

In [86]:
## run tests for each of our objects
aa.run(ipyclient)
bb.run(ipyclient)
cc.run(ipyclient)

  [####################] 100%  calculating D-stats  | 0:00:19 |  
  [####################] 100%  calculating D-stats  | 0:00:19 |  
  [####################] 100%  calculating D-stats  | 0:00:25 |  


### The results table
The results of the tests are stored as a data frame (pandas.DataFrame) in `results_table`, which can be easily accessed and manipulated. The tests are listed in order and can be reference by their `'index'` (the number in the left-most column). For example, below we see the results for object `'cc'` tests 0 and 1. You can see which taxa were used in each test by accessing them as `'cc.tests[0]'` or `'cc.tests[1]'`. An even better way to see which individuals were involved in each test, however, is to use our plotting functions, which we describe further below.  

In [87]:
## print the results
print cc.results_table

## you can sort the results by Z-score
cc.results_table.sort_values(by="Z", ascending=False)

## save the table to a file 
cc.results_table.to_csv("cc.abba-baba.csv")

      dstat  bootmean   bootstd         Z    ABBA     BABA   nloci
0  0.040534  0.044190  0.057974  0.762233  126.75  116.875  5433.0
1  0.025745  0.027328  0.057322  0.476742  127.00  120.625  5433.0


### Auto-generating tests
Entering all of the tests by hand can be pain, which is why we wrote functions to auto-generate tests given an input **rooted** tree, and a number of contraints on the tests to generate from that tree. It is important to add constraints on the tests otherwise the number that can be produced becomes very large very quickly. Calculating results runs pretty fast, but summarizing and interpreting thousands of results is pretty much impossible, so it is generally better to limit the tests to those which make some intuitive sense to run. You can see in this example that implementing contrainst reduces the number of tests from 1608 to 13. 

In [51]:
dd = bb.copy()
dd.newick = newick

## all possible tests
dd.generate_tests_from_tree()

## constrained number of tests
constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["40578_rex", "35855_rex"],
    }
dd.generate_tests_from_tree(constraint_dict=constraint_dict, constraint_exact=False)

## further constrained tests
dd.generate_tests_from_tree(constraint_dict=constraint_dict, constraint_exact=True)

  1608 tests generated from tree
  117 tests generated from tree
  13 tests generated from tree


In [56]:
dd.run(ipyclient)
print dd.results_table
dd.plot(height=500, pct_tree_y=0.2, alpha=4., tree_style='c');

       dstat  bootmean   bootstd         Z        ABBA        BABA    nloci
0  -0.106317 -0.106516  0.026445  4.027774  623.281250  771.578125  14925.0
1  -0.099541 -0.099239  0.031920  3.108960  441.421875  539.015625  10874.0
2  -0.130011 -0.129287  0.030074  4.298969  535.750000  695.875000  13558.0
3  -0.092725 -0.093508  0.025917  3.607933  611.156250  736.078125  14861.0
4  -0.093148 -0.092276  0.031281  2.949909  434.296875  523.515625  10839.0
5  -0.112616 -0.112537  0.029607  3.801033  529.421875  663.796875  13504.0
6  -0.119912 -0.119424  0.027591  4.328389  563.703125  717.312500  13842.0
7  -0.115168 -0.115400  0.033445  3.450443  397.234375  500.640625  10183.0
8  -0.144355 -0.145753  0.030007  4.857237  491.046875  656.734375  12614.0
9   0.023032  0.023117  0.039497  0.585292  311.625000  297.593750   9960.0
10 -0.227141 -0.226871  0.055667  4.075483   54.015625   85.765625  14823.0
11 -0.049829 -0.049580  0.022499  2.203656  900.093750  994.500000  15238.0
12 -0.008594

### More about input file paths (i/o)
The default (required) input data file is the `.loci` file produced by `ipyrad`. When performing D-statistic calculations this file will be parsed to retain the maximal amount of information useful for each test. An additional optional input file that you can enter is a newick string tree file. You do not *need* a tree to run ABBA-BABA tests, but you do need at least *a hypothesis* for how your samples are related to one another in order to set up your tests. By loading in a tree for your data set we can use it to easily set up hypotheses to test, and to plot results on the tree.

In [76]:
## path to a locifile created by ipyrad
locifile = "./branch-test/pedtest_outfiles/pedtest.loci"

## path to an unrooted tree inferred with tetrad
newick = "./analysis-tetrad/ped-min4.tree"

### (optional): root the tree
For abba-baba tests you will pretty much always want your tree to be rooted, since the test relies on an assumption about which alleles are ancestral. We have created a simple tree plotting library for `ipyrad` which uses Toyplot as its plotting backend, and `ete3` as its tree manipulation backend. Using this you can re-root your tree. 

Using the `ipa.tree()` function load in a newick string as a tree object and then root the tree on the two *P. przewalskii* samples using the `root()` function. You can either enter the names of the outgroup samples explicitly or enter a wildcard to select them. We show the rooted tree from a tetrad analysis below. 

Lastly, save the rooted tree back as a newick string. We will pass the newick string representation to our `baba` analysis objects when we create them.   

In [78]:
## load in the tree
tre = ipa.tree(newick)

## set the outgroup either as a list or using a wildcard selector
tre.root(outgroup=["32082_przewalskii", "33588_przewalskii"])
tre.root(wildcard="prz")

## draw the tree
tre.draw();

## save the rooted newick string back to a variable and print
newick = tre.newick

### Interpreting results

You can see in the `results_table` above that the D-statistic ranged between -0.5 and 0.5. These values are not too terribly informative, and so we instead generally focus on the Z-score representing how far the distribution of D-statistic values across bootstrap replicates deviates from its expected value of zero. The default number of bootstrap replicates to perform per test is 1000. Each replicate resamples nloci with replacement. 

As you can see from the occurrence of ABBA and BABA patterns, they tend to occur at fairly equal proportions. The values are calculated using SNP frequencies, which is why they are floats instead of integers, and this is also why you are able to combine multiple samples to represent a single tip in the tree (e.g., see the test we setup, above). 

In [89]:
print cc.results_table

      dstat  bootmean   bootstd         Z    ABBA     BABA   nloci
0  0.040534  0.044190  0.057974  0.762233  126.75  116.875  5433.0
1  0.025745  0.027328  0.057322  0.476742  127.00  120.625  5433.0


### Run many tests (without having to write them out by hand)
Writing many tests out by hand can quickly become cumbersome as the number grows large. For that reason we provide a convenient function, `.generate_tests_from_tree()`, for generating all possible tests on a given topology. This is useful for exploring your data and generating hypotheses. Once you've figured out which tests are interesting, you may wish to only retain those with interesting results for your final tables and figures. 

You can restrict the tests that will be generated with this function by using `constraints`, which will return only tests that meet your constraint requirements. In the example below when we allow all possible tests to be generated it creates 859 tests, which is far too many results to summarize easily. If we constrain the tests to those which are most relevant that number can be greatly reduced. In this example we set the two *P. przewalskii* samples as the outgroup, and set *P. cyathophylla* as the P3 taxon. This reduces the number of tests to 33. By default, this will allow tests in which either one of the two *P. przewalskii* samples is the outgroup, or both. We can further restrict it to only tests that meet our contraints *exactly*, which in this case means that both *P. przewalskii* samples represent the outgroup (meaning their pooled SNP frequency is always calculated) and not just one. This is enforced with the argument `constraint_exact=True`. This reduces the number of tests to just 11. 

### Assessing significance
The test above does not show significant evidence of admixture between *P. cyathophylla* and any of the *P. rex* or *P. thamnophila* samples. For each test a Z-score is calculated from the distribution of bootstrap replicate D-statistic values, and significant results are indicated by colored D-statistic distributions, whereas non-significant tests are shown in grey. You can modify the default significance value (Z=3.1) to a different value of your choosing by setting the alpha parameter in the plot function. You may wish to choose your significance value based on converting Z-score to a p-value and correcting for multiple tests. Below I show a result that does show significant evidence of admixture. 

### Find all tests for a given tree
Using the `tree2tests()` function you can automatically generate a list of all possible 4-taxon tests on a rooted tree. 

For even small trees this can very quickly generate a massive number of tests. You can add constraints on the tests to reduce the number of tests, and restrict it to the tests you are particularly interested in. Here we will require that *P. przewalskii* is the outgroup, and we'll find all possible tests contrasting other samples against "33413_thamno". 

### Running 5-taxon (partitioned) D-statistics
To perform partitioned D-statistic tests is not any harder than running the standard four-taxon D-statistic tests. You simply enter your tests with 5 taxa in them now, listed as p1-p5. 

## Simulation scenario of 12 taxon tree

Simulated data on a 12 taxon tree. Big split: (A,B,C,D) ((E,F,G,H),(I,J,K,L)). Gene flow occurs from IJ -> H, and from C->B. 



In [9]:
import msprime as ms
import ete3 as ete
import numpy as np
import ipyrad.analysis.baba as baba
tree = baba.Tree()

In [38]:
nreps = 10
admix = None
Ns = int(5e5)
gen = 20

In [80]:
## todo: rotate tip names 90 deg. & offset & show idx for all
## also add hover stats for [edge.length, edge.Ns, node.idx, node.name]
tree.draw(width=500, height=250);

In [184]:
tree.tree.get_leaf_names()

['d', 'c', 'a', 'b', 'h', 'g', 'e', 'f', 'l', 'k', 'i', 'j']

In [200]:
## node ages
Taus = np.array(list(set(tree.verts[:, 1]))) * 1e4 * gen

## The tips samples, ordered alphanumerically
## Population IDs correspond to their indexes in pop config
ntips = len(tree.tree)
names = {name: idx for idx, name in enumerate(sorted(tree.tree.get_leaf_names()))}
pop_config = [
    ms.PopulationConfiguration(sample_size=2, initial_size=Ns)
    for i in range(ntips)
]

## migration matrix all zeros init
migmat = np.zeros((ntips, ntips)).tolist()

## a list for storing demographic events
demog = []

## coalescent times
coals = sorted(list(set(tree.verts[:, 1])))[1:]
for ct in xrange(len(coals)):
    ## check for admix event before next coalescence
    ## ...
    
    ## print coals[ct], nidxs, time
    nidxs = np.where(tree.verts[:, 1] == coals[ct])[0]
    time = Taus[ct+1]

    ## add coalescence at each node
    for nidx in nidxs:
        node = tree.tree.search_nodes(name=str(nidx))[0]

        ## get destionation (lowest child idx number), and other
        dest = sorted(node.get_leaves(), key=lambda x: x.idx)[0]
        otherchild = [i for i in node.children if not i.get_leaves_by_name(dest.name)][0]

        ## get source
        if otherchild.is_leaf():
            source = otherchild
        else:
            source = sorted(otherchild.get_leaves(), key=lambda x: x.idx)[0]
        
        ## add coal events
        event = ms.MassMigration(
                    time=int(time),
                    source=names[source.name], 
                    destination=names[dest.name], 
                    proportion=1.0)
        print int(time), source.name, dest.name, [names[source.name], names[dest.name]]
    
        ## ...
        demog.append(event)
        
        
## sim the data
replicates = ms.simulate(
    population_configurations=pop_config,
    migration_matrix=migmat,
    demographic_events=demog,
    num_replicates=10000,
    length=100, 
    mutation_rate=1e-8)

200000 b a [1, 0]
200000 f e [5, 4]
200000 j i [9, 8]
400000 c a [2, 0]
400000 g e [6, 4]
400000 k i [10, 8]
600000 d a [3, 0]
600000 h e [7, 4]
600000 l i [11, 8]
800000 i e [8, 4]
1000000 e a [4, 0]


In [201]:
replicates.next()


<msprime.trees.TreeSequence at 0x7f742eff8a90>

In [128]:
r, b = baba.baba(replicates, test, nboots=1000)


TypeError: 'source' is not number

In [38]:
np.where(tree.verts[:, 1] == coals[0])[0]

array([ 3,  7, 10])

In [46]:
[i.idx for i in tree.tree.get_descendants() if i.is_leaf()]

[14, 13, 18, 22, 11, 12, 17, 21, 15, 16, 19, 20]

In [42]:
print tree.verts
print tree.edges

[[  7.125   5.   ]
 [ 10.125   3.   ]
 [  9.25    2.   ]
 [  8.5     1.   ]
 [  4.125   4.   ]
 [  6.125   3.   ]
 [  5.25    2.   ]
 [  4.5     1.   ]
 [  2.125   3.   ]
 [  1.25    2.   ]
 [  0.5     1.   ]
 [  9.      0.   ]
 [  8.      0.   ]
 [ 10.      0.   ]
 [ 11.      0.   ]
 [  5.      0.   ]
 [  4.      0.   ]
 [  6.      0.   ]
 [  7.      0.   ]
 [  1.      0.   ]
 [  0.      0.   ]
 [  2.      0.   ]
 [  3.      0.   ]]
[[ 2 13]
 [ 3 11]
 [ 3 12]
 [ 2  3]
 [ 1  2]
 [ 0  1]
 [ 5 18]
 [ 6 17]
 [ 7 15]
 [ 7 16]
 [ 6  7]
 [ 5  6]
 [ 4  5]
 [ 8 22]
 [ 9 21]
 [10 19]
 [10 20]
 [ 9 10]
 [ 8  9]
 [ 4  8]
 [ 0  4]
 [ 1 14]]


In [23]:
for node in tree.tree.traverse("postorder"):
    print node.name

d
c
a
b
3
2
1
h
g
e
f
7
6
5
l
k
i
j
10
9
8
4
0


In [7]:

def demography(nreps, Ns=500000, gen=10, mut=1e-9, mig=1e-9, scen=0):
    
    # years are in units of 1e6 years, divide to get units in generations
    Taus = (np.array([0, 1, 2, 3, 4, 5]) * 1e6) / gen

    # Migration rates C -> B and from IJ -> EF
    m_C_B = mig
    m_IJ_EF = mig
    
    # Population IDs correspond to their indexes in pop_config.
    pop_config = [
        ms.PopulationConfiguration(sample_size=2, initial_size=Ns)
        for i in range(12)]
    
    ## migration matrix all zeros time=0
    migmat = np.zeros((12, 12)).tolist()
    
    ## set up demography
    if scen:
        ## this one is INTO IJ (forward in time), and INTO C
        x = ms.MigrationRateChange(time=0., rate=m_C_B, matrix_index=(2, 1))
        y = ms.MigrationRateChange(time=Taus[1], rate=m_IJ_EF, matrix_index=(8, 4)) 

    else:
        ## this one in INTO EF (forward) and INTO B
        x = ms.MigrationRateChange(time=0, rate=m_C_B, matrix_index=(1, 2))
        y = ms.MigrationRateChange(time=Taus[1], rate=m_IJ_EF, matrix_index=(4, 8)) 
        
    #ms.MigrationRateChange(time=0., rate=m_IJ_EF, matrix_index=(9, 4)) 
    #ms.MigrationRateChange(time=0., rate=m_IJ_EF, matrix_index=(8, 4)) 

    demog = [
        ## initial migration from C -> B
        x,
        ms.MigrationRateChange(time=Taus[1], rate=0),

        # merge events at time 1 (b,a), (f,e), (j,i)
        ms.MassMigration(time=Taus[1], source=1, destination=0, proportion=1.0), 
        ms.MassMigration(time=Taus[1], source=5, destination=4, proportion=1.0), 
        ms.MassMigration(time=Taus[1], source=9, destination=8, proportion=1.0), 
        
        ## migration from IJ -> EF (backward in time)
        y,
        
        ## merge events at time 2 (c,a), (g,e), (k,i)
        ms.MassMigration(time=Taus[2], source=2, destination=0, proportion=1.0), 
        ms.MassMigration(time=Taus[2], source=6, destination=4, proportion=1.0), 
        ms.MassMigration(time=Taus[2], source=10, destination=8, proportion=1.0), 

        ## end migration at ABC and merge
        ms.MigrationRateChange(time=Taus[2], rate=0),
        ms.MassMigration(time=Taus[3], source=3, destination=0, proportion=1.0), 
        ms.MassMigration(time=Taus[3], source=7, destination=4, proportion=1.0), 
        ms.MassMigration(time=Taus[3], source=11, destination=8, proportion=1.0),   
        
        ## merge EFJH -> IJKL
        ms.MassMigration(time=Taus[4], source=8, destination=4, proportion=1.0),   
        
        ## merge ABCD -> EFJHIJKL
        ms.MassMigration(time=Taus[5], source=4, destination=0, proportion=1.0),   
    ]

    ## sim the data
    replicates = ms.simulate(
        population_configurations=pop_config,
        migration_matrix=migmat,
        demographic_events=demog,
        num_replicates=nreps,
        length=100, 
        mutation_rate=mut)
    
    return replicates

### TESTING THE SIMS

In [20]:
## introgression ij -> ef
test = {
    'p5': ['a', 'b', 'c', 'd'],
    'p4': ['k'],
    'p3': ['i', 'j'],
    'p2': ['e', 'f'],
    'p1': ['g'],
}

sims = demography(10000, Ns=1e6, mut=1e-8, mig=5e-5, gen=1, scen=1)
r, b = baba.baba(sims, test, nboots=1000)
print r

           dstat  bootmean   bootstd        abxxa        baxxa          Z
p3      0.105043  0.105103  0.008031  4347.031250  3520.593750  13.079410
p4      0.020286  0.020316  0.009568  3316.695312  3184.804688   2.120156
shared  0.507796  0.508296  0.008812  9140.445313  2983.799479  57.627364


In [19]:
sims = demography(10000, Ns=1e6, mut=1e-8, mig=5e-5, gen=1, scen=0)
r, b = baba.baba(sims, test, nboots=1000)
print r

           dstat  bootmean   bootstd       abxxa       baxxa          Z
p3      0.090130  0.090191  0.009824  4224.21875  3525.71875   9.174368
p4      0.015188  0.015300  0.009858  3294.12500  3195.56250   1.540635
shared  0.432020  0.431656  0.011232  6834.56250  2710.78125  38.463642


           dstat  bootmean   bootstd       abxxa       baxxa          Z
p3      0.090130  0.090636  0.009607  4224.21875  3525.71875   9.381222
p4      0.015188  0.015511  0.009963  3294.12500  3195.56250   1.524395
shared  0.432020  0.431724  0.011572  6834.56250  2710.78125  37.333960


### Run on sim data

In [10]:
## there should be no imbalance in this test
test = {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['l'],
    'p2': ['k'],
    'p1': ['i', 'j'],
}
## mindict
mindict = {key: 1 for key in test}

### Run a single simulated data set

In [11]:
sims = demography(10000, Ns=1e6, mut=1e-9, mig=1e-7, gen=1)
res, boots = baba.baba(sims, test, mindict, 1000)
print res

                   0
dstat       0.003868
bootmean    0.001674
bootstd     0.037506
abba      330.539062
baba      327.992188
Z           0.044628


### 4-taxon tests

In [12]:
tests = [
    ## no introgression
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['l'],
    'p2': ['j'],
    'p1': ['k'],
    }, 
    ## no introgression
    {
    'p4': ['e','f','g','h'],
    'p3': ['d'],
    'p2': ['c'],
    'p1': ['a'],
    }, 
    ## no introgression
    {
    'p4': ['h'],
    'p3': ['g'],
    'p2': ['f'],
    'p1': ['e'],
    }, 
    ## no introgression
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['g'],
    'p2': ['f'],
    'p1': ['e'],
    }, 
    ## no introgression
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['k'],
    'p2': ['j'],
    'p1': ['i'],
    }, 
    ## introgression 
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['i', 'j'],
    'p2': ['g'],
    'p1': ['f'],
    },
    ## introgression 
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['i', 'j'],
    'p2': ['g'],
    'p1': ['e'],
    },
    ## introgression 
    {
    'p4': ['a', 'b', 'c', 'd'],
    'p3': ['i', 'j'],
    'p2': ['g'],
    'p1': ['e', 'f'],
    },   
    ## introgression C->B
    {
    'p4': ['e','f','g','h'],
    'p3': ['c'],
    'p2': ['b'],
    'p1': ['a'],
    },     
    ## introgression C->B
    {
    'p4': ['i', 'j', 'k', 'l'],
    'p3': ['c'],
    'p2': ['b'],
    'p1': ['a'],
    },         
]

tests    

[{'p1': ['k'], 'p2': ['j'], 'p3': ['l'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['a'], 'p2': ['c'], 'p3': ['d'], 'p4': ['e', 'f', 'g', 'h']},
 {'p1': ['e'], 'p2': ['f'], 'p3': ['g'], 'p4': ['h']},
 {'p1': ['e'], 'p2': ['f'], 'p3': ['g'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['i'], 'p2': ['j'], 'p3': ['k'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['f'], 'p2': ['g'], 'p3': ['i', 'j'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['e'], 'p2': ['g'], 'p3': ['i', 'j'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['e', 'f'], 'p2': ['g'], 'p3': ['i', 'j'], 'p4': ['a', 'b', 'c', 'd']},
 {'p1': ['a'], 'p2': ['b'], 'p3': ['c'], 'p4': ['e', 'f', 'g', 'h']},
 {'p1': ['a'], 'p2': ['b'], 'p3': ['c'], 'p4': ['i', 'j', 'k', 'l']}]

### Run a batch of simulated data sets

In [13]:
## a simulation generator
sims = demography(10000, Ns=1e6, mut=1e-9, mig=1e-8, gen=250)

## pass it as first arg to batch func
r, b = baba.batch(sims, tests, nboots=1000, ipyclient=ipyclient)

  [####################] 100%  calculating D-stats  | 0:00:17 |  


## Calculate 4-taxon statistics
<br>

$
    D = \frac{\Sigma(ABBA - BABA)}{\Sigma(ABBA+BABA)}
$

<br>

$
    D_p = \frac{\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ (1-p_4) ] - [(1-p_1) ~ p_2 ~ p_3 ~ (1-p_4)]}      {\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ (1-p_4) ] + [(1-p_1) ~ p_2 ~ p_3 ~ (1-p_4)]}
$

<br>

## Calculate 5-taxon statistics

<br>

$
    D_{12} = \frac{\Sigma(ABBBA - BABBA)}{\Sigma(ABBBA+BABBA)}
$

$
    D_{1} = \frac{\Sigma(ABBAA - BABAA)}{\Sigma(ABBAA+BABAA)}
$

$
    D_{2} = \frac{\Sigma(ABABA - BAABA)}{\Sigma(ABABA+BAABA)}
$

<br>

$
    D_{p12} = \frac 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ p_4 ~ (1-p_5) ] - [(1-p_1) ~ p_2 ~ p_3 ~ p_4 ~ (1-p_5)]} 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ p_4 ~ (1-p_4) ] + [(1-p_1) ~ p_2 ~ p_3 ~ p_4 ~ (1-p_5)]}
$


$
    D_{p1} = \frac 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ (1-p_4) ~ (1-p_5) ] - [(1-p_1) ~ p_2 ~ p_3 ~ (1-p_4) ~ (1-p_5)]} 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ p_3 ~ (1-p_4) ~ (1-p_5) ] + [(1-p_1) ~ p_2 ~ p_3 ~ (1-p_4) ~ (1-p_5)]}
$


$
    D_{p2} = \frac 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ (1-p_3) ~ p_4 ~ (1-p_5) ] - [(1-p_1) ~ p_2 ~ (1-p_3) ~ p_4 ~ (1-p_5)]} 
        {\Sigma ~ [ p_1 ~ (1-p_2) ~ (1-p_3) ~ p_4 ~ (1-p_5) ] + [(1-p_1) ~ p_2 ~ (1-p_3) ~ p_4 ~ (1-p_5)]}
$

<br>

### Setup CLI data test

In [17]:
## a dictionary with [required] key names 
## optional: additional 'p4' key for 5-taxon tests.
test = {
    'p5': ["3L_0", "3J_0", "3K_0"], 
    'p4': ["2G_0", "2H_0"],
    'p3': ["2E_0", "2F_0"],
    'p2': ["1D_0"],
    'p1': ["1A_0", "1B_0", "1C_0"],
}

## optional: dict for min samples per taxon (default=1 per tax)
## used to filter loci for inclusion in data set
mindict = {
    'p1': 1,
    'p2': 1,
    'p3': 1, 
    'p4': 1,
    'p5': 1,
}

## loci input file
handle = data.outfiles.loci

In [18]:
## run baba.batch() for five taxa
r, b = baba.baba(handle, test, None, 1000)
print r

           dstat  bootmean   bootstd      abxxa      baxxa         Z
p3     -0.333333 -0.426138  0.505859   0.250000   0.500000  0.658945
p4      0.080460  0.072276  0.319822   1.958333   1.666667  0.251576
shared  0.005321 -0.001116  0.204465  12.465278  12.333333  0.026022


### Setup msprime data test

In [38]:
## introgression ij -> ef
test = {
    'p5': ['a', 'b', 'c', 'd'],
    'p4': ['k'],
    'p3': ['i', 'j'],
    'p2': ['e', 'f'],
    'p1': ['g'],
}

## sim and calc
sims = demography(20000, Ns=1e5, mut=1e-8, mig=1e-6, gen=1)
r, b = baba.baba(sims, test, None, 1000)

In [39]:
print r

           dstat  bootmean   bootstd         abxxa    baxxa          Z
p3      0.212910  0.212728  0.008965   5061.437500  3284.50  23.748952
p4      0.043302  0.043464  0.010466   3838.375000  3519.75   4.137329
shared  1.000000  1.000000  0.000000  30292.041667     0.00   0.000000


In [40]:
sims = demography(20000, Ns=1e6, mut=1e-9, mig=0, gen=250, scen=1)
r, b = baba.baba(sims, test, None, 1000)
print r

           dstat  bootmean   bootstd     abxxa     baxxa         Z
p3      0.006717  0.005618  0.028301  384.0625  378.9375  0.237336
p4      0.082609  0.081418  0.140152   31.1250   26.3750  0.589423
shared  0.006858  0.006667  0.021331  683.6250  674.3125  0.321494


In [52]:
#arr[:3, :, :5]
r,b = runit(20000, Ns=500000, mut=1e-8, mig=1e-7, gen=200, nboots=1000, test=tt, scen=0)