<h2><span style="color:gray">ipyrad-analysis toolkit:</span> treemix</h2>

The program [TreeMix](https://bitbucket.org/nygcresearch/treemix/wiki/Home) by [Pickrell & Pritchard (2012)](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002967)  is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: "In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph."

### Required software

In [1]:
# conda install ipyrad -c bioconda
# conda install treemix -c bioconda
# conda install toytree -c eaton-lab

In [2]:
import ipyrad.analysis as ipa
import toytree

In [3]:
print('ipyrad', ipa.__version__)
print('toytree', toytree.__version__)
! treemix --version | grep 'TreeMix v. '

ipyrad 0.9.6-dev
toytree 0.2.0
TreeMix v. 1.12


### Short Tutorial:

If you entered population information during data assembly then you may have already produced a `.treemix.gz` output file that can be used as input to the treemix command line program. Alternatively, you can run treemix using the ipyrad tool here which offers some additional flexibility for filtering SNP data, and for running treemix programatically over many parameter settings. 

The key features offered by `ipa.treemix` include: 

1. Filter unlinked SNPs (1 per locus) many times for replicate analyses.
2. Filter by sample or populations coverage.
3. Plotting functions. 
4. Easy to write for-loops 

In [6]:
# the path to your HDF5 formatted snps file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"

In [7]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CRL0001", "CUCA4", "CUSV6", "CUMM5"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017", "CRL0001"],
}

# minimum n samples that must be present in each SNP from each group
minmap = {
    "virg": 3,
    "mini": 2,
    "gemi": 2,
    "bran": 2,
    "fusi": 2,
    "sagr": 2,
    "oleo": 3,
}

In [8]:
# init a treemix analysis object with some param arguments
tmx = ipa.treemix(
    data=data, 
    imap=imap,
    minmap=minmap, 
    seed=1234,
    root="bran,fusi",
    bootstrap=10,
    m=2,
)

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 0
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
subsampled 30621 unlinked SNPs


In [9]:
# print the command string that will be called and run it
print(tmx.command)
tmx.run()


treemix -i /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-treemix/test.treemix.in.gz -o /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-treemix/test -m 2 -bootstrap 10 -seed 1234 -root bran,fusi


In [8]:
# draw the best scoring admixture graph
tmx.draw_tree();

In [9]:
# draw the covariance matrix
tmx.draw_cov();

# Cookbook

### 1. Finding the best value for `m`

As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs. 

In [10]:
# init a treemix analysis object with some param arguments
tmx = ipa.treemix(
    data=data, 
    imap=imap,
    minmap=minmap, 
    seed=1234,
    root="bran,fusi",
)

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 0
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
subsampled 30621 unlinked SNPs


In [11]:
tests = {}
nadmix = [0, 1, 2, 3, 4, 5]

# iterate over n admixture edges
for adm in nadmix:
    tmx.params.m = adm
    tmx.run()
    tests[adm] = tmx.results.llik

In [12]:
import toyplot
toyplot.plot(
    nadmix,
    [tests[i] for i in nadmix],
    width=350, 
    height=275,
    stroke_width=3,
    xlabel="n admixture edges",
    ylabel="ln(likelihood)",
);

### 2. Iterate over different subsamples of SNPs

The treemix tool randomly subsamples 1 SNP per locus to reduce the effect of linkage on the results. However, depending on the size of your data set, and the strength of the signal, subsampling may yield slightly different results in different iterations. You can check over different subsampled iterations by re-initing the treemix tool with a different (or no) random seed. Below I plot the results of 9 iterations for m=2. 

In [14]:
# a gridded canvas to plot trees on 
canvas = toyplot.Canvas(width=800, height=800)

# iterate over multiple set of SNPs
for i in range(9):
    
    # init a treemix analysis object with a random (no) seed
    tmx = ipa.treemix(
        data=data, 
        imap=imap,
        minmap=minmap,
        root="bran,fusi",
        global_=True,
        m=2,
        quiet=True
    )
    
    # run model fit
    tmx.run()
    
    # create a grid axis and add tree to axes
    axes = canvas.cartesian(grid=(3, 3, i))
    tmx.draw_tree(axes)