# <span style="color:gray">ipyrad-analysis toolkit:</span> tetrad

`tetrad` is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree. 

### Required software

In [1]:
# conda install ipyrad -c bioconda
# conda install tetrad -c eaton-lab

In [1]:
import ipyrad.analysis as ipa
import toytree

### Input arguments:


In [8]:
tet = ipa.tetrad(
    name='ficus-min10', 
    data="/home/deren/Downloads/ficus-min10-10K.snps.hdf5",
    nquartets=1e6, 
    nboots=10,
)

loading snps array [113 taxa x 1052734 snps]
max unlinked SNPs per quartet [nloci]: 61480
quartet sampler [random]: 1000000 / 6438740


In [None]:
tet.run(auto=True, show_cluster=True)

Parallel connection | latituba: 8 cores
initializing quartet sets database
[####################] 100% 2:20:51 | inferring full tree * | mean SNPs/quartet: 36945 
[####################] 100% 0:48:58 | bootstrap inference 1 | mean SNPs/quartet: 37992 
[####################] 100% 0:48:18 | bootstrap inference 2 | mean SNPs/quartet: 36435 
[####################] 100% 0:48:13 | bootstrap inference 3 | mean SNPs/quartet: 37288 
[####################] 100% 0:48:25 | bootstrap inference 4 | mean SNPs/quartet: 36691 
[####################] 100% 0:48:49 | bootstrap inference 5 | mean SNPs/quartet: 37128 
[####################] 100% 0:48:48 | bootstrap inference 6 | mean SNPs/quartet: 36946 
[####################] 100% 0:48:27 | bootstrap inference 7 | mean SNPs/quartet: 36576 
[#####               ]  25% 0:13:18 | bootstrap inference 8 | mean SNPs/quartet: 35088 

In [6]:
from tetrad.distributor import *
import ipyparallel as ipp
ipyclient = ipp.Client()

In [7]:
tet._store_N_samples(True, ipyclient)
self = Distributor(tet, ipyclient, None, False)

initializing quartet sets database


In [60]:
self.jobs

range(0, 10334625, 40530)

In [72]:
chunk = 40530 * 5

In [73]:
from tetrad.worker import *

with single_threaded(np):

    # open seqarray view, the modified arr is in bootstarr
    with h5py.File(tet.files.idb, 'r') as io5:
        seqview = io5["bootsarr"][:]
        maparr = io5["bootsmap"][:, 0]
        smps = io5["quartets"][chunk:chunk + tet._chunksize]

        # create an N-mask array of all seq cols
        nall_mask = seqview[:] == 78

    # init arrays to fill with results
    rquartets = np.zeros((smps.shape[0], 4), dtype=np.uint16)
    rinvariants = np.zeros((smps.shape[0], 16, 16), dtype=np.uint16)

    # TODO: test again numbafying the loop below, but on a super large 
    # matrix. Maybe two strategies should be used for different sized 
    # problems...

In [74]:
from collections import Counter


In [143]:
idx = 5500
sidx = smps[idx]
print(sidx)

seqs = seqview[sidx]
#print(seqs)

nmask = np.any(nall_mask[sidx], axis=0)
nmask += np.all(seqs == seqs[0], axis=0) 
nmask3 = np.array([
    Counter(i).most_common()[0][1] == 2
    for i in seqs[:, ~nmask].T
])

s2= seqs[:, ~nmask].shape, seqs[:, ~nmask][:, nmask3].shape
s2

[ 0 37 39 82]


((4, 3233), (4, 691))

In [137]:
bidx, invar = calculate(seqs, maparr, nmask, TESTS)

In [138]:
#import toyplot
#toyplot.matrix(invar[0]);

In [139]:
seqs[:, ~nmask][:, nmask3]

array([[2, 2, 1, ..., 3, 3, 3],
       [1, 0, 3, ..., 1, 1, 1],
       [3, 2, 1, ..., 3, 3, 3],
       [1, 0, 3, ..., 1, 1, 1]], dtype=uint8)

In [140]:
from tetrad.jitted import *

In [141]:
mats = chunk_to_matrices(seqs, maparr, nmask)

In [148]:
from tetrad.utils import TESTS
tests = TESTS

# empty arrs to fill
svds = np.zeros((3, 16), dtype=np.float64)
scor = np.zeros(3, dtype=np.float64)
rank = np.zeros(3, dtype=np.float64)

# svd and rank.
for test in range(3):
    svds[test] = np.linalg.svd(mats[test].astype(np.float64))[1]
    rank[test] = np.linalg.matrix_rank(mats[test].astype(np.float64))

minrank = int(min(10, rank.min()))
print(minrank)
for test in range(3):
    scor[test] = np.sqrt(np.sum(svds[test, minrank:]**2))
    
# sort to find the best qorder
best = np.where(scor == scor.min())[0]
bidx = tests[best][0]

print(scor, bidx)

10
[8.18150234 0.36098334 8.70600503] [0 2 1 3]


In [149]:
#mats

In [150]:
sidx[bidx]

array([ 0, 39, 37, 82], dtype=uint16)

In [151]:
print([tet.samples[i] for i in sidx])
print([tet.samples[i] for i in sidx[bidx]])

['Adoxa_MJD_120', 'divaricatum_PWS_1773', 'elatum_PWS_3084', 'obtusatum_Tzont']
['Adoxa_MJD_120', 'elatum_PWS_3084', 'divaricatum_PWS_1773', 'obtusatum_Tzont']


In [185]:
np.random.randint(3)#(range(3))

0

In [135]:
    # fill arrays with results as we compute them. This iterates
    # over all of the quartet sets in this sample chunk. It would
    # be nice to have this all numbified (STOP TRYING...)
    for idx in range(smps.shape[0]):
        sidx = smps[idx]
        seqs = seqview[sidx]

        ## these axis calls cannot be numbafied, but I can't 
        ## find a faster way that is JIT compiled, and I've
        ## really, really, really tried. Tried again now that
        ## numba supports axis args for np.sum. Still can't 
        ## get speed improvements by numbifying this loop.
        ## tried guvectorize too...
        nmask = np.any(nall_mask[sidx], axis=0)
        nmask += np.all(seqs == seqs[0], axis=0) 

        ## here are the jitted funcs
        bidx, invar = calculate(seqs, maparr, nmask, TESTS)

        ## store results
        rquartets[idx] = sidx[bidx]
        rinvariants[idx] = invar

KeyboardInterrupt: 

In [26]:
nmask1 = np.any(nall_mask[sidx], axis=0)
nmask2 = np.sum(seqs == seqs[0], axis=0) > 2
#nmask3 = np.sum(seqs == seqs[0], axis=0) < 3
#print(nmask.sum(), nmask.size - nmask.sum())
seqs[:, np.invert(nmask1 + nmask2)]

array([[2, 3, 0, ..., 3, 0, 2],
       [2, 3, 0, ..., 3, 0, 2],
       [0, 2, 2, ..., 1, 1, 0],
       [0, 2, 2, ..., 1, 1, 0]], dtype=uint8)

In [27]:
from collections import Counter

In [28]:
Counter(seqs[:,  ~(nmask1+nmask2)][:, 2]).most_common()

[(0, 2), (2, 2)]

In [29]:
seqs[:, ~(nmask1+nmask2)]

array([[2, 3, 0, ..., 3, 0, 2],
       [2, 3, 0, ..., 3, 0, 2],
       [0, 2, 2, ..., 1, 1, 0],
       [0, 2, 2, ..., 1, 1, 0]], dtype=uint8)

In [30]:
nmask = nmask1 + nmask2

In [32]:
tet.samples

['29154_superba_SRR1754715',
 '30556_thamno_SRR1754720',
 '30686_cyathophylla_SRR1754730',
 '32082_przewalskii_SRR1754729',
 '33413_thamno_SRR1754728',
 '33588_przewalskii_SRR1754727',
 '35236_rex_SRR1754731',
 '35855_rex_SRR1754726',
 '38362_rex_SRR1754725',
 '39618_rex_SRR1754723',
 '40578_rex_SRR1754724',
 '41478_cyathophylloides_SRR1754722',
 '41954_cyathophylloides_SRR1754721']

In [33]:
sidx

array([ 0,  2,  8, 10], dtype=uint16)

In [14]:
a, b = calculate(seqs, maparr, nmask, TESTS)

import pandas as pd
df = pd.DataFrame(b)
df.style.background_gradient(cmap='Blues')
#df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0,0,0,0,11,1,2,0,1,63,1,0,2,2,57
1,0,5,0,0,41,46,1,1,0,0,0,0,0,0,0,0
2,0,0,14,0,1,0,0,0,152,0,140,1,0,0,1,1
3,0,0,0,47,0,0,0,0,1,0,1,1,126,1,0,380
4,34,50,1,1,6,0,0,0,0,1,0,0,0,0,0,0
5,19,0,0,3,0,0,0,0,0,0,13,3,0,0,3,61
6,1,2,0,0,0,0,1,1,1,39,28,0,0,1,1,0
7,1,0,0,2,0,0,1,23,0,0,0,0,2,165,1,195
8,143,0,170,2,0,0,0,0,21,0,0,0,2,0,2,0
9,0,0,5,0,1,33,49,2,0,1,0,0,0,0,0,0


In [31]:
tre = mtre.treelist[0]
tre.draw(
    #layout='c', 
    #width=800, 
    #height=800,
    tip_labels_align=True,
    node_labels="support",
    #use_edge_lengths=False,
);

In [42]:
tet.run(auto=True)

Parallel connection | latituba: 8 cores
initializing quartet sets database
[####################] 100% 0:25:02 | inferring full tree    


In [67]:
import os

os.path.exists(tet.trees.cons)

False

In [61]:
tre = toytree.tree(tet.trees.cons)
tre.root('reference').draw(
    layout='c', 
    width=800, 
    height=800,
    tip_labels_align=True,
    node_labels="support",
    #use_edge_lengths=False,
);

ToytreeError: Sample ['reference'] is not in the tree

In [45]:
tet.trees.tree

'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-tetrad/test.tree'

In [41]:
import h5py

with h5py.File(tet.files.idb, 'r') as io5:
    
    print("idb bootsmap")
    print(io5["bootsmap"][5:])

    print("idb seqarr")
    print(io5["seqarr"][:, :])

    print("idb seqarr")
    print(io5["spans"][-5:])

    print("idb seqarr")
    print(io5["bootsarr"][-5:])
    
    print("idb quartets")
    print(io5["quartets"][-5:])


idb bootsmap
[[      0       5]
 [      0       6]
 [      0       7]
 ...
 [  61929 1052731]
 [  61929 1052732]
 [  61929 1052733]]
idb seqarr
[[65 71 71 ... 71 84 84]
 [78 78 78 ... 71 67 84]
 [78 78 78 ... 71 67 84]
 ...
 [78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]]
idb seqarr
[[1052613 1052645]
 [1052645 1052653]
 [1052653 1052675]
 [1052675 1052715]
 [1052715 1052734]]
idb seqarr
[[78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]
 [78 78 78 ... 78 78 78]]


In [28]:
with h5py.File(tet.files.data, 'r') as io5:
    print(io5["snpsmap"][:50])


[[    1     0     5     1  2558]
 [    1     1    12     1  2565]
 [    1     2    13     1  2566]
 [    1     3    15     1  2568]
 [    1     4    41     1  2594]
 [    1     5    42     1  2595]
 [    1     6    53     1  2606]
 [    1     7    58     1  2611]
 [    1     8    60     1  2613]
 [    1     9    62     1  2615]
 [    1    10    63     1  2616]
 [    1    11    65     1  2618]
 [    1    12    71     1  2624]
 [    1    13    74     1  2627]
 [    1    14    78     1  2631]
 [    1    15    89     1  2642]
 [    1    16    91     1  2644]
 [    1    17    98     1  2651]
 [    1    18   108     1  2661]
 [    1    19   112     1  2665]
 [    1    20   115     1  2668]
 [    1    21   117     1  2670]
 [    1    22   119     1  2672]
 [    1    23   122     1  2675]
 [    1    24   125     1  2678]
 [    1    25   126     1  2679]
 [    1    26   134     1  2687]
 [    1    27   138     1  2691]
 [    1    28   146     1  2699]
 [    1    29   149     1  2702]
 [    1   

In [7]:
# init raxml object with input data and (optional) parameter options
rax = ipa.raxml(data=phyfile, T=4, N=10)

# print the raxml command string for prosperity
print(rax.command)

# run the command, (options: block until finishes; overwrite existing)
rax.run(block=True, force=True)

raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 10 -x 12345
job test finished successfully


In [5]:
# (optional) draw your tree in the notebook
import toytree

# load from the .trees attribute of the raxml object, or from the saved tree file
tre = toytree.tree(rax.trees.bipartitions)

# draw the tree
rtre = tre.root(wildcard="prz")
rtre.draw(tip_labels_align=True, node_labels="support");

ToytreeError: No Samples matched the wildcard

### Longer tutorial

By default several parameters are pre-set in the raxml object. To remove those parameters from the command string you can set them to None. Additionally, you can build complex raxml command line strings by adding almost any parameter to the raxml object init, like below. You probably can't do everythin in raxml using this tool, it's only meant as a convenience. You can always of course just write the raxml command line string by hand instead.

In [23]:
# init raxml object
rax = ipa.raxml(data=phyfile, T=4, N=10)

# parameter dictionary for a raxml object
rax.params

N        10                  
T        4                   
binary   raxmlHPC-PTHREADS-SSE3
f        a                   
m        GTRGAMMA            
n        test                
p        54321               
s        ~/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy
w        ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml
x        12345               

In [24]:
# paths to output files produced by raxml inference
rax.trees

bestTree                   ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bestTree.test
bipartitions               ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test
bipartitionsBranchLabels   ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitionsBranchLabels.test
bootstrap                  ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bootstrap.test
info                       ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_info.test

### Cookbook

Most frequently used: perform 100 rapid bootstrap analyses followed by 10 rapid hill-climbing ML searches from random starting trees under the GTRGAMMA substitution model. 

In [9]:
rax = ipa.raxml(
    data=phyfile,
    name="test",
    workdir="analysis-raxml",
    m="GTRGAMMA",
    T=20,
    f="a",
    N=100,
)
print(rax.command)

raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 100 -x 12345


Another common option: Perform N rapid hill-climbing ML analyses from random starting trees, with no bootstrap replicates.

In [10]:
rax = ipa.raxml(
    data=phyfile,
    name="test",
    workdir="analysis-raxml",
    m="GTRGAMMA",
    T=20,
    f="d",
    N=10,
    x=None,
)
print(rax.command)

raxmlHPC-PTHREADS-SSE3 -f d -T 20 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 10


### What's next?

If you have reference mapped data then you should see the `.treeslider()` tool to infer trees in sliding windows along scaffolds; or the `.window_extracter()` tool to extract, filter, and concatenate RAD loci within a given window (e.g., near some known gene).