# Building a concatenate species tree with ete build
Besides the reconstruction of gene phylogenies, ETE build provides basic workflows to build concatenated trees. 
You only need to provide a file defining the target COGs to use. This is, a file where each lines contains a tab-delimited list of sequences that are considered orthologs among all taxa.

- Relevant documentation: [ete3 build supermatrix](http://etetoolkit.org/cookbook/ete_build_supermatrix.ipynb)

# Infer species tree based on a concatenated alignment of marker gene families

## Create a COGS file that ete build can read
To play well with ETE, the COGs file should contain sequences names where the species code can be easily extracted. 

Remember that, in our case, all sequences have the following format:

- `spcode.seqname`, where `.` acts as a delimiter.

In [1]:
# Load trees and marker trees
import pickle

def extract_spcode(nodename):
    return nodename.split('.')[0]

all_trees = pickle.load(open('data/alltrees.pkl', 'rb'))
sptree_markers = pickle.load(open('data/sptree_markers.pkl', 'rb'))

## Create a COGS file that ete build can read

In [2]:
# TASK: Create a text file where each line contains a tab-delimited list of sequences of a marker gene family (COG)
# Save the file as "data/marker_cogs.tsv" in your data folder

with open('data/marker_cogs.tsv', 'w') as COGS:
    for tname in sptree_markers: 
        t = all_trees[tname]
        print('\t'.join(t.get_leaf_names()), file=COGS)
    


In [3]:
# Let's have a look at the COGs file
!head data/marker_cogs.tsv

933801.Ahos_0739	273063.STK_02680	673860.AciM339_0977	263820.PTO0416	1051632.TPY_0322	224324999.sul008	224324.aq_008	525897.Dbac_2775	743299.Acife_2711	637389.Acaty_c0617	555778.Hneap_0322	1255043.TVNIR_2575	713587.THITH_14295	1158165.KB898880_gene1474	1121405.dsmv_3590
273063.STK_04260	933801.Ahos_0848	673860.AciM339_1102	263820.PTO0644	1121405.dsmv_3585	525897.Dbac_2770	224324999.sul015	224324.aq_015	1051632.TPY_0327	1158165.KB898880_gene1479	1255043.TVNIR_2570	713587.THITH_14270	555778.Hneap_0327	637389.Acaty_c0622	743299.Acife_2706
273063.STK_04220	933801.Ahos_0854	263820.PTO0650	673860.AciM339_1108	224324.aq_1654	224324999.sul1654	525897.Dbac_2764	637389.Acaty_c0628	743299.Acife_2700	555778.Hneap_0333	713587.THITH_14240	1255043.TVNIR_2564	1158165.KB898880_gene1485	1121405.dsmv_3579	1051632.TPY_0333
933801.Ahos_0662	273063.STK_20680	673860.AciM339_1558	263820.PTO1221	555778.Hneap_0345	637389.Acaty_c0643	743299.Acife_2685	1255043.TVNIR_2552	713587.THITH_14175	1158165.KB898880_gen

## Use ete build to infer a supermatrix tree out of your marker families

- specify a gene tree workflow (for the gene alignment)
- specify a supermatrix workflow (how to concatenate COGs and program use to infer the final tree)


In [6]:
# TASK: Execute ete build to infer a species tree out of the marker COGs
!ete3 build -w mafft_default-none-none-none -m sptree_fasttree_all -o data/sptree \
           --cogs data/marker_cogs.tsv -a data/all_prots.faa --spname-delimiter . --clearall --cpu 5 --noimg


Toolchain path: /home/huerta/eteconda/envs/etecourse/bin/ete3_apps/bin 
Toolchain version: unknown
['mafft_default-none-none-none']
['cog_all-alg_concat_default-fasttree_full']

      --------------------------------------------------------------------------------
                  ETE build (3.1.2) - reproducible phylogenetic workflows

      Citation:

       Huerta-Cepas J, Serra F and Bork P. ETE 3: Reconstruction, analysis and
       visualization of phylogenomic data. Mol Biol Evol (2016)
       doi:10.1093/molbev/msw046

      (Note that a list of the external programs used to complete all necessary
      computations will be shown after workflow execution. Those programs should
      also be cited.)

      --------------------------------------------------------------------------------
      
[32mINFO[0m -  Testing x86-64  portable applications...
       clustalo: [32mOK[0m - 1.2.4
      dialigntx: [32mOK[0m - This is DIALIGN-TX Version 1.0.2 - A Multiple Sequence alignme

[32mINFO[0m -  [1;37;40mWaiting 2 seconds[0m
[32mINFO[0m -  [1;33mwaiting for 2 cores
[32mINFO[0m -  [1;33mLaunched[0m 2 jobs. 2(R), 27(W). Cores usage: 4/5
[32mINFO[0m -  [1;37;40m Updating tasks status:[0m (Tue Sep  1 12:42:05 2020)
[32mINFO[0m -  Thread [1;37;40mcog_all-alg_concat_default-fasttree_full[0m: pending tasks: [1;33m1[0m of sizes: 15
[32mINFO[0m -   (Q[0m) [31mConcatAlgTask[0m (15 species, 33 COGs, ConcatAlg, /[1;37;40mcog_all-al...ttree_full[0m)
[32mINFO[0m -  [1;37;40mWaiting 2 seconds[0m
[32mINFO[0m -  [1;33mwaiting for 2 cores
[32mINFO[0m -  [1;33mLaunched[0m 2 jobs. 2(R), 25(W). Cores usage: 4/5
[32mINFO[0m -  [1;37;40m Updating tasks status:[0m (Tue Sep  1 12:42:07 2020)
[32mINFO[0m -  Thread [1;37;40mcog_all-alg_concat_default-fasttree_full[0m: pending tasks: [1;33m1[0m of sizes: 15
[32mINFO[0m -   (Q[0m) [31mConcatAlgTask[0m (15 species, 33 COGs, ConcatAlg, /[1;37;40mcog_all-al...ttree_full[0m)
[32mINFO[0m 

[32mINFO[0m -  [1;33mwaiting for 2 cores
[32mINFO[0m -  [1;33mLaunched[0m 2 jobs. 2(R), 3(W). Cores usage: 4/5
[32mINFO[0m -  [1;37;40m Updating tasks status:[0m (Tue Sep  1 12:42:42 2020)
[32mINFO[0m -  Thread [1;37;40mcog_all-alg_concat_default-fasttree_full[0m: pending tasks: [1;33m1[0m of sizes: 15
[32mINFO[0m -   (Q[0m) [31mConcatAlgTask[0m (15 species, 33 COGs, ConcatAlg, /[1;37;40mcog_all-al...ttree_full[0m)
[32mINFO[0m -  [1;37;40mWaiting 2 seconds[0m
[32mINFO[0m -  [1;33mwaiting for 2 cores
[32mINFO[0m -  [1;33mLaunched[0m 2 jobs. 2(R), 1(W). Cores usage: 4/5
[32mINFO[0m -  [1;37;40m Updating tasks status:[0m (Tue Sep  1 12:42:44 2020)
[32mINFO[0m -  Thread [1;37;40mcog_all-alg_concat_default-fasttree_full[0m: pending tasks: [1;33m1[0m of sizes: 15
[32mINFO[0m -   (Q[0m) [31mConcatAlgTask[0m (15 species, 33 COGs, ConcatAlg, /[1;37;40mcog_all-al...ttree_full[0m)
[32mINFO[0m -  [1;37;40mWaiting 2 seconds[0m
[32mINFO[0m - 

You can also use codon alignments to infer the tree based on nucleotide rather than amino acid residues (useful when sequences are too similar)

In [17]:
# TASK: Execute ete build to infer a species tree out of the marker COGs, using codon alignments and nt sequences
!ete3 build -w mafft_default-none-none-none -m sptree_fasttree_all -o data/sptree \
           --cogs data/marker_cogs.tsv -a data/all_prots.faa -n data/all_prots.fna --nt-switch-threshold 0.0 \
           --spname-delimiter . --clearall --cpu 5 --noimg


Toolchain path: /home/huerta/miniconda3/envs/eccb20/bin/ete3_apps/bin 
Toolchain version: unknown
['mafft_default-none-none-none']
['cog_all-alg_concat_default-fasttree_full']

      --------------------------------------------------------------------------------
                  ETE build (3.1.2) - reproducible phylogenetic workflows

      Citation:

       Huerta-Cepas J, Serra F and Bork P. ETE 3: Reconstruction, analysis and
       visualization of phylogenomic data. Mol Biol Evol (2016)
       doi:10.1093/molbev/msw046

      (Note that a list of the external programs used to complete all necessary
      computations will be shown after workflow execution. Those programs should
      also be cited.)

      --------------------------------------------------------------------------------
      
[32mINFO[0m -  Testing x86-64  portable applications...
       clustalo: [32mOK[0m - 1.2.4
      dialigntx: [32mOK[0m - This is DIALIGN-TX Version 1.0.2 - A Multiple Sequence alignmen

In [1]:
!cp "data/sptree/cog_all-alg_concat_default-fasttree_full/all_prots.faa.final_tree.nw" "data/sptree.nw" 

!ete3 annotate --ncbi -t "data/sptree.nw"| ete3 view --ncbi 

Traceback (most recent call last):
  File "/home/huerta/miniconda3/envs/etecourse/bin/ete3", line 8, in <module>
    sys.exit(main())
  File "/home/huerta/miniconda3/envs/etecourse/lib/python3.6/site-packages/ete3/tools/ete.py", line 95, in main
    _main(sys.argv)
  File "/home/huerta/miniconda3/envs/etecourse/lib/python3.6/site-packages/ete3/tools/ete.py", line 269, in _main
    args.func(args)
  File "/home/huerta/miniconda3/envs/etecourse/lib/python3.6/site-packages/ete3/tools/ete_annotate.py", line 61, in run
    tree.annotate_ncbi_taxa(args.taxid_attr)
  File "/home/huerta/miniconda3/envs/etecourse/lib/python3.6/site-packages/ete3/phylo/phylotree.py", line 798, in annotate_ncbi_taxa
    return ncbi.annotate_tree(self, taxid_attr=taxid_attr, tax2name=tax2name, tax2track=tax2track, tax2rank=tax2rank)
  File "/home/huerta/miniconda3/envs/etecourse/lib/python3.6/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 545, in annotate_tree
    lineage = tax2track[node_taxid],

And, of course, you can load the species tree and operate with it

In [37]:
from ete3 import PhyloTree
t = PhyloTree("data/sptree/cog_all-alg_concat_default-fasttree_full/all_prots.faa.final_tree.nw")

# Phylogenetic distance from our strain to the reference genome:
print(t.get_distance("224324999", "224324"))

# Phylogenetic distance from all taxa
print("Cophenetic distance matrix:\n",  t.cophenetic_matrix())

0.0
Cophenetic distance matrix:
 ([[0, 0.9269901, 0.9766983999999999, 1.0198972, 0.992823, 0.992823, 1.776664, 1.8348280000000001, 0.9181431, 1.0569824, 0.9746603, 1.7396190000000002, 1.0144897, 1.0255883, 1.8264490000000002], [0.9269901, 0, 0.8590059, 0.9022047000000001, 1.0524091, 1.0524091, 1.8362501, 1.8944141, 0.6481650000000001, 0.9392898999999999, 0.8569678000000001, 1.7992051, 0.8967972000000001, 0.9078958, 1.8860351], [0.9766983999999999, 0.8590059, 0, 0.2902328, 1.1021174, 1.1021174, 1.8859584000000003, 1.9441224, 0.8501588999999999, 0.56747, 0.6050761, 1.8489134000000003, 0.2848253, 0.6560041, 1.9357434], [1.0198972, 0.9022047000000001, 0.2902328, 0, 1.1453162, 1.1453162, 1.9291572000000001, 1.9873212000000002, 0.8933577, 0.6106688000000001, 0.6482749000000001, 1.8921122000000001, 0.0933881, 0.6992029000000001, 1.9789422], [0.992823, 1.0524091, 1.1021174, 1.1453162, 0, 0.0, 1.351439, 1.4096030000000002, 1.0435621, 1.1824013999999998, 1.1000793, 1.314394, 1.1399086999999999, 