#  Building species trees using concatenated alignments

This recipe shows how to automate the reconstruction of species trees based on several multiple sequence alignments. 


## Requirements
- ete3
- ete3_external_apps
- [basic concepts about ete-build](ete_build_basics.ipynb) 
- [composing custom workflows](ete_build_workflows.ipynb)

## Recipe

Reconstructing a species tree based on several concatenated alignments requires the use of two types of workflows
- a gene-tree workflow used to align the sequences of each gene family (`-w`) 
- a workflow to concatenate and build a tree based on the supermatrix alignment (`-m`)

Typically, all sequences used in a concatenated alignment are grouped by Orthologous Groups (OGs). Whithin OGs, one and only one sequence is expected to represent a given species. 


### 1. Prepare data: sequences and orthologous groups 

- The COGs file must be a text file containing the same sequence IDs as in the input file. Each TAB delimited line will be considered a COG. 

   For instance, the following example would define 3 COGs of size 3, 2 and 4 sequences respectively:
```
sp1_seqA   sp2_seqA    sp3_seqA
sp1_seqB   sp2_seqB    
sp1_seqC   sp3_seqC    sp4_seqC    sp5_seqC
```          

- By default, the expected format for the sequence names/identifiers is *`SpeciesCode_SequenceName`*. The species code should allways precede the sequence names, but you can change the default underscore delimiter character using `--spname-delimiter`.

- All sequences must be provided in a single FASTA file. 


### 2. Choose a gene-tree workflow for aligning sequences within each OG

Sequences belonging to OGs must be aligned prior to concatenation. Any ete-build gene-tree workflow can be selected for that and passed with the `-w` option. 

If an alignment trimming step is present in the gene-tree workflow, the trimmed version of the alignment will be used for concatenation. 


### 3. Choose a workflow to select OGs from the set and infer the final tree

Supermatrix (concatenated) workflow names are defined in a very similar way as gene-tree workflows. 

There are three master tasks in a supermatrix workflow: 

- OG selection: Used to define the set of OGs that will be used to build the concatenated alignment. Although this can be done manually, `ete-build` offers several automatic options to discards OGs missing a given percentage of species. 

- Alignment concatenation: This task accepts no options at the moment, as it is simply used to call the gene-tree workflow of choice to align OG sequences

- Tree inference: Used to infer the tree based on the concatenated alignment. 

In most cases, the _OG selection_ step will default to `cog_all`.  




In [3]:
%%bash 
ete3 build -w clustalo_default-trimal01-none-none -m sptree_fasttree_100 -o basic_sptree/ --clearall -a data/proteome_seqs.fa.gz --cogs data/cogs.txt

Toolchain path: /Users/jhc/anaconda/bin/ete3_apps 
Toolchain version: 2.0.3


      --------------------------------------------------------------------------------
                  ETE build - reproducible phylogenetic workflows 
                                    unknown, unknown.

      If you use ETE in a published work, please cite:

        Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python
        Environment for Tree Exploration. BMC Bioinformatics 2010,
        11:24. doi:10.1186/1471-2105-11-24

      (Note that a list of the external programs used to complete all necessary
      computations will be also shown after execution. Those programs should
      also be cited.)
      --------------------------------------------------------------------------------

    
[32mINFO[0m -  Testing x86-64  portable applications...
       clustalo: [32mOK[0m - 1.2.1
[33mDialign-tx not supported in OS X[0m
       fasttree: [32mOK[0m - FastTree Version 2.1.8 Double 