# Automatic switching from amino-acid to codon-based alignments

This recipe shows how to use `ete-build` to build nucleotide-based genes tress based on amino-acid alignments.


For instance, if the average sequence identity in the amino-acid alignment is higher than a given threshold, ete-build will automatically translate the amino-acid alignment into a codon-based alignment and infer a tree based on the nucleotide models (codon-models are not supported).

This is a useful approach when protein alignments do not provide enough phylogenetic resolution (i.e. synonymous mutations are masked), but you still want to use amino-acid based alignments for better accuracy vs nucleotide-based ones.


## Requirements
- ete3
- ete3_external_apps
- [basic concepts about ete-build](ete_build_basics.ipynb)



## Recipe

### 1. Prepare amino-acid and nucleotide multi sequence FASTA files 

Both FASTA files should contain the exact same set of sequence names. Nucleotide sequences should be the coding sequence for the amino-acid sequences (i.e coincide in length)



In [6]:
%%bash 
head data/NUP62.aa.fa data/NUP62.nt.fa -n5

==> data/NUP62.aa.fa <==
>Phy003I7ZJ_CHICK
TMSQFNFSSAPAGGGFSFSTPKTAASTTAATGFSFTPAPSSGFTFGGAAPTPASSQPVTP
FSFSTPASSALPTAFSFGTPATATTAAPAASVFPLGGNAPKLNFGGTSTTQATGITGGFG
FGTSAPTSVPSSQAAAPSGFMFGTAATTTTTTTAAQPGTTGGFTFSSGTTTQAGTTGFNI
GATSTAAPQAVPTGLTFGAAPAAAATTTASLGSTTQPAATPFSLGGQSSATLTASTSQGP

==> data/NUP62.nt.fa <==
>Phy003I7ZJ_CHICK
ACCATGAGCCAGTTCAACTTCAGCTCGGCCCCGGCGGGAGGCGGCTTCTCCTTCAGCACGCCGAAAACGGCCGCCAGCAC
CACCGCGGCCACCGGCTTCTCCTTCACGCCCGCTCCCTCCTCGGGATTCACGTTCGGCGGCGCTGCTCCGACACCCGCCA
GCAGCCAGCCCGTCACGCCCTTCTCCTTCAGCACGCCGGCCAGCAGCGCGCTGCCCACCGCCTTCAGCTTCGGGACGCCC
GCAACAGCCACCACCGCCGCCCCGGCTGCCAGCGTGTTCCCGTTAGGGGGAAACGCACCAAAGCTCAACTTTGGAGGCAC


### 2. Enable mixed mode in ete-build workflows 
For this, 
- pass both FASTA files as arguments to ete-build (`-a` for proteins, `-n` for nucleotide sequences)
- specifying a threshold for the aa->nt switch. This is, the maximum protein sequence identity allowed to build protein-based trees. 

If the average sequence identity in a protein alignments is higher than the threshold provided, `ete-build` will convert the alignment into a codon-based alignment and continue to infer the tree using a nucleotide model.

In the following example, we configure the workflow to use nucleotide alignments if the average protein similarity is above 90%. 

In [19]:
%%bash 
ete3 build -a data/NUP62.aa.fa -n data/NUP62.nt.fa -o mixed_types/ -w standard_fasttree --clearall --nt-switch-threshold 0.9


Toolchain path: /Users/jhc/anaconda/bin/ete3_apps 
Toolchain version: 2.0.3


      --------------------------------------------------------------------------------
                  ETE build - reproducible phylogenetic workflows 
                                    unknown, unknown.

      If you use ETE in a published work, please cite:

        Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python
        Environment for Tree Exploration. BMC Bioinformatics 2010,
        11:24. doi:10.1186/1471-2105-11-24

      (Note that a list of the external programs used to complete all necessary
      computations will be also shown after execution. Those programs should
      also be cited.)
      --------------------------------------------------------------------------------

    
[32mINFO[0m -  Testing x86-64  portable applications...
       clustalo: [32mOK[0m - 1.2.1
[33mDialign-tx not supported in OS X[0m
       fasttree: [32mOK[0m - FastTree Version 2.1.8 Double 

During executing, a warning like this was raised: 

**```Switching to codon alignment! amino-acid sequence similarity: 0.91 >= 0.90```**
 
In the results folder, both the amino acid and the nucleotide alignments will be reported. The codon-based alignment is called `"*.used_alg.fa"`, as this is the alignment actually used to build the reported tree. 


In [17]:
%%bash
head -n2 mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.fa

>Phy00535AU_PYGAD
------------------------------------------------AAQTPA-SSQPAGLFSFSTPGAAA-QPASFSFGTPATAA-AAPAANVFPLGANAPKLNFGGSAATQATGITGGFGFGSSVPTSVPSSQAAAPSGFVFGCAGTTTTTT---TTSAQSGTTGTFTFSSGTATQAGTPSFNIGAAA---PQAAPTGLTFGTAPAAA-ATTAATLGAATQS-TTPFCLGGQSA-------ATLTTSTSQGPTLSFGAKLGGRNTAPAAPPAAATTTTSILGSAGPTLFASIASSSAPTSA-TTTGLSLGAP---STGTASLGTLGFGLKVPGTTAAAT-STATSTT--SASGFALNLKPLTTTGAIGAGTSTAAITTATTA-SAPPVMTYAQLESLINKWSLELEDQEKHFLHQATQVNAWDRTLIENGEKITSLHREVEKVKLDQKRLDQELDFILSQQKELEDLLTPLEESVKEQSGTIYLQHADEERERT---------------------------------------------------------------------------------------------


In [18]:
%%bash
head -n2 mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.used_alg.fa

>Phy00535AU_PYGAD
------------------------------------------------------------------------------------------------------------------------------------------------GCGGCCCAGACGCCTGCC---AGCAGCCAGCCCGCCGGGCTCTTCTCCTTCAGCACGCCGGGCGCTGCCGCG---CAGCCTGCCAGCTTCAGCTTCGGGACGCCGGCCACGGCCGCC---GCGGCTCCGGCAGCAAACGTGTTCCCGCTGGGGGCAAATGCACCAAAATTAAACTTTGGAGGCAGCGCTGCAACTCAAGCTACTGGAATCACAGGGGGCTTTGGATTTGGTAGCTCTGTACCGACCAGCGTGCCCTCAAGTCAAGCAGCAGCCCCTTCTGGCTTTGTGTTTGGATGTGCTGGCACCACCACCACCACCACC---------ACCACCTCCGCTCAGTCTGGGACAACTGGAACGTTTACTTTCTCCAGTGGTACCGCAACTCAGGCCGGAACGCCCAGCTTCAACATTGGCGCTGCAGCT---------CCGCAGGCAGCGCCCACCGGGTTGACCTTTGGAACAGCACCTGCAGCTGCT---GCCACCACTGCTGCCACCTTAGGGGCCGCAACCCAGTCG---ACAACCCCCTTCTGCCTTGGGGGGCAGTCTGCC---------------------GCAACGCTGACCACTAGTACCAGCCAGGGACCCACTCTGTCCTTTGGAGCCAAACTTGGAGGTAGGAACACCGCACCCGCCGCTCCCCCGGCTGCCGCTACCACCACAACCTCCATTCTTGGTTCAGCGGGGCCTACGTTGTTTGCATCTATAGCGAGTTCTTCAGCACCGACGTCGGCT---ACCACCACGGGCCTCTCACTTGGTGCCCCT---------TCCACTGGGACAGCAAGTCTTGGAACGC