A script that maps similar reads to a reference sequence file and extracts genes by annotation (such as the CEGMA output) with the advantage that each individual gene sequence remains in-frame. This allows for a wide range of subsequent analyses; for example, splitting the data by codon position or selecting individual genes. Although we used and tested the program specifically for using the CEGMA output, any gff annotation and reference fasta sequence should work.
The script offers an option to build RAxML phylogenies from each individual alignment. However, individual alignments can also be merged into a supermatrix for further analysis.
If you use this code please cite:
Leavitt, S.D., Grewe, F., Widhelm, T., Muggia, L., Wray, B., & Lumbsch, H.T. (2016). Resolving evolutionary relationships in lichen-forming fungi using diverse phylogenomic datasets and analytical approaches. Scientific Reports, 6:22262. link
Installed globally (executable must be in one of the directories listed in the PATH environment variable, as shown by echo $PATH
)
- Perl (with BioPerl)
- bowtie2
- bcftools (with vcfutils.pl) and samtools
- muscle
- raxmlHPC (optional)
additional Perl modules
- File::Which
- Getopt::Long
- Pod::Usage
You can use either single-end or paired-end reads in this analysis, but all reads will be handled as single-ended. We recommend that you trim your reads before use; we used Trimmomatic.
You need several files:
- A comma separated table that points to your sequence read files (see seq_loc.csv as an example)
- Reference fasta file; we used the CEGMA standard output file output.cegma.dna
- Annotation of the reference .fasta file in .gff format; we used the CEGMA standard output file output.cegma.local.gff
perl map_n_extract.pl -t [-f -g -r -p -y -h]
-table -t <.csv> Comma seperated table with NGS sequence read location (name,se_or_pe1_reads[,se_or_pe2_reads])
[-fasta -f <.fasta>] reference sequence file (default=output.cegma.dna)
[-gff -g <.gff>] annotation of reference sequence file (default=output.cegma.local.gff)
[-reference -r <STRING>] Name of reference sequence (default=Ref)
[-threads -p <INT>] Number of available threads (default=8)
[-run_phylo -y] If you want to run raxML phylogeny for individual genes
[-help -h] Displays this basic usage information
Please note that you can chose to run RAxML analyses for each individual gene alignment with option -y.
The script will produce two output folders. The OUTPUT_alignments folder holds all alignments of mapping across whole fasta reference sequences and the OUTPUT_good_CDS_alignments folder contains only gene alignments which are in frame with the open reading frame of the reference annotation.
If you chose to analyze phylogenetic inferences for each individual gene alignment (option -y), you'll also see a OUTPUT_phylogenies folder that contains all RAxML trees.
You can use the program FASconCAT (Kueck & Meusemann, 2010) to build a supermatrix. Place the FASconCAT perl scrip into the OUTPUT_good_CDS_alignment folder, execute with
perl FASconCAT_v1.0.pl
and follow the instructions.
If you chose the -y option before and see the OUTPUT_phylogenies folder, you can combine all individual gene trees into one newick file (all_tree.tre) with:
cat ./OUTPUT_phylogenies/*bipartitions\.*RAXML >> all_tree.tre