map_n_extract

A script that maps similar reads to a reference sequence file and extracts genes by annotation (such as the CEGMA output) with the advantage that each individual gene sequence remains in-frame. This allows for a wide range of subsequent analyses; for example, splitting the data by codon position or selecting individual genes. Although we used and tested the program specifically for using the CEGMA output, any gff annotation and reference fasta sequence should work.

The script offers an option to build RAxML phylogenies from each individual alignment. However, individual alignments can also be merged into a supermatrix for further analysis.

If you use this code please cite:
Leavitt, S.D., Grewe, F., Widhelm, T., Muggia, L., Wray, B., & Lumbsch, H.T. (2016). Resolving evolutionary relationships in lichen-forming fungi using diverse phylogenomic datasets and analytical approaches. Scientific Reports, 6:22262. link

Requirements:

Installed globally (executable must be in one of the directories listed in the PATH environment variable, as shown by echo $PATH)

Perl (with BioPerl)
bowtie2
bcftools (with vcfutils.pl) and samtools
muscle
raxmlHPC (optional)

additional Perl modules

File::Which
Getopt::Long
Pod::Usage

To run the script:

Illumina sequence data

You can use either single-end or paired-end reads in this analysis, but all reads will be handled as single-ended. We recommend that you trim your reads before use; we used Trimmomatic.

Have your files ready

You need several files:

A comma separated table that points to your sequence read files (see seq_loc.csv as an example)
Reference fasta file; we used the CEGMA standard output file output.cegma.dna
Annotation of the reference .fasta file in .gff format; we used the CEGMA standard output file output.cegma.local.gff

Run the script

perl map_n_extract.pl  -t [-f -g -r -p -y -h]

 -table -t <.csv>			Comma seperated table with NGS sequence read location (name,se_or_pe1_reads[,se_or_pe2_reads])
 
 [-fasta -f <.fasta>]		reference sequence file (default=output.cegma.dna)
 [-gff -g <.gff>]			annotation of reference sequence file (default=output.cegma.local.gff)
 [-reference -r <STRING>]	Name of reference sequence (default=Ref)
 [-threads -p <INT>]		Number of available threads (default=8)
 [-run_phylo -y]			If you want to run raxML phylogeny for individual genes
 [-help -h]					Displays this basic usage information

Please note that you can chose to run RAxML analyses for each individual gene alignment with option -y.

Output

The script will produce two output folders. The OUTPUT_alignments folder holds all alignments of mapping across whole fasta reference sequences and the OUTPUT_good_CDS_alignments folder contains only gene alignments which are in frame with the open reading frame of the reference annotation.
If you chose to analyze phylogenetic inferences for each individual gene alignment (option -y), you'll also see a OUTPUT_phylogenies folder that contains all RAxML trees.

Further Processing

Supermatrix

You can use the program FASconCAT (Kueck & Meusemann, 2010) to build a supermatrix. Place the FASconCAT perl scrip into the OUTPUT_good_CDS_alignment folder, execute with

perl FASconCAT_v1.0.pl

and follow the instructions.

Combining all trees

If you chose the -y option before and see the OUTPUT_phylogenies folder, you can combine all individual gene trees into one newick file (all_tree.tre) with:

cat ./OUTPUT_phylogenies/*bipartitions\.*RAXML >> all_tree.tre

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
test_files_for_local_use		test_files_for_local_use
FASconCAT_v1.0.pl		FASconCAT_v1.0.pl
LICENSE		LICENSE
README.md		README.md
map_n_extract.pl		map_n_extract.pl
seq_loc.csv		seq_loc.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_files_for_local_use

test_files_for_local_use

FASconCAT_v1.0.pl

FASconCAT_v1.0.pl

LICENSE

LICENSE

README.md

README.md

map_n_extract.pl

map_n_extract.pl

seq_loc.csv

seq_loc.csv

Repository files navigation

map_n_extract

Requirements:

To run the script:

Illumina sequence data

Have your files ready

Run the script

Output

Further Processing

Supermatrix

Combining all trees

About

Releases

Packages

Contributors 3

Languages

License

felixgrewe/map_n_extract

Folders and files

Latest commit

History

Repository files navigation

map_n_extract

Requirements:

To run the script:

Illumina sequence data

Have your files ready

Run the script

Output

Further Processing

Supermatrix

Combining all trees

About

Resources

License

Stars

Watchers

Forks

Languages