* Under Reconstruction *

ECPP

Exon Capture for Phylogenetics

Contributors:

Adnan Moussalli, Tim O'Hara, Mark Phuong, Luisa Teasdale, and Jose Grau

This workflow is essentially a BASH wrapper around a selection of established programs and in-house scripts that take you from raw sequence data to final alignments ready for downstream phylogenetic analyses.

This pipeline has many many dependencies, hence I have containerised it. You will need APPTAINER installed on you system. See the following for instructions:

https://apptainer.org/docs/admin/1.0/installation.html

https://github.com/apptainer/apptainer/blob/main/INSTALL.md

Things you need:

Raw sequences, demultiplex. It is a good idea to rename them to something meaningful, e.g. EC1_FAMILY_GENUS_SPECIES_REGNO_LOCALITY. This name will carry through all the way to the first tree you produce. Create a new folder for each projects, and in that folder place the raw sequences in a subfolder named "0_raw". Note, all sequence files must end in *_R1.fastq.gz or *_R2.fastq.gz.
A reference fasta file containing a single representative sequence for each exon, translated. This is typically your bait design before you chopped them up into baits.
A list of the samples you want to process as a text file. These will be the names of the raw sequence files, but without "_R1.fastq.gz".
Please ensure that the reference fasta file and the sample list text file are in Unix format. "DOS uses carriage return and line feed ("\r\n") as a line ending, while Unix uses just line feed ("\n")." It is the number one reason why the pipeline might fail to execute. Simply install "dos2unix" and convert all text based files you bring over from windows before running the pipelines. One day I will automate this step.

ECPP has 5 modules and are run sequentially. Please follow the syntax outlined in the file ../Test

CLEAN - this module remove duplicates, subsets the raw reads to only those having a hit to the reference (though the threshold is somewhat relaxed) and quality trims and removes adapters using trimmomatic.
TRINITY - this module assembles using the CLEAN reads. I have tried and tested many assemblers. Based on my experience, Trinity remains the best options. I plan to benchmark it against MEGAHITS shortly, but for now TRINITY it is. It then tblastn the reference (as query) against the sample specific assembly (as subject). The best hit per reference exon is identified and the local hit coordinates used to extract the corresponding exon match from the corresponding assembled contig. This creates a new reference which is "sample specific".
MAP - this module maps the CLEAN reads onto the new "sample specific" reference using BBMAP, followed by variant calling using VARSCAN.
ALIGN - this module will produce a separate fasta alignment for each targeted exon, and 'all exons combined' supermatrix files in fasta, nexus (with CHAR SET defined) and phylip format.
SUBSET – based on summary file from the last module, you identify either exons or taxa that should be removed. Here is you opportunity to do so. Realignment are done on these subsets with some additional polishing. Final alignments are ready for phylogenetic analyses.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Legacy/1.2.1		Legacy/1.2.1
Test		Test
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
README_1.3.2.docx		README_1.3.2.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legacy/1.2.1

Legacy/1.2.1

Test

Test

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

README_1.3.2.docx

README_1.3.2.docx

Repository files navigation

* Under Reconstruction *

ECPP

About

Releases

Packages

Languages

Victaphanta/ECPP

Folders and files

Latest commit

History

Repository files navigation

*** Under Reconstruction ***

ECPP

About

Resources

Stars

Watchers

Forks

Languages

* Under Reconstruction *