The pipeline compare multiple fasta files using nucmer and extracts aligned fragments that meet user-defined parameters.
- assembly pseudomolecule in fasta format.
- annotation files in gff and bed format. Requires both.
- Making a pseudomolecule for input fasta files
Stich all contigs in de novo assembly to generate a psudomolecule using a linker sequence or order contigs against a reference genome using abacas.
Stich contigs using a linker sequence "NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN" or use psudomolecule generated by abacas contig ordering.
Pending contig stiching bash script
- Annotations:
Annotate pseudomolecules using prokka or any other tool. The script expects an individual annotation folder for each sample consisting gff and bed files.
usage: recombination_analysis.py [-h] -filename FILENAME -out OUT -prokka_dir
PROKKA_DIR [-jobrun JOBRUN] -dir DIR
-analysis ANALYSIS_NAME
[-remove_temp REMOVE_TEMP] -steps STEPS
[-pbs PBS]
Recombination/HGT Analysis.
The pipeline takes a list of fasta files and aligns All-vs-All using Nucmer.
Extracts aligned region by parsing nucmer coordinate and gff/bed annotations to extract regions that matches user defined percent identity and minimum aligned length parameters.
Generates a preliminary reference database out of extracted aligned regions by deduplicating and removing containments.
Removes containment fragments from preliminary database using nucmer.
Performs nucmer alignment between pseudomolecule fasta file and final containment removed aligned fragments to generate an alignment matrix score.
optional arguments:
-h, --help show this help message and exit
Required arguments:
-filename FILENAME This file should contain a list of fasta filenames(one per line) that the user wants to use from argument -dir folder. For Genome coordinate consistency, make sure the fasta files are in a pseudomolecule format
-out OUT Output directory to save the results
-prokka_dir PROKKA_DIR
Directory containing results of Prokka annotation pipeline or individual sample folders consisting gff and bed file. The folder name should match the fasta file prefix.
-jobrun JOBRUN Type of job to run. Run script on a compute cluster, parallelly on local or on local system(default): cluster, parallel-local, local
-dir DIR Directory containing fasta files specified in -filename list
-analysis ANALYSIS_NAME
Unique Analysis Name to save results with this prefix
Optional arguments:
-remove_temp REMOVE_TEMP
Remove Temporary directories from /tmp/ folder: yes/no
-steps STEPS Analysis Steps to be performed. Use All or 1,2,3,4,5 to run all steps of pipeline.
1: Align all assembly fasta input file against each other using Nucmer.
2: Parses the Nucmer generated aligned coordinates files, extract individual aligned fragments and their respective annotation for metadata.
3: Generate a database of these extracted aligned regions by deduplicating and removing containments using BBmaps dedupe tool.
4: Remove containments from preliminary database by running nucmer
5: Performs nucmer alignment between input fasta file and final containment removed extracted fragments to generate an alignment score matrix.
-pbs PBS Provide PBS memory resources for individual nucmer jobs. Default: nodes=1:ppn=1,pmem=4000mb,walltime=6:00:00
python recombination_analysis.py -filename filenames -out /path-to-out-dir/ -prokka_dir /path-to/fasta_file_annotations/ -jobrun parallel-local -dir /path-to-pseudomolecule/fasta_files/ -analysis 2018_07_18_analysis_name -step All
or
python recombination_analysis.py -filename filenames -out /path-to-out-dir/ -prokka_dir /path-to/fasta_file_annotations/ -jobrun parallel-local -dir //path-to-pseudomolecule/fasta_files/ -analysis 2018_07_18_analysis_name -step 1,2,3,4,5
-
Final_HGT_score_matrix.csv: This file contain final score matrix computed from nucmer alignments between uniquely extracted fragments and input assembly fasta file.
-
Final_HGT_score_matrix_meta.tsv: This file contains gene annotations for each uniquely extracted fragments.