Improving draft genome assembly

This is a repository for scripts I used in improving and annotating the Purple Martin (Progne subis) draft genome assembly. These scripts were executed using the Texas A&M High Preformance Research Computing resources (https://hprc.tamu.edu/).

Improving draft genome assembly

Starting input files (with my file sizes as example)

Assembled reference genome (1 fasta file 1.16GB)
PacBio reads (5 bam files totalling 83GB)
Illumina reads (3 fastq files totalling 75GB)

The main steps in this assembly improvement are polishing the genome with Arrow and Pilon, and scaffolding it with ARKS+LINKS. There are some intermediate steps in between these programs to create necessary input files. In addition, I also checked for duplicate scaffolds, contaminants, and assessed genome quality. All steps are listed below (approximate run time in parentheses) with matching numbers to the script file names in the assembly folder. 📁

Polish genome using Arrow, with PacBio subreads.bam files listed in input.fofn file (run time 2 days)
Align illumina reads to the arrow-corrected genome using BWA (run time 1.5 days)
Sort, index, and look at stats of aligned bam file with samtools (run time 2.5 hrs)
Polish genome using Pilon (run time 12 hrs)
Create interleaved linked reads file from illumina reads using LongRanger (run time ~2 days)
Scaffold genome using the ARKS+LINKS pipline (run time 13 hrs)
Identify and remove duplicate scaffolds with bbmap (run time ~2 min)
Shorten scaffold header names and calculate scaffold lengths with sed, cat, & awk (run time ~2 min)
Evaluate genome metrics with QUAST (run time 4 min)
Assess genome assembly completeness with BUSCO (run time 3-4 days)
Scan full genome for any contaminants (step ii. run time ~3.5 days, with full genome split over 5 jobs)
1. Download taxonomy database from NCBI, splitting reference genome file into chunks, and making lists ready for BLAST step
2. Use BLAST+ for each list of scaffolds, using script largely based from kdelmore
3. Check outputs for non-eukaryota using grep

Chromosomal synteny

Synteny using chicken chromosomal information was done with the chicken genome accession# GCA_000002315.5. This was used to order and orient the purple martin scaffolds in chromosomal positions and was completed through the Satsuma Synteny program (example script with different species here: https://github.com/edegreef/NBW-resequencing/blob/main/reference_genome/03-satsuma.sh). The data file containing the output information ("purple_scafs_ordered_by_chicken.csv") is uploaded in this repository (https://github.com/edegreef/PUMA-reference-genome/blob/master/purple_scafs_ordered_by_chicken.csv)

Genome annotation

This annotation pipeline uses MAKER, running multiple rounds and using programs such as repeatmasker, exonerate, snap, and augustus. All steps are listed below (approximate run time in parentheses) with matching numbers to the script file names in the annotation folder. 📁 Note: Normally people use both RNA & protein data for annotation, however, I am only using protein evidence. These following instructions are using protein evidence only (and for bird data).

Prepare input files for running the first round of MAKER, including optionally splitting genome file into multiple chunks (to save time), downloading model data from ensembl (chicken, zebra finch, flycatcher), and obtaining maker control files
First round of MAKER
1. Prepare maker_opts.ctl file with adjusted parameters to run RepeatMasker and Exonerate, and a few other parameter changes (protein=protein fasta files, model_org=aves, protein2genome=1, cpus=20, min_contig=5000). I added the "02-" in the file name to organize opts file for this step since we will be editing and using this file again later. Should remove the "02-" in the name when running maker.
2. Run maker_run.lsf. This script is largely based off from TAMU's GCATemplates, creating a temporary directory for the large number of files MAKER will create. I ran 5 of these scripts (maker_run0.lsf, maker_run1.lsf, maker_run2.lsf....), adjusted for each genome chunk created in Step 1 so I can still use the same maker_opts.ctl file for each job. (run time 2 days - with 5 genome chunks running simultaneously (total 1.14GB))
Create HMM model using .gff files from maker's outputs (run time ~10 min)
Second round of MAKER
1. Prepare maker_opts.ctl file with updated parameters to train SNAP (maker_gff=merged gff file from round1, protein_pass=1, rm_pass=1, snaphmm=hmm file created in step 3, protein=#remove, model_org=#remove, repeat_protein=#remove, protein2genome=0, pred_stats=1)
2. Run same job script files used in first round of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 4 hours - with 5 chunks running simultaneously)
Create 2nd HMM model using .gff files from 2nd round of maker (run time ~10 min)
Third round of MAKER
1. Prepare maker_opts.ctl file with updated parameters to re-train SNAP and include Augustus chicken model (maker_gff=merged gff file from round2, snaphmm=hmm file created in step 5, augustus_species=chicken)
2. Run same job script files used in previous rounds of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 15 hours - with 5 chunks running simultaneously)
Create 3rd HMM model using .gff files from 3rd round of maker (run time ~10 min)
Fourth round of MAKER
1. Prepare maker_opts.ctl file with updated parameters to re-train SNAP and include Augsutus chicken model, and filter AED values (maker_gff=merged gff file from round3, snaphmm=hmm file created in step 7, AED_threshold=0.5)
2. Run same job script files used in previous rounds of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 15 hours - with 5 chunks running simultaneously)
Merge outputs from 4th round of maker (run time ~few min)
Run InterProScan with the proteins fasta file from maker outputs. Two code scripts are listed here in this step, the .lsf file is for the TAMU HPRC where -goterms was not available, and the .slurm file was used on Compute Canada which included -goterms. (run time 7 hours). I continued next steps with interproscan output including -goterms.
Blast proteins fasta file with uniprot protein data (chicken, zebra finch, flycatcher) using BLAST+ (run time 7.5 hrs)
Prep files for integrating maker, interproscan, and blastp outputs (run time 5 min)
Add functional info to blastp outputs (run time 2 min)
Update interproscan outputs and additionally create gff file containing only genes (run time 3 min)

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
annotation		annotation
assembly		assembly
PUMA-logo.JPG		PUMA-logo.JPG
README.md		README.md
purple_scafs_ordered_by_chicken.csv		purple_scafs_ordered_by_chicken.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation

annotation

assembly

assembly

PUMA-logo.JPG

PUMA-logo.JPG

README.md

README.md

purple_scafs_ordered_by_chicken.csv

purple_scafs_ordered_by_chicken.csv

Repository files navigation

Improving draft genome assembly

Chromosomal synteny

Genome annotation

About

Releases

Packages

Languages

edegreef/PUMA-reference-genome

Folders and files

Latest commit

History

Repository files navigation

Improving draft genome assembly

Chromosomal synteny

Genome annotation

About

Resources

Stars

Watchers

Forks

Languages