Skip to content

edegreef/PUMA-reference-genome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

This is a repository for scripts I used in improving and annotating the Purple Martin (Progne subis) draft genome assembly. These scripts were executed using the Texas A&M High Preformance Research Computing resources (https://hprc.tamu.edu/).

Improving draft genome assembly

Starting input files (with my file sizes as example)

  • Assembled reference genome (1 fasta file 1.16GB)
  • PacBio reads (5 bam files totalling 83GB)
  • Illumina reads (3 fastq files totalling 75GB)

The main steps in this assembly improvement are polishing the genome with Arrow and Pilon, and scaffolding it with ARKS+LINKS. There are some intermediate steps in between these programs to create necessary input files. In addition, I also checked for duplicate scaffolds, contaminants, and assessed genome quality. All steps are listed below (approximate run time in parentheses) with matching numbers to the script file names in the assembly folder. 📁

  1. Polish genome using Arrow, with PacBio subreads.bam files listed in input.fofn file (run time 2 days)
  2. Align illumina reads to the arrow-corrected genome using BWA (run time 1.5 days)
  3. Sort, index, and look at stats of aligned bam file with samtools (run time 2.5 hrs)
  4. Polish genome using Pilon (run time 12 hrs)
  5. Create interleaved linked reads file from illumina reads using LongRanger (run time ~2 days)
  6. Scaffold genome using the ARKS+LINKS pipline (run time 13 hrs)
  7. Identify and remove duplicate scaffolds with bbmap (run time ~2 min)
  8. Shorten scaffold header names and calculate scaffold lengths with sed, cat, & awk (run time ~2 min)
  9. Evaluate genome metrics with QUAST (run time 4 min)
  10. Assess genome assembly completeness with BUSCO (run time 3-4 days)
  11. Scan full genome for any contaminants (step ii. run time ~3.5 days, with full genome split over 5 jobs)
    1. Download taxonomy database from NCBI, splitting reference genome file into chunks, and making lists ready for BLAST step
    2. Use BLAST+ for each list of scaffolds, using script largely based from kdelmore
    3. Check outputs for non-eukaryota using grep

Chromosomal synteny

Synteny using chicken chromosomal information was done with the chicken genome accession# GCA_000002315.5. This was used to order and orient the purple martin scaffolds in chromosomal positions and was completed through the Satsuma Synteny program (example script with different species here: https://github.com/edegreef/NBW-resequencing/blob/main/reference_genome/03-satsuma.sh). The data file containing the output information ("purple_scafs_ordered_by_chicken.csv") is uploaded in this repository (https://github.com/edegreef/PUMA-reference-genome/blob/master/purple_scafs_ordered_by_chicken.csv)

Genome annotation

This annotation pipeline uses MAKER, running multiple rounds and using programs such as repeatmasker, exonerate, snap, and augustus. All steps are listed below (approximate run time in parentheses) with matching numbers to the script file names in the annotation folder. 📁 Note: Normally people use both RNA & protein data for annotation, however, I am only using protein evidence. These following instructions are using protein evidence only (and for bird data).

  1. Prepare input files for running the first round of MAKER, including optionally splitting genome file into multiple chunks (to save time), downloading model data from ensembl (chicken, zebra finch, flycatcher), and obtaining maker control files
  2. First round of MAKER
    1. Prepare maker_opts.ctl file with adjusted parameters to run RepeatMasker and Exonerate, and a few other parameter changes (protein=protein fasta files, model_org=aves, protein2genome=1, cpus=20, min_contig=5000). I added the "02-" in the file name to organize opts file for this step since we will be editing and using this file again later. Should remove the "02-" in the name when running maker.
    2. Run maker_run.lsf. This script is largely based off from TAMU's GCATemplates, creating a temporary directory for the large number of files MAKER will create. I ran 5 of these scripts (maker_run0.lsf, maker_run1.lsf, maker_run2.lsf....), adjusted for each genome chunk created in Step 1 so I can still use the same maker_opts.ctl file for each job. (run time 2 days - with 5 genome chunks running simultaneously (total 1.14GB))
  3. Create HMM model using .gff files from maker's outputs (run time ~10 min)
  4. Second round of MAKER
    1. Prepare maker_opts.ctl file with updated parameters to train SNAP (maker_gff=merged gff file from round1, protein_pass=1, rm_pass=1, snaphmm=hmm file created in step 3, protein=#remove, model_org=#remove, repeat_protein=#remove, protein2genome=0, pred_stats=1)
    2. Run same job script files used in first round of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 4 hours - with 5 chunks running simultaneously)
  5. Create 2nd HMM model using .gff files from 2nd round of maker (run time ~10 min)
  6. Third round of MAKER
    1. Prepare maker_opts.ctl file with updated parameters to re-train SNAP and include Augustus chicken model (maker_gff=merged gff file from round2, snaphmm=hmm file created in step 5, augustus_species=chicken)
    2. Run same job script files used in previous rounds of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 15 hours - with 5 chunks running simultaneously)
  7. Create 3rd HMM model using .gff files from 3rd round of maker (run time ~10 min)
  8. Fourth round of MAKER
    1. Prepare maker_opts.ctl file with updated parameters to re-train SNAP and include Augsutus chicken model, and filter AED values (maker_gff=merged gff file from round3, snaphmm=hmm file created in step 7, AED_threshold=0.5)
    2. Run same job script files used in previous rounds of maker: maker_run0.lsf, maker_run1.lsf, maker_run2.lsf... (run time 15 hours - with 5 chunks running simultaneously)
  9. Merge outputs from 4th round of maker (run time ~few min)
  10. Run InterProScan with the proteins fasta file from maker outputs. Two code scripts are listed here in this step, the .lsf file is for the TAMU HPRC where -goterms was not available, and the .slurm file was used on Compute Canada which included -goterms. (run time 7 hours). I continued next steps with interproscan output including -goterms.
  11. Blast proteins fasta file with uniprot protein data (chicken, zebra finch, flycatcher) using BLAST+ (run time 7.5 hrs)
  12. Prep files for integrating maker, interproscan, and blastp outputs (run time 5 min)
  13. Add functional info to blastp outputs (run time 2 min)
  14. Update interproscan outputs and additionally create gff file containing only genes (run time 3 min)

About

Improving the Purple Martin reference genome assembly with Arrow, Pilon, and ARKS, and then annotating with MAKER

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published