Skip to content

CCGP Repository for the genome assembly working group.

Notifications You must be signed in to change notification settings

ccgproject/ccgp_assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 

Repository files navigation

California Conservation Genomics Project (CCGP)

California Conservation Genomics Project (CCGP) repository for the genome assembly working group.

Content

Overview

This repository contains scripts used for the reference genome assembly efforts of the CCGP.

CCGP reference genomes are assembled following a protocol adapted from Rhie et al. (2021). Assemblies are comprised of PacBio HiFi long read data, which is scaffolded using proximity ligation/chromatin conformation capture (HiC or OmniC) (Dovetail Genomics). Our minimum target reference genome quality is 6.7.Q40, and in most cases we expect to reach 7.C.Q50 or better (see Table 1 in Rhie et al. (2021)).

Here the overview of our current pipeline:

CCGP: Overview of our current pipeline

Pipeline overview

There have been multiple versions since the beginning of the project and this is an overview of how the pipeline has evolved.

CCGP: Evolution of the assembly pipeline

Color blocks:

  • Yellow: sequencing datatypes
  • Dark gray: Fixed processes
  • Light gray: Optional processes
  • Blue: Iterative step

Workflows

  • PacBio HiFi
    • PacBio Adapter filtering
    • K-mer counting with meryl
    • Genome size, heterozygosity and repeat content estimation
    • Coverage validation (calculation of expected coverage given the sequencing data
  • HiC/OmniC
    • Library QC with Dovetail Genomics tools
  • Contig assembly with HiFiasm
    • Depending on datasets available or ploidy, we are using single or HiC mode on HiFiasm.

Purge haplotigs: haplotypic duplications and contig overlaps

  • Alignment of HiFi data with minimap2 and purging with purge_dups
  • Alignments with Arima Genomics Mapping Pipeline
  • Scaffolding with SALSA
  • Generation and visualization of contact maps
    • HiGlass
    • Generation of tracks
      • HiFi coverage
      • HiC/OmniC coverage
      • Genome assembly mappability
      • Gap description
    • PretextSuite

Gap closing

  • Using YAGCloser - based on gap spanning of long reads

Mitochondrial assembly

  • Mitogenome assembly pipeline or MitoHiFi

Contamination screening

  • Organelle filtering from nuclear assemblies
  • Contamination screening with Blobtools
  • Contiguity metrics (contig and scaffold N50)
  • BUSCO scores
  • per base quality / k-mer completeness
  • Frameshift errors
  • Gap description
  • Genome mappability
  • Mapping quality

Versioning

Learn more

  • For further information about our project and efforts please redirect to the CCGP website
  • For more information about the project, you can also check this:

Shaffer HB, Toffelmier E, Corbett-Detig RB, Escalona M, Erickson B, Fiedler P, Gold M, Harrigan RJ, Hodges S, Luckau TK, Miller C, Oliveira DR, Shaffer KE, Shapiro B, Sork VL, Wang IJ (2022) Landscape genomics to enable conservation actions: the California Conservation Genomics Project. Journal of Heredity, 113 (6): 577–588, https://doi.org/10.1093/jhered/esac020

References

About

CCGP Repository for the genome assembly working group.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages