Skip to content

Nextflow pipeline for scaffolding genome assemblies with Hi-C reads

License

Notifications You must be signed in to change notification settings

WarrenLab/hic-scaffolding-nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hic-scaffolding-nf

Nextflow pipeline for scaffolding genome assemblies with Hi-C reads

Introduction

This pipeline requires the following inputs:

  1. A fasta file containing assembled contigs (--contigs)
  2. Hi-C reads in paired-end fastq(.gz) format (--r1Reads and --r2Reads)

It then performs the following tasks:

  1. Aligns the Hi-C reads to the contigs using chromap
  2. Scaffolds the contigs using yahs
  3. Prepares all the files you need to do manual curation in Juicebox

and produces the following outputs:

  • Alignments in bam format (out/chromap/aligned.bam)
  • A scaffolded assembly in both agp and fasta formats (out/scaffolds/yahs.out_scaffolds_final.[agp,fa])
  • .hic and .assembly files for loading in Juicebox Assembly Tools (out/juicebox_input/out_JBAT.[hic,assembly])

Configuration

Running on Lewis

If you're running this on the Lewis cluster, I've already got a profile set up with everything you need, so just add -profile lewis to the command and you're good to go.

Running on another cluster/cloud/locally

This pipeline has the following dependencies:

Nextflow must be in your path. You can get nextflow to make a conda environment containing chromap and yahs for you with -profile conda (note one dash!). JuicerTools is distributed as a jar file, so you need to tell the pipeline where it is by adding the argument --juicer-tools-jar /path/to/jar (note two dashes!). You can also add this stuff to a config file called nextflow.config in the directory from which you're running it (see nextflow documentation).

Running

nextflow run WarrenLab/hic-scaffolding-nf \
    --contigs contigs.fa \
    --r1Reads hic_reads_R1.fastq.gz \
    --r2Reads hic_reads_R2.fastq.gz

N.B. The WarrenLab/hic-scaffolding-nf is the name of this github repository, not a local path on your machine. You do not need to download any file in this repository; just tell nextflow to run WarrenLab/hic-scaffolding-nf and it will take care of downloading the pipeline for you.

You'll need to add a couple options depending on your configuration (see section above).

If you want to specify an enzyme to YAHS, you can add, e.g.,

--extra-yahs-args "-e GATC"

Curation

In addition to the scaffolded assembly, this pipeline creates files you can use to manually curate the assembly in Juicebox Assembly Tools. When you are done with the curation, follow these instructions from the YAHS documentation:

Once completed editing, there should be a file named something like out_JBAT.review.assembly generated by Juicebox, which can be fed into juicer post command to generate AGP and FASTA files for the final genome assembly. You also need the out_JBAT.liftover.agp coordinate file previously generated with juicer pre command.

juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa

This will end up with two files out_JBAT.FINAL.agp and out_JBAT.FINAL.fa. Together with hic-to-contigs.bin or the original BED/BAM file, you can regenerate a HiC contact map for the final assembly as described in the previous section.

About

Nextflow pipeline for scaffolding genome assemblies with Hi-C reads

Resources

License

Stars

Watchers

Forks

Packages

No packages published