Skip to content
Permalink
2021-05-01
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time
layout title zenodo_link level tags questions objectives follow_up_training time_estimation key_points contributors
tutorial_hands_on
Unicycler Assembly
Introductory
prokaryote
I have short reads and long reads. How do I assemble a genome?
Perform Quality Control on your reads
Perform a Small genome Assembly with Unicycler
Evaluate the Quality of the Assembly with Quast
Annotate the assembly with Prokka
type topic_name tutorials
internal
assembly
ecoli_comparison
4h
We learned about the strategies used by assemblers for hybrid assemblies
We performed an hybrid assembly of a bacterial genome and its annotation
Unicycler is a pipeline bases on Spades and Pilon dedicated to hybrid assembly of Small genomes
Combination of short and long reads helped us produce an almost perfect assembly
nekrut
delphine-l
slugger70

The goal: E. coli C-1 assembly

In this tutorial we assemble and annotate the genome of E. coli strain C-1. This strain is routinely used in experimental evolution studies involving bacteriophages. For instance, now classic works by Holly Wichman and Jim Bull ({% cite Bull1997 %}, {% cite Bull1998 %}, {% cite Wichman1999 %}) have been performed using this strain and bacteriophage phiX174.

To sequence the genome we have obtained the strain from the Yale E. coli Stock Center. The stock center sent us a filter paper disk infused with cells. The disk was placed in the center of an LB-agar plate. A single colony was picked and resuspended in a liquid LB medium, grown overnight, and genomic DNA was isolated. The DNA was then sequenced using two methods. To obtain high coverage, high accuracy data we used Illumina miSEQ to generated 250-bp paired end reads. To generate high length reads we used the Oxford Nanopore MinION machine.

Our goal is to reconstruct and annotate the full genome of E. coli C-1. As you will see in this tutorial a combination of many short, high accuracy reads with long, error-prone reads helps us produce an almost perfect assembly.

Outline step-by-step

In this tutorial, we will deal with:

  1. TOC {:toc}

{: .agenda}

Background on data and tools

The data

In this tutorial we will assemble a genome using two types of input data: (1) Illumina 250 bp paired-end reads and (2) Oxford Nanopore reads.

Illumina data

We generated 9,345,897 250 bp read pairs (library preparation performed on genomic DNA fragmented to mean size of 600 bp). However, to make sure that you can complete this tutorial in a finite amount of time we have downsampled (reduced in size) to 1,000,000 paired end reads - just enough to produce an accurate assembly.

Oxford Nanopore Data

There are 12,738 2d-reads. Maximum read length is 27,518 bp. The distribution of reads lengths looks like this:

Nanopore read length distribution

You can see that there many reads under the second peak with median of approximately 7.5 kb.

{% icon warning %} Oxford Nanopore Data Format

Oxford Nanopore machines output data in fast5 format that contains additional information besides sequence data. In this tutorial we assume that this data is already converted into fastq. An additional tutorial dedicated to handling fast5 datasets will be developed shortly. {: .warning}

The tools

In this analysis we will perform two tasks: (1) assembly and (2) annotation. Below we will briefly outline the main ideas behind these two procedures and will describe the tools we will be using.

Assembly

{% icon comment %} Knowing your assembly

Here we assume that you know a thing or two about assembly process. If you don't: look at the slides accompanying this tutorial as well as other tutorials is this section. {: .comment}

Logo unicycler

For assembly we will be using Unicycler (also see publication {% cite Wick2017 %}). Unicycler is designed specifically for hybrid assembly (that is, using both short- and long-read sequencing data) of small (e.g., bacterial, viral, organellar) genomes. In our hands it has produced complete high quality assemblies. Unicycler employs a multi-step process that utilizes a number of software tools:

Unicycler process

As you can see Unicycler relies heavily on SPAdes ({% cite Bankevich2012 %}) and Pilon. We will briefly describe these two tools.

Spades

Multisized deBruijn graph

Assemblers usually construct graphs for k-mers of a fixed size. We have noted that when k is small it is difficult to resolve the repeats. If k is too large a corresponding graph may be fragmented (especially if read coverage is low). SPAdes uses several values for k (that are either manually set or inferred automatically) to create a multisized graph that minimized tangledness and fragmentation by combining various k-mers ({% cite Bankevich2012 %})):

Multigraph approach implemented in SPAdes

Read pair utilization

While the use of paired reads and mate pairs is key to genome assembly, and not new, SPAdes utilizes so called paired DeBruin graphs to take the advantage of the paired end data. One of the key issues with paired DeBruin graphs is that the resulting genome assemblies do not tolerate variability in insert sizes: The initial formulation of paired DeBruijn graphs assumed constant distance between pairs of reads. In practice this distance is always variable. SPAdes performs k-bimer (these are k-mers derived from paired reads) adjustment to identify exact or nearly-exact distances for each k-bimer pair.

Error correction

Sequencing data contains a substantial number of sequencing errors that manifest themselves as deviations (bulges and non-connected components) within the assembly graph. One way to improve the graph before assembly it is to minimize the number of sequencing errors by performing error correction. SPAdes uses BayesHammer ({% cite Nikolenko2013 %}) to correct the reads. Here is a brief summary of what it does:

  1. SPAdes (or rather BayesHammer) counts k-mers in reads and computes k-mer statistics that take into account base quality values.
  2. A Hamming graph is constructed in which k-mers are nodes. In this graph edges connect nodes (k-mers) if they differ from each other by a number of nucleotides up to a certain threshold (the Hamming distance). The graph is central to the error correction algorithm.
  3. Then Bayesian subclustering is done on the graph from the previous step. For each k-mer we now know the center of its subcluster.
  4. Solid k-mers are derived from cluster centers and are assumed to be error free.
  5. Solid k-mers are mapped back to the reads.
  6. Reads are corrected using solid k-mers:

Read correction with BayesHammer

In the case of the full dataset, SPAdes error correction changed 14,013,757 bases in 3,382,337 reads - a substantial fraction of the full ~18 million read dataset.

Pilon

Pilon improves draft assemblies by using the information from the original reads aligned to the draft assembly. The following image from a publication by {% cite Walker2014 %} highlights the steps of this process:

Pilon workflow

Annotation

For annotation we are using Prokka (also see {% cite Seemann2014 %}). It scans the assembly generated with Unicycler with a set of feature prediction tools and compiles a list of genome annotation. It predicts the following features (Table from {% cite Seemann2014 %}):

Feature Tool used by Prokka
Protein-coding sequences (CDS) Prodigal
Ribosomal RNA genes RNAmmer
Transfer RNA genes Aragorn
Signal leader peptides SignalP
Non-coding RNA genes Infernal

Prokka predicts protein-coding regions using a two step process. It first identifies coordinates of putative genes using Prodigal and then compares the gene sequence against databases of known sequences at protein level using Blast+ and HMMer.

Let's try it

Load data and assess quality

In this example we will use a downsampled version of E. coli C-1 Illumina and ONT sequencing data. These include 3 files: forward and reverse reads for Illumina, and Long read file produced by ONT. All data are in fastq format.

{% icon hands_on %} Hands-on: Obtaining our data

  1. Make sure you have an empty analysis history. Give it a name.

    {% snippet faqs/galaxy/histories_create_new.md %}

  2. Import the following file from Zenodo

    https://zenodo.org/record/940733/files/illumina_f.fq
    https://zenodo.org/record/940733/files/illumina_r.fq
    https://zenodo.org/record/940733/files/minion_2d.fq
    

    {% snippet faqs/galaxy/datasets_import_via_link.md %} {% snippet faqs/galaxy/datasets_import_from_data_library.md %}

{: .hands_on}

If all goes well you will see datasets uploading and changing states from gray to green as shown below. The figure below also shows how datasets can be tagged.

Datasets in History

Assess Read Quality

To assess quality we will use two tools: FastQC ({% cite FastQC %}) to generate quality statistics and multiQC ({% cite Ewels2016 %}) to summarize these statistics.

{% icon hands_on %} Hands-on: Quality Control

  1. FastQC {% icon tool %}:

    • {% icon param-files %} "Short read data from your current history": Select all three FastQ datasets simultaneously
  2. MultiQC {% icon tool %}: to generate a summary of the FastQC reports with

  • "Which tool was used generate logs?": FastQC
  • "Type of FastQC output": Raw data
  • {% icon param-files %} "FastQC Output": RawData outputs of FastQC

{: .hands_on}

A quick look at quality score distribution will show a confusing picture:

QC reported zoomed out

So let's zoom in into Illumina data:

QC reported zoomed in

Assembly with Unicycler

Now it is time to perform assembly.

{% icon hands_on %} Hands-on: Unicycler Assembly

  1. Unicycler {% icon tool %} with the following parameters :
  • "Paired or Single end data?" to Paired
  • "First Set of reads" to the forward reads file f
  • "Second Set of reads" to the reverse reads file r
  • "Long reads" to the MinION file
  • Use default parameters

{: .hands_on}

Assembly takes time!

There is no such thing as Assembly in real time. It takes time so it is a good time to have lunch or at least coffee. This Unicycler run will take anywhere between 90 minutes and two hours. {: .warning}

Assess Assembly quality with Quast

Quast ({% cite Gurevich2013 %}) is a tool providing quality metrics for assemblies, and can also be used to compare multiple assemblies. The tool can also take an optional reference file as input, and will provide complementary metrics.

{% icon hands_on %} Hands-on: Assembly Quality

  1. Quast {% icon tool %}: with the following parameters
  • "Contigs/scaffolds output file": Select the fasta file resulting from the Unicycler assembly.

{: .hands_on}

The Quast tool outputs assembly metrics as an html file with metrics and graphs. The image below looks exceptionally boring. This is a good thing:

Quast Interface

One can see that there are two (!) contigs. The largest contig is 4,576,290 bp (for comparison E. coli K12 MG1655 strain genome length is 4,656,144 bp) and the smallest is 4,581,676 (total length) - 4,576,290 (length of the largest) = 5,386 bp. When we analyzed this dataset for the first time we were initially puzzled by this second contig. But we quickly realized that this is simply the genome of bacteriophage phiX174 which is routinely used as a spike-in in Illumina sequencing. Thus we have two genomes: the one of E.coli C-1 and phiX174! We can now use Prokka to annotate our two genomes.

Annotation with Prokka

{% icon hands_on %} Hands-on: Annotation

  1. Prokka {% icon tool %}:
  • "Contigs to annotate": Select the assembly ouput of Unicycler
  • "Genus name": Escherichia
  • "Species name": coli
  • "Strain name": C-1
  • "Use genus-specific BLAST database": yes

{: .hands_on}

Prokka outputs 10 datasets (including two log files). These are in various formats:

  • txt : Provides Statistics on the annotation : number of CDS predicted, number of rRNA etc.
  • tbl : Provides a tabulated list of annotated features.
  • fsa : Nucleotide fasta file of the input contig sequence.
  • sqn : ASN.1 format file for submission to GenBank.
  • ffn : Nucleotide FASTA file of all the prediction transcripts.
  • faa : Protein FASTA file of the translated CDS sequences.
  • fna : Nucleotide fasta file of the input contig sequence.
  • gbk : GenBank file.
  • gff : gff3 file.

Visualize the results in IGV

Let's look at the entire assembly and its annotation in the genome browser. We can do this using Integrated Genome Browser (IGV).

Visualization requires a local installation of IGV. If you have IGV installed - just start it. If you don't - read on.

Starting IGV

Go to IGV download page and select one of the options. The one I would try first would be Java Web Start. Simply click the Launch button for 10 GB distribution.

{% icon hands_on %} Hands-on: Visualization in IGV

  1. Start IGV. It will look something like this:

    IGV just started

  2. Locate the output of Unicycler and expand it :

    Unicycler result Visualization

  3. Click on the local link highlighted with orange outline. The browser will change:

    IGV with Unicycler assembly

  4. Let's add Prokka annotations to the browser image. For this simply expand Prokka's GFF3 dataset and click on the local link:

    Expanded GFF3 dataset representing Prokka annotations

{: .hands_on}

You will now see the annotations within the browser window:

Prokka result Visualization