# Genome Assembly

## Introduction
Genome assembly is the process of taking a large number of fragments of DNA (sequencing reads) and putting them back together to create a representation of the original DNA sequence from which they originated.

Many genomes contain large numbers of repeat sequences. Often these repeats are thousands of nucleotides long, and some occur in many different locations in the genome. This makes genome assembly a very difficult computational problem to solve. However, there are many genome assembly tools that exist that can produce long contiguous sequences (contigs) from sequencing reads. The assembly tools that you choose is determined by different factors, largely this is the length of the sequencing reads and the sequencing technology used to produce the reads.

In this practical we will assemble one chromosome of a malaria parasite: Plasmodium falciparum, the IT clone. We have sequenced the genome with both PacBio and Illumina and pre-filtered the reads to select only those reads from a single chromosome.

## Learning outcomes
On completion of the tutorial, you can expect to be able to:

* Describe the different approaches to genome assembly
* Generate a genome assembly from illumina data
* Generate a genome assembly from PacBio data
* Generate statistics to evaluate the quality of a genome assembly 

## Tutorial sections
This tutorial comprises the following sections:   
 1. [PacBio genome assembly](pacbio_assembly.ipynb)   
 2. [Assembly algorithms](assembly_algorithms.ipynb)   
 3. [Illumina genome assembly](illumina_assembly.ipynb) 
 4. [Genome assembly estimation](assembly_estimation.ipynb)   
 5. [PacBio genome assembly again](pacbio_assembly_again.ipynb) 

## Authors
This tutorial was written by [Jacqui Keane](https://github.com/jacquikeane) based on material from [Shane McCarthy](https://github.com/mcshane) and Thomas Otto.

## Running the commands from this tutorial
You can follow this tutorial by typing all the commands you see into a terminal window. This is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, open a new terminal window and type the command below:

In [1]:
cd ~/course_data/assembly/data

## Let’s get started!
This tutorial requires that you have canu, jellyfish, velvet, assembly-stats and wtdbg installed on your computer. These are already installed on the virtual machine you are using. To check that these are installed, run the following commands:

In [None]:
canu

In [None]:
jellyfish

In [2]:
velvetg

: 1

In [None]:
velveth

In [None]:
assembly-stats

In [None]:
wtdbg2

This should return the help message for software canu, jellyfish, velvet, assembly-stats and wtdbg2 respectively.

If after this course you would like to download and install this software the instructions can be found at the links below, alternatively we recommend [bioconda](https://bioconda.github.io/) for the installation and management of your bioinformatics software.

* The [canu](https://canu.readthedocs.io/en/latest/) website
* The [jellyfish](https://github.com/gmarcais/Jellyfish) github page
* The [velvet](https://www.ebi.ac.uk/~zerbino/velvet/) wesite
* The [wtdbg2](https://github.com/ruanjue/wtdbg2) github page

To get started with the tutorial, go to the first section: [Pacbio genome assembly](pacbio_assembly.ipynb)