# Workshop 6 (Week beginning May 4)
# *De novo* assembly

This workshop will be focussed on *de novo* genome assembly. Specifically, you will be assembling the genome of SARS-CoV-2 (COVID-19) from real samples. Make sure you have watched the *de novo* assembly lecture before attempting this workshop.

**Forewarning:**
As we are using real data (not smaller demo datasets), the computation will take several hours (at least). It might be a good idea to time your work on this workshop so the workflow can run overnight. The workflow won't be run on your personal device, so you don't need to worry about computing resources or keeping your device awake while it is working. Long run-times are a reality of computational genomics!


## Part 1: Introduction to Galaxy

If you have been introduced to Galaxy before (perhaps in another subject) and you have an account for Galaxy Australia, move on to Part 2.

Galaxy is a web-based bioinformatics analysis and workflow platform. It has a suite of popular bioinformatics tools installed and ready to use on genomics data. There are several Galaxy servers around the world, each with specific toolsets and reference data for various analyses.

We are using Galaxy for this workshop as it has a sophisticated job-scheduling system that shares computing resources fairly for all users.

We will be using Galaxy Australia (link below).

> https://usegalaxy.org.au/

First, create an account by selecting "Login or Register" from the toolbar a the top of the page.

Once you are logged in you will return to the main landing page. It has 3 sections:
1. On the left-hand side is a library of tools, some of which we have already used in this subject.
2. On the right-hand side is your analysis history. This is where you will see the inputs and outputs of workflow steps.
3. The centre section is where you will visualise your data, configure tools and edit workflows (among other things).

If you'd like to explore Galaxy further, try this introduction written by Simon Gladman (Melbourne Bioinformatics).

> https://www.melbournebioinformatics.org.au/tutorials/tutorials/galaxy_101/galaxy_101/

## Part 2: Get some data

Create a new history in the panel on the right-hand side by using the plus button at the top. Select the title "Unknown History" can rename it SARS-CoV-2 assembly (enter/return to confirm change).

The data we will be using is hosted on Galaxy Australia.

Select "shared data" from the tool bar at the top and pick "data libraries" from the drop-down menu.

Use the search bar to search for and select "COVID-19 Data".

Use the checkbox to select "Assembly input data". Select "Export to History" in the toolbar at the top and pick "as datasets" from the drop-down menu.

Return to the main landing page by clicking the green-tick pop-over box or by selecting "Analyze data" from the toolbar at the top of the page.

You will now see three files in your history (on the right-hand side). All three are FASTQ files, two are paired end Illumina reads (forward F, and reverse R) and the other are Nanopore reads. You can view the contents of the files with the eye symbol.

**What is the advantage of using both Nanopore and Illumina reads for *de novo* assembly?**

The data are generated from bronchoalveolar lavage fluid (BALF) from COVID-19 patients. The raw data, sequenced from these fluid samples contained COVID-19 reads as well as host (human) reads, and reads from potential bacterial co-infections.

These FASTQ files have already been through quality control and pre-processing. Adapter sequences have been trimmed, human sequences have been removed and multiple samples have been merged into each file. The nanopore FASTQ file consists of reads from 3 separate patients. Similarly, the Illumina FASTQ files contain reads from another 3 patients.

**How would you remove human reads from a raw read set that contains reads originating from different species? More on this in workshop 8**

## Part 3: Import a workflow

Download the Galaxy-Workflow-COVID-19__Assembly.ga file from the LMS. This file tells Galaxy which tools to run with which settings and in what order. Workflows allow users to chain together a set of tools once and use it on many datasets or share it with colleagues around the world.

Navigate to the "Workflow" page from the toolbar at the top of the page then import the workflow file you downloaded. Select the workflow you downloaded and pick "edit" from the drop-down menu. This allows us to view the steps in the workflow and how they connect with each other. The leftmost boxes are the input FASTQ files (one file for Nanopore reads and two for Illumina reads, forward and reverse). The input reads are used by two different assembly tools SPAdes and Unicycler. Each of these tools is capable of using both long and short reads to produce an assembly. We will compare the results from each of these tools. The assemblies generated by SPAdes and Unicycler will then be used as the input to Bandage (boxes on the right-hand side), a tool that provides statistics and visualisations for genome assemblies. 

In each workflow step, the tool output is shown below the horizontal line. Click the * next to all of the outputs in the workflow, they will turn green. This means the outputs will not be hidden in your history. Now you can select the save icon in the top right.

## Part 4: Run the workflow

Select the run (play button) icon. In the centre pane you will select the workflow inputs and options. From the drop-down menus, select the correct FASTQ files in the forward, reverse and ONT fields. None of the other settings need to be changed. Now you can hit the "run workflow" button!

The workflow steps will appear in your history on the right-hand side. If the box is grey, it is waiting for resources to be allocated to it. It may also be waiting for another step to finish. If a box is yellow, it's running. If it's green, it's done. 

The assembly steps will likely take several hours. You can refresh your history at the top of the history pane. The progress of the job will continue even if you close your browser tab. 

## Part 5: Analyse the results

Once all the jobs have finished, all of the files in your history will be green. 

Select the Unicycler final assembly graph (use the eye icon to view the data in the main pane). The contig sequences above a certain length are listed with their lengths and depths (you will need to scroll all the way to the right to see these). The Unicycler final assembly FASTA file contains only the contigs above a certain size.

**What is the length of the longest Unicycler contig?**

View the SPAdes contigs FASTA file. 

**What is the length of the longest SPAdes contig?**

**How long is the SARS-CoV-2 genome? Use an online resource to find out. Is it a similar length compared to the largest contigs found by Unicycler and Spades? How could you check that these contigs are the same sequence found in online sequence repositories?**

View the bandage graph images for the SPAdes and Unicycler graphs. 

**There are many short contigs in both assemblies. Suggest a reason for this.**

**Would more reads help create a more connected assembly? Can increasing the number of input reads ever a bad thing for assembly?**

**Some contigs in the graph are tangled with many short sequences (each with a different colour) connected. What is once potential cause of loops and tangles in assemblies?**

**If we took the largest, unbranched contig from each of these assemblies, could we say we have assembled the complete SARS-CoV-2 genome? What kind of virus is SARS-CoV-2 and what kind of genome does it have? Do our contigs match this?**

**Is an unconnected assembly useful? What is one potential use of this assembly workflow in the COVID-19 pandemic?**

## Part 6: Further reading

Now that you have assembled the SARS-CoV-2 genome, you might want to know what else you can do with COVID-19 data on Galaxy. The SARS-CoV-2 Galaxy project has lots of resources to help you extend this analysis. It's the source of the assembly workflow we used. Follow the link below.

> https://covid19.galaxyproject.org/

You might also want to learn more about the tools we used: SPAdes, Unicycler and Bandage. Unicycler and Bandage were both written by Ryan Wick who works in Melbourne. You might like to try to work out the differences between the two aligners and why they produced different assemblies from the same input data.

> https://github.com/ablab/spades

> https://github.com/rrwick/Unicycler

> https://github.com/rrwick/Bandage

Workshop material developed by Steven Morgan. Thanks to Simon Gladman and the COVID-19 Galaxy project.