# Module 3: Genome assembly

Welcome to the notebook! There are some very important instructions for you to follow:

1.) Click on File on the top left corner and select save a copy in drive

**Your changes will not be saved if you do not do this step**

2.) Click on the name of the workbook in the top left corner and replace "Copy of" with your full name

**You will be submitting the downloaded notebook file as your proof of completion for this module**


Please type:
```
print("Yes, I have done step 1")
print("Yes, I have done step 2")

```
into the code block below, then run by clicking the "triangle"/"Play" icon on that block



In [None]:
#this block will be checked


Yes, I have done step 1
Yes, I have done step 2


# Installing Conda
Conda is a versatile software management tool. Conda is an open source system of managing tools and libraries. More info on the library used to install conda on Google Colab is at this [website](https://inside-machinelearning.com/en/how-to-install-use-conda-on-google-colab/)

Note - your runtime will refresh and reconnect after running this. It will say runtime crashed, this seems normal, wait for the session to reconnect after this.


You can check out this repo for how this tool works:
https://github.com/conda-incubator/condacolab



In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:17
🔁 Restarting kernel...


In [None]:
!conda init

In [None]:
#Add any conda or software installs here
!conda install -c bioconda spades

# Module Three : Part One - Short read assembly



#### Genome assembly
Now we will learn to assemble the sequence reads using the command line. This method is convenient when handling a high number of isolates. There are many tools available such as SPAdes, velvet, shovill etc. Here, we will use the tool SPAdes to assemble the sequence reads of the isolate ERR2093269.

# Retrieve data files for this practical
Colab launches a virtual computing environment each time you start a notebook. You will need to download the data you need in the steps to follow using the code blocks below

###STEP 1: Download raw sequencing reads from the database

In [None]:
!mkdir short_read_assembly

In [None]:
%cd short_read_assembly/

In [None]:
!pwd

**Forward Reads**

In [None]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR209/009/ERR2093269/ERR2093269_1.fastq.gz

**Reverse reads**

In [None]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR209/009/ERR2093269/ERR2093269_2.fastq.gz

### STEP 2: *Check* options for Spades


In [None]:
!spades.py -h

### STEP 3: Run the Spades tool with command line
*Run the following command to start your assembly using spades:*



In [None]:
!spades.py -o SPADES_OUT -1 ERR2093269_1.fastq.gz -2 ERR2093269_2.fastq.gz -t 20

In the command option -o refers to the name of the output folder, -1 and -2 refer to the read1 and read2 files. The process will take a while to run, once finished all the output files will be in the SPADES_OUT folder

####NOTE: The assembly process takes a considerable amount of time to complete. To proceed with the rest of the practical, we have provided the results. Please stop the previous command for now and download the SPADES output from the link below.

In [None]:
!wget https://wcs_data_transfer.cog.sanger.ac.uk/ERR2093245_SPADES_OUT.zip

In [None]:
!unzip ERR2093245_SPADES_OUT.zip

### STEP 4: View the Results

Output will be on the SPADES_OUT directory. You can list to view the contents of the folder

In [None]:
%cd SPADES_OUT/

In [None]:
!pwd

In [None]:
!ls -l

For downstream analysis, the file you are likely yo be interested in is the "contigs.fasta"

In [None]:
# can check how many contigs were generated for your sample
!grep '>' SPADES_OUT/contigs.fasta |wc -l

##Assessing quality after assembly using Quast

In [None]:
!conda install -c bioconda quast

We can also generate statistics for the assembled contigs, namely, number of contigs N50 and total assembled size using another tool “QUAST”. It can be run using the following command:


In [None]:
!quast.py contigs.fasta

The tool will create a folder “quast_results” and the results will be within the folder prefixed “results”. In order to view the results by opening the “report.pdf” file.


In [None]:
!ls -l

In [None]:
%cd quast_results/

### Question 1: How many contigs were greater than or equal 10000 bp in the assembly?

### Question 2: What was the GC content in the sample?

### Question 3: What was the N50?

###Question 4: What was the size of the largest contig?

###Question 4: Was this a good quality sequence? Support your answer

### Long read Genome Asssembly


Long-read involves reconstructing a complete sequence of DNA from long-read sequencing data. Unlike short-read sequencing, which produces shorter fragments of DNA, long-read sequencing generates much longer fragments, often spanning thousands of base pairs. This method provides several advantages:


1.   Improved Contiguity: Long reads can span repetitive regions and complex genomic structures, resulting in more continuous and accurate assemblies.
2.   Better Resolution of Complex Regions: Long reads are particularly useful for assembling regions with high GC content, structural variations, and repetitive elements that are challenging for short-read technologies.






In [None]:
%cd /content

In [None]:
!mkdir long_read_assembly

In [None]:
%cd long_read_assembly

In [None]:
!pwd


/content


#### Retrieve long read data

In [None]:
!wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/046/SRR17645346/SRR17645346_1.fastq.gz


#### Install Flye

Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs.

https://github.com/fenderglass/Flye

In [None]:
!conda install -c bioconda flye

##### Look at the flye help page

In [None]:
!flye -h

In [None]:
!flye -t 16 --nano-raw SRR17645346_1.fastq.gz -o assembly

In this command, the flag -t is used to specify the number of threads, --nano-raw to tell assembler we are provding nanopore reads, and -o to specify the output directory

In [None]:
%cd assembly

In [None]:
!ls

The main output files are:


1.  assembly.fasta - Final assembly. Contains contigs and possibly scaffolds
2.   assembly_graph.gfa - Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges
3. assembly_info.txt - Extra information about contigs (such as length or coverage).


####Question 5. Run quast QC on the long read assembly (assembly.fasta) and determine the N50