# Part 3. Hands-on over the genome assembly  

Welcome to the hands-on tutorial! Here we will assemble *Vibrio alginolyticus*, which plays a role in mangroves and was originally isolated in the Colombian Pacific (doi: 10.1128/spectrum.02928-23), it has the ability to create symbiotic relationships with other organisms, it is used as an indicator of species for monitoring pollution and eutrophication in coastal and estuarine waters, including mangroves.

Please before start, be organized with your files, create a Google Drive folder to save all the material of this course and upload all the .ipynb downloaded from github.

### Install condacolab

In [None]:
# install condacolab, a version of conda that runs in Google Colab

!pip install -q condacolab
import condacolab
condacolab.install()


### Install software

In [None]:
#Takes around 30 seconds
!pip install quast

In [None]:
#This takes about 4 minutes
!conda install bioconda::busco

In [None]:
!conda install bioconda::prodigal

### Download data

In [None]:
!wget https://zenodo.org/records/14969215/files/long_reads_assembly.tar.gz

### Extract the .tar.gz file

In [None]:
!tar -xvf long_reads_assembly.tar.gz

In [None]:
#Go to the tutorial data folder
#In Google Colab, you must use '%' cd command to change directories.
%cd long_reads_assembly

In [None]:
#We can list the files in that folder
!ls

## Genome assembly using NGSEP

To run the JAR file, you must first load the Java module. This will allow you to execute the following command, which will display the different algorithms included in NGSEP.

In [None]:
#This takes about 3 seconds
!java -jar NGSEPcore_5.0.0.jar

You will get the following output:

![image](./images/ngsep.png)

To visualize the assembler options, run:

In [None]:
!java -jar NGSEPcore_5.0.0.jar Assembler

You will get the following output:

![image](./images/ngsep_2.png)

## *Vibrio alginolitycus* Assembly - Nanopore sequencing from a Colombian Sample   

In [None]:
#Runs in about 35 minutes
!java -XX:+UseSerialGC -Xmx12g -jar NGSEPcore_5.0.0.jar Assembler -i SRR31094202_m10k_q15_Valginolyticus_nanopore.fastq.gz -o Valginolyticus_nanopore_ngsep

After 35 minutes You will get the following output:

![image](./images/ngsep_3.png)

## Quality Evaluation

The results obtained by assemblers may incur errors that undermine the quality of the assemblies, which is why it is necessary to review the quality of the results. In this section, the QUAST and BUSCO tools will be used, which will allow the quality of the genomic assemblies to be analyzed.

In [None]:
!quast.py -t 4 Valginolyticus_nanopore_ngsep.fa Vibrioalginolyticus_ASM2365091v1.fna

You will get the following output:

![image](./images/quast.png)

Now, let's run BUSCO: 

In [None]:
#Takes about 20 seconds
!run_BUSCO.py -i Valginolyticus_nanopore_ngsep.fa -m genome -l bacteria_odb10 -o valginolyticus_ngsep_busco

If it was not possible to run BUSCO in Google Colab, the output files can be found in the `valginolyticus_ngsep_busco` folder.

In [None]:
#Go to the folder
%cd valvalginolyticus_ngsep_busco

In [None]:
#List the files
!ls

In [None]:
#Open the file short_summary.specific.bacteria_odb10.valginolyticus_ngsep_busco.txt
!cat short_summary.specific.bacteria_odb10.valginolyticus_ngsep_busco.txt

You will get the following output:

![image](./images/BUSCO.png)

In this report you can find the assembly statistics:

- Number of scaffolds: Total number of scaffolds in the assembly. More scaffolds  indicate a more fragmented assembly.
- Number of contigs: Total number of contigs in the assembly. A higher number indicates fragmentation.
- Total length
- Percent gaps
- Scaffold N50: Length of the shortest scaffold among the largest scaffolds that together make up at least 50% of the total assembly length.
- Contigs N50: Length of the shortest contig among the largest contigs that together make up at least 50% of the total assembly length. 

## Gene annotation
We will use prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm) a gene prediction software, that is utilized for prokaryotic genomes, it identifys genes and translation initiation sites creating a gff3 format file.

In [None]:
#Go back to the data folder
%cd ..

In [None]:
#takes about 15 seconds
!prodigal -f gff -i Valginolyticus_nanopore_ngsep.fa -o Valginolyticus_nanopore_ngsep.gff3

In [None]:
#List the files
!ls

## Alignment and genome comparison
We will use NGSEP GenomesAligner, it allows to align genomes based on gene synteny

In [None]:
#Takes about 1 minute
!java -XX:+UseSerialGC -Xmx12g -jar NGSEPcore_5.0.0.jar GenomesAligner -o galn Valginolyticus_nanopore_ngsep.fa Valginolyticus_nanopore_ngsep.gff3 Vibrioalginolyticus_ASM2365091v1.fna Vibrioalginolyticus_ASM2365091v1.gff3

In [None]:
#List the files
!ls

_____________________________

# Results interpretation  

In this module we are goint to valuate all the output files created, it might be needed to check manuals: 

- Quast: https://quast.sourceforge.net/docs/manual.html
- Busco: https://busco.ezlab.org/busco_userguide.html
- gff3 format: https://www.ensembl.org/info/website/upload/gff3.html

As output files of **NGSEP**, you will find:  

- The presence/absence matrix for each of the orthogroups found.
- The frequency of each orthogroup and its classification between core_genome and accessory_genome.
- The ortho groups with the list of genes for each one
- The files for visualization with SynVisio
- The alignment of the genomes in html format

### Now let’s visualize our assembly. 

#### SynVisio

[SynVisio](https://synvisio.github.io/#/) is a web-based synteny viewer where you can view aligned genomes (the genome you aligned and the reference genome).

1. Open the link https://synvisio.github.io/#/ 

![image](./images/synvisio.png)

2. In the tab “Upload own data to Dashboard” load your output files galn_SynvisioCollinearity.txt y galn_SynvisioAnnot.txt. 

![image](./images/synvisio2.png)

3. Click on upload.

![image](./images/synvisio3.png)

4. Go back to the “Synteny Dashboard” Tab.

5. Select source and target chromosomes to make contrasts. 

![image](./images/synvisio4.png)

6. Explore your data visualization.

![image](./images/synvisio5.png)


####  D-GENIES
D-Genies is a web-based tool for aligning genomes based on sequence. Go to the web portal https://dgenies.toulouse.inra.fr/, in the “Run” tab, load the two genomes you used in the previous step. Explore the visualizations of your data.