# Annotation of genomes

## Introduction

Once we have analysis ready genomes (verified quality and lack of contamination), we can do annotation. The main goal of annotation is to identify the genomic features that are present in the contigs.

There are many available tools to find these relevant sequences, and they mostly use predictive algorithms (recognition of common sequences in the DNA) or alignment against a reference database. 

Two of the most commonly used tools are [BAKTA](https://github.com/oschwengers/bakta)  (https://doi.org/10.1099/mgen.0.000685) and the [NCBI Prokaryotic Genome Annotation Pipeline](https://github.com/ncbi/pgap) (https://doi.org/10.1093/nar/gkaa1105). 

Bakta produces a comprehensive annotation of Open Reading Frames (sequences that code for amino acids over a length threshold), different types of RNA (ncRNA, tRNA, rRNA), hypothetical proteins, genes and CRISPR arrays. 

Finally, we will do a quick phylogenetic analysis by measuring differences among the genomes and produce a phylogenetic tree with the results.

### Preparation and required tools

As in every tutorial, we make the pre-packaged tools of the HPC cluster available by running the script below

In [None]:
# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

The tools used in this tutorial are:

- `bakta        v1.7.0`
- `snippy       v4.6.0`
- `gubbins      v3.2.1`
- `FastTree     v2.1.11`

Bakta uses a reference database that is to large for a tutorial **~(30GB)~**, so we are using a reduced database (~3.0 GB) that may produce slightly less reliable feature annotation. 

The database was downloaded and decompressed into our software folder using the command below

```sh
singularity exec -B /scratch,/project,/etc </PATH/TO/BAKTA/IMAGE/> bakta_db download \
--output </OUTPUT/DIR/> \
--type light
```

Finally, let's do a quick recap of the results we have produced and the software employed in the tutorials.

In [None]:
echo -e "This is how the project folder looks for now:\n"
tree -dL 2 ~/tutorials

echo -e "\nAnd remember the structure of the software directory:\n"
tree -dL 2 /mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/ 

## Annotation of draft genomes

We are using **bakta** from a singularity image and must use a **reference database** already available. 

To avoid typing over and over again, we put these location inside 
[ENVIRONMENT VARIABLES](https://linuxize.com/post/how-to-set-and-list-environment-variables-in-linux/).

In [None]:
BAKTA_IMG="/mnt/cidgoh-object-storage/images/bakta_1.7.sif"
BAKTA_DB="/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/baktadb-light"

Now, we execute the command `bakta` specifying a single draft genome. Annotation for the remaining ones is already provided (15 - 20 min per annotation).

The line `-B /etc,/mnt` does something similar to `export SINGULARITY_BIND="/mnt,/etc"`. It makes the specified drives/folder readable when executing commands from inside the container.

In [None]:
singularity exec -B /etc,/mnt "$BAKTA_IMG" bakta \
    --db $BAKTA_DB                                                                  `# path to bakta database` \
    ~/tutorials/results/contigs/ERR10479518_contigs.fa                              `# file to annotate`\
    --output ~/tutorials/results/annotation/                                        `# output directory` \
    --genus Pseudomonas                                                             `# specify genus of isolate` \
    --prefix "ERR10479518"                                                          `# prefix of sample name only for output`

Now, we bring the rest of the results from our precomputed analyses and explore the output of **bakta**. As the results are somehow heavy (700 Mb), it may take a few minutes.  

In [None]:
cp -r /mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/precomputed_results/annotation/* \
      ~/tutorials/results/annotation/

This is how our directory structure looks now, it includes a new folder with all bakta results.

In [None]:
tree -dL 2 $HOME/tutorials

The output of **Bakta** annotation can be found in `tutorials/results/annotation` and they contain several files. 

- Circular plots `[.svg, .png]` can be visualized directly on Jupyter by clicking them or downloaded to your local computer for visualization.
- Annotation results in this dataset are not ideal as we are working with draft genomes instead of complete chromosome assemblies or scaffolds. 
- More refined assemblies require long reads to polish the contigs and eliminate gaps.

Take a look at this paper for more information about assembly approaches: https://doi.org/10.1093/bib/bbw096. 

# Phylogenetic analysis

## Core genome SNV

To produce a phylogenetic tree we must compare the genomes using some kind of metric. Here, we will use **Snippy** to identify single nucleotide variants (SNV) in our genomes (https://github.com/tseemann/snippy). 

**Gubbins** is a tool that removes genes or features that were obtained through horizontal gene transfer and that may obscure the interpretation of the phylogenetic tree (https://github.com/nickjcroucher/gubbins). We use it to remove these genes from our comparison. 

In [None]:
# define environment variables
SNIPPY_IMG="/mnt/cidgoh-object-storage/images/snippy_4.6.0.sif"
GUBBINS_IMG="/mnt/cidgoh-object-storage/images/gubbins-3.2.1.img"
REF_GENOME="/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/reference_data/GCF_000006765.1_ASM676v1_genomic.fna"

**Snippy** is a relatively fast tool, so we can analyze all the genomes we have available without too much wait. We apply the tool to every available draft genome. 

- The recommended use for the tool is actually on raw reads. 

To avoid typing the same command for every draft genome, we use a [**_for loop_**](https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php) to apply the same command to every contig file. 

In [None]:
# runs for aprox 10 min
for CONTIG in $(ls $HOME/tutorials/results/contigs/*contigs.fa)
do
    
    # define isolate name
    SAMPLE_PREFIX=$(basename "$CONTIG" '_contigs.fa')
    echo "Processing $SAMPLE_PREFIX"

    # run snippy for each isolate
    singularity exec -B /mnt,/etc $SNIPPY_IMG snippy \
    --cpus 8 \
    --force \
    --cleanup \
    --quiet \
    --outdir $HOME/tutorials/results/snippy/$SAMPLE_PREFIX  `# save in a subdirectory named after sample ID` \
    --ctgs "$CONTIG" \
    --ref $REF_GENOME  

done

Once we have the data from the previous command, we perform a few additional steps to clean our results and have them ready for phylogenetic analysis. 

1. We run **snippy-core** to summarize the **SNVs** that are present only in genes contained in all isolates. This set can be considered the core-genome or the genomic features common to a group of isolates.
    - We also run `snippy-clean_full_aln` in the resulting file to clean the text file

In [None]:
cd $HOME/tutorials/results/snippy/

singularity exec -B /mnt,/etc "$SNIPPY_IMG" snippy-core \
    --ref $REF_GENOME \
    $HOME/tutorials/results/snippy/ERR*
    
singularity exec -B /mnt,/etc $SNIPPY_IMG snippy-clean_full_aln \
    core.full.aln > clean.full.aln


2. Then, **Gubbins** eliminates the features that are transmitted through horizontal gene transfer. 

3. Once we have the cleaned alignment, we use **snp-sites** to extract the sites in the alignment with a **SNV**.

In [None]:
# move into the snippy output folder
cd $HOME/tutorials/results/snippy

# gubbins remove horizontal gene transfer 
singularity exec -B /mnt,/etc $GUBBINS_IMG run_gubbins.py \
    -p gubbins \
    clean.full.aln

# SNP-sites
singularity exec -B /mnt,/etc $SNIPPY_IMG snp-sites \
    -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln

## Phylogenetic tree

Once we have conducted all of the above steps. We have a file that contains the **SNVs** present in genes common to all isolates (core-genes) and that were not obtained by horizontal gene transfer. 

This file can be used to create a phylogenetic tree. To make it quick, we are using the tool **FastTree** which was already prepared in our software folder. 


In [None]:
/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/FastTree \
    -gamma                  `# employ a gamma model` \
    -gtr                    `# use generalized time reversible model` \
    -nt                     `# alignment is nucleotide based` \
    -quiet                  `# reduce printed output` \
    clean.core.aln > clean.core.tree

Finally, we can visualize the tree we just produce by pasting the contents of the file `clean.core.tree` in a visualizer tool such as [phyl.io](https://phylo.io/), or [iTOL](https://itol.embl.de/upload.cgi).

The data we used may not be the best representation of differences in a phylogenetic tree as it comes from a single center outbreak and the samples are highly similar among them.