# Annotation of genomes

## Introduction to annotation tools

Once we have analysis ready genomes (verified quality and lack of contamination), we can do annotation. The main goal of annotation is to identify the genomic features that are present in the contigs. There are many available tools to find these relevant sequences, and they mostly use predictive algorithms (recognition of common sequences in the DNA) or alignment against a reference database. 

Two of the most commonly used tools are [BAKTA](https://github.com/oschwengers/bakta)  (https://doi.org/10.1099/mgen.0.000685) and the [NCBI Prokaryotic Genome Annotation Pipeline](https://github.com/ncbi/pgap) (https://doi.org/10.1093/nar/gkaa1105). 

Bakta produces a comprehensive annotation of Open Reading Frames (sequences that code for amino acids over a length threshold), different types of RNA (ncRNA, tRNA, rRNA), hypothetical proteins, genes and CRISPR arrays. 

In [3]:
# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

Bakta uses a reference database that is to large for a tutorial (~30GB), so we are using a reduced database (~3.0 GB) that may produce less reliable feature annotation. 

The database was downloaded and decompressed into `/tools/reference_data/db-light` using the command below
```
singularity exec -B /scratch,/project,/etc </PATH/TO/BAKTA/IMAGE/> \
    bakta_db download \
    --output </OUTPUT/DIR/> \
    --type light
```

## Annotation of draft genomes

First, we can review the structure of our working directory. 

### **Note:** Remember to have **bash_kernel** for the Jupyter notebook loaded.

In [4]:
# review the project structure
tree -dL 2 ~/tutorials

/home/jupyter-mdprieto/tutorials
|-- raw_reads
|-- results
|   |-- annotation
|   |-- assembly_checkm
|   |-- assembly_quast
|   |-- contigs
|   |-- reads_qc
|   `-- snippy
|-- tools
|   |-- db-light
|   |-- reference_data
|   `-- sing_imgs
`-- trimmed_reads

13 directories


We save the path to the bakta singularity image in the following [ENVIRONMENTAL VARIABLE](https://linuxize.com/post/how-to-set-and-list-environment-variables-in-linux/) to use them in our commands

In [5]:
BAKTA_IMG="/mnt/cidgoh-object-storage/images/bakta_1.7.sif"

Now, we execute the command `bakta` specifying a single draft genome. Annotation for the remaining ones is already provided (7 min per annotation).

In [16]:
# remove previously produced results
rm ~/tutorials/results/annotation/ERR10479518*

singularity exec -B /etc "$BAKTA_IMG" bakta \
    --db ~/tutorials/tools/db-light                                                                `# path to bakta database` \
    ~/tutorials/results/contigs/ERR10479518_contigs.fa                              `# file to annotate`\
    --output ~/tutorials/results/annotation/                                        `# output directory` \
    --genus Pseudomonas                                                             `# specify genus of isolate` \
    --prefix $(echo "ERR10479518_contigs.fa" | grep -Eo "ERR[0-9]+")                `# prefix of sample name only for output`


parse genome sequences...
	imported: 142
	filtered & revised: 142
	contigs: 142

start annotation...
predict tRNAs...
	found: 63
predict tmRNAs...
	found: 1
predict rRNAs...
	found: 3
predict ncRNAs...
	found: 53
predict ncRNA regions...
	found: 39
predict CRISPR arrays...
	found: 0
predict & annotate CDSs...
	predicted: 6713 
	discarded spurious: 4
	revised translational exceptions: 1
	detected IPSs: 0
	found PSCCs: 6107
	lookup annotations...
	conduct expert systems...
		amrfinder: 26
		protein sequences: 625
	combine annotations and mark hypotheticals...
analyze hypothetical proteins: 588
	detected Pfam hits: 136 
	calculated proteins statistics
	revise special cases...
extract sORF...
	potential: 44894
	discarded due to overlaps: 35701
	discarded spurious: 0
	detected IPSs: 0
	found PSCCs: 0
	lookup annotations...
	filter and combine annotations...
	filtered sORFs: 0
detect gaps...
	found: 0
detect oriCs/oriVs...
	found: 2
detect oriTs...
	found: 0
apply feature overlap filters...

Now, our analysis results look something like this. The directory structure now includes results from annotation and the tools directory includes a reference database to run bakta. 

In [6]:
# project structure
tree -dL 2 ~/tutorials

/home/jupyter-mdprieto/tutorials
|-- raw_reads
|-- results
|   |-- annotation
|   |-- assembly_checkm
|   |-- assembly_quast
|   |-- contigs
|   |-- reads_qc
|   `-- snippy
|-- tools
|   |-- db-light
|   |-- reference_data
|   `-- sing_imgs
`-- trimmed_reads

13 directories


The output of **Bakta** annotation can be found in `tutorials/results/annotation`. 

- Circular plots can be visualized directly on Jupyter by clicking them or downloaded to your local computer and opened.
- Annotation results in this dataset are not ideal as we are working with draft genomes instead of complete chromosome asssemblies or scaffolds. These more refined assemblies require long reads to polish the contigs.

Take a look at this paper for more information about assembly approaches: https://doi.org/10.1093/bib/bbw096. 

# Phylogenetic tree

To produce a phylogenetic tree we must compare the genomes using some kind of metric. Here, we will use **Snippy** to identify single nucleotide variants in our genomes (https://github.com/tseemann/snippy). 

**Gubbins** is a tool that removes genes or features that were obtained through horizontal gene transfer and that may obscure the interpretation of the phylogenetic tree (https://github.com/nickjcroucher/gubbins). 

In [2]:
SNIPPY_IMG="$HOME/tutorials/tools/sing_imgs/snippy_4.6.0.sif"
GUBBINS_IMG="$HOME/tutorials/tools/sing_imgs/gubbins-3.2.1.img"
REF_GENOME="$HOME/tutorials/tools/reference_data/GCF_000006765.1_ASM676v1_genomic.fna"

**Snippy** is a relatively fast tool, so we can analyze all the genomes we have available without too much wait. We apply the tool to every available draft genome. 

- The recommended use for the tool is actually on raw reads. 

To avoid typing the same command for every draft genome, we use a [**_for loop_**](https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php) to quickly reproduce the same command for a list of contig files. 

In [3]:
for contig in $(ls $HOME/tutorials/results/contigs/*contigs.fa)
do
    
    # define isolate name
    sample_id=$(basename "$contig" '_contigs.fa')
    echo "Processing $sample_id"

    # run snippy for each isolate
    singularity exec $SNIPPY_IMG snippy \
    --outdir $HOME/tutorials/results/snippy/$sample_id  `# save in a subdirectory`\
    --ctgs "$contig" \
    --ref $REF_GENOME \
    --cpus 8 \
    --force \
    --cleanup \
    --quiet

    # produce core genome analysis in results_snippy folder
    
    cd $HOME/tutorials/results/snippy/
    singularity exec "$SNIPPY_IMG" snippy-core --ref $REF_GENOME \
    $HOME/tutorials/results/snippy/ERR*
done

Processing ERR10479510
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[samclip] samclip 0.4.0 by Torsten Seemann (@torstenseemann)
[samclip] Loading: reference/ref.fa.fai
[samclip] Found 1 sequences in reference/ref.fa.fai
[M::process] read 320000 sequences (80000000 bp)...
[M::process] read 253774 sequences (63420526 bp)...
[M::mem_process_seqs] Processed 320000 reads in 33.671 CPU sec, 4.140 real sec
[samclip] Processed 100000 records...
[samclip] Processed 200000 records...
[samclip] Processed 300000 records...
[M::mem_process_seqs] Processed 253774 reads in 29.408 CPU sec, 3.590 real sec
[samclip] Processed 400000 records...
[samclip] Processed 500000 records...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -Y -M -R @RG\tID:ERR10479510\tSM:ERR10479510 -t 8 reference/ref.fa fake_reads.fq
[main] Real time: 9.783 sec; CPU: 64.325 sec
[samclip] Total SAM records 574968, removed 9431, allowed 1610, passed 565537
[samclip] Header contained 3 lines
[samclip] Done.
[bam_sort_core] mergi

Once we have the data from the snippy call, we run a few additional steps to clean our results and to have them ready for phylogenetic analysis. 

1. We run **snippy-core** to make a summary of SNV found in genes that are common to all isolates analyzed
2. We run another tool, **Gubbins**, that eliminates features that are known to be transmitted through horizontal gene transfer. As the variability in these may come from other bacteria in the environment, they are not useful for phylogenetic analysis. 
3. Once we have the cleaned alignment, we use **snp-sites** to 

In [None]:
cd $HOME/tutorials/results_snippy

# clean file
singularity exec $SNIPPY_IMG snippy-clean_full_aln core.full.aln > clean.full.aln

# gubbins remove horizontal gene transfer 
singularity exec $GUBBINS_IMG run_gubbins.py -p gubbins clean.full.aln

# SNP-sites
singularity exec $SNIPPY_IMG snp-sites -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln

# produce tree
$HOME/tools/FastTree -gtr -nt clean.core.aln > clean.core.tree

Finally, we can visualize the tree we just produce by pasting the contents of the file `clean.core.tree` in a visualizer tool such as [phyl.io](https://phylo.io/), or [iTOL](https://itol.embl.de/upload.cgi).

The data we used may not be the best representation of differences in a phylogenetic tree as it comes from a single center outbreak and the samples were collected close to each other. Another issue may be that the reference genome employed was not adequate and thus, it is not showing significant differences. 

In [23]:
cd $CONTIGS_DIR
readlink -f *contigs.fa > path.txt
cat path.txt | sed -s "s/_contigs.fa//" | sed -s 's/contigs/results_snippy/' > filenames.txt


# join them and remove temp files
paste filenames.txt path.txt > ~/scripts/snippy-input.tab
rm filenames.txt path.txt