# Annotation of genomes

Once we have analysis ready genomes (verified quality and lack of contamination), we can do annotation. The main goal of annotation is to identify the genomic features that are present in the analysis contigs. Usually, these tools use predictive algorithms (recognition of common sequences in the DNA) or alignment against a reference database to identify the different features of a genome. 

Bakta is a tool that produces a comprehensive annotation composed of Open Reading Frames (sequences that include codons for aminoacids of at least certain length), different types of RNA (ncRNA, tRNA, rRNA), hypothetical proteins, genes and CRISPR arrays. 

In [2]:
# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

Bakta uses a reference database that may be really large (~30GB). A reduced database of approximately 3.0 GB can be employed and will produce less specific results in feature annotation. 

We already downloaded the database using the command below and it is available in `/tools/reference_data/`
```
singularity exec -B /scratch,/project,/etc </PATH/TO/BAKTA/IMAGE/> \
    bakta_db download \
    --output </OUTPUT/DIR/> \
    --type light
```

We save the path to the bakta singularity image and reference dataset in the following way to provide it in the main pipeline.  

In [2]:
BAKTA_IMG="/mnt/cidgoh-object-storage/images/bakta_1.7.sif"
export BAKTA_DB="/mnt/cidgoh-object-storage/database/bakta/db.tar.gz"

In [3]:
singularity exec -B /etc "$BAKTA_IMG" bakta \
    --db "$BAKTA_DB"                                                        `# path to bakta database` \
    /home/jupyter-mdprieto/tutorials/contigs/ERR10479518_contigs.fa         `# file to annotate`\
    --output /home/jupyter-mdprieto/tutorials/annotation/                   `# output directory` \
    --genus Pseudomonas                                                     `# specify genus of isolate` \
    --prefix $(echo "ERR10479518_contigs.fa" | grep -Eo "ERR[0-9]+")        `# prefix of sample name only for output`




Now, our analysis results look something like this. The directory structure now includes results from annotation and the tools directory includes a reference database to run the annotation. 

```
/home/jupyter-mdprieto
|-- scripts
|-- tools
|   |-- reference_data
|   `-- sing_imgs
`-- tutorials
    |-- annotation
    |-- assembly_checkm
    |-- assembly_quast
    |-- contigs
    |-- raw_reads
    |-- results_qc
    `-- trimmed_reads
```

The output of **Bakta** annotation can be found in `tutorials/annotation`. Circular plots can be visualized by Jupyter by clicking them or exported to a local system. They may be partially incomplete as we are working with draft genomes instead of complete chromosome asssemblies or scaffolds. These more refined assemblies require long reads to polish the contigs. 


# Phylogenetic tree

To produce a phylogenetic tree we must compare the genomes using some kind of metric. Here, we will use **Snippy** to identify single nucleotide variants in our genomes and compare how different are these isolates between them. 

In [28]:
SNIPPY_IMG="/home/jupyter-mdprieto/tools/sing_imgs/snippy_4.6.0.sif"
GUBBINS_IMG="/home/jupyter-mdprieto/tools/sing_imgs/gubbins-3.2.1.img"
CONTIGS_DIR="/home/jupyter-mdprieto/tutorials/contigs"
REF_GENOME="/home/jupyter-mdprieto/tools/reference_data/GCF_000006765.1_ASM676v1_genomic.fna"

Snippy is a relatively fast tool, so we can analyze all the genomes we have available without too much wait. We apply the tool to every available draft genome. **Snippy** can also be used with raw reads and is actually the recommended use.

In [27]:
for contig in $(ls $HOME/tutorials/contigs/*contigs.fa)
do
    # define isolate name
    meta=$(basename "$contig" '_contigs.fa')

    # run snippy for each isolate
    singularity exec "$SNIPPY_IMG" snippy \
    --outdir "$HOME/tutorials/results_snippy/$meta"  `# save in a subdirectory`\
    --ctgs "$contig" \
    --ref $REF_GENOME \
    --cpus 8 \
    --force \
    --cleanup \
    --quiet

    # produce core genome analysis in results_snippy folder
    cd $HOME/tutorials/results_snippy/
    singularity exec "$SNIPPY_IMG" snippy-core --ref $REF_GENOME \
    /home/jupyter-mdprieto/tutorials/results_snippy/ERR*
done

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[samclip] samclip 0.4.0 by Torsten Seemann (@torstenseemann)
[samclip] Loading: reference/ref.fa.fai
[samclip] Found 1 sequences in reference/ref.fa.fai
[M::process] read 320000 sequences (80000000 bp)...
[M::process] read 253774 sequences (63420526 bp)...
[M::mem_process_seqs] Processed 320000 reads in 33.732 CPU sec, 4.157 real sec
[samclip] Processed 100000 records...
[samclip] Processed 200000 records...
[samclip] Processed 300000 records...
[M::mem_process_seqs] Processed 253774 reads in 29.132 CPU sec, 3.567 real sec
[samclip] Processed 400000 records...
[samclip] Processed 500000 records...
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -Y -M -R @RG\tID:ERR10479510\tSM:ERR10479510 -t 8 reference/ref.fa fake_reads.fq
[main] Real time: 9.843 sec; CPU: 64.386 sec
[samclip] Total SAM records 574968, removed 9431, allowed 1610, passed 565537
[samclip] Header contained 3 lines
[samclip] Done.
[bam_sort_core] merging from 0 files and 3 i

Once we have the data from the snippy call, we run a few additional steps to clean our results and to have them ready for phylogenetic analysis. 

1. We run **snippy-core** to make a summary of SNV found in genes that are common to all isolates analyzed
2. We run another tool, **Gubbins**, that eliminates features that are known to be transmitted through horizontal gene transfer. As the variability in these may come from other bacteria in the environment, they are not useful for phylogenetic analysis. 
3. Once we have the cleaned alignment, we use **snp-sites** to 

In [None]:
cd $HOME/tutorials/results_snippy

# clean file
singularity exec $SNIPPY_IMG snippy-clean_full_aln core.full.aln > clean.full.aln

# gubbins remove horizontal gene transfer 
singularity exec $GUBBINS_IMG run_gubbins.py -p gubbins clean.full.aln

# SNP-sites
singularity exec $SNIPPY_IMG snp-sites -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln

# produce tree
$HOME/tools/FastTree -gtr -nt clean.core.aln > clean.core.tree

Finally, we can visualize the tree we just produce by pasting the contents of the file `clean.core.tree` in a visualizer tool such as [phyl.io](https://phylo.io/), or [iTOL](https://itol.embl.de/upload.cgi).

The data we used may not be the best representation of differences in a phylogenetic tree as it comes from a single center outbreak and the samples were collected close to each other. Another issue may be that the reference genome employed was not adequate and thus, it is not showing significant differences. 

In [23]:
cd $CONTIGS_DIR
readlink -f *contigs.fa > path.txt
cat path.txt | sed -s "s/_contigs.fa//" | sed -s 's/contigs/results_snippy/' > filenames.txt


# join them and remove temp files
paste filenames.txt path.txt > ~/scripts/snippy-input.tab
rm filenames.txt path.txt