### Anvio - Contig profiling

https://merenlab.org/2016/06/22/anvio-tutorial-v2/

Contig profiling creates a database of your contigs. It calculates k-mer frequencies for your sample (standard k-setting is 4, which you can change with the --kmer-size parameter (DON'T unless you have a good reason)), soft splits long contigs, and identifies open reading frames (which can be skipped using --skip-gene-calling). Run the following code to generate your database:

Subsequently, you can add various elements of analysis to your contig profile. The following list is available:

Augustus + Prodigal gene calls: adds open reading frames to your dataset from the genes from the Augustus database (eukaryotes) and Prodigal (bacteria + archaea) WORKING ON THIS: NOT SURE IF IT WORKS

Hidden Markov Model (HMM): A widely used prediction model in bioinformatics software, which can offer great advantages in homology detection.

NCBI's Cluster of Orthologous genes (NCBI COG): this allows you to annotate your database with gene functions from NCBI COG. Current version: 2020

KoFAM Metabolism calls: Uses the KEGG database to call metabolic genes and estimate paths of your community. Currently used KEGG version is KEGG_build_2020-12-23.

Kaiju Taxonomy calls:

Each of these adds a new layer of information to your dataset, so might be very interesting to explore.

In [None]:
conda activate anvio-8

#install taxonomy 
diamond --version
conda install diamond=0.9.14
anvi-setup-scg-taxonomy

#download ncbi-cogs 
anvi-setup-ncbi-cogs --num-threads 11

#metabolism database
anvi-setup-kegg-data

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-8

#Contig database from assembled genomes. stores information related to your sequences: positions of open reading frames, k-mer frequencies for each contig, functional and taxonomic annotation of genes, etc.
#set parameters:
SAMPLENAME='healthy_2019_mcav'
CONTIGPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping'
CONTIGFILE="healthy_2019_mcav_filtered.contigs-fixed.fsa"
OUTDIR='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/profiles'
BAMPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping/index'


#excute from healthy_2019_mcav dir

#generates contig database 

#from merged, fixed contig fasta file created in previous step..need for downstream analysis
#default k-mer frequency is 4
anvi-gen-contigs-database -f $CONTIGPATH/$CONTIGFILE --project-name $SAMPLENAME -o ./anvio/$SAMPLENAME.contigs.db  

#integrate HMMs into the database:
anvi-run-hmms -c ./anvio/$SAMPLENAME.contigs.db --num-threads 6

#this runs NCBI COGs against your contigs.db, integrating gene functions.
anvi-run-ncbi-cogs -c ./anvio/$SAMPLENAME.contigs.db -T 4

#ADD KEGG-KOFAM
anvi-run-kegg-kofams -c ./anvio/$SAMPLENAME.contigs.db \
                     -T 4 #these are the threads that Anvi'O is allowed to use
#ADD CONTIG STATS
anvi-display-contigs-stats ./anvio/$SAMPLENAME.contigs.db --report-as-text --as-markdown -o ./anvio/anvio_stats.txt


#create sample profiles
for f in `cat healthy_2019_MCAV_sampleids`;
do
anvi-profile -c ./anvio/$SAMPLENAME.contigs.db  \
            -i $BAMPATH/"$f".bam \
            --min-percent-identity 95 \
            --sample-name "healthy"_"$f" \
            --output-dir $OUTDIR
# --output-dir doesn't seem to work. its putting it in bampath
#use contig database and sample bam files to create single profiles. can specify to keep contigs of min length (min-contig-length) and 95% identity to database
#keep parameters consistent in order to merge to larger profile 
#already filtered out min 1000bp contigs in 'anvi-script-reformat-fasta', so not adding specification.
done

#merge single sample profiles to one profile
anvi-merge -c ./anvio/"$SAMPLENAME".contigs.db \
            $OUTDIR/*/PROFILE.db \
            -o $OUTDIR/"$SAMPLENAME"_profile_merged



#JOB ID: 21599196
#sample profiles: 21634556
#merge profiles: 21649620
#bash script: anvio_db_profiles & profiles_script & profiles

#### Results - contig db 
|contigs_db|healthy_2019_mcav|
|:--|:--:|
|Total Length|1186762994|
|Num Contigs|448229|
|Num Contigs > 100 kb|19|
|Num Contigs > 50 kb|47|
|Num Contigs > 20 kb|1501|
|Num Contigs > 10 kb|9799|
|Num Contigs > 5 kb|42749|
|Num Contigs > 2.5 kb|144061|
|Longest Contig|437957|
|Shortest Contig|1000|
|Num Genes (prodigal)|1058453|
|L50|98859|
|L75|224211|
|L90|342407|
|N50|3176|
|N75|1806|
|N90|1265|
|Archaea_76|466|
|Bacteria_71|607|
|Protista_83|184|
|Ribosomal_RNA_12S|0|
|Ribosomal_RNA_16S|7|
|Ribosomal_RNA_18S|9|
|Ribosomal_RNA_23S|13|
|Ribosomal_RNA_28S|11|
|Ribosomal_RNA_5S|0|
|archaea (Archaea_76)|0|
|eukarya (Protista_83)|1|
|bacteria (Bacteria_71)|6|


##### last time results for comparison: improvement :)
contigs_db	mcav_contigs_fixed
Total Length	57097694
Num Contigs	39414
Num Contigs > 100 kb	0
Num Contigs > 50 kb	0
Num Contigs > 20 kb	2
Num Contigs > 10 kb	35
Num Contigs > 5 kb	370
Num Contigs > 2.5 kb	2156
Longest Contig	24857
Shortest Contig	1000
Num Genes (prodigal)	58029
L50	14211
L75	25911
L90	33821
N50	1339
N75	1127
N90	1044
Archaea_76	9
Bacteria_71	8
Protista_83	5
Ribosomal_RNA_12S	0
Ribosomal_RNA_16S	0
Ribosomal_RNA_18S	2
Ribosomal_RNA_23S	0
Ribosomal_RNA_28S	2
Ribosomal_RNA_5S	0
bacteria (Bacteria_71)	0
eukarya (Protista_83)	0
archaea (Archaea_76)	0

**Do not use below code...still troubleshooting**

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 6:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-8

SAMPLENAME='healthy_2019_mcav'
#DIR=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/results/MetaBAT_mcav_bins
BINFILE="/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/das_tool_DASTool_bins/das_tool_DASTool_contig2bin.tsv"
PROFILEDIR='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/profiles'


#convert metabin results to anvio format

#FILES=$(find $DIR/*.fa)
#for f in $FILES; do
# NAME=$(basename $f .fa)
# grep ">" $f | sed 's/>//' | sed -e "s/$/\t$NAME/" | sed 's/\./_/' >> metabins4anvio.txt
#done
#metabin produces fasta files containing contigs of each bin 
#collection artifact requires a txt file that contains list of contigs with their associated bins (2 columns) 

# Import bin results as anvio collection 
# run from anvio dir 
anvi-import-collection $BINFILE \
                       -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db \
                       -c $SAMPLENAME.contigs.db \
                       --contigs-mode \
                       -C "$SAMPLENAME"_collection
#import binning results of dastool from 3Binning step as a collection artifact in anvio
##contigs-mode specificies that input txt file describes contigs names not split names
  

#bash script: import_collection
#job ID: 21663018

**Troubleshooting code below**
- Trying to figure out clustering in anvio itself
- Skip for now

In [None]:
anvi-cluster-contigs -p $OUTDIR/"$SAMPLENAME"_contigs_merged \
                    -c ./working/$SAMPLENAME.contigs.db \
                    --log-file //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/cluster_log \
                    -C "$SAMPLENAME"_collection \
                    -T 3 \
                    --driver 'metabat2'
#binning: clusters contigs. can use various drivers: 'concoct, metabat2, maxbin2, dastool, or binsanity.'

#error output that the module of clustering isnt fully developed and is better to just use own binning software. use flag '--just-do-it' if you want to try it out

#### Results from importing collection:

In [None]:
anvi-migrate --migrate-safely ./working/"$SAMPLENAME".contigs.db
#migrate anvio artifact after updating anvio to version 8 - only do once if updated

#### Interactive mode
https://merenlab.org/2016/02/27/the-anvio-interactive-interface/
- read more into features and go through tutorial 
https://merenlab.org/2015/11/28/visualizing-from-a-server/

In [None]:
# ssh unity like normal
# launch interactive job 
salloc -p cpu --mem 150G
# start new tab and relaunch unity specifying compute node
ssh -L 8080:cpu022:8080 unity

# initiate command in compute node, go to http://0.0.0.0:8080 in local chrome  
anvi-interactive -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db -c $SAMPLENAME.contigs.db -C "$SAMPLENAME"_collection
#need collection info since skipped clustering in the merging step 


In [None]:
ssh -L 8080:localhost:8080 unity
#connect to unity this way from local terminal if you want to use the interactive browser below
# ex: ssh -L 8080:localhost:8080 meren@server.university.edu

# now on public server: https://anvi-server.org/

In [None]:
# summarize bins - creates html output that summarizes completion/coverage of contigs in bins 
# takes long time to run 
anvi-summarize -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db -c $SAMPLENAME.contigs.db -o "$SAMPLENAME"-SUMMARY -C "$SAMPLENAME"_collection

#### Taxonomy 
https://merenlab.org/2019/10/08/anvio-scg-taxonomy/
- single-copy core genes (SCGs) and the taxonomy of the genomes - as defined by the GTDB - from which these genes are coming from

In [None]:
module load miniconda/22.11.1-1
conda activate anvio-8

In [None]:
SAMPLENAME='healthy_2019_mcav'
PROFILEDIR='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/profiles'
OUTDIR='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/anvio/taxonomy'

In [None]:
anvi-run-scg-taxonomy -c "$SAMPLENAME".contigs.db
# adding this to the contig db creation step, but here it was done separately 
# default percent identity: 90

In [None]:
anvi-estimate-scg-taxonomy -c "$SAMPLENAME".contigs.db \
                           -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db \
                           --num-threads 12 \
                           --metagenome-mode \
                           --compute-scg-coverages \
                           --output-file $OUTDIR/"$SAMPLENAME"-taxa-abundance.tsv
#taxa matching from assembled metagenome - will just match ASVs to SCG taxa and give percent identity 
#difference here is that it will calculate relative abundances of the matched SCG taxa across all samples as well as percent identity (like an OTU table)
# a little unclear on what --compute-scg-coverages does: estimate coverage of bins ?

In [None]:
anvi-show-collections-and-bins -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db

In [None]:
#calculate taxonomy of MAGs
anvi-estimate-scg-taxonomy -c "$SAMPLENAME".contigs.db \
                           -p $PROFILEDIR/"$SAMPLENAME"_profile_merged/PROFILE.db \
                           -C "$SAMPLENAME"_collection \
                           --num-threads 12 \
                           --metagenome-mode \
                           --compute-scg-coverages \
                           --output-file $OUTDIR/"$SAMPLENAME"-taxa-abundance.tsv


### Results

In [None]:
SAMPLENAME=mcav
anvi-db-info working/mcav_assembly_redo/$SAMPLENAME.contigs.db

In [None]:
DB Info (no touch)
===============================================
Database Path ................................: working/mcav_assembly_redo/mcav.contigs.db
description ..................................: [Not found, but it's OK]
db_type ......................................: contigs (variant: unknown)
version ......................................: 21


DB Info (no touch also)
===============================================
project_name .................................: mcav
contigs_db_hash ..............................: hash2fac8c34
split_length .................................: 20000
kmer_size ....................................: 4
num_contigs ..................................: 839850
total_length .................................: 435829081
num_splits ...................................: 839850
gene_level_taxonomy_source ...................: None
genes_are_called .............................: 1
external_gene_calls ..........................: 0
external_gene_amino_acid_seqs ................: 0
skip_predict_frame ...........................: 0
splits_consider_gene_calls ...................: 1
scg_taxonomy_was_run .........................: 0
scg_taxonomy_database_version ................: None
trna_taxonomy_was_run ........................: 0
trna_taxonomy_database_version ...............: None
creation_date ................................: 1701991494.31667
gene_function_sources ........................: KEGG_Class,KEGG_Module,COG20_CATEGORY,COG20_FUNCTION,COG20_PATHWAY,KOfam,KEGG_BRITE
modules_db_hash ..............................: a2b5bde358bb

* Please remember that it is never a good idea to change these values. But in some
  cases it may be absolutely necessary to update something here, and a
  programmer may ask you to run this program and do it. But even then, you
  should be extremely careful.


AVAILABLE GENE CALLERS
===============================================
* 'prodigal' (707,308 gene calls)
* 'Ribosomal_RNA_28S' (2 gene calls)
* 'Ribosomal_RNA_18S' (2 gene calls)


AVAILABLE FUNCTIONAL ANNOTATION SOURCES
===============================================
* COG20_CATEGORY (4,796 annotations)
* COG20_FUNCTION (4,796 annotations)
* COG20_PATHWAY (1,268 annotations)
* KEGG_BRITE (2,346 annotations)
* KEGG_Class (217 annotations)
* KEGG_Module (217 annotations)
* KOfam (2,362 annotations)


AVAILABLE HMM SOURCES
===============================================
* 'Archaea_76' (76 models with 71 hits)
* 'Bacteria_71' (71 models with 62 hits)
* 'Protista_83' (83 models with 14 hits)
* 'Ribosomal_RNA_12S' (1 model with 0 hits)
* 'Ribosomal_RNA_16S' (3 models with 0 hits)
* 'Ribosomal_RNA_18S' (1 model with 2 hits)
* 'Ribosomal_RNA_23S' (2 models with 0 hits)
* 'Ribosomal_RNA_28S' (1 model with 2 hits)
* 'Ribosomal_RNA_5S' (5 models with 0 hits)

In [None]:
OUTDIR=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/mcav_assembly_redo/profiles
TAXPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/mcav_assembly_redo/taxonomy
mkdir $TAXPATH
anvi-estimate-scg-taxonomy -c $DBPATH/$SAMPLENAME.contigs.db -p $OUTDIR/"$SAMPLENAME"_profile_merged/PROFILE.db -T 3 --metagenome-mode --compute-scg-coverages -o $TAXPATH/"$SAMPLENAME"-scg-taxonomy.tsv
#taxa matching from assembled metagenome - will just match ASVs to SCG taxa and give percent identity 
anvi-run-scg-taxonomy -c $DBPATH/"$SAMPLENAME".contigs.db

In [None]:
#estimate metabolism using KEGGKOFOAMS annotation of db
anvi-estimate-metabolism -c $DBPATH/$SAMPLENAME.contigs.db --metagenome-mode