### Anvio - Contig profiling

Contig profiling creates a database of your contigs. It calculates k-mer frequencies for your sample (standard k-setting is 4, which you can change with the --kmer-size parameter (DON'T unless you have a good reason)), soft splits long contigs, and identifies open reading frames (which can be skipped using --skip-gene-calling). Run the following code to generate your database:

Subsequently, you can add various elements of analysis to your contig profile. The following list is available:

Augustus + Prodigal gene calls: adds open reading frames to your dataset from the genes from the Augustus database (eukaryotes) and Prodigal (bacteria + archaea) WORKING ON THIS: NOT SURE IF IT WORKS

Hidden Markov Model (HMM): A widely used prediction model in bioinformatics software, which can offer great advantages in homology detection.

NCBI's Cluster of Orthologous genes (NCBI COG): this allows you to annotate your database with gene functions from NCBI COG. Current version: 2020

KoFAM Metabolism calls: Uses the KEGG database to call metabolic genes and estimate paths of your community. Currently used KEGG version is KEGG_build_2020-12-23.

Kaiju Taxonomy calls:

Each of these adds a new layer of information to your dataset, so might be very interesting to explore.

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-7.1

#Contig database from assembled genomes. stores information related to your sequences: positions of open reading frames, k-mer frequencies for each contig, functional and taxonomic annotation of genes, etc.
#set parameters:
SAMPLENAME=mcav
CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/results/mcav_assembly3
CONTIGFILE=mcav.contigs-fixed.fa

#generate the contigs database:
#default k-mer frequency is 4
anvi-gen-contigs-database -f $CONTIGPATH/$CONTIGFILE --project-name $SAMPLENAME -o ./working/$SAMPLENAME.contigs.db  

#integrate HMMs into the database:
anvi-run-hmms -c ./working/$SAMPLENAME.contigs.db --num-threads 6

#this runs NCBI COGs against your contigs.db, integrating gene functions.
anvi-run-ncbi-cogs -c ./working/$SAMPLENAME.contigs.db 

#ADD KEGG-KOFAM
anvi-run-kegg-kofams -c ./working/$SAMPLENAME.contigs.db \
                     -T 4 #these are the threads that Anvi'O is allowed to use
#ADD CONTIG STATS
anvi-display-contigs-stats ./working/$SAMPLENAME.contigs.db --report-as-text --as-markdown -o ./results/anvio_stats

#generates contig database from merged, fixed contig fasta file created in previous step..need for downstream analysis
#JOB ID: 13297970
#bash script: mcav_db.txt

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 6:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-7.1
SAMPLENAME=mcav
OUTDIR=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/index

 

anvi-merge -c ./working/$SAMPLENAME.contigs.db \
            ./working/$OUTDIR/*/PROFILE.db \
            -o ./working/$OUTDIR/$SAMPLENAME_contigs_merged
#merge single sample profiles to one profile

  
anvi-cluster-contigs -p ./working/$OUTDIR/SAMPLENAME_contigs_merged \
                     -c ./working/$SAMPLENAME.contigs.db \ 
                     -C ./results/collection \ 
                     --driver concoct
#binning: clusters contigs. can use various drivers: 'concoct, metabat2, maxbin2, dastool, or binsanity.'
#seperated individual sample profiling and merging to different bash scripts just for troubleshooting


#splitting code up to make sure it works 
#bash script: binning
#job ID: 14815575

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 6:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-7.1

#set parameters
SAMPLENAME=mcav
#mkdir=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/$SAMPLENAME_profiles
samplepath=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts

OUTDIR=//project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/index

for f in T1_12_2022 T1_13_2022 T1_16_2019 T1_20_2019 T1_24_2019 T1_40_2022 T1_57_2022 T1_70_2022 T2_10_2022 T2_16_2019 T3_13_2022 T3_14_2019 T3_15_2019 T3_19_2022 T3_1_2019 T3_40_2022 T3_48_2022 T3_49_2022 T3_51_2022 T3_60_2022 T3_8_2019 T3_9_2019
do
anvi-profile -c ./working/$SAMPLENAME.contigs.db  \
            -i ./working/index/"$f".bam \
            --min-percent-identity 95 \
            --sample-name "$f"
            --output-dir $OUTDIR
#use contig database and sample bam files to create single profiles. can specify to keep contigs of min length (min-contig-length) and 95% identity to database
#keep parameters consistent in order to merge to larger profile 
#already filtered out min 1000bp contigs in 'anvi-script-reformat-fasta', so not adding specification.
done

#binning: clusters contigs. can use various drivers: 'concoct, metabat2, maxbin2, dastool, or binsanity.'
#seperated individual sample profiling and merging to different bash scripts just for troubleshooting
#bash script: profiles.txt
#JOB ID: 13298522

Taxonomy 
https://merenlab.org/2019/10/08/anvio-scg-taxonomy/
- single-copy core genes (SCGs) and the taxonomy of the genomes - as defined by the GTDB - from which these genes are coming from

In [None]:
conda activate anvio-7.1

diamond --version
conda install diamond=0.9.14
anvi-setup-scg-taxonomy

#download ncbi-cogs 
anvi-setup-ncbi-cogs --num-threads 11

In [None]:
mkdir ./results/taxonomy
SAMPLENAME = mcav 

In [None]:
anvi-estimate-scg-taxonomy -c ./working/$SAMPLENAME.contigs.db --num-parallel-processes 3 --num-threads 3 --metagenome-mode --ouput-file ./results/taxonomy/scg-taxonomy.tsv
#taxa matching from assembled metagenome - will just match ASVs to SCG taxa and give percent identity 

In [None]:
anvi-estimate-scg-taxonomy ./working/$SAMPLENAME.contigs.db \
                           -p PROFILE.db \
                           --num-threads 3 \
                           --metagenome-mode \
                           --compute-scg-coverages \
                           --output-file ./results/taxonomy/taxa-abundance.tsv
#difference here is that it will calculate relative abundances of the matched SCG taxa across all samples as well as percent identity (like an OTU table)