Metaquast 
- quality assessment of metagenomic reads, no reference genome included here

In [None]:
metaquast mcav.contigs.fa -o quast_output

https://quast.sourceforge.net/docs/manual.html#sec1
..how to interpret quality results?
- check how many large contigs you have (>1000 bp)
- did not map to reference genome.
- right now just helpful to see length and quality of contigs, maybe can reassess after mapping back to metagenome?
Cite metaquast: https://quast.sourceforge.net/publications.html

Mapping
- reassembling contigs back to metagenome 

Anvio: used various ways throughout this pipeline

In [None]:
conda create -n anvio-7.1
#dir=/home/brooke_sienkiewicz_student_uml_edu/.conda/envs/anvio-7.1
anvi-setup-ncbi-cogs --num-threads 11

Anvio
https://merenlab.org/2016/06/22/anvio-tutorial-v2/#anvi-profile
- reformats fasta file, filters contigs >1000bp
- aligns reads and indexes and stores in bam files 

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-7.1

SAMPLENAME=mcav
READSPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/trimmed
CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/results/mcav_assembly3
CONTIGFILE=mcav.contigs.fa

anvi-script-reformat-fasta $CONTIGPATH/$CONTIGFILE -o $CONTIGPATH/$SAMPLENAME.contigs-fixed.fa -l 1000 --simplify-names
#fixes deflines for later and filters on size (set to 1000 bp)

FIXEDCON="$SAMPLENAME".contigs-fixed.fa

bowtie2-build $CONTIGPATH/$FIXEDCON ./working/"$SAMPLENAME"_contigs
#this builds an index of your contigs, which only needs to happen once

for f in T1_12_2022 T1_13_2022 T1_16_2019 T1_20_2019 T1_24_2019 T1_40_2022 T1_57_2022 T1_70_2022 T2_10_2022 T2_16_2019 T3_13_2022 T3_14_2019 T3_15_2019 T3_19_2022 T3_1_2019 T3_40_2022 T3_48_2022 T3_49_2022 T3_51_2022 T3_60_2022 T3_8_2019 T3_9_2019
do
bowtie2 --threads 11 -x ./working/"$SAMPLENAME"_contigs -1 $READSPATH/"$f"_MCAV_R1_001_val_1.fq -2 $READSPATH/"$f"_MCAV_R2_001_val_2.fq -S ./working/"$f".sam
#this creates an alignment of your reads to your contigs and collects that in a .sam file

samtools view -F 4 -b -S ./working/"$f".sam -o ./working/"$f"-RAW.bam
#this converts your sam file to a bam file, but its neither sorted nor indexed, so we use an Anvi'O script to do so:

anvi-init-bam ./working/"$f"-RAW.bam -o ./results/index/"$f".bam
#index and sort your bam file

#rm ../working/"$f"-RAW.bam
#removal failed (was pointing to wrong directory) but going to keep raw seqs anyway
done
#generates BAM files from each sample sequence, aligns, indexes...need output bam for downstream analysis

#bash script: mapping.txt
#JOB ID: 13289175

Anvio - Contig profiling
Contig profiling creates a database of your contigs. It calculates k-mer frequencies for your sample (standard k-setting is 4, which you can change with the --kmer-size parameter (DON'T unless you have a good reason)), soft splits long contigs, and identifies open reading frames (which can be skipped using --skip-gene-calling). Run the following code to generate your database:

Subsequently, you can add various elements of analysis to your contig profile. The following list is available:

Augustus + Prodigal gene calls: adds open reading frames to your dataset from the genes from the Augustus database (eukaryotes) and Prodigal (bacteria + archaea) WORKING ON THIS: NOT SURE IF IT WORKS

Hidden Markov Model (HMM): A widely used prediction model in bioinformatics software, which can offer great advantages in homology detection.

NCBI's Cluster of Orthologous genes (NCBI COG): this allows you to annotate your database with gene functions from NCBI COG. Current version: 2020

KoFAM Metabolism calls: Uses the KEGG database to call metabolic genes and estimate paths of your community. Currently used KEGG version is KEGG_build_2020-12-23.

Kaiju Taxonomy calls:

Each of these adds a new layer of information to your dataset, so might be very interesting to explore.

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o //project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-7.1

#Contig database from assembled genomes. stores information related to your sequences: positions of open reading frames, k-mer frequencies for each contig, functional and taxonomic annotation of genes, etc.
#set parameters:
SAMPLENAME=mcav
CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/results/mcav_assembly3
CONTIGFILE=mcav.contigs-fixed.fa

#generate the contigs database:
#default k-mer frequency is 4
anvi-gen-contigs-database -f $CONTIGPATH/$CONTIGFILE --project-name $SAMPLENAME -o ./working/$SAMPLENAME.contigs.db  

#integrate HMMs into the database:
anvi-run-hmms -c ./working/$SAMPLENAME.contigs.db --num-threads 6

#this runs NCBI COGs against your contigs.db, integrating gene functions.
anvi-run-ncbi-cogs -c ./working/$SAMPLENAME.contigs.db 

#ADD KEGG-KOFAM
anvi-run-kegg-kofams -c ./working/$SAMPLENAME.contigs.db \
                     -T 4 #these are the threads that Anvi'O is allowed to use
#ADD CONTIG STATS
anvi-display-contigs-stats ./working/$SAMPLENAME.contigs.db --report-as-text --as-markdown -o ./results/anvio_stats

#generates contig database from merged, fixed contig fasta file created in previous step..need for downstream analysis
#JOB ID: 13297970