# WTS sequencing data-processing

***

**Index**

1. QC, trimming, and filtering
   - Quality trimming (Trimmomatic)
   - Re-check quality (FastQC; MultiQC)
   - Remove rRNA with SortMeRNA
   - Host filtering
   - Check library sizes for each step
2. Assembly
   - Co-assembly
   - Mini co-assemblies
   - Per-sample assemblies
   - Assessing assemblies
3. Identify putative viral contigs
   - VIBRANT
   - VirSorter
   - BLAST
4. Summary table of putative viral contig methods
5. Generate per-sample dereplicated fasta file of putative viral hits
6. Quality assessment with checkV
7. Dereplication of contigs across all samples
8. Recreate summary table for dereplicated vOTUs
   - BLAST alignments
   - checkV
   - Filtering of vOTUs
9. Read mapping to obtain vOTU coverages


**Data**

RNA transcriptomic data was obtained for 13 birds comprised of 9 faecal samples and 4 swab samples.

Of the faecal samples:
- S1-S7 are cloacitis birds Bravo, Alice, Taeatanga, Cyndy, Merv, Bella and Sinbad
- S8 and S9 are control birds Nora and Scratch

Of the swab samples:
- S10 and S11 are cloacitis birds Uri and Hikoi
- S12 and S13 are control birds Mukeke and Atareta

## Trimmomatic

Raw data concatenated in first step of WGS pipeline. RNA samples S1 - S13 moved to WTS folder under `0.Raw_concat`.

**Trimmomatic: run**

`sbatch scripts/wts_1_trimmomatic.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_1_trimmomatic
#SBATCH --time 00:05:00
#SBATCH --mem=12GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH -e slurm_output/wts_1_trimmomatic_%a_%j.err
#SBATCH -o slurm_output/wts_1_trimmomatic_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/

# Load module(s)
module purge
module load Trimmomatic/0.39-Java-1.8.0_144                

# Make output directory 
mkdir -p 1.Qual_filtered_trimmomatic/

# Set up variables for input path and output path
inpath=0.Raw_concat
outpath=1.Qual_filtered_trimmomatic

# Make adapter file if not already created
if [ ! -f iua.fna ]; then
    echo ">FastQC_adapter" > iua.fna
    echo "AGATCGGAAGAG" >> iua.fna
fi
           
# Quality filter and trim 
srun trimmomatic PE -threads 10 -phred33 \
${inpath}/S${SLURM_ARRAY_TASK_ID}_R1.fastq.gz ${inpath}/S${SLURM_ARRAY_TASK_ID}_R2.fastq.gz \
${outpath}/S${SLURM_ARRAY_TASK_ID}_R1.fastq S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq \
${outpath}/S${SLURM_ARRAY_TASK_ID}_R2.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq \
ILLUMINACLIP:iua.fna:1:25:7 CROP:115 SLIDINGWINDOW:4:30 MINLEN:50

# Tidy up the singleton reads
cat S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq \
> ${outpath}/S${SLURM_ARRAY_TASK_ID}_single.fastq

rm S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq

***

## Fastqc analysis of post-trimmed reads

`sbatch scripts/wts_1_qc_fastqc.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_1_qc_fastqc
#SBATCH --time 01:00:00
#SBATCH --mem 1GB
#SBATCH --array=1-13
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 2
#SBATCH -e slurm_output/wts_1_qc_fastqc_%a_%j.err
#SBATCH -o slurm_output/wts_1_qc_fastqc_%a_%j.out

# Set up working directories
cd <working_directory>/WTS/
mkdir -p 1.Qual_filtered_trimmomatic/fastqc/

# load modules
module purge
module load FastQC/0.11.9
module load MultiQC/1.9-gimkl-2020a-Python-3.8.2

# Run fastqc on each sample
srun fastqc \
-o 1.Qual_filtered_trimmomatic/fastqc/ \
1.Qual_filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R1.fastq 1.Qual_filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R2.fastq

**MultiQC**

`sbatch scripts/wts_1_qc_multiqc.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_1_qc_multiqc
#SBATCH --time 00:10:00
#SBATCH --mem 1GB
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 2
#SBATCH -e slurm_output/wts_1_qc_multiqc_%j.err
#SBATCH -o slurm_output/wts_1_qc_multiqc_%j.out

# Set up working directories
cd <working_directory>/WTS/

# load modules
module purge
module load FastQC/0.11.9
module load MultiQC/1.9-gimkl-2020a-Python-3.8.2

# Run multiqc to generate report for all samples
srun multiqc -f \
-o 1.Qual_filtered_trimmomatic/fastqc/ \
1.Qual_filtered_trimmomatic/fastqc/

***

## SortMeRNA: filter out rRNA

Remove ribosomal RNA from transcripts.

**sortmerna v2.1 - Interleave paired files**

In [None]:
# Change to working directory
cd <working_directory>/WTS/

# Load BBMap
module purge
module load BBMap/38.81-gimkl-2020a

# Interleave paird files
for i in {1..13}; do
    reformat.sh \
    in=1.Qual_filtered_trimmomatic/S${i}_R#.fastq \
    out=1.Qual_filtered_trimmomatic/S${i}_interleaved.fastq
done

**SortMeRNA v2.1: Run**

Paired data (interleaved files)

`sbatch scripts/wts_1_sortmerna_interleaved.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_1_sortmerna_interleaved
#SBATCH --time 04:00:00
#SBATCH --mem=2GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e slurm_output/wts_1_sortmerna_interleaved_%a_%j.err
#SBATCH -o slurm_output/wts_1_sortmerna_interleaved_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/
mkdir -p 1.rRNA_filtered/{aligned,unaligned}
mkdir -p 1.rRNA_filtered/S${SLURM_ARRAY_TASK_ID}_interleaved

cd 1.rRNA_filtered/S${SLURM_ARRAY_TASK_ID}_interleaved

# Load modules
module purge
module load SortMeRNA/2.1-gimkl-2017a

DATABASES_PATH='/nesi/project/uoa02469/Software/sortmerna-2.1'
INPUT_DIR='<working_directory>'

# Run sortmerna - paired files
sortmerna \
--ref ${DATABASES_PATH}/rRNA_databases/silva-bac-16s-id90.fasta,${DATABASES_PATH}/index/silva-bac-16s-db:\
${DATABASES_PATH}/rRNA_databases/silva-bac-23s-id98.fasta,${DATABASES_PATH}/index/silva-bac-23s-db:\
${DATABASES_PATH}/rRNA_databases/silva-arc-16s-id95.fasta,${DATABASES_PATH}/index/silva-arc-16s-db:\
${DATABASES_PATH}/rRNA_databases/silva-arc-23s-id98.fasta,${DATABASES_PATH}/index/silva-arc-23s-db:\
${DATABASES_PATH}/rRNA_databases/silva-euk-18s-id95.fasta,${DATABASES_PATH}/index/silva-euk-18s-db:\
${DATABASES_PATH}/rRNA_databases/silva-euk-28s-id98.fasta,${DATABASES_PATH}/index/silva-euk-28s:\
${DATABASES_PATH}/rRNA_databases/rfam-5s-database-id98.fasta,${DATABASES_PATH}/index/rfam-5s-db:\
${DATABASES_PATH}/rRNA_databases/rfam-5.8s-database-id98.fasta,${DATABASES_PATH}/index/rfam-5.8s-db \
--reads $INPUT_DIR/1.Qual_filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_interleaved.fastq \
--aligned $INPUT_DIR/1.rRNA_filtered/aligned/S${SLURM_ARRAY_TASK_ID}_paired_rRNA \
--other $INPUT_DIR/1.rRNA_filtered/unaligned/S${SLURM_ARRAY_TASK_ID}_paired_non_rRNA \
--paired_in --num_alignments 1 --fastx --log -v -a 32

Single files (orphan reads output from trimmomatic)

`sbatch scripts/wts_1_sortmerna_single.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_1_sortmerna_single
#SBATCH --time 00:45:00
#SBATCH --mem=2GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e slurm_output/wts_1_sortmerna_single_%a_%j.err
#SBATCH -o slurm_output/wts_1_sortmerna_single_%a_%j.out

# Change to working directory
cd <working_directory>/WTS/
mkdir -p 1.rRNA_filtered/aligned
mkdir -p 1.rRNA_filtered/unaligned
mkdir -p 1.rRNA_filtered/S${SLURM_ARRAY_TASK_ID}_single

cd 1.rRNA_filtered/S${SLURM_ARRAY_TASK_ID}_single

# Load modules
module purge
module load SortMeRNA/2.1-gimkl-2017a

DATABASES_PATH='/nesi/project/uoa02469/Software/sortmerna-2.1'
INPUT_DIR='<working_directory>'

# Run sortmerna - single files
sortmerna \
--ref ${DATABASES_PATH}/rRNA_databases/silva-bac-16s-id90.fasta,${DATABASES_PATH}/index/silva-bac-16s-db:\
${DATABASES_PATH}/rRNA_databases/silva-bac-23s-id98.fasta,${DATABASES_PATH}/index/silva-bac-23s-db:\
${DATABASES_PATH}/rRNA_databases/silva-arc-16s-id95.fasta,${DATABASES_PATH}/index/silva-arc-16s-db:\
${DATABASES_PATH}/rRNA_databases/silva-arc-23s-id98.fasta,${DATABASES_PATH}/index/silva-arc-23s-db:\
${DATABASES_PATH}/rRNA_databases/silva-euk-18s-id95.fasta,${DATABASES_PATH}/index/silva-euk-18s-db:\
${DATABASES_PATH}/rRNA_databases/silva-euk-28s-id98.fasta,${DATABASES_PATH}/index/silva-euk-28s:\
${DATABASES_PATH}/rRNA_databases/rfam-5s-database-id98.fasta,${DATABASES_PATH}/index/rfam-5s-db:\
${DATABASES_PATH}/rRNA_databases/rfam-5.8s-database-id98.fasta,${DATABASES_PATH}/index/rfam-5.8s-db \
--reads $INPUT_DIR/1.Qual_filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_single.fastq \
--aligned $INPUT_DIR/1.rRNA_filtered/aligned/S${SLURM_ARRAY_TASK_ID}_single_rRNA \
--other $INPUT_DIR/1.rRNA_filtered/unaligned/S${SLURM_ARRAY_TASK_ID}_single_non_rRNA \
--paired_in --num_alignments 1 --fastx --log -v -a 32

**SortmeRNA: Remove introduced empty lines**

NOTE: sortmeRNA can sometimes introduce blank lines into the middle of files. This creates errors in the downstream files after deinterleaving (deinterleaving quits at the empty lines and the downstream files exclude all other reads in the file).

Delete empty lines (or those with only whitespace) from the files

In [None]:
cd <working_directory>/WTS/

# "Paired" files
for file in 1.rRNA_filtered/unaligned/*paired_non_rRNA.fastq; do
    filename=$(basename ${file} .fastq)
    echo ${filename}
    grep '\S' ${file} > 1.rRNA_filtered/unaligned/${filename}_filt.fastq
done

# "Single" files
for file in 1.rRNA_filtered/unaligned/*single_non_rRNA.fastq; do
    filename=$(basename ${file} .fastq)
    echo ${filename}
    grep '\S' ${file} > 1.rRNA_filtered/unaligned/${filename}_filt.fastq
done

**Deinterleave paired files**

In [None]:
# Change to working directory
cd <working_directory>/WTS/

# Load BBMap
module purge
module load BBMap/38.81-gimkl-2020a

# Deinterleave paired files
for i in {1..13}; do
    reformat.sh \
    in=1.rRNA_filtered/unaligned/S${i}_paired_non_rRNA_filt.fastq \
    out1=1.rRNA_filtered/unaligned/S${i}_R1_non_rRNA.fastq \
    out2=1.rRNA_filtered/unaligned/S${i}_R2_non_rRNA.fastq
done

**Downstream files**

Filtered files for use downstream:
- paired reads: `1.rRNA_filtered/unaligned/*_R1_non_rRNA.fastq`
- single reads: `1.rRNA_filtered/unaligned/*single_non_rRNA_filt.fastq`

***

## Filter out host sequences

**Preamble**

Metagenome data derived from microbial communities associated with a host should ideally be filtered to remove any reads originating from host DNA. This may improve the quality and efficiency of downstream data processing, and is also an important consideration when working with metagenomes that may include data of a sensitive nature (and which may also need to be removed prior to making the data publicly available e.g. kākāpō). This is especially important for any studies involving human subjects or those involving samples derived from taonga species.

There are several approaches that can be used to achieve this. The general principle is to map your reads to a reference genome (e.g. human genome) and remove those reads that map to the reference from the dataset.


Here, we will map our reads against a masked kākāpō genome which is processed to hide sections that: are presumbed microbial contaminant in the reference; have high homology to microbial genes/genomes (e.g. ribosomes); or those that are of low complexity. This ensures that reads that would normally map to these sections of the kākāpō genome are not removed from the dataset (as genuine microbial reads that we wish to retain might also map to these regions), while all reads mapping to the rest of the genome are removed.

The masked kākāpō genome was created in the WGS pipeline.

JobID: 28538992
Runtime: 5min
Mem: 9GB

**Host filtering: per-sample BBMap read mapping, slurm array**

Note:
- This step outputs fastq files where reads that map to the reference genome have been filtered out.
- The output from `outu` is the filtered file for downstream use.
- Host filtering here is run as a two step process for each sample: first, on the paired reads (R1 and R2), and then again for the unpaired (single) reads file.
- The parameters are set based on the recomendations for host filtering outlined [here](https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/37175-introducing-removehuman-human-contaminant-removal?t=42552)

`sbatch scripts/wts_1_hostfilt_mapping.sl`

In [None]:
#!/bin/bash
#SBATCH -A uoa03068
#SBATCH -J wts_1_hostfilt_mapping
#SBATCH --time 00:10:00
#SBATCH --mem 28GB
#SBATCH --ntasks 1
#SBATCH --array=1-13
#SBATCH --cpus-per-task 32
#SBATCH -e slurm_output/wts_1_hostfilt_mapping_%a_%j.err
#SBATCH -o slurm_output/wts_1_hostfilt_mapping_%a_%j.out

# Set up working directories
mkdir -p <working_directory>/WTS/1.host_filtered/
cd <working_directory>/WTS/1.host_filtered/

# Load BBMap module
module purge
module load BBMap/38.81-gimkl-2020a

## Run bbmap

# Copy over indexed masked kākāpō genome from WGS directory
cp <working_directory>/WGS/1.host_filtered/<indexed_genome_files> <working_directory>/WTS/1.host_filtered/

# Paired reads (R1 and R2)
srun bbmap.sh -Xmx26g -t=32 \
minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 qtrim=rl trimq=10 untrim \
in1=../1.rRNA_filtered/unaligned/S${SLURM_ARRAY_TASK_ID}_R1_non_rRNA.fastq \
in2=../1.rRNA_filtered/unaligned/S${SLURM_ARRAY_TASK_ID}_R2_non_rRNA.fastq \
outu1=S${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
outu2=S${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq

# Unpaired (single) reads
srun bbmap.sh -Xmx26g -t=32 \
minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 qtrim=rl trimq=10 untrim \
in=../1.rRNA_filtered/unaligned/S${SLURM_ARRAY_TASK_ID}_single_non_rRNA_filt.fastq \
outu=S${SLURM_ARRAY_TASK_ID}_single_hostFilt.fastq

**Downstream files**

Filtered files for use downstream:

- paired reads: `1.host_filtered/*_R1_hostFilt.fastq; 1.host_filtered/*_R2_hostFilt.fastq`
- single reads: `1.host_filtered/*single_hostFilt.fastq`

***

## Checking library sizes

In [None]:
cd <working_directory>/WTS/

# All raw files
for file in 0.Raw_concat/*; do
    echo ${file} 
done
for file in 0.Raw_concat/*; do
    zgrep -c '@' ${file} 
done

# Trimmed files (concatenated parts 1 and 2 together)
for file in 1.Qual_filtered_trimmomatic/*.fastq; do
    echo ${file}
done
for file in 1.Qual_filtered_trimmomatic/*.fastq; do
    grep -c '@' ${file} 
done

# SortmeRNA-processed files
for file in 1.rRNA_filtered/unaligned/*non_rRNA*; do
    echo ${file}
done
for file in 1.rRNA_filtered/unaligned/*non_rRNA*; do
    grep -c '@' ${file} 
done

# Host filtered files
for file in 1.host_filtered/*hostFilt*; do
    echo ${file}
done
for file in 1.host_filtered/*hostFilt*; do
    grep -c '@' ${file} 
done

***

## Downstream processing options

The following analyses regard virus detection from the RNA metatranscriptome data. Gene transcription profiles are created in the WGS pipeline.

***

## Assembly


Assembly via SPAdes

- rnaSPAdes is not actually suitable for assembling viral RNA (is designed more specifically for transcripts), while metaviralSPAdes is geared towards DNA viruses. Thus, here we use metaSPAdes for RNA viral contig assembly.

**Concatenate reads for assembly**

In [None]:
cd <working_directory>/WTS/

mkdir -p 2.assembly_spades/infiles_concat

cat 1.host_filtered/*_R1_hostFilt.fastq > 2.assembly_spades/infiles_concat/filtered_reads_R1.fastq
cat 1.host_filtered/*_R2_hostFilt.fastq > 2.assembly_spades/infiles_concat/filtered_reads_R2.fastq
cat 1.host_filtered/*_single_hostFilt.fastq > 2.assembly_spades/infiles_concat/filtered_reads_single.fastq

# Mini co-assemblies
## Healthy = S8, S9, S12, S13
## Diseased = S1 - S7, S10, S11

mkdir 1.host_filtered/Healthy
mkdir 1.host_filtered/Diseased

for i in {1:7,10,11}; do
cp 1.host_filtered/S${i}*  1.host_filtered/Diseased/
done

rm 1.host_filtered/Diseased/S12*
rm 1.host_filtered/Diseased/S13*

for i in {8,9,12,13}; do
cp 1.host_filtered/S${i}* 1.host_filtered/Healthy/
done

cat 1.host_filtered/Healthy/*_R1_hostFilt.fastq > 2.assembly_spades/infiles_concat/Healthy_reads_R1.fastq
cat 1.host_filtered/Healthy/*_R2_hostFilt.fastq > 2.assembly_spades/infiles_concat/Healthy_reads_R2.fastq
cat 1.host_filtered/Healthy/*_single_hostFilt.fastq > 2.assembly_spades/infiles_concat/Healthy_reads_single.fastq

cat 1.host_filtered/Diseased/*_R1_hostFilt.fastq > 2.assembly_spades/infiles_concat/Diseased_reads_R1.fastq
cat 1.host_filtered/Diseased/*_R2_hostFilt.fastq > 2.assembly_spades/infiles_concat/Diseased_reads_R2.fastq
cat 1.host_filtered/Diseased/*_single_hostFilt.fastq > 2.assembly_spades/infiles_concat/Diseased_reads_single.fastq

**Run co-assembly**

NOTE: when changing the memory allocation, make sure to change it in both the SBATCH header and the actual spades call (the `-m` flag)

`sbatch scripts/wts_2.coassembly_spades.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_2.coassembly_spades
#SBATCH --time 7:30:00
#SBATCH --mem=40GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_2.coassembly_spades_%j.err
#SBATCH -o slurm_output/wts_2.coassembly_spades_%j.out

# Set up working directory
cd <working_directory>/WTS/
mkdir -p 2.assembly_spades/coassembly

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Set up variables for input path and output path
inpath=2.assembly_spades/infiles_concat
outpath=2.assembly_spades

# Run metaSPAdes
srun spades.py --meta -t 16 -m 80 \
-1 2.assembly_spades/infiles_concat/filtered_reads_R1.fastq \
-2 2.assembly_spades/infiles_concat/filtered_reads_R2.fastq \
-s 2.assembly_spades/infiles_concat/filtered_reads_single.fastq \
-o 2.assembly_spades/coassembly/

**Mini co-assemblies**

`sbatch scripts/wts_2.healthy_coassembly_spades.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_2.healthy_coassembly_spades
#SBATCH --time 3:30:00
#SBATCH --mem=30GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_2.healthy_coassembly_spades_%j.err
#SBATCH -o slurm_output/wts_2.healthy_coassembly_spades_%j.out

# Set up working directory
cd <working_directory>/WTS/
mkdir -p 2.assembly_spades/mini_co_assemblies/Healthy

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Set up variables for input path and output path
inpath=2.assembly_spades/infiles_concat
outpath=2.assembly_spades

# Run metaSPAdes
srun spades.py --meta -t 16 -m 50 \
-1 2.assembly_spades/infiles_concat/Healthy_reads_R1.fastq \
-2 2.assembly_spades/infiles_concat/Healthy_reads_R2.fastq \
-s 2.assembly_spades/infiles_concat/Healthy_reads_single.fastq \
-o 2.assembly_spades/mini_co_assemblies/Healthy/

JobID: 28540090
Runtime:1hr 10min
Mem: 9GB

`sbatch scripts/wts_2.diseased_coassembly_spades.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_2.diseased_coassembly_spades
#SBATCH --time 2:30:00
#SBATCH --mem=15GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_2.diseased_coassembly_spades_%j.err
#SBATCH -o slurm_output/wts_2.diseased_coassembly_spades_%j.out

# Set up working directory
cd <working_directory>/WTS/
mkdir -p 2.assembly_spades/mini_co_assemblies/Diseased

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Set up variables for input path and output path
inpath=2.assembly_spades/infiles_concat
outpath=2.assembly_spades

# Run metaSPAdes
srun spades.py --meta -t 16 -m 50 \
-1 2.assembly_spades/infiles_concat/Diseased_reads_R1.fastq \
-2 2.assembly_spades/infiles_concat/Diseased_reads_R2.fastq \
-s 2.assembly_spades/infiles_concat/Diseased_reads_single.fastq \
-o 2.assembly_spades/mini_co_assemblies/Diseased/

**Run individual assemblies as slurm array**

`sbatch scripts/wts_2.assembly_spades.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_2.assembly_spades
#SBATCH --time 04:00:00
#SBATCH --mem=10GB
#SBATCH --ntasks=1
#SBATCH --array=1-13
#SBATCH --cpus-per-task=6
#SBATCH -e slurm_output/wts_2.assembly_spades_%a_%j.err
#SBATCH -o slurm_output/wts_2.assembly_spades_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/
mkdir -p 2.assembly_spades/indv_assembly

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Run metaSPAdes
srun spades.py --meta -t 6 -m 10 \
-1 1.host_filtered/S${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
-2 1.host_filtered/S${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq \
-s 1.host_filtered/S${SLURM_ARRAY_TASK_ID}_single_hostFilt.fastq \
-o 2.assembly_spades/indv_assembly/S${SLURM_ARRAY_TASK_ID}_assembly/

**Optional: Filtering out short contigs**
    
Note: Jemma Geoghegan recommended not filtering by length here for RNA viruses (as they can be quite short and/or you might still get useful information out of short fragments of viral genomes).

If you wish to filter out short contigs, you can do so via `seqmagick`.

**Assessing assemblies**

***Counting the number of contigs in each of the assemblies***

In [None]:
cd <working_directory>/WTS/

## Count contigs
# All contigs
grep -c '>' 2.assembly_spades/coassembly/scaffolds.fasta
grep -c '>' 2.assembly_spades/mini_coassembly/Healthy/scaffolds.fasta
grep -c '>' 2.assembly_spades/mini_coassembly/Diseased/scaffolds.fasta

for file in 2.assembly_spades/indv_assembly/*_indv_assembly/scaffolds.fasta; do
    echo ${file}
    grep -c '>' ${file} 
done

**Assembly statistics via BBMap's stats.sh script**

In [None]:
module purge
module load BBMap/38.73-gimkl-2018b

cd <working_directory>/WTS/

# Run stats.sh
stats.sh in=2.assembly_spades/coassembly/scaffolds.fasta
stats.sh in=2.assembly_spades/mini_coassembly/Healthy/scaffolds.fasta
stats.sh in=2.assembly_spades/mini_coassembly/Diseased/scaffolds.fasta

for file in 2.assembly_spades/indv_assembly/*_indv_assembly/scaffolds.fasta; do
    echo ${file}
    stats.sh in=${file} 
done

***

## Identify putative viral contigs

Below will cover:

- VIBRANT
- VirSorter2
- Blast searchs (against NCBI-nr, NCBI-nt, custom RdRp database)
    - Note: the blast searches require a modified version of the fasta file (remove line wrapping, and shorten sequence headers). You can use the script below for this, and then use the output modified files for all of the downstream processes.

### reformat_fa2oneline.py

Python script created by Michael Hoggard to reformat fasta output files to be ready for inputting into blast (shorten contig headers and remove unneccesary info; remove linebreaks from within sequences).

Example code:

`reformat_fa2oneline.py \`
`-i 2.assembly/coassembly/scaffolds.fasta \`
`-o 3.virus_prediction/0.reformatted_fasta/scaffolds_reformated.fa`

Output:

`3.virus_prediction/0.reformatted_fasta/scaffolds_reformated.fa`: assembly fasta file reformatted for use downstream

**reformat_fa2oneline.py: all assemblies**

In [None]:
# Set up working directories
cd <working_directory>/WTS/
mkdir -p 3.virus_prediction/0.reformatted_spades_fasta/

# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a

script_path='<working_directory>/WTS/3.virus_prediction'

# run reformat_fa2oneline.py on co-assemblies
3.virus_prediction/reformat_fa2oneline.py -i 2.assembly_spades/coassembly/scaffolds.fasta -o 3.virus_prediction/0.reformatted_spades_fasta/coassembly_scaffolds_reformated.fa
3.virus_prediction/reformat_fa2oneline.py -i 2.assembly_spades/mini_coassembly/Healthy/scaffolds.fasta -o 3.virus_prediction/0.reformatted_spades_fasta/healthy_coassembly_scaffolds_reformated.fa
3.virus_prediction/reformat_fa2oneline.py -i 2.assembly_spades/mini_coassembly/Diseased/scaffolds.fasta -o 3.virus_prediction/0.reformatted_spades_fasta/diseased_coassembly_scaffolds_reformated.fa

# run on individual samples
for i in {1..13}; do
    ${script_path}/reformat_fa2oneline.py \
      -i 2.assembly_spades/indv_assembly/S${i}_indv_assembly/scaffolds.fasta \
      -o 3.virus_prediction/0.reformatted_spades_fasta/S${i}_scaffolds_reformated.fa
done

### VIBRANT

See [VIBRANT github](https://github.com/AnantharamanLab/VIBRANT) for more information.

NOTE: `$DB_PATH` variable is already set up with the VIBRANT module in NeSI




***VIBRANT: individual assemblies***

`sbatch scripts/wts_3_vibrant.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vibrant
#SBATCH --time 00:30:00
#SBATCH --mem=6GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_3_vibrant_%a_%j.err
#SBATCH -o slurm_output/wts_3_vibrant_%a_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 1.vibrant/individual_assemblies

# Load dependencies
module purge
module load VIBRANT/1.2.1-gimkl-2020a

# Run main analyses 
VIBRANT_run.py -t 16 \
-i 0.reformatted_spades_fasta/S${SLURM_ARRAY_TASK_ID}_scaffolds_reformated.fa \
-d $DB_PATH \
-folder 1.vibrant/individual_assemblies

***VIBRANT: Co-assembly***

`sbatch scripts/wts_3_vibrant_coassembly.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vibrant_coassembly
#SBATCH --time 00:30:00
#SBATCH --mem=6GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_3_vibrant_coassembly_%j.err
#SBATCH -o slurm_output/wts_3_vibrant_coassembly_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 1.vibrant/coassembly  

# Load dependencies
module purge
module load VIBRANT/1.2.1-gimkl-2020a

# Run main analyses 
VIBRANT_run.py -t 16 \
-i 0.reformatted_spades_fasta/coassembly_scaffolds_reformated.fa \
-d $DB_PATH \
-folder 1.vibrant/coassembly

***VIBRANT: Healthy co-assembly***

`sbatch scripts/wts_3_vibrant_healthy.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vibrant_healthy
#SBATCH --time 00:30:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH -e slurm_ouput/wts_3_vibrant_healthy_%j.err
#SBATCH -o slurm_ouput/wts_3_vibrant_healthy_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 1.vibrant/healthy 

# Load dependencies
module purge
module load VIBRANT/1.2.1-gimkl-2020a

# Run main analyses 
VIBRANT_run.py -t 16 \
-i 0.reformatted_spades_fasta/healthy_coassembly_scaffolds_reformated.fa \
-d $DB_PATH \
-folder 1.vibrant/healthy

***VIBRANT: Diseased co-assembly***

`sbatch scripts/wts_3_vibrant_diseased.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vibrant_diseased
#SBATCH --time 00:30:00
#SBATCH --mem=5GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH -e slurm_ouput/wts_3_vibrant_diseased_%j.err
#SBATCH -o slurm_ouput/wts_3_vibrant_diseased_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 1.vibrant/diseased  

# Load dependencies
module purge
module load VIBRANT/1.2.1-gimkl-2020a

# Run main analyses 
VIBRANT_run.py -t 16 \
-i 0.reformatted_spades_fasta/diseased_coassembly_scaffolds_reformated.fa \
-d $DB_PATH \
-folder 1.vibrant/diseased

### VirSorter2

See [VirSorter2 github](https://github.com/jiarong/VirSorter2) for more information and for a link to a tutorial/SoP, including recomended score threshold settings

Running VirSorter2: parameters
Key options:

- `-d` database path (not part of the NeSI VirSorter2 module. Databases loaded locally from /nesi/project/uoa02469/Databases/virsorter2_20210909/)
- `-i` input file (fa or fq format)
- `-j` threads (jobs)
- `--tmpdir` directory for temp files
- `--rm-tmpdir` remove intermediate/temp files
- `-l` label (prefix for output files; useful when re-running classify with different filtering)
- `-w` working (output) directory


n.b. to re-run with different filter threshold, just run 'classify' section, i.e.

`virsorter run [options] classify`

Filtering options include:

- `--min-score FLOAT` min score to identify as viral (default 0.5)
- `--min-length INTEGER` min seq length (default 0)
- `--high-confidence-only` only output high confidence viral sequences; this is equivalent to screening final-viral-score.tsv with the following criteria: (max_score >= 0.9) OR (max_score >=0.7 AND hallmark >= 1) (default: False)
- `--hallmark-required-on-short` require hallmark gene on short viral seqs (default 'short' length < 3kb) (default False)

NOTE: `module unload XALT` also needs to be included after `module purge` to avoid a conflict when running the current VirSorter2 module in NeSI

***VirSorter2: individual assemblies***

`sbatch scripts/wts_3_vsort2.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vsort2
#SBATCH --time 08:00:00
#SBATCH --mem=8GB
#SBATCH --ntasks=1
#SBATCH --array=1-13
#SBATCH --cpus-per-task=32
#SBATCH -e slurm_ouput/wts_3_vsort2_%a_%j.err
#SBATCH -o slurm_ouput/wts_3_vsort2_%a_%J.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 2.virsorter2/individual_assemblies

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Run virsorter2 - need access to Handley lab NeSI database to run this specific script 
srun virsorter run -j 32 \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
-i 0.reformatted_spades_fasta/S${SLURM_ARRAY_TASK_ID}.scaffolds_reformated.fa \
--min-score 0.5 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l S${SLURM_ARRAY_TASK_ID} \
-w 2.virsorter2/individual_assemblies/S${SLURM_ARRAY_TASK_ID} \
--tmpdir ${SLURM_JOB_ID}.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}

JobID: 23526354 
Mem: 2GB
Time: 1.5 hr 

***VirSorter2: co-assembly***

`sbatch scripts/wts_3_vsort2_coassembly.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vsort2_coassembly
#SBATCH --time 08:00:00
#SBATCH --mem=8GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e slurm_ouput/wts_3_vsort2_coassembly_%j.err
#SBATCH -o slurm_ouput/wts_3_vsort2_coassembly_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 2.virsorter2/coassembly

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2
 
# Run virsorter2
srun virsorter run -j 32 \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
-i 0.reformatted_spades_fasta/coassembly_scaffolds_reformated.fa \
--min-score 0.5 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l coassembly \
-w 2.virsorter2/coassembly \
--tmpdir coassembly.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}

***VirSorter2: Healthy co-assembly***

`sbatch scripts/wts_3_vsort2_healthy.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vsort2_healthy
#SBATCH --time 03:00:00
#SBATCH --mem=2GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_ouput/wts_3_vsort2_healthy_%j.err
#SBATCH -o slurm_ouput/wts_3_vsort2_healthy_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 2.virsorter2/healthy

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2
 
# Run virsorter2
srun virsorter run -j 16 \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
-i 0.reformatted_spades_fasta/healthy_coassembly_scaffolds_reformated.fa \
--min-score 0.5 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l healthy \
-w 2.virsorter2/healthy \
--tmpdir healthy.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}

***VirSorter2: Diseased co-assembly***

`sbatch scripts/wts_3_vsort2_diseased.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_vsort2_diseased
#SBATCH --time 04:00:00
#SBATCH --mem=2GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH -e slurm_ouput/wts_3_vsort2_diseased_%j.err
#SBATCH -o slurm_ouput/wts_3_vsort2_diseased_%j.out

# Set up working directories
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 2.virsorter2/diseased
 
# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Run virsorter2
srun virsorter run -j 20 \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
-i 0.reformatted_spades_fasta/diseased_coassembly_scaffolds_reformated.fa \
--min-score 0.5 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l diseased \
-w 2.virsorter2/diseased \
--tmpdir diseased.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}

### BLAST searches

Blast searches against NCBI nt, NCBI nr, and custom RdRp database (manually generated/curated by Jemma Geoghegan).

Notes:

- The nt database is loaded via `module load BLASTDB/2021-01` and works as is (and already includes taxonomy, so this can be called via e.g. `sskingdoms`)
- The nr database was downloaded from NCBI (23/08/2021) and a dmnd database generated including taxonomy to allow for the inclusion of `sskingdoms`. This database is available in the Handley group databases directory: `/nesi/project/uoa02469/Databases/NCBI_20210116/nr/nr_wTax_FULL.dmnd`.
- The RdRp database as custom generated by Jemma Geoghegan.

***BLAST search: nr (non-redundant protein database) sequences: individual assemblies***

`sbatch scripts/wts_3_blast_nr.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nr
#SBATCH --time 30:00:00
#SBATCH --mem=36GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nr_%a_%j.err
#SBATCH -o slurm_output/wts_3_blast_nr_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 3.blast/nr

# Load module(s)
module purge
module load DIAMOND/2.0.6-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/NCBI_20210823_dmnd/nr_wTax_FULL.dmnd'

# Run diamond blastx search
srun diamond blastx \
-q 0.reformatted_fasta/S${SLURM_ARRAY_TASK_ID}.scaffolds_reformated.fa \
-d ${search_db} \
-o 3.blast/nr/S${SLURM_ARRAY_TASK_ID}_nr.txt \
-e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue sskingdoms --more-sensitive

***BLAST-nr: coassembly***

`sbatch scripts/wts_3_blast_nr_coassembly`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nr_coassembly
#SBATCH --time 30:00:00
#SBATCH --mem=36GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nr_coassembly_%j.err
#SBATCH -o slurm_output/wts_3_blast_nr_coassembly_%j.out
#SBATCH --profile=task

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load DIAMOND/2.0.6-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/NCBI_20210823_dmnd/nr_wTax_FULL.dmnd'

# Run diamond blastx search
srun diamond blastx \
-q 0.reformatted_spades_fasta/coassembly_scaffolds_reformated.fa \
-d ${search_db} \
-o 3.blast/nr/coassembly_nr.txt \
-e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue --more-sensitive

***BLAST-nr: mini-coassemblies***

`sbatch scripts/wts_3_blast_nr_mini`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nr_mini
#SBATCH --time 30:00:00
#SBATCH --mem=36GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nr_mini_%j.err
#SBATCH -o slurm_output/wts_3_blast_nr_mini_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load DIAMOND/2.0.6-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/NCBI_20210823_dmnd/nr_wTax_FULL.dmnd'

# Run diamond blastx search
for j in healthy diseased; do
    srun diamond blastx \
    -q 0.reformatted_fasta/${j}_coassembly_scaffolds_reformated.fa \
    -d ${search_db} \
    -o 3.blast/nr/${j}_coassembly_nr.txt \
    -e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue sskingdoms --more-sensitive
done

***BLAST search: nt (nucleotide database) sequences: individual assemblies***

`sbatch scripts/wts_3_blast_nt.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nt
#SBATCH --time 06:00:00
#SBATCH --mem=20GB
#SBATCH --array=1-13
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nt_%a_%j.err
#SBATCH -o slurm_output/wts_3_blast_nt_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 3.blast/nt

# Load module(s)
module purge
module load BLAST/2.10.0-GCC-9.2.0
module load BLASTDB/2021-01

# Database variable
search_db=nt

# Run diamond blastx search
srun blastn \
-query 0.reformatted_fasta/S${SLURM_ARRAY_TASK_ID}.scaffolds_reformated.fa \
-db ${search_db} \
-out 3.blast/nt/S${SLURM_ARRAY_TASK_ID}_nt.txt \
-max_target_seqs 5 -num_threads 24 -evalue 1E-10 \
-outfmt "6 qseqid sacc salltitles pident length evalue sskingdoms"

***BLAST-nt: coassmebly***

`sbatch scripts/wts_3_blast_nt_coassembly.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nt_coassembly
#SBATCH --time 06:00:00
#SBATCH --mem=10GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nt_coassembly_%j.err
#SBATCH -o slurm_output/wts_3_blast_nt_coassembly_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load BLAST/2.10.0-GCC-9.2.0
module load BLASTDB/2021-01

# Database variable
search_db=nt

# Run diamond blastx search
srun blastn \
-query 0.reformatted_fasta/coassembly_scaffolds_reformated.fa \
-db ${search_db} \
-out 3.blast/nt/coassembly_nt.txt \
-max_target_seqs 5 -num_threads 24 -evalue 1E-10 \
-outfmt "6 qseqid sacc salltitles pident length evalue sskingdoms"

***BLAST-nt: mini-coassmeblies***

`sbatch scripts/wts_3_blast_nt_mini.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_nt_mini
#SBATCH --time 06:00:00
#SBATCH --mem=12GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_3_blast_nt_mini_%j.err
#SBATCH -o slurm_output/wts_3_blast_nt_mini_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load BLAST/2.10.0-GCC-9.2.0
module load BLASTDB/2021-01

# Database variable
search_db=nt

# Run diamond blastx search
for j in healthy diseased; do
    srun blastn \
    -query 0.reformatted_fasta/${j}_coassembly_scaffolds_reformated.fa \
    -db ${search_db} \
    -out 3.blast/nt/${j}_coassembly_nt.txt \
    -max_target_seqs 5 -num_threads 24 -evalue 1E-10 \
    -outfmt "6 qseqid sacc salltitles pident length evalue sskingdoms"
done

***BLAST search: RdRp (polymerase) viral sequences: individual assemblies***

Notes:
This searches against a custom database of RdRp sequences generated by Jemma Geoghegan.
A DIAMOND formatted version of this database is available in the Handley group directory: /nesi/project/uoa02469/Databases/viral_rdrp_protein_database/viral_rdrp.dmnd


`sbatch scripts/wts_3_blast_rdrp.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_rdrp
#SBATCH --time 00:05:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH -e slurm_output/wts_3_blast_rdrp_%j.err
#SBATCH -o slurm_output/wts_3_blast_rdrp_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/
mkdir -p 3.blast/rdrp

# Load module(s)
module purge
module load DIAMOND/0.9.32-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/viral_rdrp_protein_database/viral_rdrp.dmnd'

# Run diamond blastx search
for i in {1..13}; do
    srun diamond blastx \
    -q 0.reformatted_fasta/S${i}.scaffolds_reformated.fa \
    -d ${search_db} \
    -o 3.blast/rdrp/S${i}_RdRp.txt \
    -e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue --more-sensitive
done

***BLAST-RdRp: coassmebly***

`sbatch scripts/wts_3_blast_rdrp_coassembly.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_rdrp_coassembly
#SBATCH --time 00:05:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH -e slurm_output/wts_3_blast_rdrp_coassembly_%j.err
#SBATCH -o slurm_output/wts_3_blast_rdrp_coassembly_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load DIAMOND/0.9.32-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/viral_rdrp_protein_database/viral_rdrp.dmnd'

# Run diamond blastx search
srun diamond blastx \
-q 0.reformatted_fasta/coassembly_scaffolds_reformated.fa \
-d ${search_db} \
-o 3.blast/rdrp/coassembly_RdRp.txt \
-e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue --more-sensitive
done

***BLAST-RdRp: mini-coassmeblies***

`sbatch scripts/wts_3_blast_rdrp_mini.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_3_blast_rdrp_mini
#SBATCH --time 00:05:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH -e slurm_output/wts_3_blast_rdrp_mini_%j.err
#SBATCH -o slurm_output/wts_3_blast_rdrp_mini_%j.out

# Set up working directory
cd <working_directory>/WTS/3.virus_prediction/

# Load module(s)
module purge
module load BLAST/2.10.0-GCC-9.2.0
module load BLASTDB/2021-01

# Database variable
search_db=nt

# Run diamond blastx search
for j in healthy diseased; do
    srun blastn \
    -query 0.reformatted_fasta/${j}_coassembly_scaffolds_reformated.fa \
    -db ${search_db} \
    -out 3.blast/rdrp/${j}_coassembly_RdRp.txt \
    -max_target_seqs 5 -num_threads 24 -evalue 1E-10 \
    -outfmt "6 qseqid sacc salltitles pident length evalue sskingdoms"
done

JobID: 23540744

***

## Summary table of all methods

- `summarise_viral_contigs.py` generated by Michael Hoggard
- For each sample, generate summary table including putative virus results for each method

Note:

- In output summary table, individual contigs will generally be repeated over multiple lines due to multiple blast database hits for the contig.
- `summary_table_VIRUSES.txt` output file is filtered to retain any hits made by VIBRANT, VirSorter2, RdRp database hits, or nt or nr hits with taxonomy matching viruses.

***summarise_viral_contigs.py: All samples¶***

In [None]:
# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a

# Script path
script_path='/nesi/project/uoa02469/custom-scripts/MikeH'

## Output filepath
cd <working_directory>/WTS/
mkdir -p 3.virus_prediction/4.summary_tables

# Run script: individual samples
for i in {1..13}; do
    ${script_path}/summarise_viral_contigs.py \
    --blastn_nt 3.virus_prediction/3.blast/nt/S${i}_nt.txt \
    --blastx_nr 3.virus_prediction/3.blast/nr/S${i}_nr.txt \
    --blastx_rdrp 3.virus_prediction/3.blast/rdrp/S${i}_RdRp.txt \
    --vibrant 3.virus_prediction/1.vibrant/individual_assemblies/VIBRANT_S${i}.scaffolds_reformated/VIBRANT_results_S${i}.scaffolds_reformated/VIBRANT_summary_results_S${i}.scaffolds_reformated.tsv \
    --virsorter2 3.virus_prediction/2.virsorter2/individual_assemblies/S${i}/S${i}-final-viral-score.tsv \
    --out_prefix 3.virus_prediction/4.summary_tables/S${i}_summary_table
done

# Coassembly
${script_path}/summarise_viral_contigs.py \
    --blastn_nt 3.virus_prediction/3.blast/nt/coassembly_nt.txt \
    --blastx_nr 3.virus_prediction/3.blast/nr/coassembly_nr.txt \
    --blastx_rdrp 3.virus_prediction/3.blast/rdrp/coassembly_RdRp.txt \
    --vibrant 3.virus_prediction/1.vibrant/coassembly/VIBRANT_coassembly.scaffolds_reformated/VIBRANT_results_coassembly.scaffolds_reformated/VIBRANT_summary_results_coassembly.scaffolds_reformated.tsv \
    --virsorter2 3.virus_prediction/2.virsorter2/coassembly/coassembly-final-viral-score.tsv \
    --out_prefix 3.virus_prediction/4.summary_tables/coassembly_summary_table

# Mini assemblies
${script_path}/summarise_viral_contigs.py \
    --blastn_nt 3.virus_prediction/3.blast/nt/healthy_coassembly_nt.txt \
    --blastx_nr 3.virus_prediction/3.blast/nr/healthy_coassembly_nr.txt \
    --blastx_rdrp 3.virus_prediction/3.blast/rdrp/healthy_coassembly_RdRp.txt \
    --vibrant 3.virus_prediction/1.vibrant/healthy/VIBRANT_healthy_coassembly.scaffolds_reformated/VIBRANT_results_healthy_coassembly.scaffolds_reformated/VIBRANT_summary_results_healthy_coassembly.scaffolds_reformated.tsv \
    --virsorter2 3.virus_prediction/2.virsorter2/healthy/healthy-final-viral-score.tsv \
    --out_prefix 3.virus_prediction/4.summary_tables/healthy_coassembly_summary_table

${script_path}/summarise_viral_contigs.py \
    --blastn_nt 3.virus_prediction/3.blast/nt/diseased_coassembly_nt.txt \
    --blastx_nr 3.virus_prediction/3.blast/nr/diseased_coassembly_nr.txt \
    --blastx_rdrp 3.virus_prediction/3.blast/rdrp/diseased_coassembly_RdRp.txt \
    --vibrant 3.virus_prediction/1.vibrant/diseased/VIBRANT_diseased_coassembly.scaffolds_reformated/VIBRANT_results_diseased_coassembly.scaffolds_reformated/VIBRANT_summary_results_diseased_coassembly.scaffolds_reformated.tsv \
    --virsorter2 3.virus_prediction/2.virsorter2/diseased/diseased-final-viral-score.tsv \
    --out_prefix 3.virus_prediction/4.summary_tables/diseased_coassembly_summary_table

***

## Generate per-sample dereplicated fasta file of putative viral hits

`virome_per_sample_derep.py` generated by Michael Hoggard

- Creates fasta file combining results from all tools (based on `summary_table_VIRUSES.txt` generated above)
    - Reads in assembly fasta file and writes out any found in summary table contig_ID to sampleX_derep.fasta
    - (If provided) Reads in VIBRANT fasta output file and appends any *prophage* (i.e. IDs containing 'fragment') found in summary table contig_ID to sampleX_derep.fasta
    - (If provided) Reads in VirSorter2 fasta output file and appends any *prophage* (i.e. IDs containing 'partial') found in summary table contig_ID to sampleX_derep.fasta
    - Filters out any included full contigs that were also identified as (excised) prophage via VIBRANT or VirSorter2

***Run virome_per_sample_derep.py***

In [None]:
# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a

# Script path
script_path='/nesi/project/uoa02469/custom-scripts/MikeH'

# Directories
cd <working_directory>/WTS/
mkdir -p 3.virus_prediction/5.perSample_derep

# Run for individual samples
for i in {1..13}; do
    ${script_path}/virome_per_sample_derep.py \
    --assembly_fasta 3.virus_prediction/0.reformatted_fasta/S${i}.scaffolds_reformated.fa \
    --summary_table 3.virus_prediction/4.summary_tables/S${i}_summary_table_VIRUSES.txt \
    --vibrant 3.virus_prediction/1.vibrant/VIBRANT_S${i}.scaffolds_reformated/VIBRANT_phages_S${i}.scaffolds_reformated/S${i}.scaffolds_reformated.phages_combined.fna \
    --virsorter2 3.virus_prediction/2.virsorter2/S${i}/S${i}-final-viral-combined.fa \
    --output 3.virus_prediction/5.perSample_derep/S${i}_perSample_derep.fasta
done


# Coassembly
${script_path}/virome_per_sample_derep.py \
    --assembly_fasta 3.virus_prediction/0.reformatted_spades_fasta/coassembly_scaffolds_reformated.fa \
    --summary_table 3.virus_prediction/4.summary_tables/coassembly_summary_table_VIRUSES.txt \
    --vibrant 3.virus_prediction/1.vibrant/coassembly/VIBRANT_coassembly_scaffolds_reformated/VIBRANT_phages_coassembly_scaffolds_reformated/coassembly_scaffolds_reformated.phages_combined.fna \
    --virsorter2 3.virus_prediction/2.virsorter2/coassembly/coassembly-final-viral-combined.fa \
    --output 3.virus_prediction/5.perSample_derep/coassembly_perSample_derep.fasta

# Mini assemblies
${script_path}/virome_per_sample_derep.py \
    --assembly_fasta 3.virus_prediction/0.reformatted_spades_fasta/healthy_coassembly_scaffolds_reformated.fa \
    --summary_table 3.virus_prediction/4.summary_tables/healthy_coassembly_summary_table_VIRUSES.txt \
    --vibrant 3.virus_prediction/1.vibrant/healthy/VIBRANT_healthy_coassembly_scaffolds_reformated/VIBRANT_phages_healthy_coassembly_scaffolds_reformated/healthy_coassembly_scaffolds_reformated.phages_combined.fna \
    --virsorter2 3.virus_prediction/2.virsorter2/healthy/healthy-final-viral-combined.fa \
    --output 3.virus_prediction/5.perSample_derep/healthy_coassembly_perSample_derep.fasta

${script_path}/virome_per_sample_derep.py \
    --assembly_fasta 3.virus_prediction/0.reformatted_spades_fasta/diseased_coassembly_scaffolds_reformated.fa \
    --summary_table 3.virus_prediction/4.summary_tables/diseased_coassembly_summary_table_VIRUSES.txt \
    --vibrant 3.virus_prediction/1.vibrant/diseased/VIBRANT_diseased_coassembly_scaffolds_reformated/VIBRANT_phages_diseased_coassembly_scaffolds_reformated/diseased_coassembly_scaffolds_reformated.phages_combined.fna \
    --virsorter2 3.virus_prediction/2.virsorter2/diseased/diseased-final-viral-combined.fa \
    --output 3.virus_prediction/5.perSample_derep/diseased_coassembly_perSample_derep.fasta

***

## CheckV per sample

More info about checkV [here](https://bitbucket.org/berkeleylab/checkv/src/master/)

Run checkv on ouput from dereplication of contigs identified by all methods to obtain quality information.


***CheckV: individual assemblies***

`sbatch scripts/wts_4_checkv_perSample.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_4_checkv_perSample
#SBATCH --time 00:10:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --array=1-13
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_4_checkv_perSample_%a_%j.err
#SBATCH -o slurm_output/wts_4_checkv_perSample_%a_%j.out

# Set up working directories
cd <working_directory>/WTS/
mkdir -p 3.virus_prediction/6.perSample_checkv

# Load module(s)
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Run checkv
checkv_in="3.virus_prediction/5.perSample_derep/S${SLURM_ARRAY_TASK_ID}_perSample_derep.fasta"
checkv_out="3.virus_prediction/6.perSample_checkv/S${SLURM_ARRAY_TASK_ID}_perSample_derep_checkv_out"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet

***CheckV: coassemblies***

`sbatch scripts/wts_4_checkv_perSample_coassemblies.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_4_checkv_perSample
#SBATCH --time 01:00:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_4_checkv_perSample_coassemblies_%j.err
#SBATCH -o slurm_output/wts_4_checkv_perSample_coassemblies_%j.out

# Set up working directories
cd <working_directory>/WTS/

# Load module(s)
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Run checkv
checkv_in="3.virus_prediction/5.perSample_derep/coassembly_perSample_derep.fasta"
checkv_out="3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet

checkv_in="3.virus_prediction/5.perSample_derep/healthy_coassembly_perSample_derep.fasta"
checkv_out="3.virus_prediction/6.perSample_checkv/healthy_coassembly_perSample_derep_checkv_out"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet

checkv_in="3.virus_prediction/5.perSample_derep/diseased_coassembly_perSample_derep.fasta"
checkv_out="3.virus_prediction/6.perSample_checkv/diseased_coassembly_perSample_derep_checkv_out"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet

***CheckV perSample: Concatenate output fasta files for downstream use (viruses.fna and proviruses.fna)***

Also modify prophage contig headers for downstream use

In [None]:
cd <working_directory>/WTS/

# Individual assemblies
for i in {1..13}; do
    # concatenate viruses and prophage files
    cat 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/viruses.fna 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/proviruses.fna > 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/cat_checkv_out.fasta 
    # modify checkv prophage contig headers
    sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/cat_checkv_out.fasta
done

# Coassembly
 cat 3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/viruses.fna 3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/proviruses.fna > 3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta 
    # modify checkv prophage contig headers
    sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta

# Mini assemblies
for j in healthy diseased; do
    cat 3.virus_prediction/6.perSample_checkv/${j}_coassembly_perSample_derep_checkv_out/viruses.fna 3.virus_prediction/6.perSample_checkv/${j}_coassembly_perSample_derep_checkv_out/proviruses.fna > 3.virus_prediction/6.perSample_checkv/${j}_coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta 
    # modify checkv prophage contig headers
    sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 3.virus_prediction/6.perSample_checkv/${j}_coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta
done

***Add checkv results to summary_table_VIRUSES.txt***

In [None]:
cd <working_directory>/WTS/

# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a
python3
import pandas as pd
import numpy as np

# Loop through all individual sample summary_tables, add key checkv results columns and write out.
for i in range(1, 14):
    # Import summary table (manually set dtypes)
    summary_table = pd.read_csv('3.virus_prediction/4.summary_tables/S'+str(i)+'_summary_table_VIRUSES.txt', sep='\t', 
                                dtype={'contig_ID': str, 'nt_contig_ID': str, 'nt_sacc': str, 'nt_stitle': str, 'nt_pident': float, 
                                       'nt_hit_length': float, 'nt_evalue': float, 'nt_Kingdom': str, 
                                       'nr_contig_ID': str, 'length': float, 'nr_stitle': str, 'nr_pident': float, 
                                       'nr_hit_length': float, 'nr_evalue': float, 'nr_Kingdom': str, 
                                       'rdrp_contig_ID': str, 'rdrp_stitle':str, 'rdrp_pident': float, 
                                       'rdrp_hit_length': float, 'rdrp_evalue': float, 
                                       'vibrant_contig_ID': str, 'vibrant_total_genes': float, 
                                       'vibrant_KEGG_genes': float, 'vibrant_KEGG_v_score': float, 'vibrant_Pfam_genes': float, 
                                       'vibrant_Pfam_v_score': float, 'vibrant_VOG_genes': float, 'vibrant_VOG_v_score': float, 
                                       'vsort2_contig_ID': str, 'vsort2_max_score': float, 'vsort2_max_score_group': str, 
                                       'vsort2_hallmark_genes': float, 'vsort2_viral_component': float, 'vsort2_cellular_component': float})        
    # Import checkv summary results
    checkv = pd.read_csv('3.virus_prediction/6.perSample_checkv/S'+str(i)+'_perSample_derep_checkv_out/quality_summary.tsv', sep='\t', 
                        dtype={'contig_id': str, 'contig_length': float, 'provirus': str, 'proviral_length': float, 
                               'gene_count': float, 'viral_genes': float, 'host_genes': float, 'checkv_quality': str, 
                               'miuvig_quality': str, 'completeness': float, 'completeness_method': str, 
                               'contamination': float, 'kmer_freq': float, 'warnings': str})
    checkv = checkv.add_prefix('checkv_')
    checkv['contig_ID'] = checkv['checkv_contig_id']
    # Merge with summary_table
    summary_table = pd.merge(summary_table, checkv, left_on="contig_ID", right_on="contig_ID", how='outer')
    # Output summary table
    summary_table.to_csv('3.virus_prediction/4.summary_tables/S'+str(i)+'_summary_table_VIRUSES_checkv.txt', sep='\t', index=False)



# Coassembly
summary_table = pd.read_csv('3.virus_prediction/4.summary_tables/coassembly_summary_table_VIRUSES.txt', sep='\t', 
                            dtype={'contig_ID': str, 'nt_contig_ID': str, 'nt_sacc': str, 'nt_stitle': str, 'nt_pident': float, 
                                   'nt_hit_length': float, 'nt_evalue': float, 'nt_Kingdom': str, 
                                   'nr_contig_ID': str, 'length': float, 'nr_stitle': str, 'nr_pident': float, 
                                   'nr_hit_length': float, 'nr_evalue': float, 'nr_Kingdom': str, 
                                   'rdrp_contig_ID': str, 'rdrp_stitle':str, 'rdrp_pident': float, 
                                   'rdrp_hit_length': float, 'rdrp_evalue': float, 
                                   'vibrant_contig_ID': str, 'vibrant_total_genes': float, 
                                   'vibrant_KEGG_genes': float, 'vibrant_KEGG_v_score': float, 'vibrant_Pfam_genes': float, 
                                   'vibrant_Pfam_v_score': float, 'vibrant_VOG_genes': float, 'vibrant_VOG_v_score': float, 
                                   'vsort2_contig_ID': str, 'vsort2_max_score': float, 'vsort2_max_score_group': str, 
                                   'vsort2_hallmark_genes': float, 'vsort2_viral_component': float, 'vsort2_cellular_component': float})        

# Import checkv summary results
checkv = pd.read_csv('3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/quality_summary.tsv', sep='\t', 
                    dtype={'contig_id': str, 'contig_length': float, 'provirus': str, 'proviral_length': float, 
                           'gene_count': float, 'viral_genes': float, 'host_genes': float, 'checkv_quality': str, 
                           'miuvig_quality': str, 'completeness': float, 'completeness_method': str, 
                           'contamination': float, 'kmer_freq': float, 'warnings': str})

# Add prefix to checkv column headers
checkv = checkv.add_prefix('checkv_')

# Add 'contig_ID' column to checkv
checkv['contig_ID'] = checkv['checkv_contig_id']

# Merge with summary_table
summary_table = pd.merge(summary_table, checkv, left_on="contig_ID", right_on="contig_ID", how='outer')

# Output summary table
summary_table.to_csv('3.virus_prediction/4.summary_tables/coassembly_summary_table_VIRUSES_checkv.txt', sep='\t', index=False)




# Mini assemblies
for j in healthy diseased:
    summary_table = pd.read_csv('3.virus_prediction/4.summary_tables/'+str(j)+'_coassembly_summary_table_VIRUSES.txt', sep='\t', 
                                dtype={'contig_ID': str, 'nt_contig_ID': str, 'nt_sacc': str, 'nt_stitle': str, 'nt_pident': float, 
                                    'nt_hit_length': float, 'nt_evalue': float, 'nt_Kingdom': str, 
                                    'nr_contig_ID': str, 'length': float, 'nr_stitle': str, 'nr_pident': float, 
                                    'nr_hit_length': float, 'nr_evalue': float, 'nr_Kingdom': str, 
                                    'rdrp_contig_ID': str, 'rdrp_stitle':str, 'rdrp_pident': float, 
                                    'rdrp_hit_length': float, 'rdrp_evalue': float, 
                                    'vibrant_contig_ID': str, 'vibrant_total_genes': float, 
                                    'vibrant_KEGG_genes': float, 'vibrant_KEGG_v_score': float, 'vibrant_Pfam_genes': float, 
                                    'vibrant_Pfam_v_score': float, 'vibrant_VOG_genes': float, 'vibrant_VOG_v_score': float, 
                                    'vsort2_contig_ID': str, 'vsort2_max_score': float, 'vsort2_max_score_group': str, 
                                    'vsort2_hallmark_genes': float, 'vsort2_viral_component': float, 'vsort2_cellular_component': float})        

    # Import checkv summary results
    checkv = pd.read_csv('3.virus_prediction/6.perSample_checkv/'+str(j)+'_coassembly_perSample_derep_checkv_out/quality_summary.tsv', sep='\t', 
                        dtype={'contig_id': str, 'contig_length': float, 'provirus': str, 'proviral_length': float, 
                            'gene_count': float, 'viral_genes': float, 'host_genes': float, 'checkv_quality': str, 
                            'miuvig_quality': str, 'completeness': float, 'completeness_method': str, 
                            'contamination': float, 'kmer_freq': float, 'warnings': str})

    # Add prefix to checkv column headers
    checkv = checkv.add_prefix('checkv_')

    # Add 'contig_ID' column to checkv
    checkv['contig_ID'] = checkv['checkv_contig_id']

    # Merge with summary_table
    summary_table = pd.merge(summary_table, checkv, left_on="contig_ID", right_on="contig_ID", how='outer')

    # Output summary table
    summary_table.to_csv('3.virus_prediction/4.summary_tables/'+str(j)+'_coassembly_summary_table_VIRUSES_checkv.txt', sep='\t', index=False)


***OPTIONAL: Add sampleIDs to contig IDs***

Add sampled IDs to all contig IDs so that each of the contigs that make up dereplicated contigs can be traced back to the samples they were derived from during downstream analyses.

In [None]:
cd <working_directory>/WTS/

for i in {1..13}; do
    sampleID=$(echo S${i})
    sed -i -e "s/>/>${sampleID}_/g" 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/cat_checkv_out.fasta
    done

Find retroviral contigs

In [None]:
cd <working_directory>/WTS/

for file in 3.virus_prediction/4.summary_tables/*_summary_table_VIRUSES_checkv.txt; do
    echo ${file}
    grep -n 'retro' ${file} >> retro.txt
done

***

## Dereplication of contigs across all samples

- Dereplicate contigs via Sullivan group's `Cluster_genomes_5.1.pl` approach (mummer) to generate 'viral OTUs' (vOTUs)
    - From Roux et al. (2017) Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity: "Contigs from all samples were clustered with nucmer (Delcher, Salzberg & Phillippy, 2003) at >95% ANI across >80% of their lengths, as in (Brum et al., 2015; Gregory et al., 2016), to generate a pool of non-redundant population contigs."
    - **Note that the above quote was for DNA viruses**
- Requires Mummer v4.x. 

For more info on Cluster_genomes_5.1.pl see [here](https://github.com/simroux/ClusterGenomes)

***Prep: download Cluster_genomees_5.1.pl***

In [None]:
mkdir -p <working_directory>/WTS/Software
cd <working_directory>/WTS/Software

# Download the script and set executable permissions
wget https://raw.githubusercontent.com/simroux/ClusterGenomes/master/Cluster_genomes_5.1.pl
chmod 755 <working_directory>/WTS/Software/Cluster_genomes_5.1.pl

# Install autodie pearl module (dependency for Cluster_genomes.pl)
cpan autodie

***Prep: Concatenate fasta files together for cluster_genomes.pl***

In [None]:
cd <working_directory>/WTS/
mkdir -p 4a.cluster_genomes/

# Create output file
> 4a.cluster_genomes/cat_AllSamples.fasta

# Concatenate all samples into output file
for i in {1..13}; do
    cat 3.virus_prediction/6.perSample_checkv/S${i}_perSample_derep_checkv_out/cat_checkv_out.fasta >> 4a.cluster_genomes/cat_All_indv_Samples.fasta
done

for j in healthy diseased; do
    cat 3.virus_prediction/6.perSample_checkv/${j}_coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta >> 4a.cluster_genomes/cat_HD_Samples.fasta
done

cat 3.virus_prediction/6.perSample_checkv/coassembly_perSample_derep_checkv_out/cat_checkv_out.fasta 4a.cluster_genomes/cat_HD_Samples.fasta >> 4a.cluster_genomes/cat_All_coassemblies.fasta

cat 4a.cluster_genomes/cat_All_coassemblies.fasta 4a.cluster_genomes/cat_All_indv_Samples.fasta >> 4a.cluster_genomes/cat_AllSamples.fasta


***Optional: filter out short contigs (e.g. contigs < 500 kbp)***

In [None]:
cd <working_directory>/WTS/

# load seqmagick
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

# Filter out short contigs 
seqmagick convert --min-length 500 4a.cluster_genomes/cat_AllSamples.fasta 4a.cluster_genomes/cat_AllSamples.m500.fasta

# Compare total number of contigs for each
grep -c ">" 4a.cluster_genomes/cat_AllSamples.fasta

grep -c ">" 4a.cluster_genomes/cat_AllSamples.m500.fasta


***Run cluster_genomes.pl***

Run at min identity = 95% similarity over at least 80% of the shortest contig

`sbatch scripts/wts_5_derep_clusterGenomes.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_5_derep_clusterGenomes
#SBATCH --time 00:10:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e slurm_output/wts_5_derep_clusterGenomes_%j.err
#SBATCH -o slurm_output/wts_5_derep_clusterGenomes_%j.out

# Set up working directories
cd <working_directory>/WTS/4a.cluster_genomes

# Set variables
module purge
cluster_genomes_path='/nesi/nobackup/uoa03068/Annie/RNA_seq/Software'
mummer_path='/nesi/project/uoa02469/Software/mummer_v4.0.0/bin/'

# Run for all contigs
${cluster_genomes_path}/Cluster_genomes_5.1.pl \
-f cat_AllSamples.fasta \
-d ${mummer_path} \
-t 32 \
-c 80 \
-i 95

***Check total number of clustered contigs (vOTUs) remaining***

In [None]:
cd <working_directory>/WTS/4a.cluster_genomes

grep -c ">" cat_AllSamples_95-80.fna

***

## Modify derep contig headers to be `vOTU_n`

- CheckV downstream fails with contig headers longer than 10,000 characters. The previous dereplication steps can result in some headers being too long and checkv failing.
- Replace all headers with `vOTU_n` and create a lookup table file of `vOTU_n` IDs against full contig headers (for reference later).

**Python script to do the above:**

In [None]:
cd <working_directory>/WTS/4a.cluster_genomes

# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a
python3

import os
import pandas as pd
import numpy as np
import re
from Bio.SeqIO.FastaIO import SimpleFastaParser

fasta_in = 'cat_AllSamples_95-80.fna'
fasta_out = 'vOTUs.fa'
lookup_table_out = 'vOTUs_virID_lookup_table.txt'

# Read in fasta file, looping through each contig
# rename contig headers with incrementing vOTU_n headers
# write out new vOTUs.fa file and tab-delimited table file of matching vOTU_n and contigID headers.
i=1
with open(fasta_in, 'r') as read_fasta:
    with open(fasta_out, 'w') as write_fasta:
        with open (lookup_table_out, 'w') as write_table:
            write_table.write("vOTU_ID" + "\t" + "cluster_contigID" + "\n")
            for name, seq in SimpleFastaParser(read_fasta):
                write_table.write("vOTU_" + str(i) + "\t" + name + "\n")
                write_fasta.write(">" + "vOTU_" + str(i) + "\n" + str(seq) + "\n")
                i += 1

quit()

***

**Downstream vOTU processing:**

Re-run vOTUs through the blast searches, checkv, and generate new summary table (seemed like the easiest way to get the blast search info back in the mix after generating the clustered vOTU contig set).

***

## vOTUs: 3 x blast searches

Re-run 3 x blast searches (nr, nt, rdrp) for the dereplicated viral contig sets (vOTUs).

***RdRP blastx search: vOTUs***

In [None]:
# Set up working directory
cd <working_directory>/WTS/4a.cluster_genomes
mkdir -p blast

# Load module(s)
module purge
module load DIAMOND/0.9.32-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/viral_rdrp_protein_database/viral_rdrp.dmnd'

# Run diamond blastx search
diamond blastx \
-q vOTUs.fa \
-d ${search_db} \
-o blast/vOTUs_RdRp.txt \
-e 1E-5 -k 3 -p 6 -f 6 qseqid qlen sseqid stitle pident length evalue --more-sensitive

***nr blastx search: vOTUs***

`sbatch scripts/wts_5_vOTUs_blast_nr.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_5_vOTUs_blast_nr
#SBATCH --time 07:00:00
#SBATCH --mem=20GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_5_vOTUs_blast_nr_%j.err
#SBATCH -o slurm_output/wts_5_vOTUs_blast_nr_%j.out

# Change to working directory
cd <working_directory>/WTS/4a.cluster_genomes

# Load module(s)
module purge
module load DIAMOND/2.0.6-GCC-9.2.0

# Database variable
search_db='/nesi/project/uoa02469/Databases/NCBI_20210823_dmnd/nr_wTax_FULL.dmnd'

# Run diamond blastx search
srun diamond blastx \
-q vOTUs.fa \
-d ${search_db} \
-o blast/vOTUs_nr.txt \
-e 1E-5 -k 3 -p 24 -f 6 qseqid qlen sseqid stitle pident length evalue sskingdoms --more-sensitive

***nt blastn search: vOTUs¶***

`sbatch scripts/wts_5_vOTUs_blast_nt.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_5_vOTUs_blast_nt
#SBATCH --time 01:00:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e slurm_output/wts_5_vOTUs_blast_nt_%j.err
#SBATCH -o slurm_output/wts_5_vOTUs_blast_nt_%j.out

# Change to working directory
cd <working_directory>/WTS/4a.cluster_genomes

# Load module(s)
module purge
module load BLAST/2.10.0-GCC-9.2.0
module load BLASTDB/2021-01

# Database variable
search_db=nt

# Run diamond blastx search
blastn \
-query vOTUs.fa \
-db ${search_db} \
-out blast/vOTUs_nt.txt \
-max_target_seqs 5 -num_threads 24 -evalue 1E-10 \
-outfmt "6 qseqid sacc salltitles pident length evalue sskingdoms"

***

## vOTUs: checkV

Re-run CheckV on vOTUs to output checkv stats on the final clustered/dereplicated contig set.

`sbatch scripts/wts_5_vOTUs_checkv.sl`

In [None]:
#!/bin/bash -e
#SBATCH -A uoa03068
#SBATCH -J wts_5_vOTUs_checkv
#SBATCH --time 00:10:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e slurm_output/wts_5_vOTUs_checkv_%j.err
#SBATCH -o slurm_output/wts_5_vOTUs_checkv_%j.out

# Set up working directories
cd <working_directory>/WTS/
mkdir -p 4b.checkv_analyses_vOTUs

# Load module(s)
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Run main analyses 
checkv_in="4a.cluster_genomes/vOTUs.fa"
checkv_out="4b.checkv_analyses_vOTUs/checkv_out_vOTUs"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet

**CheckV: Concatenate output fasta files (viruses.fna and proviruses.fna)¶**

In [None]:
cd <working_directory>/WTS/

# Concatenate viruses and prophage files
cat 4b.checkv_analyses_vOTUs/checkv_out_vOTUs/viruses.fna  4b.checkv_analyses_vOTUs/checkv_out_vOTUs/proviruses.fna >  4b.checkv_analyses_vOTUs/cat_checkv_out_vOTUs.fasta 
# Modify checkv prophage contig headers
sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 4b.checkv_analyses_vOTUs/cat_checkv_out_vOTUs.fasta

***

## Optional: Further filtering of clustered contigs (vOTU)

Filter (losely) based on Sullivan/Roux groups' curation thresholds.

Based on info on [this protocols page](https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v3?step=4) (as of 24 Aug 2021)

- n.b. omit inclusion of VirSorter2 score and hallmark genes thresholds for filtering
- keep any with: (viral gene > 0 OR (viral gene == 0 AND host gene == 0))
- Downstream (e.g. in R) add manual checks for annotations of concern from [this list](https://bitbucket.org/MAVERICLab/virsorter2-sop/raw/03b8f28bee979e2b7fd99d7375d915c29c938339/resource/suspicious-gene.list), based on vibrant and/or DRAM-v annotations.
- Downstream, you can also choose whether to further quality filter (e.g. to remove "short" contigs, or those categorised by checkv as 'Low-quality')

**checkv_filter_contigs.py: run on checkv output from vOTUs**

Outputs:

- `..._filtered.fna` : filtered fasta file
- `..._filtered_quality_summary.tsv` : filtered checkv quality_summary output file

In [None]:
# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a

# Set up working directory
cd <working_directory>/WTS/
script_path='/nesi/project/uoa02469/custom-scripts/MikeH/'

# Run for vOTUs
${script_path}/checkv_filter_contigs.py \
    --checkv_dir_input 4b.checkv_analyses_vOTUs/checkv_out_vOTUs/ \
    --output_prefix 4b.checkv_analyses_vOTUs/vOTUs

***

## WTS vOTUs read mapping: vOTU read coverage

- Optional: combine with dereplicated **WGS** contigs for read mapping to ensure reads aren't erroneously mapped to genes in viral genomes that are similar to transcripts from hosts (e.g. AMGs).
    - In this case, ignore the counts for the wgs contigs. This is simply to ensure the transcript reads aren't erroneously mapped to putative RNA viruses. (Transcription read mapping for WGS contigs is included in the WGS data processing workflow).

In [None]:
cd <working_directory>/WTS/
mkdir -p 5.wts_read_mapping_vOTUs

# Concatenate WGS dereplicated contigs with contigs of putative RNA viruses
cat <working_directory>/WGS/2.assembly_spades/dedupe/dereplicated_contigs.fna <working_directory>/WTS/4b.checkv_analyses_vOTUs/vOTUs_filtered.fna > 5.wts_read_mapping_vOTUs/cat_RNAviralContigs_and_DNAderepContigs.fna

***WTS vOTUs: BBMap, build reference index***

`sbatch scripts/wts_6_read_mapping_vOTUs_index.sl`

In [None]:
#!/bin/bash
#SBATCH -A uoa03068
#SBATCH -J wts_6_read_mapping_vOTUs_index
#SBATCH --time 00:10:00
#SBATCH --mem 4GB
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 24
#SBATCH -e slurm_output/wts_6_read_mapping_vOTUs_index_%j.err
#SBATCH -o slurm_output/wts_6_read_mapping_vOTUs_index_%j.out

# Set up working directory
cd <working_directory>/WTS/5.wts_read_mapping_vOTUs/

# Load module(s)
module purge
module load BBMap/38.90-gimkl-2020a

# Build index
bbmap.sh -Xmx4g ref=cat_RNAviralContigs_and_DNAderepContigs.fna

***WTS vOTUs: per-sample WTS read mapping, slurm array***

`sbatch scripts/wts_6_read_mapping_vOTUs.sl`

In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J wts_6_read_mapping_vOTUs
#SBATCH --time 01:00:00
#SBATCH --mem 15GB
#SBATCH --ntasks 1
#SBATCH --array=1-13
#SBATCH --cpus-per-task 30
#SBATCH -e slurm_output/wts_6_read_mapping_vOTUs_%a_%j.err
#SBATCH -o slurm_output/wts_6_read_mapping_vOTUs_%a_%j.out

# Set up working directory
cd <working_directory>/WTS/5.wts_read_mapping_vOTUs/
mkdir -p wts/

# Load module(s)
module purge
module load BBMap/38.90-gimkl-2020a
module load SAMtools/1.10-GCC-9.2.0

# in/out varibles
WTS_READ_DIR='<working_directory>/WTS/1.host_filtered'
OUTPATH='wts'

# Run read mapping
srun bbmap.sh \
t=30 -Xmx15g ambiguous=best minid=0.95 \
in1=${WGS_READ_DIR}/S${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
in2=${WGS_READ_DIR}/S${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq \
covstats=${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.covstats.txt \
statsfile=${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.statsfile.txt \
out=${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.sam

# Convert sam to bam
samtools sort -@ 10 -o ${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.bam ${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.sam

# Run pileup to extract read counts
pileup.sh \
in=${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.sam \
rpkm=${OUTPATH}/S${SLURM_ARRAY_TASK_ID}.wts.covstats_pileup.txt 

***

## WTS vOTUs: Per-sample coverage calculations

Per-sample coverages can be calculated at:

- genome level (e.g. across multiple binned contigs)
- contig level (e.g. for viral genomes represented by a single contig)
- gene level

`summarise_counts.py` (and associated script `summarise_counts.R`) generated by Michael Hoggard can output each of these sets of summaries depending on the input data (pileup or featurecounts) and output options.

As we're interested in RNA viruses here, we will generate contig-level coverage summaries based on the output from *pileup*.

Below, contig-level coverage summary outputs will be generated for:

- WTS read mapping (indicative of *RNA* data WTS-contig read-coverages)

**Preamble: summarise_counts.py arguments**

Required

- `--input (-i) counts_files.txt`: Input file(s) of counts data returned by featurecounts (gene_counts file) or pileup.sh (rpkm output files passed using wildcard).
    - Note: for pileup.sh output files, this takes a wildcard to capture all the relevant input files (e.g. `--input *rpkm.txt`)
- `--format (-f) ['featurecounts', 'pileup']` : Format of the input counts data. Must be 'featurecounts' or 'pileup'


Optional
- `--sample_mapping_file (-m) sample_mapping_file.txt` : Mapping file of unique sample identifier strings matching to sample groups (and library size (optional)).
    - Columns must be named as: `sampleID`, `group`, `lib.size` (lib.size is an optional column).
    - sampleID strings must also be present in the input files (in the counts column headers in output featurecounts file, or in the filenames of pileup.sh output files)
- `--lib_norm (-n) ['total', 'mapped']` : Set whether to (re)calculate RPKM and FPKM based on total library size or mapped reads per sample. Generally (often based on single organism transcription analyses) total mapped reads is used. However, in some instances, such as metatranscriptome data from mixed organisms, you may wish to override this with total library size (total read counts in fastq files used for read mapping). (Default = 'mapped')
- `--count_threshold (-t) numeric_threshold_value` : Set threshold to zero out low count values below the threshold. (Default = 1)
- `--read_counts (-r) summary_read_counts.tsv` : Output file name for table of read counts (per-sample total library, mapped read, and filtered mapped read (based on count_threshold) counts). (Default = 'summary_read_counts.tsv')
- `--edger_out (-e) summary_edgeR_glmQLFTest.tsv` : Output file name for summary table of EdgeR analysis of "differentially expressed genes" (DEG) across sample groups
- `--output (-o) normalised_summary_count_table.tsv` : Output file name for summary count table (default = 'normalised_summary_count_table.tsv')


**Required prep: generate sample mapping files**

A (tab-delimited) sample mapping file is required if you wish to include TMM normalisation (as this is generated via EdgeR and requires sample grouping information).

Note:

- For read mapping WTS data to putative RNA viruses here, you will want to include the library size of the filtered **WTS** read files (i.e. those in `/working/dir/1.host_filtered`).
- You can count the number of reads in each via `grep -c '@' ${file}`, and sum the R1 and R2 reads together
- Required columns (spelling of column headers matters)
    - `sampleID`: unique strings that identify each sample (these unique strings must also be present in the counts files file names)
    - `group`: treatment groups for each sample (e.g. 'treatment', 'control'; or 'groupA', 'groupB', 'groupC' etc.)
- Optional column
    - `lib.size`: total library size per sample (total read counts in the quality trimmed and filtered fastq files used for read mapping). (NOTE: this column is required if `--format featurecounts` and `--lib_norm total` (for `--format pileup`, this data is present in the input files output by pileup's `rpkm=...` option))

Example mapping file:

| SampleID | group | lib.size | 
| :- | -: | -: | 
| Sample_1 | A | 1054234 | 
|Sample_2|A|1543619|
|Sample_3|B|1246519|
|Sample_4|B|1349855|
|Sample_5|C|1644875|
|Sample_6|C|1422537|

Save this sample mapping file as, e.g:

`working/dir/5.read_mapping/wts/wts_sample_mapping_file.tsv`


***WTS vOTUs: summarise_counts.py***


| sampleID | group | lib.size |
| :- | -: | -: | 
|S1|    A|  16260228|
|S2|    A|   949022|
|S3|    A|   8190158|
|S4|    A|   628904|
|S5|    A|   1480372|
|S6|    A|   852728|
|S7|    A|   3638000|
|S8|    B|   5801764|
|S9|    B|   5119314|
|S10|   A|   664664|
|S11|   A|   1485674|
|S12|   B|   889510|
|S13|   B|   570700|

In [None]:
cd <working_directory>/WTS/

# Load module(s)
module purge
module load Python/3.8.2-gimkl-2020a
module load R/3.6.2-gimkl-2020a

# Add location of summarise_counts.py and summarise_counts.R to PATH
export PATH="/nesi/project/uoa02469/custom-scripts/MikeH:$PATH"

# Run summarise_counts.py
summarise_counts.py \
    --input '5.wts_read_mapping_vOTUs/wts/*.covstats_pileup.txt' --format pileup \
    --sample_map 5.wts_read_mapping_vOTUs/wts/sample_mapping_file.txt \
    --lib_norm total \
    --count_threshold 10 \
    --read_counts 5.wts_read_mapping_vOTUs/wts/wts_summary_read_counts_RNA_vOTUs.tsv \
    --output 5.wts_read_mapping_vOTUs/wts/wts_summary_count_table_RNA_vOTUs.tsv

***