# Metafly: a whitefly metagenomics project
By Cyrielle Ndougonna \
Supervision: Ezechiel B. Tibiri, Romaric Nanema & Fidèle Tiendrébéogo

Project aims: \
O1: establish the diversity of viruses associated with whiteflies originating from two locations in Côte d'Ivoire (Bonoua and N'Djem) \
O2: catalogue the endosymbiotic bacteria associated with whiteflies originating from the two sites \
O3: characterise whitefly (_Bemisia tabaci_) genotypes circulating in the two areas

This notebook describes the steps in the bioinformatics pipeline used for the analysis of Oxford Nanopore reads generated from whitefly samples collected in Bonoua and N'Djem.
The analysis was executed on the iTrop HPC.

# A. Getting started

In [None]:
# connect to distant server
ssh bioinfo-master1.ird.fr -l ndougonna

# check available partitions
sinfo

# launch an interactive session
srun -c12 --pty bash -i

# connect to node 23
srun -N 23 -c12 --pty bash -i

In [None]:
# create project directory in /scratch
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq

In [None]:
# raw data is located in the following directory
/projects/medium/whitefly_ont/FASTQ

Barcodes of interest are BC92, BC93, BC94, BC95, BC96 (Bonoua) and BC41, BC42, BC43, BC44, BC45, BC46, BC47 (N'Djem).

# B. Quality control with NanoPlot

## 1. Create working directory qc

In [None]:
# create qc directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/qc
cd /scratch/whitefly_ont_sequencing/from_fastq/qc
pwd

In [None]:
# check how many reads were generated
awk '{s++}END{print s/4}' /projects/medium/whitefly_ont/FASTQ/SQK-NBD114-96_barcode92.fastq

In [None]:
# check how many bases were sequenced
seqtk seq -A /projects/medium/whitefly_ont/FASTQ/SQK-NBD114-96_barcode92.fastq | grep -v ">" | wc -m

## 2. Run NanoPlot

In [None]:
# load NanoPlot
module load nanoplot/1.43.0
module list

In [None]:
# print NanoPlot help menu
NanoPlot --help

In [None]:
#run NanoPlot
NanoPlot -t 8 -o /scratch/whitefly_ont_sequencing/from_fastq/qc \
            --fastq /projects/medium/whitefly_ont/FASTQ/*.fastq \
            --plots kde hex dot
### I received a message saying that hex was deprecated and needed to be run using --legacy hex; other dependencies needed to be installed for this

In [None]:
# examine QC reports
cd /scratch/whitefly_ont_sequencing/from_fastq/qc/barcode92
cat NanoStats.txt

In [None]:
# copy qc reports from node to master
cp /scratch/whitefly_ont_sequencing/from_fastq/qc/ /projects/medium/whitefly_ont

# download qc reports on local disk
scp -r ndougonna@bioinfo-san.ird.fr:/projects/medium/whitefly_ont/qc /Users/cyrielle_ndougonna/Desktop/WAVE/lab_management/training/minION/ont_workshop_2024_09/personal_project/results/from_fastq/

# C. Mapping

Here, .fastq are mapped against the reference _Bemisia tabaci_ genome. Mapped reads will be assembled separately from unmapped reads: \
for mapped: _de novo_ assembly and assignation using _Bemisia tabaci_ database \
for unmapped: _de novo_ assembly and assignation using viruses database; _de novo_ assembly and assignation using bacteria database; _de novo_ assembly and assignation using fungi database

In [None]:
# download Bemisia tabaci reference genome
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/refseq
cd /scratch/whitefly_ont_sequencing/from_fastq/refseq
pwd

wget -r --no-parent https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_genomic.fna.gz

In [None]:
# unzip .fna.gz
gunzip /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_genomic.fna.gz

In [None]:
# move and rename reference genome to refseq directory
mv /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_genomic.fna ./b_tabaci.genomic.fna

# delete download directory
rm -r /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/

In [None]:
# load minimap2
module load minimap2/2.24
module list

In [None]:
# print minimap2 help menu
minimap2 --help

In [None]:
# create mapping directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/mapping
cd /scratch/whitefly_ont_sequencing/from_fastq/mapping
pwd

# run minimap2
minimap2 -t 8 -ax map-ont /projects/medium/whitefly_ont/refseq/b_tabaci.genomic.fna /projects/medium/whitefly_ont/FASTQ/SQK-NBD114-96_barcode92.fastq -o reads_vs_bemisia_bc92.sam

In [None]:
# print mapping statistics
module load samtools/1.19.2
module list

samtools flagstats reads_vs_bemisia_bc92.sam

In [None]:
# convert .sam to .bam
samtools view -b -o reads_vs_bemisia_bc92.bam reads_vs_bemisia_bc92.sam

In [None]:
# check file size (check that .sam have been compressed)
ls -alhrt reads_vs_bemisia_bc92.bam reads_vs_bemisia_bc92.sam

In [None]:
# extract mapped reads; these will be used for de novo assembly with B. tabaci database
samtools view -@ 8 -bh -F 4 reads_vs_bemisia_bc92.sam > reads_vs_bemisia_bc92_mapped.bam

In [None]:
# extract unmapped reads; these will be used for de novo assembly with viruses, fungi and bacteria databases
samtools view -@ 8 -bh -f 4 reads_vs_bemisia_bc92.sam > reads_vs_bemisia_bc92_unmapped.bam

In [None]:
# convert mapped.bam to .fastq  using `samtools fastq`
samtools fastq reads_vs_bemisia_bc92_mapped.bam > reads_vs_bemisia_bc92_mapped.fastq

# convert unmapped.bam to .fastq  using `samtools fastq`
samtools fastq reads_vs_bemisia_bc92_unmapped.bam > reads_vs_bemisia_bc92_unmapped.fastq

In [None]:
# copy .bam and .fastq from node to master
cp /scratch/whitefly_ont_sequencing/from_fastq/mapping/ /projects/medium/whitefly_ont/mapping

# download mapping files on local disk
scp -r ndougonna@bioinfo-san.ird.fr:/projects/medium/whitefly_ont/mapping /Users/cyrielle_ndougonna/Desktop/WAVE/lab_management/training/minION/ont_workshop_2024_09/personal_project/results/from_fastq/

# D. _de novo_ assembly using Flye

## 1. Create working directory assembly

In [None]:
# create assembly directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/assembly
cd /scratch/whitefly_ont_sequencing/from_fastq/assembly
pwd

## 2. Run Flye

In [None]:
# load Flye
module load flye/2.9.2
module list

In [None]:
# print Flye help menu
flye --help

In [None]:
# run Flye on mapped reads
flye --threads 8 --nano-hq /scratch/whitefly_ont_sequencing/from_fastq/mapping/reads_vs_bemisia_bc92_mapped.fastq -o ./flye_output_bc92

In [None]:
# run Flye on unmapped reads
## add time flag to record running time
flye --threads 8 --meta --nano-hq /scratch/whitefly_ont_sequencing/from_fastq/mapping/reads_vs_bemisia_bc92_unmapped.fastq -o flye_output_bc92_meta

In [None]:
# run Flye on raw reads
## add time flag to record running time
flye --threads 8 --meta --nano-hq /projects/medium/whitefly_ont/FASTQ/SQK-NBD114-96_barcode92.fastq -o flye_output_bc92_raw

## 3. Estimate quality of raw assemblies (QUAST & MetaQUAST)

In [None]:
# create quast_raw directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/quast/raw
cd /scratch/whitefly_ont_sequencing/from_fastq/quast/raw
pwd

In [None]:
# load QUAST
module load quast/5.2.0
module list

In [None]:
# print Metaquast help menu
metaquast.py --help

In [None]:
###### help menu output
###### Usage: python /usr/local/quast-5.2.0/metaquast.py [options] <files_with_contigs>

In [None]:
# run QUAST on raw assembly
quast.py /projects/medium/whitefly_ont/assembly/flye_output_bc92/assembly.fasta -o quast_raw_bc92 --silent

In [None]:
# run MetaQUAST on raw meta assembly
metaquast.py /projects/medium/whitefly_ont/assembly/flye_output_bc92_meta/assembly.fasta -o quast_raw_bc92_meta --silent

In [None]:
# explore MetaQUAST outputs
head -25 /scratch/whitefly_ont_sequencing/from_fastq/quast/raw/quast_bc92report.txt
head -25 /scratch/whitefly_ont_sequencing/from_fastq/quast/raw/quast_meta_bc92s/report.txt

compare assemblies for all Bonoua samples \
compare assemblies for all N'Djem samples

## 4. Estimate quality of assembly with checkV (viral genomes only)

In [None]:
# create checkv directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/checkv/unpolished
cd /scratch/whitefly_ont_sequencing/from_fastq/checkv/unpolished
pwd

In [None]:
# load checkv
## checkv not installed on the cluster
module load checkv/xxxxxxxxxx
module list

In [None]:
# download checkV database
checkv download_database /scratch/whitefly_ont_sequencing/from_fastq/checkv/

In [None]:
export CHECKVDB=/scratch/whitefly_ont_sequencing/from_fastq/checkv/checkv-db-xxxxxxxxxxxxxxxxx

In [None]:
# run checkv on assemblies obtained from unmapped reads (_meta)
checkv end_to_end /scratch/whitefly_ont_sequencing/from_fastq/assembly/flye_ouput_bc92_meta/assembly.fasta output_checkv_unpolished_bc92_meta

## 5. Perform preliminary taxonomic assignment
This first assignation will serve to sort/discard samples for which assembly yielded chimers.
I will use the viral database for this test, as it is the smallest one.

In [None]:
# load Diamond
module load diamond/2.0.11
module list

In [None]:
# create diamond directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/diamond_test
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond_test
pwd

In [None]:
# create reference database for viruses
diamond makedb --in /scratch/whitefly_ont_sequencing/from_fastq/refseq/virus.protein.faa -d virusdb

In [None]:
#print Diamond help menu
diamond --help

In [None]:
#launch Diamond
diamond blastx --very-sensitive --threads 8 --db virusdb.dmnd --query /projects/medium/whitefly_ont/assembly/flye_output_bc41_meta/assembly.fasta --outfmt 6 evalue score length pident mismatch gapopen stitle qtitle -o diamond_blastx_bc41_meta.csv

This first round of assignation yielded conclusive results. I can move forward with polishing the assemblies and performing taxonomic assignation on the polished assemblies.

## 6. Polish assembly with Medaka

In [None]:
# create polishing directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/polishing
cd /scratch/whitefly_ont_sequencing/from_fastq/polishing
pwd

In [None]:
# load medaka
module load medaka/1.5
module list

In [None]:
# print medaka options
medaka_consensus -h

use -m flag to select the correct model  \
.pod5 were basecalled using dorado model dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
I will use the r104_e81_sup_g5015 model for polishing

In [None]:
# run medaka on meta assembly
medaka_consensus -t 8 -m r104_e81_sup_g5015 -i /projects/medium/whitefly_ont/mapping/fastq/reads_vs_bemisia_bc92_unmapped.fastq -d /projects/medium/whitefly_ont/assembly/flye_output_bc92_meta/assembly.fasta -o medaka_bc92_meta

In [None]:
# run medaka on assembly
medaka_consensus -t 8 -m r104_e81_sup_g5015 -i /projects/medium/whitefly_ont/mapping/fastq/reads_vs_bemisia_bc92_mapped.fastq -d /projects/medium/whitefly_ont/assembly/flye_output_bc92/assembly.fasta -o medaka_bc92

In [None]:
# change threads to 2
[03:01:48 - Predict] Reducing threads to 2, anymore is a waste.
[03:01:48 - Predict] It looks like you are running medaka without a GPU and attempted to set a high number of threads. We have scaled this down to an optimal number. If you wish to improve performance please see https://nanoporetech.github.io/medaka/installation.html#improving-parallelism.

In [None]:
# polish assemblies obtained from raw reads (i.e. without mapping .fasta against the whitefly genome)

medaka_consensus -t 8 -m r104_e81_sup_g5015 -i /projects/medium/whitefly_ont/FASTQ/SQK-NBD114-96_barcode92.fastq -d /projects/medium/whitefly_ont/assembly/flye_output_bc92_raw/assembly.fasta -o medaka_bc92_raw

## 7. Compare quality of polished assemblies and unpolished ones (MetaQUAST)

In [None]:
# create quast_polished directory
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/quast/polished
cd /scratch/whitefly_ont_sequencing/from_fastq/quast/polished
pwd

In [None]:
# load QUAST
module load quast/5.2.0
module list

In [None]:
# run QUAST on polished assembly
quast.py /projects/medium/whitefly_ont/polishing/medaka_bc92/consensus.fasta -o quast_polished_bc92 --silent

In [None]:
# run MetaQUAST on polished meta assembly
metaquast.py /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta -o quast_polished_bc92_meta --silent

In [None]:
# create directory for comparisons
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/quast/compared
cd /scratch/whitefly_ont_sequencing/from_fastq/quast/compared
pwd

In [None]:
# compare outputs for polished and unpolished assemblies
quast.py /projects/medium/whitefly_ont/polishing/medaka_bc92/consensus.fasta /projects/medium/whitefly_ont/assembly/flye_output_bc92/assembly.fasta -o quast_compared_bc92 --silent

In [None]:
# compare outputs for polished and unpolished meta assemblies
metaquast.py /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta /projects/medium/whitefly_ont/assembly/flye_output_bc92_meta/assembly.fasta -o quast_compared_bc92_meta --silent

# E. Taxonomic assignment

## 1. Download relevant databases

In [None]:
# it can take several hours to download some of the large databases (e.g. bacteria)

In [None]:
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/refseq
cd /scratch/whitefly_ont_sequencing/from_fastq/refseq
pwd

### 1.1. Bacteria database

In [None]:
# download bacteria database
wget -r --no-parent -A bacteria.*.protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

In [None]:
# merge .faa.gz files into one
cat /projects/medium/whitefly_ont/refseq/ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria.*.protein.faa.gz > bacteria.protein.faa.gz

In [None]:
# unzip .faa.gz (although Diamond online manual states that input protein reference database file may be gzip compressed)
gunzip /projects/medium/whitefly_ont/refseq/bacteria.protein.faa.gz

In [None]:
# delete download directory
rm -r /projects/medium/whitefly_ont/refseq/ftp.ncbi.nlm.nih.gov/

### 1.2.1. Fungi database (DNA sequence)

In [None]:
# download fungi genomic database
wget -r --no-parent -A fungi.*.genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/

In [None]:
# merge .fna.gz files into one
cat /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.genomic.fna.gz > fungi.genomic.fna.gz

In [None]:
# unzip .fna.gz (although Diamond online manual states that input protein reference database file may be gzip compressed)
gunzip /scratch/whitefly_ont_sequencing/from_fastq/refseq/fungi.genomic.fna.gz

In [None]:
# delete download directory
rm -r /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/

### 1.2.2. Fungi database (protein sequence)

In [None]:
# download fungi protein database
wget -r --no-parent -A fungi.*.protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/

In [None]:
# merge .faa.gz files into one
cat /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.genomic.fna.gz > fungi.protein.faa.gz

In [None]:
# unzip .faa.gz (although Diamond online manual states that input protein reference database file may be gzip compressed)
gunzip /scratch/whitefly_ont_sequencing/from_fastq/refseq/fungi.protein.faa.gz

In [None]:
# delete download directory
rm -r /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/

### 1.3. Virus database

In [None]:
# download virus database 
## download protein database, as there are a lot of recombinants for viruses
wget -r --no-parent https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz

In [None]:
# unzip .fna.gz (although Diamond online manual states that input protein reference database file may be gzip compressed)
gunzip /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz

### 1.4. Clean-up

In [None]:
# move and rename reference genome to refseq directory
mv /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa ./virus.protein.faa

# delete download directory
rm -r /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/

## 2. Create Diamond databases

In [None]:
# create diamond directory
mkdir -p /projects/medium/whitefly_ont/diamond
cd /projects/medium/whitefly_ont/diamond
pwd

In [None]:
# load Diamond
module load diamond/2.0.11
module list

In [None]:
# for bacteria, use diamond database available on cluster
/share/banks/diamond/nr/21-09/diamond_nr.dmnd

In [None]:
# create reference database for bacteria using protein sequences from NCBI
diamond makedb --in /projects/medium/whitefly_ont/refseq/bacteria.protein.faa -d bacteriadb

In [None]:
# create reference database for fungi using DNA sequences from NCBI
diamond makedb --in /projects/medium/whitefly_ont/refseq/fungi.genomic.fna -d fungidb_dna
## received the following error message
Error: The sequences are expected to be proteins but only contain DNA letters. Use the option --ignore-warnings to proceed.
### ran the following command
diamond makedb --in /projects/medium/whitefly_ont/refseq/fungi.genomic.fna -d fungidb_dna --ignore-warnings

In [None]:
# create reference database for fungi using protein sequences from NCBI
diamond makedb --in /projects/medium/whitefly_ont/refseq/fungi.protein.faa -d fungidb

In [None]:
# create reference database for viruses using protein sequences from NCBI
diamond makedb --in /projects/medium/whitefly_ont/refseq/virus.protein.faa -d virusdb

## 3. Perform taxonomic assignment

In [None]:
# create diamond directories
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/diamond/virus
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/diamond/fungi
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/diamond/bacteria

### 3.1. Against virus database

In [None]:
# blast polished assemblies against virusdb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/virus
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/virusdb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92/consensus.fasta -o diamond_blastx_virus_bc92.csv

In [None]:
# blast polished meta assemblies against virusdb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/virus
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/virusdb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta -o diamond_blastx_virus_bc92_meta.csv

In [None]:
# blast polished raw assemblies against virusdb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/virus
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/virusdb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_raw/consensus.fasta -o diamond_blastx_virus_bc92_raw.csv

### 3.2. Against fungi database

In [None]:
# blast polished assemblies against fungidb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/fungus
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/fungidb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92/consensus.fasta -o diamond_blastx_fungi_bc92.csv

In [None]:
# blast polished meta assemblies against fungidb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/fungi
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/fungidb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta -o diamond_blastx_fungi_bc92_meta.csv

In [None]:
# blast polished raw assemblies against fungidb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/virus
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/fungidb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_raw/consensus.fasta -o diamond_blastx_fungi_bc92_raw.csv

### 3.3. Against bacteria database

In [None]:
# blast polished meta assemblies against diamond_nr.dmnd
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/bacteria
pwd
diamond blastx --very-sensitive --threads 8 --db /share/banks/diamond/nr/21-09/diamond_nr.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta -o diamond_blastx_bacteria_bc92_meta.csv

In [None]:
# blast polished meta assemblies against bacteriadb
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/bacteria
pwd
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/bacteriadb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92_meta/consensus.fasta -o diamond_blastx_bacteria_bc92_meta.csv

## 4. Perform assignation of mapped reads against _B. tabaci_ genome

In [None]:
# create diamond directory
mkdir -p /projects/medium/whitefly_ont/diamond
cd /projects/medium/whitefly_ont/diamond
pwd

In [None]:
# load Diamond
module load diamond/2.0.11
module list

In [None]:
# create reference database for Bemisia tabaci using DNA sequence
diamond makedb --in /projects/medium/whitefly_ont/refseq/b_tabaci.genomic.fna -d bemisiadb_dna

In [None]:
# download Bemisia tabaci reference protein sequence
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/refseq
cd /scratch/whitefly_ont_sequencing/from_fastq/refseq
pwd

wget -r --no-parent https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_protein.faa.gz

In [None]:
# unzip .fna.gz
gunzip /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_protein.faa.gz

In [None]:
# move and rename reference genome to refseq directory
mv /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/GCF_001854935.1_ASM185493v1_protein.faa ./b_tabaci.protein.faa

# delete download directory
rm -r /scratch/whitefly_ont_sequencing/from_fastq/refseq/ftp.ncbi.nlm.nih.gov/

mv /scratch/whitefly_ont_sequencing/from_fastq/refseq/b_tabaci.protein.faa /projects/medium/whitefly_ont/refseq/

In [None]:
# create reference database for Bemisia tabaci using protein sequence
cd /projects/medium/whitefly_ont/diamond
pwd
diamond makedb --in /projects/medium/whitefly_ont/refseq/b_tabaci.protein.faa -d bemisiadb

In [None]:
mkdir -p /scratch/whitefly_ont_sequencing/from_fastq/diamond/bemisia
cd /scratch/whitefly_ont_sequencing/from_fastq/diamond/bemisia
pwd

In [None]:
diamond blastx --very-sensitive --threads 8 --db /projects/medium/whitefly_ont/diamond/bemisiadb.dmnd --header --outfmt 6 qtitle stitle qstart qend sstart send evalue bitscore length pident mismatch gapopen --query /projects/medium/whitefly_ont/polishing/medaka_bc92/consensus.fasta -o diamond_blastx_bemisia_bc92.csv