# Metafly: a whitefly metagenomics project
By Cyrielle Ndougonna \
Supervision: Ezéchiel B. Tibiri & Fidèle Tiendrébéogo

project aims: \
O1: characterise whitefly (_Bemisia tabaci_) genotypes circulating in the two study areas (Bonoua and N'Djem) \
O2: establish the diversity of viruses associated with whiteflies originating from Bonoua and N'Djem \
O3: catalogue the endosymbiotic bacteria associated with whiteflies originating from Bonoua and N'Djem

This notebook describes the steps in the bioinformatics pipeline used for the analysis of whitefly Oxford Nanopore reads
The analysis was executed on the UJKZ HPC.

conventions: \
directory names in all caps \
file names with underscore and no caps

# A. Basecalling ONT reads with Dorado

## 1. Create working directories and import raw data files

In [None]:
# connect to distant server
ssh cndougonna@102.216.123.67
scontrol show partitions

In [None]:
# reserve a node and create personal folder in /scratch
srun --ntasks=1 --cpus-per-task=8 --mem=32G --time=03:00:00 --pty bash -i
mkdir -p /scratch/whitefly_ont_sequencing

In [None]:
# create data and basecalling directories
mkdir -p /scratch/whitefly_ont_sequencing/raw_data
mkdir -p /scratch/whitefly_ont_sequencing/basecalling

In [None]:
# copy raw data from /home folder to /raw_data
cd /scratch/whitefly_ont_sequencing/raw_data
cp -r /home/cndougonna/whitefly/FAV02519 ./
ls

## 2. Basecalling

In [None]:
# load Dorado
module load doradoxxxxxxxxxx
module list

In [None]:
# print Dorado options
dorado basecaller --help

In [None]:
# list models available for download
dorado download --list

In [None]:
# download appropriate model
dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.0.0

In [None]:
# there are 42 .pod5 files in total; create a loop to run the basecalling
## dorado basecaller detects and removes adapter/primer/barcode sequences by default
for FILE in /scratch/whitefly_ont_sequencing/raw_data/*.pod5; do FILENAME=$(FAV02519_6c0a1734_fba2136f_ "$FILE" .pod5); \
dorado basecaller --recursive dna_r10.4.1_e8.2_400bps_sup@v5.0.0 --emit-fastq "$FILE" > ./fastq/${FILENAME}.fastq; done

# B. Quality control with NanoPlot

## 1. Create working directory qc

In [None]:
# create qc directory
mkdir -p /scratch/whitefly_ont_sequencing/qc

## 2. Run NanoPlot

In [None]:
# print NanoPlot help menu
NanoPlot --help

In [None]:
#run NanoPlot
NanoPlot -t 8 -o /scratch/whitefly_ont_sequencing/qc \
            --fastq /scratch/whitefly_ont_sequencing/basecalling/SQK-NBD114-96_barcode41.fastq \
            --plots kde hex dot
### I received a message saying that hex was deprecated and needed to be run using --legacy hex; other dependencies needed to be installed for this

In [None]:
# examine QC reports
cd /scratch/whitefly_ont_sequencing/qc/barcode41
cat NanoStats.txt

# C. _de novo_ assembly using Flye

## 1. Create working directory assembly

In [None]:
# create assembly directory
mkdir -p /scratch/whitefly_ont_sequencing/assembly
cd /scratch/whitefly_ont_sequencing/assembly

## 2. Run Flye

In [None]:
# load Flye
module load flye/2.9.3
module list

In [None]:
# print Flye help menu
flye --help

In [None]:
# run Flye
time flye --threads 8 --resume --meta --nano-hq /scratch/whitefly_ont_sequencing/basecalling/xxxxxxxx.fastq -o flye_output

## 4. Polish assembly

# D. Taxonomic assignation

## 1. Download relevant databases

In [None]:
# it can take several hours to download some of the large databases (e.g. bacteria)

In [None]:
# download eukaryot database
ftp//:ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/
wget -r --no-parent -A GCF_*_genomic.fna.gz ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Bemisia_tabaci/all_assembly_versions/GCF_001854935.1_ASM185493v1/

In [None]:
# download bacteria database
wget -r --no-parent -A bacteria.*.genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

In [None]:
# download fungi database
wget -r --no-parent -A fungi.*.genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/

In [None]:
# download virus database 
wget -r --no-parent -A viral.*.genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/