# Parasite genome assembly
In this notebook you can find the commands and scripts that were used for the genome assembly.
Input and output directories are specified under `### input ###` in every script. The path to the executable file of the program is specified in the `### soft ###` section of the script. More datails are noted in comments in every script.

#Library preparation
## ID correction of reads
BBMap/BBTools (v. 39.01) was used for ID correction of reads and filtering of rnaSPAdes results. 
Script for BBTools:

In [None]:
# !/bin/bash

### INPUT ###

INDIR=/home/LVP/raw_data/dna # Full path to the input directory
OUTDIR=/home/LVP/raw_data/dna/bbtools # Full path to the output directory
SPTAG=pret # Species tag
OUTTAG=renamed_raw # Output files tag

### SOFT ###

BBTools_reformat=/home/al/Documents/Programs/BBMap_39.01/bbmap/reformat.sh

### MAIN ###

for dir in $(find $INDIR -mindepth 1 -type d); do
    r1="$(find $dir -type f -name '*_R1.fq.gz')"
    r2="$(find $dir -type f -name '*_R2.fq.gz')"
    tag=$(basename $dir)
    echo "BBtools/reformat.sh starts to work with: "
    echo "R1: $r1"
    echo "R2: $r2"
    echo "TAG: $tag"
    mkdir $OUTDIR/$tag
    cd $OUTDIR/$tag
    nohup $BBTools_reformat in=$r1 in2=$r2 out=${SPTAG}.${tag}.${OUTTAG}.R1.fq.gz out2=${SPTAG}.${tag}.${OUTTAG}.R2.fq.gz trimreaddescription=t addslash=t spaceslash=f
    wait
    cd ..
done

##Quality control
FastQC (v. 0.12.1) was utilized to spot potential problems in sequencing datasets.

FastQC launching command that contains paths to input files and output directory:

In [None]:
fastqc -o genome_fq_out *.fq.gz

FastP utility (v. 0.23.2) performed the low-quality and adapter sequences removal. Scripts are available in the repository. The trimming parameters was the following: `--cut_window_size 4 --cut_mean_quality 20 --qualified_quality_phred 20 --length_required 25`.

FastP utility launching script:

In [None]:
#!/bin/bash

### INPUT ###

INDIR=/home/LVP/raw_data/dna/bbtools # Full path to the input directory
OUTDIR=/home/LVP/raw_data/dna/fastp_out # Full path to the output directory

### Soft ###

fastp=/home/LVP/Soft/fastp

### PARAMS ###

WIN_SIZE=4
QUAL_QUAL=20
MEAN_QUAL=20
MIN_LEN=25
THREADS=5

### MAIN ###

cd $OUTDIR

for dir in $(find $INDIR -mindepth 1 -type d); do
    r1="$(find $dir -type f -name '*R1.fq.gz')"
    r2="$(find $dir -type f -name '*R2.fq.gz')"
    tag=$(basename $dir)
    echo "FastP starts to work with: "
    echo "R1: $r1"
    echo "R2: $r2"
    echo "TAG: $tag"
    echo "Window size: $WIN_SIZE"
    echo "Mean quality: $MEAN_QUAL"
    echo "Qualified quality: $QUAL_QUAL"
    echo "Min length: $MIN_LEN"
    echo "Threads: $THREADS"
    mkdir ${tag}_fastP_output
    cd ${tag}_fastP_output
    nohup $fastp --in1 $r1 --out1 ${tag}.fastp_tmm.R1.fastq.gz --in2 $r2 --out2 ${tag}.fastp_tmm.R2.fastq.gz --unpaired1 ${tag}.fastp_unpaired1.fastq.gz --unpaired2 ${tag}.fastp_unpaired2.fastq.gz$
    wait
    cd ..
done

echo "##### Job is complete #####"


##Decontamination via Kraken2
Kraken2 (v. 2.1.2)  is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. In our project, we used it to build a database of possible contamination (such as bacteria and viruses).
In case of genomic short paired-end read libraries we used standard database that includes sequences libraries from archaea, viruses, bacteria, plasmid, fungi, human and protozoa.

Database  downloading:

In [None]:
#!/bin/bash

### INPUT ###

OUTDIR=/home/LVP/Kraken2_DB_reticulata/ # Full path to the directory where the database will be stored
DBTAG=Kraken2_plus_db_reticulata # Some name of the database to be created

### SOFT ###

KRAKEN2DIR=/home/LVP/Soft/kraken2 # Full path to the directory where the Kraken2 executable files are stored

### Processing ###

mkdir -p $OUTDIR
cd $OUTDIR

nohup $KRAKEN2DIR/kraken2-build --download-taxonomy --db $DBTAG --threads 5 --use-ftp
wait
nohup $KRAKEN2DIR/kraken2-build --download-library archaea --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library viral --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library bacteria --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library plasmid --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library human --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library fungi --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library protozoa --db $DBTAG --threads 5
wait
nohup $KRAKEN2DIR/kraken2-build --download-library UniVec_Core --db $DBTAG --threads 5
wait

echo "##### Job is complete #####"

Build database:

In [None]:
#!/bin/bash

### INPUT ###

OUTDIR=/home/LVP/Kraken2_DB_reticulata/ # Full path to the directory where the database will be stored
DBTAG=Kraken2_plus_db_TEST # Some name of the database to be created

### SOFT ###

KRAKEN2DIR=/home/LVP/Soft/kraken2 # Full path to the directory where the Kraken2 executable files are stored

### Processing ###

cd $OUTDIR

nohup $KRAKEN2DIR/kraken2-build --build --db $DBTAG --threads 12
wait

echo "##### Job is complete #####"

Decontamination:

In [None]:
#!/bin/bash

### Input ###

FASTPDIR=/home/LVP/fastp/dna/ # Full path to the directory where the fastP results are stored
OUTDIR=/home/LVP/kraken2_outputs/genome_kraken2_output/ # Full path to the directory where the classification results (Kraken2 output) will be stored
DBDIR=/home/LVP/Kraken2_DB_reticulata # Full path to the directory where the database is stored
DBTAG=Kraken2_plus_db_TEST # Some name of the database that will be used
SUFFIX=#

### SOFT ###

KRAKEN2DIR=/home/LVP/Soft/kraken2 # Full path to the directory where the Kraken2 executable files are stored

### Processing ###

mkdir -p $OUTDIR
cd $OUTDIR

for dir in $(find $FASTPDIR -mindepth 1 -type d); do
        r1="$(find $dir -type f -name '*fastp_tmm.R1.fastq.gz')"
        r2="$(find $dir -type f -name '*fastp_tmm.R2.fastq.gz')"
        tag=$(basename $dir)
        echo "Kraken2 starts to work with: "
        echo "R1: $r1"
        echo "R2: $r2"
        echo "TAG: $tag"
	      mkdir ${tag}_vs_${DBTAG}
	      cd ${tag}_vs_${DBTAG}
        nohup $KRAKEN2DIR/kraken2 --db ${DBDIR}/${DBTAG}/ --threads 12 --paired --gzip-compressed --unclassified-out ${tag}.unclass.R${SUFFIX}.fastq --classified-out ${tag}.class.R${SUFFIX}.fastq --output ${tag}_vs_${DBTAG}.tab --report ${tag}_vs_${DBTAG}.report ${r1} ${r2}
	      cd ..
        wait
done

echo "##### Job is complete #####"

The results were visualized by Pavian (v. 1.0) using Kraken2 output files `.report` for every sample.

For the following steps all decontaminated short paired-end libraries were merged together. 

##Genome size assessment
The calculations were performed for a k-mer length 25 bp (parameter `--mer-len=25`). `--size=265M` parameter was taken based on the total size of the [_Sacculina_ genome](https://www.ebi.ac.uk/ena/browser/view/GCA_916048095) (264,490,643 bp). 

The outputs were analyzed using [GenomeScope](http://qb.cshl.edu/genomescope/) (v. 1.0). 

Jellyfish was run using the script:

In [None]:
#!/bin/bash

# INPUT: #
R1=/home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R1.fastq # Full path to the merged decontaminated (after Kraken2) R1 fastq file
R2=/home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R2.fastq # Full path to the merged decontaminated (after Kraken2) R2 fastq file
OUTTAG=Pret.genome.unclass.k25_counts.jf # Some name of Jellyfish result file
HISTOTAG=Pret.genome.unclass.k25_counts.histo
OUTDIR=/home/LVP/genome_assembly/jellyfish_results # Full path to the directory where the Jellyfish output will be stored

### SOFT ###

JELLYFISH=/home/LVP/Soft/Trinity/jellyfish-2.3.0/bin/jellyfish # Full path to the Jellyfish

### MAIN ###
echo "***** Jellyfish began to work with params: *****"
echo "R1: $R1"
echo "R2: $R2"
echo "OUTTAG: $OUTTAG"
echo "OUTDIR: $OUTDIR"

mkdir -p $OUTDIR
cd $OUTDIR

nohup $JELLYFISH count -m 25 -s 265M -t 10 -o $OUTTAG -C $R1 $R2
wait

echo "***** Histo jellyfish began to work with param: *****"
echo "HISTOTAG: $HISTOTAG"

nohup $JELLYFISH histo -t 10 $OUTTAG > $HISTOTAG
wait

echo "##### Job is complete #####"

#Generation of *in silico* mate pairs libraries using *S. carcini* genome as reference 
*In silico* mate pair libraries were generated via [`Cross-species scaffolding`](https://github.com/thackl/cross-species-scaffolding) pipeline with `-l 141` prameter (the size of the smallest of the average length of the reads). The other parameters were left default.

A special conda environment should be created for the `cross-species-scaffolding` pipeline using `.yaml` file. `Seq-scripts`, `Perl5lib-Fastq` and `Perl5lib-Fasta` should be downloaded from Github and installed.


In [None]:
 nohup cross-mates -o /home/LVP/Pret_In_silico_MP -t 20 -l 141 /home/LVP/Pret_In_silico_MP/Sacculina_genome.fasta \
 /home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R1.fastq \
 /home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R2.fastq

#Genome assembly

SPAdes (v. 3.15.4) was used for *Peltogaster reticulata* genome assembly. The paths to the merged decontaminated fastq files and previously obtained *in silico* mate pair libraries were specified, and the assembly was done in careful mode (parameter `--careful`). 

SPAdes was run using following command where all input libries are listed:

In [None]:
nohup /home/LVP/Soft/SPAdes-3.15.4-Linux/bin/spades.py \
-1 /home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R1.fastq \
-2 /home/LVP/genome_assembly/genome_all_libs_fastP_output.unclass_merged.R2.fastq \
--mp-1 1 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-100000_1.fq \
--mp-2 1 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-100000_2.fq \
--mp-1 2 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-10000_1.fq \
--mp-2 2 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-10000_2.fq \
--mp-1 3 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-1000_1.fq \
--mp-2 3 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-1000_2.fq \
--mp-1 4 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-1500_1.fq \
--mp-2 4 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-1500_2.fq \
--mp-1 5 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-200000_1.fq \
--mp-2 5 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-200000_2.fq \
--mp-1 6 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-20000_1.fq \
--mp-2 6 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-20000_2.fq \
--mp-1 7 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-2000_1.fq \
--mp-2 7 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-2000_2.fq \
--mp-1 8 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-50000_1.fq \
--mp-2 8 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-50000_2.fq \
--mp-1 9 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-5000_1.fq \
--mp-2 9 /home/LVP/Pret_In_silico_MP2/cross-mates-2023-04-29/*mp-5000_2.fq \
--careful --threads 20 --memory 450 -o /home/LVP/SPAdes_genome/results

##Quality control of the assembly

Quality assessment of parasite genome assembly was obtained via `QUAST` and the following command was used

In [None]:
quast -o genome_quality scaffolds.fasta