@Author: Arnaud N'Guessan
This repository contains a suite of tools for analyzing the within-sample genetic diversity of SARS-CoV-2 in Wastewater (WW) samples. It is designed to work on a Compute canada server but the scripts can be run on any linux system cluster with a slurm Work manager.
Linux packages: nixpkgs/16.09 gcc/7.3.0 intel/2018.3 gcc/7.3.0 blast+/2.9.0 prinseq/0.20.4 fastp/0.20.0 bwa/0.7.17 picard/2.20.6 samtools/1.10 varscan/2.4.1 python/3.6
Python3.6 modules: "sys", "time", "multiprocessing" and "os"
R (version 3.5.2+) packages: "ggplot2", "seqinr", "grid", "RColorBrewer", "cowplot", "randomcoloR", "gplots", "lmPerm", "ggpubr", "gridExtra", "RColorBrewer", "tidyr", "dendextend", "VennDiagram", "Cairo", "UpSetR", "parallel", "foreach", "doParallel", "infotheo", "glmnet", "FD", "vegan", "ConsReg", "MASS", "leaps", "caret", "mgcv" and "session"
The pipeline is separated in two main modules that sould be run in the following order:
- Variant calling module
For running this module of the pipeline, you need to submit a job to your cluster slurm queue using the following command: 'sbatch submit_samples_to_iPMVC.sh'.
a) Inputs (files you need to copy into the Variant_calling workspace):
-->2 Paired-end .fastq files for each sample (extension should be _R1.fastq and _R2.fastq). However, you can use single-end fastq but you have to modify the run_iPMVC_in_parallel.py script at lines 15-25 appropriately. The fastq files should be in the Variant_calling workspace and you need to specify the absolute path of the Variant_calling workspace in the 2 *.sh files where 'WORKSPACE' is indicated.
-->A text file with the list of samples named "lst_samples.txt".
b) Output (in the repertory WORKSPACE/Variant_calling/out/) : VarScan .tab files (tab-delimited) and common_depth_report.csv
- Post-variant-calling analysis module
This module run various analysis of SARS-CoV-2 within host diversity (Sample coverage, lineage detection, estimation of the lineages within-sample frequency, etc).
a) Inputs (files you need to copy into the Post_variant_calling_analysis workspace):
*In the following lines, WORKSPACE refers to the absolute path of the Wastewater_surveillance_pipeline/ folder in your system.
-->A text file with the list of samples named "lst_samples.txt"
-->VarScan .tab files located at "WORKSPACE/Variant_calling/out/"
-->Fasta of the reference sequence ("MN908947_3.fasta"), which is already included but you need to replace it in case you want to use another reference genome.
-->The sample depth report file, which is located at "WORKSPACE/Variant_calling/out/common_depth_report.csv" as it is generated by the Variant calling module
-->A table named "Table_Sample_stratifications_of_interest.csv" for the data stratification analysis. An example is available at "WORKSPACE/Post_variant_calling_analysis/Table_Sample_stratifications_of_interest.csv" (TO UPDATE with new samples)
-->A table named "Metadata_samples.csv", which includes other important metadata about WW samples. An example is available at "WORKSPACE/Post_variant_calling_analysis/Metadata_samples.csv"
-->A table named "df_signature_muts_prevalence_VOCs_VUIs.csv" which indicates the signature mutations of the variants of interest (To update with the most recent data from cov-spectrum or your favourite SARS-CoV-2 lineage tracker website)
-->The tables from LSPQ named "LATEST_REPORT.csv" and "ListeCH_prefix.csv" (ask Héctor Gálvez López or someone in the covseq consortium for access). I added an older version of these files as examples in "WORKSPACE/Post_variant_calling_analysis/".
*You can execute the script in R studio but you have to manually set the argument "output_workspace", which is a variable in the script. To execute the script from the console, go to the repertory "WORKSPACE/Post_variant_calling_analysis/" and execute the following command "Rscript Wastewater_Illumina.r WORKSPACE". Also, note that some analyses are implemented so that they can be executed with parallel/distributed processing. You can change the variable "nb_cores", which is set at the start of the script, accoreding to your computer capacity... Keep in mind that more cores will consume more memory and it does not necessarily decrease run time (See https://docs.computecanada.ca/wiki/Scalability).