Skip to content

garcia-nacho/FHI_SC2_Pipeline_Illumina

Repository files navigation

FHI's SARS-CoV-2 Illumina Pipeline

Bioinformatic pipeline for SARS-CoV-2 sequence analysis used at the Folkehelseinstituttet

Description

Docker-based solution for sequence analysis of SARS-CoV-2 Illumina samples

Primer schemes supported

ArticV3
ArticV4

Installation

git clone https://github.com/garcia-nacho/FHI_SC2_Pipeline_Illumina
cd FHI_SC2_Pipeline_Illumina
docker build -t garcianacho/fhisc2:Illumina .

Note that building the image for the first time can take up to two hours.

Alternativetly, it is posible to pull updated builds from Dockerhub:

docker pull garcianacho/fhisc2:Illumina

Running the pipeline

ArticV4:
docker run -it --rm -v $(pwd):/home/docker/Fastq garcianacho/fhisc2:Illumina SARS-CoV-2_Illumina_Docker_V12.sh ArticV4

ArticV3:
docker run -it --rm -v $(pwd):/home/docker/Fastq garcianacho/fhisc2:Illumina SARS-CoV-2_Illumina_Docker_V12.sh ArticV3

Note that older versions of docker might require the flag --privileged and that multiuser systems might require the flag -u 1000 to run

The script expects the following folder structure where the fastq.gz files are placed inside independent folders for each Sample

./ExpXX    
  |-ExperimentXX.xlsx      
  |-Sample1     
      |-Sample1_SX_LXXXX_R1.fastq.gz       
      |-Sample1_SX_LXXXX_R2.fastq.gz      
  |-Sample2      
      |-Sample2_SX_LXXXX_R1.fastq.gz   
      |-Sample2_SX_LXXXX_R2.fastq.gz   
  |-Sample3   
      |-Sample2_SX_LXXXX_R1.fastq.gz   
      |-Sample2_SX_LXXXX_R2.fastq.gz
  |-...   

The script also expects a .xlsx file, that contains information about the position of the samples on a 96-well-plate and the DNA concentration (alternatively this column can be used for the Ct-values). If the file is not properly formated the script will run without errors but the Quality-control plot will not be generated or it will contain errors. Note that the script takes the name of the experiment from the name of the xlsx file. If the file is not found the names of the output files might be incorrect. It is possible to download a template of the xlsx file here

Outputs

-Summary including mutations found, pangolin lineage, number of reads, coverage, depth, etc...
-Bam files
-Consensus sequences
-Aligned consensus sequences
-Consensus nucleotide sequence for gene S
-Indels and frameshift identification
-Quality-control plot for the plate to detect possible contaminations
-Phylogenetic-tree plot of the samples
-Noise during variant calling across the genome
-Quality-control for contaminations/low-quality samples
-Amplicon efficacy of the selected primer-set for all the samples