WGS analysis pipeline

WGS analysis pipeline. Can handle both WGS and WES data.

The whole pipeline use singularity images and will pull images from singularity library when needed. Singularity recipes used are provided in singularity folder for reference.

How to run

The pipeline can be run directly using Nextflow >= v20.10.

nextflow WGS_analysis.nf -profile cluster --operation align --input input_file.txt --mode WGS --ped ped_file.ped --ref genome.fa --cohort_id cohort_name --outdir results

The pipeline automatically infer the number of samples in the cohort from your input file and adjust the filtering accordingly. When more than one sample is present, small variants and structural variants from all samples are merged in cohort wide VCF files.

Eventually update singularity_cachedir variable in nextflow.config to point to a proper folder where singularity images are stored / will be downloaded

Arguments

operation   :   align or call_variants
mode        :   WGS only supported at the moment
ref         :   fasta file for the genome. Note that .fai and bwa index are expected in the same location
input       :   tab-separated file describing input files. 
                The exact format depends on operation requested (see below)
ped         :   standard PED file containing all samples
cohort_id   :   a arbitrary name for the cohort files generated
outdir      :   output folder for results

Use --operation align/call_variants --help for more explanations.

Resources

Various supporting files are needed and expected in the resources folder. This path can be configured by changing the parameters in config/resources_GRCh37/38.conf. All files needed are provided in a Zenodo repository. Please refer to the README file in the resources folder.

NB. The available resources are based on GRCh37 with standard chromosomes 1..22 X Y MT and GRCh38 using chr1..22 chrX chrY chrM. Be sure the genome reference file passed with --ref matches the expected nomenclature for your genome build.

Input files format

PED file

A standard tab-separated PED file without header, describing all samples provided in the input file. All sample IDs must match between ped and input file. All samples must have sex defined.

family_ID   individual_ID   father_ID   mother_ID   sex(1=M,2=F)    status(1=unaff,2=aff,0=unknown)

input file

Note that all files need to be specified using absolute paths

Operation: align

A 3 columns tab-separated file without header

sampleID1   s1_lane1_R1.fastq.gz    s1_lane1_R2.fastq.gz
sampleID1   s1_lane2_R1.fastq.gz    s1_lane2_R2.fastq.gz
sampleID2   s2_lane2_R1.fastq.gz    s2_lane2_R2.fastq.gz

Note that if a sample has been sequenced with multiple pairs of fastq files you need to add multiple lines for each pair of fastq files using the same sampleID. The pipeline will take care of the merge.

Operation: call_variants

A 5 columns tab-separated file without header. This file is automatically generated in the output folder when using --operation align (bam_files.txt)

sampleID1   main_bam.bam    disc.bam    split.bam
sampleID2   main_bam.bam    disc.bam    split.bam
sampleID3   main_bam.bam    disc.bam    split.bam

disc and split BAM files are files containing only discordant pair and split reads like the ones that can be obtained using Samblaster

Output

The pipeline generates a reach set of outputs including

aligned deduplicated BAM files
disc/split BAM files
Extensive QC of alignements, which includes mapping stats, coverage, relatedness, ancestry
Multi sample and single sample VCFs of small variants and structural variants (variants are provided as raw calls and filtered calls)
Variants QC report for small variants
ROH regions
Repeat expansions by Expansion Hunter

Pipeline components

Alignement and duplicate marking
- BWA-MEM + samblaster + samtools
QC and coverage from BAM files
- fastqc: reads stats
- mosdepth: coverage
- samtools flagstat / mapstat: alignment stats
- somalier: ancestry, relatedness, sex check reports
- multiqc: interactive report
small variants
- deepvariant: single sample calls
- glnexus: gvcf merge
structural variants
- lumpy: structural variants events
- CNVnator: CNV estimation
- svtools: combine, merge and classify
repeat expansion detection
- expansion hunter
ROH regions
- bcftools ROH

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
config		config
modules		modules
resources		resources
singularity		singularity
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

config

config

modules

modules

resources

resources

singularity

singularity

.gitignore

.gitignore

README.md

README.md

main.nf

main.nf

nextflow.config

nextflow.config

Repository files navigation

WGS analysis pipeline

How to run

Arguments

Resources

Input files format

PED file

input file

Operation: align

Operation: call_variants

Output

Pipeline components

Future developments

About

Releases

Packages

Languages

edg1983/WGS_pipeline

Folders and files

Latest commit

History

Repository files navigation

WGS analysis pipeline

How to run

Arguments

Resources

Input files format

PED file

input file

Operation: align

Operation: call_variants

Output

Pipeline components

Future developments

About

Resources

Stars

Watchers

Forks

Languages