Skip to content

WGLab/PipeVar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PipeVar

PipeVar is a pathogenic variant prioritization workflow for undiagnosed, rare diseases. It utilizes various tools developed from WGLab and other softwares to call structural variants, singule-nucleotide variants, indels and repeat expansions, and prioritize potential pathogenic variants and existing pathogenic variants as well.

PipeVar is implenmetned in Nextflow, and can be ran using Docker or Singularity. We are in the development of utilizing Conda for running, but for the best consistency, use either Docker or Singularity for running PipeVar. We currently only have support for Slurm, but are working on other cluster system as well.

Untitled diagram _ Mermaid Chart-2025-08-29-153706

Requirements

PipeVar requires either Docker or Singularity to run. If your system do not have Singularity installed as a module, you can try to install Singularity using conda with

conda create -n singularity singularity

or install singualrity in your conda environment by

conda install conda-forge::singularity

Set up

To use PipeVar, there is a set up stage required for two softwares. The first one is ANNOVAR.

To download ANNOVAR, go to this link https://www.openbioinformatics.org/annovar/annovar_download_form.php

Once ANNOVAR is downloaded, run setup.sh in ANNOVAR folder to download the necessary files for ANNOVAR.

setup.sh will also download necessarily PhenoSV model files as well.

After downloading PhenoSV model file, run this setup_config.sh file in your folder. It will modify nextflow.config to contain your required phenosv and download necessary datasets to run ANNOVAR.

Usage

nextflow run main.nf --bam nextflow run main.nf --vcf

REQUIRED PARAMETERS:

--bam Path to input BAM file. Cannot be used with VCF option. Must be full path. Requires .bai index file.

--vcf Path to input VCF file. Cannot be used with BAM option. Must be full path.

--ref_fa Reference genome in FASTA format. Must be full path.

--out_prefix Prefix for output files

--note Clinical note text file, in a format of VCF. used for HPO term extraction. Only neded if HPO terms are not available.

--hpo HPO ID file; note file can be used instead.

Example hpo.txt

HP:0001250
HP:0000750
HP:0001257

OPTIONAL PARAMETERS: --input_directory

Directory containing input files --output_directory Path to output directory (default: current dir) --mode <sv|snp> Variant type to analyze (required with --vcf or --bam) --type <short|long> Input data type: short or long reads (required with --bam) --light <yes|no> Use lightweight PhenoSV model and NanoCaller (faster, lower memory) --gq Minimum genotype quality [default: 20] --ad Minimum allelic depth [default: 15] --gnomad Max gnomAD allele frequency [default: 0.0001] --help Print this help message and exit

EXAMPLES:

1. Long-read full pipeline (SV + SNP + STR):
    nextflow run main.nf \
      --bam /data/sample.bam \
      --ref_fa /refs/hg38.fa \
      --out_prefix patient1 \
      --hpo /data/hpo.txt \
      --type long

2. Short-read full pipeline:
    nextflow run main.nf \
      --bam /data/sample.bam \
      --ref_fa /refs/hg38.fa \
      --out_prefix patient1 \
      --hpo /data/hpo.txt \
      --type short

3. Short-read with lightweight model:
    nextflow run main.nf \
      --bam /data/sample.bam \
      --ref_fa /refs/hg38.fa \
      --out_prefix patient1 \
      --hpo /data/hpo.txt \
      --type short \
      --light yes

4. Variant re-annotation using VCF (SV mode):
    nextflow run main.nf \
      --vcf /data/sample.vcf \
      --ref_fa /refs/hg38.fa \
      --out_prefix patient_sv \
      --hpo /data/hpo.txt \
      --mode sv

5. Auto-extract HPO from clinical notes:
    nextflow run main.nf \
      --bam /data/sample.bam \
      --ref_fa /refs/hg38.fa \
      --out_prefix patient1 \
      --note /data/note.txt \
      --type ont

NOTES: - At least one of --hpo or --note must be provided. - If --note is used, --hpo is auto-generated via phenotagger. - --mode must be specified for VCF input, and helps direct SNV vs SV flow. - --type is required for BAM input to specify sequencing technology. - All file paths must be absolute or relative to --input_directory. - --light yes uses faster, resource-friendly software such as haplotypecaller, NanoCaller and PhenoSV-light.

Softwares used

PIPELINE MODULES:

SNV CALLING
  - clair3         : Deep learning SNP caller (long-read)
  - nanocaller     : Lightweight long-read SNP caller
  - haplotypecaller: Short-read SNP caller (GATK)
  - deepvariant    : Deep learning short-read SNP caller

SV CALLING
  - cuteSV         : Long-read SV caller
  - sniffles       : Long-read SV caller
  - Manta          : Short-read SV caller
  - truvari        : SV comparison/benchmarking
  - SURVIVOR       : SV merging

STR DETECTION
  - NanoRepeat     : Long-read STR caller
  - ExpansionHunter: Short-read STR detection

PHENOTYPING
  - Phen2gene      : HPO-to-gene mapping
  - phenotagger    : NLP-based clinical note to HPO term conversion
  - PhenoGpt2      : To be implemented

VARIANT RANKING
  - ANNOVAR        : SNV/SV annotation
  - RankVar        : Final SNV ranking
  - Rankscore_analysis: Additional ranking analysis

Output

All the output will be stored in output folder, or the launch folder. The list of outputs are as followed:

Update that needs to be done.

Add LongPhase process, and ACMG Guideline, and PhenoGPT2 for note direction.

About

Pipeline to call phenotype variant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published