Skip to content

emunte/gridss_filter

Repository files navigation

gridss_filter

gridss_filter is a framework designed to assist in the prioritization of germline structural variants called by the GRIDSS software.

Prerequisites

This pipeline has been developed using the following software versions:

  • GRIDSS v2.13.2 and RepeatMasker v4.1.5 were used to generate and annotate the input VCF files.
    ⚠️ If your input data comes from different versions, output formats may differ and could require adjustments.

  • R v4.4.2: used to develop and run all the scripts in this repository.
    You will need this version (or a compatible one) to execute the filtering and processing steps.

  • R packages required (they will be installed while running the scripts if missing):

    • CRAN packages: assertthat, DiagrammeR, DiagrammeRsvg, dplyr, optparse, rsvg, stringr, yaml
    • Bioconductor packages: GenomicRanges

Input BAM files must be processed with GRIDSS and annotated using gridss_annotate_vcf_repeatmasker.


Installation

To get started, clone this repository:

git clone https://github.com/emunte/gridss_filter.git
cd gridss_filter

Running

merge_gridss_vcfs.R

This script processes VCF files annotated with gridss_annotate_vcf_repeatmasker and extracts all variants with PASS filter. It generates two dataframes:

  • Variants with a single breakend: entries that represent an isolated breakend.
  • Variants with two breakends: paired breakends that are linked via ID and MATEID fields and belong to the same EVENT. These are merged into a single row representing the full structural variant.

The goal is to simplify downstream analysis by consolidating paired breakends into unified events and separating them from single breakend calls.

Rscript merge_gridss_vcfs.R --txt <paths_vcfs.txt> --output <output_folder> [--cores N]

Arguments

--txt (-t)

Path to a tab-delimited text file containing VCF paths and sample metadata. The file must have the following format:

path sample run genes.interest
/path/to/sample1_vcf_repeatmasker.vcf sample1 run1 gene1, gene2, gene3, gene4
/path/to/sample2_vcf_repeatmasker.vcf sample2 run1 gene1, gene3, gene5
/path/to/sampleN_vcf_repeatmasker.vcf sampleN runX geneY
  • The path field must contain the full absolute path to each VCF file.
  • sample(N) should match the exact sample name
  • genes.interest should list comma-separated gene names
--output (-o):

Path to the output folder where result files will be stored.

--cores (-c):

The number of cores to use.

Output

A merge_gridss_vcfs folder will be created inside the specified output directory. Two tab-separated files will be generated:

  • output_folder/merge_gridss_vcfs/merge_gridss_two_break.txt: Variants involving two breakends that are linked through ID and MATEID.
  • output_folder/merge_gridss_vcfs/merge_gridss_one_break.txt: Variants with a single breakend (unpaired).

filter_gridssR.R

This script reads the two dataframes generated by merge_gridss_vcfs.R and applies different filters to prioritize germiline variants.

Rscript filter_gridss.R --input <merge_gridss_output_folder> --bedFile <genes.bed> --params <params.yaml> --output <results_folder> --name <project_name> [--mergeVariants N]

Arguments

--input (-i)

Path to the folder containing the merged VCF results generated by merge_gridss_vcfs.R.

--bedFile (-b)

Path to a BED file containing the genomic regions of interest.
The file must be tab-delimited and must not contain a header. It should include the following columns:

  1. Chromosome
  2. Start position
  3. End position
  4. Gene name (must match the gene names listed in the genes.interest column of the input .txt file)

Only variants that overlap regions associated with the genes of interest in each sample will be retained.

--params (-p)

Path to a YAML file with filtering parameters. Default: params.yaml.

The params.yaml file contains threshold values used during the filtering process. Below is a description of each parameter:

  • frequency:
    Variants found in more than this number of samples will be excluded. The filtering is done based on the combination of CHROM, POS, and ALT.

  • gt_AF:
    Minimum variant allele frequency (genotype-level). This value is not in percentage.

  • similar:
    Maximum number of samples in which a variant with the same breakend is allowed before being excluded.
    Similarity is based only on the CHROM and POS fields (not ALT), as structural variant notation can differ even for biologically identical events.
    This filter helps remove recurrent breakpoints that likely represent the same or a very similar underlying event.

  • minimumLength:
    For variants with two breakends on the same chromosome, only structural variants equal to or larger than this value (in base pairs) will be considered.

--mergeVariants (-m)

Defines a window (in base pairs) around the POS of single breakend variants annotated by RepeatMasker.
Variants within this window and sharing the same RepeatMasker annotation are considered the same event when calculating recurrence (similar threshold).
This does not merge variants in the output, only in how recurrence is counted.
Set to 0 to disable. Default: 4.

--output (-o)

Path to the output directory. Two subfolders will be created inside:

  • filtered_variants/: contains the final variants to be visually inspected.
    Two Excel files will be generated:

    • filtered_variants_one_break.xlsx
    • filtered_variants_two_break.xlsx
  • plots/: contains diagrams illustrating which variants were removed at each filtering step.

--name (-n)

Name to identify the output files. This string will be used as a prefix for the final variant files, e.g.:

  • <name>_filtered_variants_one_break.xlsx
  • <name>_filtered_variants_two_break.xlsx

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages