gridss_filter

gridss_filter is a framework designed to assist in the prioritization of germline structural variants called by the GRIDSS software.

Prerequisites

This pipeline has been developed using the following software versions:

GRIDSS v2.13.2 and RepeatMasker v4.1.5 were used to generate and annotate the input VCF files.
⚠️ If your input data comes from different versions, output formats may differ and could require adjustments.
R v4.4.2: used to develop and run all the scripts in this repository.
You will need this version (or a compatible one) to execute the filtering and processing steps.
R packages required (they will be installed while running the scripts if missing):
- CRAN packages: assertthat, DiagrammeR, DiagrammeRsvg, dplyr, optparse, rsvg, stringr, yaml
- Bioconductor packages: GenomicRanges

Input BAM files must be processed with GRIDSS and annotated using gridss_annotate_vcf_repeatmasker.

Installation

To get started, clone this repository:

git clone https://github.com/emunte/gridss_filter.git
cd gridss_filter

Running

`merge_gridss_vcfs.R`

This script processes VCF files annotated with gridss_annotate_vcf_repeatmasker and extracts all variants with PASS filter. It generates two dataframes:

Variants with a single breakend: entries that represent an isolated breakend.
Variants with two breakends: paired breakends that are linked via ID and MATEID fields and belong to the same EVENT. These are merged into a single row representing the full structural variant.

The goal is to simplify downstream analysis by consolidating paired breakends into unified events and separating them from single breakend calls.

Rscript merge_gridss_vcfs.R --txt <paths_vcfs.txt> --output <output_folder> [--cores N]

Arguments

`--txt` (`-t`)

Path to a tab-delimited text file containing VCF paths and sample metadata. The file must have the following format:

path	sample	run	genes.interest
/path/to/sample1_vcf_repeatmasker.vcf	sample1	run1	gene1, gene2, gene3, gene4
/path/to/sample2_vcf_repeatmasker.vcf	sample2	run1	gene1, gene3, gene5
/path/to/sampleN_vcf_repeatmasker.vcf	sampleN	runX	geneY

The path field must contain the full absolute path to each VCF file.
sample(N) should match the exact sample name
genes.interest should list comma-separated gene names

`--output` (`-o`):

Path to the output folder where result files will be stored.

`--cores` (`-c`):

The number of cores to use.

Output

A merge_gridss_vcfs folder will be created inside the specified output directory. Two tab-separated files will be generated:

output_folder/merge_gridss_vcfs/merge_gridss_two_break.txt: Variants involving two breakends that are linked through ID and MATEID.
output_folder/merge_gridss_vcfs/merge_gridss_one_break.txt: Variants with a single breakend (unpaired).

`filter_gridssR.R`

This script reads the two dataframes generated by merge_gridss_vcfs.R and applies different filters to prioritize germiline variants.

Rscript filter_gridss.R --input <merge_gridss_output_folder> --bedFile <genes.bed> --params <params.yaml> --output <results_folder> --name <project_name> [--mergeVariants N]

Arguments

`--input` (`-i`)

Path to the folder containing the merged VCF results generated by merge_gridss_vcfs.R.

`--bedFile` (`-b`)

Path to a BED file containing the genomic regions of interest.
The file must be tab-delimited and must not contain a header. It should include the following columns:

Chromosome
Start position
End position
Gene name (must match the gene names listed in the genes.interest column of the input .txt file)

Only variants that overlap regions associated with the genes of interest in each sample will be retained.

`--params` (`-p`)

Path to a YAML file with filtering parameters. Default: params.yaml.

The params.yaml file contains threshold values used during the filtering process. Below is a description of each parameter:

frequency:
Variants found in more than this number of samples will be excluded. The filtering is done based on the combination of CHROM, POS, and ALT.
gt_AF:
Minimum variant allele frequency (genotype-level). This value is not in percentage.
similar:
Maximum number of samples in which a variant with the same breakend is allowed before being excluded.
Similarity is based only on the CHROM and POS fields (not ALT), as structural variant notation can differ even for biologically identical events.
This filter helps remove recurrent breakpoints that likely represent the same or a very similar underlying event.
minimumLength:
For variants with two breakends on the same chromosome, only structural variants equal to or larger than this value (in base pairs) will be considered.

`--mergeVariants` (`-m`)

Defines a window (in base pairs) around the POS of single breakend variants annotated by RepeatMasker.
Variants within this window and sharing the same RepeatMasker annotation are considered the same event when calculating recurrence (similar threshold).
This does not merge variants in the output, only in how recurrence is counted.
Set to 0 to disable. Default: 4.

`--output` (`-o`)

Path to the output directory. Two subfolders will be created inside:

filtered_variants/: contains the final variants to be visually inspected.
Two Excel files will be generated:
- filtered_variants_one_break.xlsx
- filtered_variants_two_break.xlsx
plots/: contains diagrams illustrating which variants were removed at each filtering step.

`--name` (`-n`)

Name to identify the output files. This string will be used as a prefix for the final variant files, e.g.:

<name>_filtered_variants_one_break.xlsx
<name>_filtered_variants_two_break.xlsx

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
README.md		README.md
filter_gridssR.R		filter_gridssR.R
gridss_filters06022025.Rproj		gridss_filters06022025.Rproj
intervals_MEI.R		intervals_MEI.R
merge_gridss_vcfs.R		merge_gridss_vcfs.R
params.yaml		params.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gridss_filter

Prerequisites

Installation

Running

`merge_gridss_vcfs.R`

Arguments

`--txt` (`-t`)

`--output` (`-o`):

`--cores` (`-c`):

Output

`filter_gridssR.R`

Arguments

`--input` (`-i`)

`--bedFile` (`-b`)

`--params` (`-p`)

`--mergeVariants` (`-m`)

`--output` (`-o`)

`--name` (`-n`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gridss_filter

Prerequisites

Installation

Running

merge_gridss_vcfs.R

Arguments

--txt (-t)

--output (-o):

--cores (-c):

Output

filter_gridssR.R

Arguments

--input (-i)

--bedFile (-b)

--params (-p)

--mergeVariants (-m)

--output (-o)

--name (-n)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`merge_gridss_vcfs.R`

`--txt` (`-t`)

`--output` (`-o`):

`--cores` (`-c`):

`filter_gridssR.R`

`--input` (`-i`)

`--bedFile` (`-b`)

`--params` (`-p`)

`--mergeVariants` (`-m`)

`--output` (`-o`)

`--name` (`-n`)

Packages