gridss_filter is a framework designed to assist in the prioritization of germline structural variants called by the GRIDSS software.
This pipeline has been developed using the following software versions:
-
GRIDSS v2.13.2 and RepeatMasker v4.1.5 were used to generate and annotate the input VCF files.
⚠️ If your input data comes from different versions, output formats may differ and could require adjustments. -
R v4.4.2: used to develop and run all the scripts in this repository.
You will need this version (or a compatible one) to execute the filtering and processing steps. -
R packages required (they will be installed while running the scripts if missing):
- CRAN packages:
assertthat,DiagrammeR,DiagrammeRsvg,dplyr,optparse,rsvg,stringr,yaml - Bioconductor packages:
GenomicRanges
- CRAN packages:
Input BAM files must be processed with GRIDSS and annotated using gridss_annotate_vcf_repeatmasker.
To get started, clone this repository:
git clone https://github.com/emunte/gridss_filter.git
cd gridss_filterThis script processes VCF files annotated with gridss_annotate_vcf_repeatmasker and extracts all variants with PASS filter. It generates two dataframes:
- Variants with a single breakend: entries that represent an isolated breakend.
- Variants with two breakends: paired breakends that are linked via
IDandMATEIDfields and belong to the sameEVENT. These are merged into a single row representing the full structural variant.
The goal is to simplify downstream analysis by consolidating paired breakends into unified events and separating them from single breakend calls.
Rscript merge_gridss_vcfs.R --txt <paths_vcfs.txt> --output <output_folder> [--cores N]
Path to a tab-delimited text file containing VCF paths and sample metadata. The file must have the following format:
| path | sample | run | genes.interest |
|---|---|---|---|
| /path/to/sample1_vcf_repeatmasker.vcf | sample1 | run1 | gene1, gene2, gene3, gene4 |
| /path/to/sample2_vcf_repeatmasker.vcf | sample2 | run1 | gene1, gene3, gene5 |
| /path/to/sampleN_vcf_repeatmasker.vcf | sampleN | runX | geneY |
- The path field must contain the full absolute path to each VCF file.
sample(N)should match the exact sample namegenes.interestshould list comma-separated gene names
Path to the output folder where result files will be stored.
The number of cores to use.
A merge_gridss_vcfs folder will be created inside the specified output directory. Two tab-separated files will be generated:
output_folder/merge_gridss_vcfs/merge_gridss_two_break.txt: Variants involving two breakends that are linked throughIDandMATEID.output_folder/merge_gridss_vcfs/merge_gridss_one_break.txt: Variants with a single breakend (unpaired).
This script reads the two dataframes generated by merge_gridss_vcfs.R and applies different filters to prioritize germiline variants.
Rscript filter_gridss.R --input <merge_gridss_output_folder> --bedFile <genes.bed> --params <params.yaml> --output <results_folder> --name <project_name> [--mergeVariants N]
Path to the folder containing the merged VCF results generated by merge_gridss_vcfs.R.
Path to a BED file containing the genomic regions of interest.
The file must be tab-delimited and must not contain a header. It should include the following columns:
- Chromosome
- Start position
- End position
- Gene name (must match the gene names listed in the
genes.interestcolumn of the input.txtfile)
Only variants that overlap regions associated with the genes of interest in each sample will be retained.
Path to a YAML file with filtering parameters. Default: params.yaml.
The params.yaml file contains threshold values used during the filtering process. Below is a description of each parameter:
-
frequency:
Variants found in more than this number of samples will be excluded. The filtering is done based on the combination ofCHROM,POS, andALT. -
gt_AF:
Minimum variant allele frequency (genotype-level). This value is not in percentage. -
similar:
Maximum number of samples in which a variant with the same breakend is allowed before being excluded.
Similarity is based only on theCHROMandPOSfields (notALT), as structural variant notation can differ even for biologically identical events.
This filter helps remove recurrent breakpoints that likely represent the same or a very similar underlying event. -
minimumLength:
For variants with two breakends on the same chromosome, only structural variants equal to or larger than this value (in base pairs) will be considered.
Defines a window (in base pairs) around the POS of single breakend variants annotated by RepeatMasker.
Variants within this window and sharing the same RepeatMasker annotation are considered the same event when calculating recurrence (similar threshold).
This does not merge variants in the output, only in how recurrence is counted.
Set to 0 to disable. Default: 4.
Path to the output directory. Two subfolders will be created inside:
-
filtered_variants/: contains the final variants to be visually inspected.
Two Excel files will be generated:filtered_variants_one_break.xlsxfiltered_variants_two_break.xlsx
-
plots/: contains diagrams illustrating which variants were removed at each filtering step.
Name to identify the output files. This string will be used as a prefix for the final variant files, e.g.:
<name>_filtered_variants_one_break.xlsx<name>_filtered_variants_two_break.xlsx