Python scripts to perform somatic variant calling on unpaired samples with a Panel Of Normals (PoN) using GATK4 Mutect2
St George's, University of London
- Dionysios Grigoriadis, Bioinformatician, SGUL
✉ dgrigori@sgul.ac.uk - SGUL Bioinformatics Unit
- SGUL Genetics Centre Bioinformatics
This pipeline has been ran and tested on a Linux server (centOS).
- Python3
- GATK tool suite (4.0.4.0 or above)
- htslib-1.9 or above
- java installed
-
dependencies.py: This is the configuration python script to set-up the neccessary settings, file locations and software locations. You need to change the file locations accordingly:
INPUTS:
- maindir: Full Path. This is the main directory where all the analysis directories will be found in. You need to create this directory.
- inbamsdir: Full path. maindir subdirectory. This is the directory that contains the BAM files which will be used to generate a Panel of Normals (PoN). Each BAM file should be included in a separate (and separately-named) subdirectory inside the inbamsdir. You need to create this directory and move/link the BAM files inside it (each into a separate directory).
- call_bams: Full path. maindir subdirectory. This is the directory that contains the BAM files which you want to perform the somatic calling on. Each BAM file should be included in a separate (and separately-named) subdirectory inside the call_bams. You need to create this directory and move/link the BAM files inside it (each into a separate directory).
OUTPUTS:
- ponvcf: Full path. maindir subdirectory. This is the directory where the PoN VCFs will be created.
- ponjointvcf: Full path. maindir subdirectory. This is the directory where the collated PoN VCF will be created.
- mutect2out: Full path. maindir subdirectory. This is the directory where the VCF file with the raw mutect calls will be created.
- filtmutect2out: Full path. maindir subdirectory. This is the directory where the VCF file with the filtered mutect calls will be created.
SOFTWARE: Please specify the paths to the required software.
REFERENCES: Please specify the paths to the required reference files.
-
somatic_calling.py: This is the core script of the analysis. It performs:
- Directory handling.
- Identifies the BAM files specified for PanelOfNormals (PON) and creates one VCF for each BAM file.
- Collate sites present in the previously created individual PoN VCFs into a sites-only VCF file.
- Run mutect2 on the files specified in the call_bams directory.
- Filter the mutect2 calls.
You don't need to change anything to run it
These are supplementary scripts containing definitions of functions utilised by the main somatic_calling.py script.
- utils.py