Analysis Code for the GDC-QC Somatic Mutation Section
This repository contains the scripts to generate the mutation matching and all the results in the publication.
Folder structure
figures: Figures in the publicationnotebooks: R notebooks of the analysis and resultsprocessed_data: Output generated by the pipelinescripts: Various scripts required by the pipeline
Setup
The mutation matching (overlap) is done inside a SQLite database. Note that a large disk space is required to build the database (~100GB). The database construction and mutation matching are orchestrated by Snakemake. We recommend conda to manage the Python dependencies.
Conda environment setup:
conda create -n gdc_qc python=3.6 notebook sqlalchmey pandas snakemake crossmap
Config file config.yaml contains various file paths which should be modified before running the pipeline:
MC3_MAF_PTHS:public: Path to the public MC3 MAFcontrolled: Path to the controlled MAF
GDC_DATA_ROOT: Path to the folder containing all the GDC MAFs. The folder structure is the default structure which the offical GDC Data Transfer Tool creates. That is, the GDC MAFs are under<GDC_DATA_ROOT>/<file UUID>/<file name>.maf.gzCHAIN_PTH: Path to the lift over chain file (GRCh37 to GRCh38)CROSS_MAP_BIN: Path to the CrossMap.py script
Build the database and generate the mutation overlap tables
The pipeline is managed by Snakemake:
snakemake -l # List all the possible rules
snakemake all # Generate both the database and overlap files
For example, to run the full pipeline,
conda activate gdc_qc
snakemake all
The pipeline will generate the following files under processed_data:
mc3.public.converted.GRCh38.maf.gz: Public MC3 MAF with genomic coordinates lifted over to GRCh38mc3.controlled.converted.GRCh38.maf.gz: Controlled MC3 MAF with genomic coordinates lifted over to GRCh38all_variants.sqlite: SQLite database containing all the mutation calls and overlap tables{gdc,mc3}_recoverable_unique_variants.tsv.gz: Recoverable unique mutation calls{gdc,mc3}_recoverable_unique_variants.filter_cols.tsv.gz: Indicator-style filters matching the rows of the ecoverable unique mutation calls{gdc,mc3}_not_recoverable_unique_variants.tsv.gz: Unrecoverable unique mutation calls
Analysis and figures
Use R to run the notebooks under notebook in order to generate all the results and figures. R packages tidyverse and RSQLite are required. R Docker images such as rocker/tidyverse and lbwang/rocker-genome contain all the dependencies to run the analysis.
Internal information for Ding Lab
Refer to the lab wiki for details: https://confluence.ris.wustl.edu/display/DL/GDC-QC+AWG.