Skip to content
Somatic mutation pipeline comparison of TCGA samples between Genomic Data Commons (GDC) and MC3
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
annotations
figures
notebooks
processed_data
scripts
.gitattributes
.gitignore
README.md
Snakefile
config.yaml
gdc_qc_analysis.Rproj

README.md

Analysis Code for the GDC-QC Somatic Mutation Section

This repository contains the scripts to generate the mutation matching and all the results in the publication.

Folder structure

  • figures: Figures in the publication
  • notebooks: R notebooks of the analysis and results
  • processed_data: Output generated by the pipeline
  • scripts: Various scripts required by the pipeline

Setup

The mutation matching (overlap) is done inside a SQLite database. Note that a large disk space is required to build the database (~100GB). The database construction and mutation matching are orchestrated by Snakemake. We recommend conda to manage the Python dependencies.

Conda environment setup:

conda create -n gdc_qc python=3.6 notebook sqlalchmey pandas snakemake crossmap

Config file config.yaml contains various file paths which should be modified before running the pipeline:

  • MC3_MAF_PTHS:
    • public: Path to the public MC3 MAF
    • controlled: Path to the controlled MAF
  • GDC_DATA_ROOT: Path to the folder containing all the GDC MAFs. The folder structure is the default structure which the offical GDC Data Transfer Tool creates. That is, the GDC MAFs are under <GDC_DATA_ROOT>/<file UUID>/<file name>.maf.gz
  • CHAIN_PTH: Path to the lift over chain file (GRCh37 to GRCh38)
  • CROSS_MAP_BIN: Path to the CrossMap.py script

Build the database and generate the mutation overlap tables

The pipeline is managed by Snakemake:

snakemake -l            # List all the possible rules
snakemake all           # Generate both the database and overlap files

For example, to run the full pipeline,

conda activate gdc_qc
snakemake all

The pipeline will generate the following files under processed_data:

  • mc3.public.converted.GRCh38.maf.gz: Public MC3 MAF with genomic coordinates lifted over to GRCh38
  • mc3.controlled.converted.GRCh38.maf.gz: Controlled MC3 MAF with genomic coordinates lifted over to GRCh38
  • all_variants.sqlite: SQLite database containing all the mutation calls and overlap tables
  • {gdc,mc3}_recoverable_unique_variants.tsv.gz: Recoverable unique mutation calls
  • {gdc,mc3}_recoverable_unique_variants.filter_cols.tsv.gz: Indicator-style filters matching the rows of the ecoverable unique mutation calls
  • {gdc,mc3}_not_recoverable_unique_variants.tsv.gz: Unrecoverable unique mutation calls

Analysis and figures

Use R to run the notebooks under notebook in order to generate all the results and figures. R packages tidyverse and RSQLite are required. R Docker images such as rocker/tidyverse and lbwang/rocker-genome contain all the dependencies to run the analysis.

Internal information for Ding Lab

Refer to the lab wiki for details: https://confluence.ris.wustl.edu/display/DL/GDC-QC+AWG.

You can’t perform that action at this time.