Skip to content

GDC Dicer

Gordon Saksena edited this page Jun 14, 2017 · 15 revisions

gdc_dice

gdc_dice transforms the raw data files from a gdc_mirror into an analysis-ready form that can be immediately fed to algorithmic codes without further preparation or interpretation (a potentially substantial time savings). During this process, the dicer performs a variety of preprocessing steps, including splitting multiple-sample files (such as MAFs) into one-file-per-sample, ensuring uniform consistency of file formats within a data type, and maintaining persistent metadata so that files which have already been locally diced need not be diced again. One practical benefit of this is that new superset or subset cohorts can be formed very easily, often with a single line in a GDCtools config file such as

[aggregates]
TCGA-COADREAD: TCGA-COAD,TCGA-READ 

which is used to build a colorectal aggregate cohort from the individual TCGA colon and rectal cohorts. Another practical benefit of gdc_dice is that it can restore project-specific IDs to the data: for example, examine this file of segmented TCGA SNP6 data at the GDC, and notice that the Sample identifiers for each segment (the first column)

BUBBY_p_TCGA_b89_105_SNP_N_GenomeWideSNP_6_B10_777548	1	3301765	49960870	25127	-0.0257

have essentially no meaning outside the center which originally generated the data. After dicing, however, these segments are now identifiable by their TCGA project case ID

TCGA-CM-6171-01A-11D-1649-01                            1       3301765 49960870        25127   -0.0257

which is much more useful for tracking and analysis (especially associations between data types and/or clinical phenotypes).

Usage

gdc_dice [-c CONFIG [CONFIG ...]] [OPTIONS] [datestamp]

options:
  datestamp             Use GDC data for a specific date. If omitted, the
                        latest available data will be used.

  -h, --help            show this help message and exit
  --verbose             set verbosity level [None]
  --version             show program's version number and exit
  -l LOG_DIR, --log-dir LOG_DIR
                        Folder to store logfiles
  -c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...]
                        One or more configuration files
  -g program [program ...], --programs program [program ...]
                        Process data ONLY from these GDC programs
  -p project [project ...], --projects project [project ...]
                        Process data ONLY from these GDC projects
  --cases case_id [case_id ...]
                        Process data ONLY from these GDC cases
  -m MIRROR_DIR, --mirror-dir MIRROR_DIR
                        Root folder of mirrored GDC data
  -d DICE_DIR, --dice-dir DICE_DIR
                        Root of diced data tree
  --dry-run             Show expected operations, but don't perform dicing
  -f, --force-dice      Skip detection of already diced files, and redice
                        everything

Many of the configuration options are shared with the mirror, and using the same configuration file for both is highly recommended.

For more details on the config file format, see this wiki page.

Dicer directory structure

The dicer structure is considerably flatter than the mirror, organizing diced data into folders based on the annotation name assigned to the data file. These names have semantic meaning relevant to the analysis, rather than the attributes assigned by the GDC. This results in the following format:

dice/
└── program
    ├── project
    │   ├── annot
    │   │   ├── samp_id.uuid.txt
    │   │   └── samp_id2.uuid2.txt
    │   ├── annot2
    │   │   └── samp_id.uuid.txt
    │   └── metadata
    │       ├── diced_metadata.tsv
    │       ├── heatmap.png
    │       └── sample_counts.tsv
    └── project2
        ├── annot
        ├── annot2
        └── metadata

Note that in addition to the diced metadata, a heatmap and sample counts table is created, showing how many samples are available for each data type. The file names also include the TCGA id & the uuid of the file it was diced from.

Example: TCGA smoketest

Here is an abbreviated version of the diced directory structure produced by the test_dice target:

dice
└── TCGA
    ├── TCGA-ACC
    │   ├── CNV__snp6
    │   │   ├── TCGA-OR-A5K2-01A-11D-A29H-01.dacf172e-89eb-4afa-9613-f683a558b088.txt
    │   │   ├── TCGA-OR-A5K2-10B-01D-A29K-01.e5012a07-cbb0-4bb7-903c-5b706f2ea874.txt
    │   │   ├── TCGA-OR-A5L1-01A-11D-A309-01.05a4633f-f012-43c0-90fd-268ad47f85b0.txt
    │   │   └── TCGA-OR-A5L1-10A-01D-A309-01.818e8b0a-04a7-42e5-929c-b9d11c64d1a9.txt
    │   ├── CNV__unfiltered__snp6
    │   │   ├── TCGA-OR-A5K2-01A-11D-A29H-01.b0ef98ef-b83b-4f51-9099-a9602f3c7e32.txt
    │   │   ├── TCGA-OR-A5K2-10B-01D-A29K-01.a774d500-8c68-4071-a3b1-328f346b417c.txt
    │   │   ├── TCGA-OR-A5L1-01A-11D-A309-01.477185ad-aafa-4da0-be72-780bd66fb6cd.txt
    │   │   └── TCGA-OR-A5L1-10A-01D-A309-01.7c263070-5b38-4d8c-9a31-899e1ba9d91d.txt
    │   ├── SNV__mutect
    │   │   ├── TCGA-OR-A5J1-01A-11D-A29I-10.ff872fc4-bd1c-4975-85c8-3655ccd199a2.maf.txt
    │   │   ├── TCGA-OR-A5J2-01A-11D-A29I-10.ff872fc4-bd1c-4975-85c8-3655ccd199a2.maf.txt
    ...
    └── TCGA-SKCM
        ├── CNV__snp6
        ...
        ├── CNV__unfiltered__snp6
        ...
        ├── SNV__mutect
        ...
        ├── clinical__biospecimen
        │   ├── TCGA-D3-A3C7.192bbf1c-4f69-463c-b9ec-8d7827c8312a.txt
        │   └── TCGA-EE-A3J8.b6b16c1b-e0c3-4e43-83dd-3d450ed53c33.txt
        ├── clinical__primary
        │   ├── TCGA-D3-A3C7.51c48680-6b80-4722-a06d-bdcc6e84c087.txt
        │   └── TCGA-EE-A3J8.151de1d5-6dc4-437e-be76-a076101939f5.txt
        ├── mRNA__counts__FPKM
        │   ├── TCGA-D3-A3C7-06A-11R-A18U-07.778eed3d-02ad-43c3-8308-0c4823922c39.txt
        │   └── TCGA-EE-A3J8-06A-11R-A20F-07.9c488510-a229-45e9-b708-0b666c257dbc.txt
        ├── mRNA__geneExpNormed__FPKM
        │   ├── TCGA-D3-A3C7-06A-11R-A18U-07.c7992525-cdee-49fe-9b24-78db03a6c58d.txt
        │   └── TCGA-EE-A3J8-06A-11R-A20F-07.abc1a5a6-cd0b-450c-a650-dfe14fdb356b.txt
        ├── mRNA__geneExp__FPKM
        │   ├── TCGA-D3-A3C7-06A-11R-A18U-07.28cec425-f067-4008-9aa0-1a5cd689ff4f.txt
        │   └── TCGA-EE-A3J8-06A-11R-A20F-07.662e5a74-9217-413e-8321-07951d802b8a.txt
        ├── metadata
        │   └── 2017_03_07
        │       ├── TCGA-SKCM.2017_03_07.diced_metadata.tsv
        │       ├── TCGA-SKCM.2017_03_07.high_res.heatmap.png
        │       ├── TCGA-SKCM.2017_03_07.low_res.heatmap.png
        │       └── TCGA-SKCM.2017_03_07.sample_counts.tsv
        ├── methylation__HM450
        │   ├── TCGA-D3-A3C7-06A-11D-A19B-05.2db5c7cc-25f8-4d93-991f-173b8704cb14.data.txt
        │   └── TCGA-EE-A3J8-06A-11D-A211-05.342836f4-b506-4bf8-a1f6-949ce9cb17dc.data.txt
        ├── miR__geneExp
        │   ├── TCGA-D3-A3C7-06A-11R-A18X-13.2bb82c6a-0573-4424-b5f9-261b1383ae76.txt
        │   └── TCGA-EE-A3J8-06A-11R-A20E-13.8840f36a-0c23-4fb1-9810-9a243c78cf9d.txt
        └── miR__isoformExp
            ├── TCGA-D3-A3C7-06A-11R-A18X-13.2ed535ab-cc10-4947-aa63-aa707d995e53.txt
            └── TCGA-EE-A3J8-06A-11R-A20E-13.1f884b90-738e-4c86-8574-d9372d0bd070.txt                                                                                     

Normalization methods

The dicer emits files in one of two TSV formats, 1) MAGE-Tab data matrix format, and 2) a row-oriented TSV format. The actual data values are not altered. Normalizing the format enables a common set of analysis tools to work on various data types, without requiring them to incorporate special case code or risking having a complex algorithm crash due to unexpected formatting.

The MAGE-Tab data matrix format is used in cases where there is a uniform number of data values for each sample, as is the case for array data. The file contains one or more header columns, followed by one or more data columns for the sample. The file has two header rows, the first containing sample names and the second giving the data type. When the sample files are later merged, there is only one set of header columns in the file, followed by the data columns for each sample; the samples are distinguished by the samplenames given in the first row. While this format is less general than the row-oriented TSV format, it is more compact for data with a uniform number of data points for each sample.

Example of MAGE-Tab data matrix format:

Hybridization REF TCGA-01-2345-678-901-2345-67
Composite Element REF Chromosome Position myDataType
probe123 1 99999 3.45
probe456 2 88888 2.34

The row-oriented TSV format is used where there is a variable number of data values for each sample, such as copy-number seg files or mRNA isoform splice sites. There is one row per data point, which includes the sample name in addition to other metadata and data. The first row is a header row, giving the field names. When the sample files are later merged, there is only one header row followed by rows from all of the data files; the samples are distinguished by the sample name field on each line. This format is more general than the MAGE-Tab data matrix format, and more compact when there are wide variations in the number of data points per sample; this format is similar to the schema used by BigQuery.

Example of row-oriented TSV format

Samplename Probe Chromosome Position myDataType
TCGA-01-2345-678-901-2345-67 probe 123 1 99999 3.45
TCGA-01-2345-678-901-2345-67 probe 456 2 88888 2.34

Methylation

The HM27 and HM450 methylation array data is converted to MAGE-Tab data matrix. A header row is added to specify the sample names. In addition, the one column that contains the actual data is moved to be the final column.

Clone this wiki locally