Skip to content

Latest commit

 

History

History
119 lines (116 loc) · 18.2 KB

data-files-description.md

File metadata and controls

119 lines (116 loc) · 18.2 KB

Data file descriptions

This document contains information about all data files associated with this project. Each file will have the following association information:

  • File type will be one of:
    • Reference file: Obtained from an external source/database. When known, the obtained data and a link to the external source is included.
    • Modified reference file: Obtained from an external source/database but modified for OpenPBTA use.
    • Processed data file: Data that are processed upstream of the analysis project, e.g., the output of a somatic single nucleotide variant method. Links to the relevant D3B Center or Kids First workflow (and version where applicable) are included in Origin.
    • Analysis file: Any file created by a script in analyses/*.
  • Origin
    • For Processed data files, a link the relevant D3B Center or Kids First workflow (and version where applicable).
    • When applicable, a link to the specific script that produced (or modified, for Modified reference file types) the data.
  • File description
    • A brief one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ).

current release (v14)

File name File Type Origin File Description
histologies-base.tsv Data file Cohort-specific data files and databases Clinical and sequencing metadata for each biospecimen
histologies.tsv Modified data file molecular-subtyping-integrate histologies-base.tsv plus molecular_subtype, cancer_group, integrated_diagnosis, and harmonized_diagnosis
intersect_cds_lancet_strelka_mutect_WGS.bed Analysis file snv-callers Intersection of gencode.v39.primary_assembly.annotation.gtf.gz CDS with Lancet, Strelka2, Mutect2 regions
intersect_strelka_mutect_WGS.bed Analysis file snv-callers Intersection of gencode.v39.primary_assembly.annotation.gtf.gz CDS with Strelka2 and Mutect2 regions called
efo-mondo-map.tsv Reference mapping file Manual collation Mapping of EFO and MONDO codes to cancer groups
efo-mondo-map-prefill.tsv Modified reference mapping file Analysis file generated in molecular-subtyping-integrate Mapping of EFO and MONDO codes to cancer groups
ensg-hugo-pmtl-mapping.tsv Reference mapping file Manual curation of PMTLv1.1 by FNL; RNA-Seq pipeline GTF mapping File which maps Hugo Symbols to ENSEMBL gene IDs an each ENSG to the RMTL curated by FNL
*.bed Reference file Manual collation Bed files used for variant calling and are used for tmb calculation
uberon-map-gtex-group.tsv Reference mapping file Manual collation Mapping of UBERON codes to tissue types in GTEx broad groups
uberon-map-gtex-subgroup.tsv Reference mapping file Manual collation Mapping of UBERON codes to tissue types in GTEx subgroups
methyl-beta-values.rds Processed data file methylation beta values Methylation beta values
methyl-m-values.rds Processed data file methylation m values Methylation m values
rna-isoform-expression-rsem-tpm.rds Processed data file RNA isoform TPM files RNA isoform TPM files
fusion-dgd.tsv Processed data file DGD merged fusion results DGD merged fusion results
fusion-arriba.tsv.gz Processed data file Gene fusion detection; Workflow Fusion - Arriba TSV, annotated with FusionAnnotator
fusion-starfusion.tsv.gz Processed data file Gene fusion detection; Workflow Fusion - STARFusion TSV
fusion-annoFuse.tsv.gz Processed data file AnnoFuse QC filtered fusion file; Workflow Filter out normal and non-expressed fusions
fusion_summary_embryonal_foi.tsv Analysis file fusion-summary Summary file for presence of embryonal tumor fusions of interest
fusion_summary_ependymoma_foi.tsv Analysis file fusion-summary Summary file for presence of ependymal tumor fusions of interest
fusion_summary_ewings_foi.tsv Analysis file fusion-summary Summary file for presence of Ewing's sarcoma fusions of interest
fusion_summary_lgg_hgg_foi.tsv Analysis file fusion-summary Summary file for presence of LGG and HGG fusions of interest
fusion-putative-oncogenic.tsv Analysis file fusion_filtering Filtered and prioritized fusions
gene-counts-rsem-expected_count-collapsed.rds Analysis file PBTA+GMKF+TARGET collapse-rnaseq Gene expression - RSEM expected_count for each samples collapsed to gene symbol (gene-level)
gene-expression-rsem-tpm-collapsed.rds Analysis file PBTA+GMKF+TARGET collapse-rnaseq Gene expression - RSEM TPM for each samples collapsed to gene symbol (gene-level)
tcga_gene-counts-rsem-expected_count-collapsed.rds Modified reference file TCGA samples lifted from GENCODE v27 to v39 Gene expression - RSEM counts for each samples collapsed to gene symbol (gene-level)
tcga_gene-expression-rsem-tpm-collapsed.rds Modified reference file TCGA samples lifted from GENCODE v27 to v39 Gene expression - RSEM TPM for each samples collapsed to gene symbol (gene-level)
gtex_gene-expression-rsem-tpm-collapsed.rds Modified reference file GTEX v8 release lifted to GENCODE v39 Gene expression - RSEM TPM for each samples collapsed to gene symbol (gene-level)
gtex_gene-counts-rsem-expected_count-collapsed.rds Modified reference file GTEX v8 release lifted to GENCODE v39 Gene expression - RSEM counts for each samples collapsed to gene symbol (gene-level)
WGS.hg38.lancet.300bp_padded.bed Reference Target/Baits File SNV and INDEL calling WGS.hg38.lancet.unpadded.bed file with each region padded by 300 bp
WGS.hg38.lancet.unpadded.bed Reference Regions File SNV and INDEL calling hg38 WGS regions created using UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented with PASS variant calls from Strelka2 and Mutect2
WGS.hg38.mutect2.vardict.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M and non-N regions) used for Mutect2 and VarDict variant callers
WGS.hg38.strelka2.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller
WGS.hg38.vardict.100bp_padded.bed Reference Regions File SNV and INDEL calling WGS.hg38.mutect2.vardict.unpadded.bed with each region padded by 100 bp used for VarDict variant caller
snv-consensus-plus-hotspots.maf.tsv.gz Analysis file Kids First somatic workflow consensus calls Consensus (2 of 4) maf +1/4 hotspots
snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz Analysis file Kids First Tumor Only workflow Mutect2 tumor only with additional filters to remove t_alt_count <5
cnv-cnvkit.seg.gz Processed data file Copy number variant calling; Workflow Somatic Copy Number Variant - CNVkit SEG file
cnv-consensus.seg.gz Analysis file [copy_number_consensus_call]](https://github.com/d3b-center/OpenPedCan-analysis/tree/dev/analyses/copy_number_consensus_call) Somatic Copy Number Variant - WGS samples only
cnvkit_with_status.tsv
consensus_seg_with_status.tsv
Analysis files copy_number_consensus_call CNVkit calls for WXS or CNV consensus calls for WGS with gain/loss status
cnv-consensus-gistic.gz Analysis file run-gistic GISTIC results - WGS samples only
cnv-controlfreec.tsv.gz Processed data file Copy number variant calling; Workflow Somatic Copy Number Variant - TSV file that is a merge of ControlFreeC *_CNVs files
cnv-controlfreec-tumor-only.tsv.gz Processed data file Copy number variant calling Workflow - tumor only Somatic Copy Number Variant - TSV file that is a merge of ControlFreeC *_CNVs files
cnv-gatk.seg.gz Processed data file Copy number variant calling Somatic Copy Number Variant - TSV SEG file produced by GATK CNV
consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; all chromosomes
consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only_x_and_y.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; sex chromosomes only
consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only_autosomes.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; autosomes chromosomes only
snv-mutation-tmb-all.tsv Processed data file tmb-calculation TSV file with sample names and their tumor mutation burden counting all variants
snv-mutation-tmb-coding.tsv Processed data file tmb-calculation TSV file with sample names and their tumor mutation burden counting all variants in coding region only
sv-manta.tsv.gz Processed data file Structural variant calling; Workflow Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only)
splice-events-rmats.tsv.gz Processed data Kids First splice variant workflow; Workflow rMATs single sample workflow
cptac-protein-imputed-phospho-expression-log2-ratio.tsv.gz Processed data CPTAC pediatric brain tumor phospho-proteomics expression Imputed phospho-protein expression, log2 TMT ratio
cptac-protein-imputed-prot-expression-abundance.tsv.gz Processed data CPTAC pediatric brain tumor protein expression Imputed whole cell protein expression, total abundance
cptac-protein-imputed-prot-expression-log2-ratio.tsv.gz Processed data CPTAC pediatric brain tumor protein expression Imputed whole cell protein expression, log2 TMT ratio
gbm-protein-imputed-phospho-expression-abundance.tsv.gz Processed data CPTAC adult GBM brain tumor phospho-proteomics expression Imputed phospho-protein expression, total abundance
gbm-protein-imputed-prot-expression-abundance.tsv.gz Processed data CPTAC adult GBM brain tumor protein expression Imputed whole cell expression, total abundance
hope-protein-imputed-phospho-expression-abundance.tsv.gz Processed data Adult and Young Adolescent (AYA) brain tumor phospho-proteomics expression (Project HOPE) Imputed phospho-protein expression, total abundance
hope-protein-imputed-prot-expression-abundance.tsv.gz Processed data Adult and Young Adolescent (AYA) brain tumor protein expression (Project HOPE) Imputed whole cell protein expression, total abundance
rna-dna-qc-stats.tsv Reference QC file Quality control metrics for WGS, WXS, DNA panel, and RNA-Seq samples Used to filter samples for data release
mirna-expression-counts.rds Processed data miRNA expression counts Generated from HTG-Seq
independent-specimens.methyl.primary.tsv
independent-specimens.methyl.relapse.tsv
independent-specimens.rnaseq.primary.eachcohort.tsv
independent-specimens.rnaseq.primary.tsv
independent-specimens.rnaseq.relapse-pre-release.tsv
independent-specimens.rnaseq.relapse.eachcohort.tsv
independent-specimens.rnaseq.relapse.tsv
independent-specimens.rnaseq.primary-plus-pre-release.tsv
independent-specimens.rnaseqpanel.primary-plus.pre-release.tsv
independent-specimens.rnaseqpanel.primary-plus.tsv
independent-specimens.rnaseqpanel.primary.eachcohort.tsv
independent-specimens.rnaseqpanel.primary.tsv
independent-specimens.rnaseqpanel.relapse.eachcohort.tsv
independent-specimens.rnaseqpanel.relapse.tsv
independent-specimens.wgs.primary-plus.eachcohort.tsv
independent-specimens.wgs.primary-plus.tsv
independent-specimens.wgs.primary.eachcohort.tsv
independent-specimens.wgs.primary.tsv
independent-specimens.wgs.relapse.eachcohort.tsv
independent-specimens.wgs.relapse.tsv
independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wgs.tsv
independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wxs.tsv
independent-specimens.wgswxspanel.primary-plus.prefer.wgs.tsv
independent-specimens.wgswxspanel.primary-plus.prefer.wxs.tsv
independent-specimens.wgswxspanel.primary.eachcohort.prefer.wgs.tsv
independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
independent-specimens.wgswxspanel.primary.eachcohort.tsv
independent-specimens.wgswxspanel.primary.prefer.wgs.tsv
independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
independent-specimens.wgswxspanel.primary.tsv
independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wgs.tsv
independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
independent-specimens.wgswxspanel.relapse.eachcohort.tsv
independent-specimens.wgswxspanel.relapse.prefer.wgs.tsv
independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
independent-specimens.wgswxspanel.relapse.tsv
independent-specimens.methyl.primary-plus.eachcohort.tsv
independent-specimens.methyl.primary.eachcohort.tsv
independent-specimens.methyl.relapse.eachcohort.tsv Analysis files independent-samples Independent (non-redundant) sample list of DNA, RNA, or methylation samples of all sequencing methods, from primary, primary-plus, or relapse tumors within each or across all cohorts
independent-specimens.rnaseq.primary-plus-pre-release.tsv
independent-specimens.rnaseq.primary-pre-release.tsv
independent-specimens.rnaseq.primary-pre-release.tsv
independent-specimens.rnaseq.relapse-pre-release.tsv Analysis files independent-samples Independent (non-redundant) sample list of RNA samples of all sequencing methods, from primary, primary-plus, or relapse tumors across all cohorts for the purposes of running fusion_filtering pre-release