Skip to content

Output file descriptions

Anthony Soltis edited this page Oct 29, 2019 · 12 revisions

MutEnricher Output File Descriptions

Last updated: October 29, 2019

Contents:

Coding analysis output files

1. [prefix]_gene_enrichments.txt

This text file contains the overall gene enrichment results determined by MutEnricher.

Columns:

  1. Gene: Gene name from GTF.
  2. coordinates: Genomic coordinates of gene, from first to last annotated exon.
  3. num_nonsilent: Total non-silent mutations in gene across samples.
  4. num_bg: Total silent mutations identified within gene coordinates in samples.
  5. full_length: Total gene length in basepairs (corresponding to (2)).
  6. coding_length: Total length of gene coding domains (e.g. sum of CDS regions in GTF).
  7. bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
  8. bg_prob: Gene background mutation rate used in negative binomial tests.
  9. gene_pval: Raw p-value of negative binomial test for gene.
  10. FDR_BH: Benjamini-Hochberg FDR-corrected p-value for gene.
  11. num_samples: Number of samples possessing a non-silent somatic mutation in gene.
  12. nonsilent_position_counts: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].
  13. nonsilent_mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  14. samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

2. [prefix]_hotspot.txt

This text file contains the results of the hotspot enrichment procedure.

Columns:

  1. Gene: Gene name from GTF.
  2. hotpsot: Genomic coordinates of tested hotspot.
  3. num_mutations: Number of non-silent somatic mutations considered in hotspot test.
  4. hotspot_length: Length of hotspot window.
  5. effective_length: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).
  6. bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
  7. bg_prob: Background mutation rate used in negative binomial test for hotspot.
  8. pval: Raw p-value of negative binomial test for hotspot.
  9. FDR_BH: Benjamini-Hochberg FDR-corrected p-value for hotspot.
  10. num_samples: Number of samples possessing a non-silent somatic mutation in hotspot window.
  11. position_counts: Semi-colon-separated list of genomic positions in hotspot containing non-silent mutations, including counts; in format [position]_[count].
  12. mutation_counts: Semi-colon-separated list of genomic positions in hotspot, including base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  13. samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

3. [prefix]_gene_hotspot_Fisher_enrichments.txt

This text file contains combined significance results for the overall gene region (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.

Columns:

  1. Gene: Gene name from GTF.
  2. coordinates: Genomic coordinates of gene, from first to last annotated exon.
  3. num_nonsilent: Total non-silent mutations in gene across samples.
  4. num_bg: Total silent mutations identified within gene coordinates in samples.
  5. full_length: Total gene length in basepairs (corresponding to (2)).
  6. coding_length: Total length of gene coding domains (e.g. sum of CDS regions in GTF).
  7. bg_type: String indicating method used to estimate gene's background rate; one of global, local, or clustered_regions.
  8. bg_prob: Gene background mutation rate used in negative binomial tests.
  9. gene_pval: Raw p-value of negative binomial test for gene.
  10. hotspot_pvals: Semi-colon-separated list of p-values associated with identified gene hotspots (NA if no hotspots found).
  11. Fisher_pval: Fisher combined p-value of (9) and (10) values.
  12. Fisher_FDR: Benjamini-Hochberg FDR-corrected Fisher p-value.
  13. num_samples: Number of samples possessing a non-silent somatic mutation in gene.
  14. nonsilent_position_counts: Semi-colon-separated list of genomic positions containing non-silent mutations along with counts; in format [position]_[count].
  15. nonsilent_mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  16. samples: Semi-colon-separated list of sample IDs containing non-silent somatic mutations in gene.

4. [prefix]_gene_data.pkl

This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Gene class variables, as defined in the coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:

# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file
# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

genes = cPickle.load(open('/path/to/output/example_gene_data.pkl','rb')) # Load gene data pickle file

If a user would like to find information for a particular gene, this information then can be obtained as so:

gene_of_interest = 'KRAS'
index = None
for g in genes:
    if g.name == gene_of_interest:
        index = g.index
        break
kras = genes[index]

The above code extracts the Gene object for the gene KRAS. The user can now observe internal information associated with this gene.

5. [prefix].log

Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of genes/hotspots tested.

Non-coding analysis output files

1. [prefix]_region_WAP_enrichments.txt

This text file contains the combined enrichments results for the overall region (from the negative binomial enrichment procedure) and the weighted average proximity clustering procedure. P-values are combined with Fisher's method.

Columns:

  1. Region: Genomic coordinates of region (from input BED file).
  2. region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
  3. num_mutations: Total number of somatic mutations in region across samples.
  4. length: Length of region in basepairs.
  5. effective_length: Length of region multiplied by number of samples.
  6. bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
  7. bg_prob: Region background mutation rate used in negative binomial tests.
  8. region_pval: Raw p-value from negative binomial test of region.
  9. WAP: Statistic from weighted average proximity procedure performed on region.
  10. WAP_pval: Permutation p-value of WAP procedure.
  11. Fisher_pval: Fisher combined p-value of (8) and (10) values.
  12. FDR_BH: Benjamini-Hochberg FDR-corrected Fisher p-value.
  13. num_samples: Number of samples possessing a somatic mutation in region.
  14. position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
  15. mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  16. samples: Semi-colon-separated list of sample IDs containing somatic mutations in region.

2. [prefix]_hotspot.txt

This text file contains the results of the hotspot enrichment procedure using negative binomial tests.

Columns:

  1. Hotpsot: Genomic coordinates of hotspot.
  2. region: Genomic coordinates of full region associated with hotspot.
  3. region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
  4. num_mutations: Total number of somatic mutations in region across samples.
  5. hotspot_length: Length of hotspot window.
  6. effective_length: Length of hotspot window adjusted for cohort size (i.e. hotspot length times number of samples).
  7. bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
  8. bg_prob: Hotspot background mutation rate used in negative binomial tests.
  9. pval: Raw p-value of negative binomial test for hotspot.
  10. FDR_BH: Benjamini-Hochberg FDR-corrected p-value for hotspot.
  11. num_samples: Number of samples possessing a somatic mutation in hotspot.
  12. position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
  13. mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  14. samples: Semi-colon-separated list of sample IDs containing somatic mutations in hotspot.

3. [prefix]_region_WAP_hotspot_Fisher_enrichments.txt

This text file contains combined significance results for the overall regional (1 above) and candidate hotspots (if found, 2 above) using Fisher's method.

Columns:

  1. Region: Genomic coordinates of region (from input BED file).
  2. region_name: Name assigned to region (from 4th column of BED file if present; assigned internally otherwise).
  3. num_mutations: Total number of somatic mutations in region across samples.
  4. length: Length of region in basepairs.
  5. effective_length: Length of region multiplied by number of samples.
  6. bg_type: String indicating method used to estimate region's background rate; one of global, local, or clustered_regions.
  7. bg_prob: Region background mutation rate used in negative binomial tests.
  8. region_pval: Raw p-value from negative binomial test of region.
  9. WAP: Statistic from weighted average proximity procedure performed on region.
  10. WAP_pval: Permutation p-value of WAP procedure.
  11. hotspot_pvals: Semi-colon-separated list of p-values associated with identified hotspots (NA if no hotspots found).
  12. Fisher_pval: Fisher combined p-value of values (8), (10), and (11).
  13. Fisher_FDR: Benjamini-Hochberg FDR-corrected Fisher p-value.
  14. num_samples: Number of samples possessing a somatic mutation in region.
  15. position_counts: Semi-colon-separated list of genomic positions containing somatic mutations along with counts; in format [position]_[count].
  16. mutation_counts: Semi-colon-separated list of genomic positions with base alterations and counts; in format [position] _ [reference base(s)] _ [alternate base(s)] _ [count].
  17. samples: Semi-colon-separated list of sample IDs containing somatic mutations in region.

4. [prefix]_region_data.pkl

This is a python pickle object containing the mutation data and calculations used in the enrichment analysis. The file contains a python list of Region class variables, as defined in the non-coding analysis code. If users are interested in inspecting this information, this file can be loaded in python with:

# In Python 2
import sys, os, cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file
# In Python 3
import sys, os
import _pickle as cPickle
sys.path.insert(0, '/MutEnricher/install/path/') # Make MutEnricher install path available
sys.path.insert(0, '/MutEnricher/install/path/math_funcs/') # Add path to math functions as well

regions = cPickle.load(open('/path/to/output/example_region_data.pkl','rb')) # Load region data pickle file

If a user would like to find information for a particular region, this information then can be obtained as so:

region_of_interest = 'chr5:1295773-1296014'
index = None
for r in regions:
    if r.name == region_of_interest:
        index = r.index
        break
reg = regions[index]

The above code extracts the Region object for the defined region. The user can now observe internal information associated with this non-coding region.

5. [prefix].log

Text file containing run information, including MutEnricher version, input files, optional parameter values, and notes about the number of regions/hotspots tested.