This script calculates coverage statistics, using a bigwig as input. It has different execution modes.
--panel
: This mode will calculate the coverage metrics for one panel.--gene-list
: This mode will calculate the coverage metrics for a list of genes.--coding-regions
: This mode will calculate the coverage metrics for a set of genes defined in the BED file provided, while avoiding any connection to CellBase.none of above
: This version will calculate the coverage metrics for all genes.
It will output statistics at exon, transcript, gene (by creating a union transcript), chromosome, analysis coding region (this is panel, gene list or whole coding region) and whole genome. The output format is JSON.
This script is executed in the following way:
bigwig_analyser --bw <bigwig.bw> --output <output.json> --config <configuration.config> --wg-regions <non_n_region.bed> --disable-exon-stats
NOTE: The configuration file contain the basic configuration to annotate, get panels information, filter transcripts and calculate stats.
This file will be found in /genomes/resources/genomeref/...
.
#Create a dictionary with the configuration
config = {
"bw" : '/path/to/bigwig.bw',
"panel" : None,
"panel_version": None,
"gene_list": None,
"coding_regions": None,
"coverage_threshold": 15,
'configuration_file': '-',
"cellbase_species": 'hsapiens',
"cellbase_version": 'latest',
"cellbase_assembly": 'GRCh37/GRCh38',
"cellbase_host": '10.5.8.201:8080/cellbase-4.5.0-rc',
"cellbase_retries": -1,
"panelapp_host": 'bioinfo.extge.co.uk/crowdsourcing/WebServices',
"panelapp_gene_confidence": 'HighEvidence',
"panelapp_retries": -1,
"panelapp_assembly": "GRCh37",
"transcript_filtering_flags": 'basic',
"transcript_filtering_biotypes": 'IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_V_gene,protein_coding,nonsense_mediated_decay,non_stop_decay,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene',
"exon_padding": 15,
"wg_stats_enabled": True,
"wg_regions": '/path/to/non_n_regions.bed',
"exon_stats_enabled": False,
"coding_region_stats_enabled": True
}
gel_coverage_engine = GelCoverageRunner(config)
(results, bed) = gel_coverage_engine.run()
# Prints output to stdout
with codecs.open(args.output, 'w', 'utf8') as output_file:
output_file.write(
ujson.dumps(
results,
ensure_ascii=False
)
)
# Saves the analysed region as a BED file
bed.saveas(args.output + ".bed")
NOTE: Please note that this process is highly dependent on the reference genome, use a different assembly o version assembly will produce wrong results.
NOTE: When running an analysis over all genes the resulting JSON will be around 1.5GB, unless you add the flag --disable-exon-stats, but in this case you will be missing the exon level statistics and the coverage gaps.
NOTE: When running an analysis in panel or gene list mode it might be useful to disable the whole genome statistics to improve performance, by using the flag --disable-wg-stats.
NOTE: Beware that the reference genome and chromosome notation (i.e.: chr prefix or not) should be the same in the input bigwig file and the bed file in wg-regions.
The program iterates through the bigwig file twice: the first for the analysis of the coding region (panel, gene list or full) and the second for the analysis of the whole genome.
- To run statistics only for a panel from exon level up to panel level, provide a panel (
panel
) and panel version (panel_version
) and disable the whole genome statistics ("wg_stats_enabled": False
), while making sure that the coding region and the exon level statistics are enabled ("coding_region_stats_enabled": True
and"exon_stats_enabled": True
).- Execution time is of some seconds or under a second for small panels. The panel of intellectual disability v1.23 having 1232 genes took 46s.
- To run statistics only for a gene list from exon level up to gene list level, provide a gene list (
gene_list
) instead of panel and panel version and use the same configuration as above.- Execution time similar as the previous
- To run statistics only for all genes in the coding region do not provide panel (
panel
) or gene list (gene_list
), disable the whole genome statistics ("wg_stats_enabled": False
) and the exon level statistics ("exon_stats_enabled": False
) (the output JSON will be around 4 GB if exon stats are enabled for all genes, otherwise it is around 250 MB), while making sure that the coding region is enabled ("coding_region_stats_enabled": True
).- Execution time is over 3 hours
- To run only whole genome statistics enable
"wg_stats_enabled": True
and disable the coding region statistics ("coding_region_stats_enabled": False
). The whole genome analysis might be used in combination with a bed file defining the region to analyse (e.g.: non N regions) that is to be passed in parameter"wg_regions": '/path/to/non_n_regions.bed'
. Thiswg_regions
can be used to calculate coverage over very specific regions, for instance Cosmic variants if they are set in a BED file.- Execution time is around 1 hour
Any combination, of the previous should generate a single JSON with all the information.
The coverage module depends on two external systems: CellBase and PanelApp. CellBase is used to retrieve the genes to be analysed and the precise coordinates of each genomic region. PanelApp is used only in the panel mode to retrieve those genes belonging to a given panel. If any of these systems is down the analysis cannot run.
An exponential backoff policy will work whenever the connection to any of these two systems fails. The maximum number of retries can be configured by using the parameters cellbase_retries
and panelapp_retries
. If the value is -1 infinite retries will apply. These parameters are not available from the command line, but from the configuration file.
The connection to CellBase can be avoided by using the parameter --coding-regions
. The BED file provided must follow the format:
17 67323193 67323242 ABCA5|ENST00000392676|exon1 0.68 -
17 67310455 67310571 ABCA5|ENST00000392676|exon2 0.37607 -
17 67309233 67309437 ABCA5|ENST00000392676|exon3 0.30732 -
17 67305403 67305564 ABCA5|ENST00000392676|exon4 0.33333 -
17 67304421 67304509 ABCA5|ENST00000392676|exon5 0.44944 -
IMPORTANT: the BED file must be sorted by transcript (not by gene!), otherwise coverage statistics will be erroneous!
The results of the coverage analysis are in JSON format which is intended to be machine readable, but not human readable. To explore the results the jq
tool may ne useful (https://stedolan.github.io/jq/).
- To just print the whole JSON in a nice tabulated format use:
jq '' your.json
- To retrieve a specific field from the json query for it like this:
jq '.results.whole_genome.stats.uneveness' your.json
Much more complex queries can be done. See https://robots.thoughtbot.com/jq-is-sed-for-json for further insight.
NOTE 1: Beware that the JSONs for the coding region with coverage detail at exon level might be quite big and jq will be slow as it loads all data in memory.
NOTE 2: jq is installed in bio-pp9-01
at /genomes/software/apps/jq-1.5
The Bed Maker builds from a list of genes the BED file defining the coding regions that the Bigwig Analyser expects.
bed_maker --config resources/bed_maker.config --output resources/test/deleteme.bed --gene-list SCN2A,SPTAN1,PLCB1,SLC25A22,SCN8A,STXBP1,PNKP
#Create a dictionary with the configuration
config = {
# Sets parameters from CLI
"gene_list": None,
"chr_prefix": False,
"log_level": 10,
"transcript_filtering_flags": "basic",
"transcript_filtering_biotypes": "IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_V_gene,protein_coding,"
"nonsense_mediated_decay,non_stop_decay,TR_C_gene,TR_D_gene,TR_J_gene,"
"TR_V_gene",
"cellbase_species": "hsapiens",
"cellbase_version": "latest",
"cellbase_assembly": "grch37",
"cellbase_host": "your.cellbase.url/cellbase",
"cellbase_retries": -1,
}
bed_maker = BedMaker(config)
bed = bed_maker.run()
# Saves the analysed region as a BED file
bed.saveas("my.bed")