Skip to content

How Chippy works

cameron-jack edited this page Jul 10, 2017 · 1 revision

ChipPy operation is centred around an Ensembl-based annotations database, either downloaded pre-generated from the ChipPy web site or created as needed via an Internet connection to a live Ensembl DB. This annotation database collects information from a specific Ensembl release of a chosen organism’s genome. Gene expression data may then be The script add_expression_db.py then uploads the user-supplied expression data into the ChipPy database. This expression data can take any of three forms: absolute measured expression, differential expression – the pre-measured difference between two expression experiments – or target gene set – a list of genes to be targeted for inclusion or exclusion from the final data output.

After all relevant expression data or gene target lists have been loaded into the database, count data can be ‘exported’ from appropriate regions of NGS sequence read files. ChipPy is capable of reading WIG, BED and BEDgraph files, including those compressed with Gzip. It also can take advantage of the Samtools software suite to read the exact regions required from BAM files. This stage may take five minutes or more to complete with gigabyte sized datasets.

Many potentially interesting genomic features may be extracted, with full user control over the range and direction of the extraction. Studies have been published highlighting features seen in symmetric searches either side of gene transcription start sites (TSS) but user control equally allows either upstream (promoter) or downstream (coding region) searches. Likewise, ChipPy is not limited to only the TSS but also has annotations for visualising the untranslated region (5’ and 3’ UTR), intron-exon, exon-intron, and transcription termination site (TTS) boundaries, making it the most capable functional genomics toolkit of its kind.

To avoid issues of non-linearity or limited dynamic range in expression data, all gene expression values are converted to rank-order for plotting. To improve signal-to-noise genes are generally grouped. This can also help highlight trends in high-, mid-, or lowly-expressed genes.

Design decisions were made to aid reproducibility and correctness, both for the first time new functionality is built and also as the software develops over its lifespan. ChipPy was developed around a relational database (SQlite) core for to take advantage of relational calculus rules for organising annotation and expression count information. For all database queries and other critical code sections we use unit tests to ensure that expected behaviour is maintained under both success and failure conditions. To track program execution ChipPy writes to a central log file, recording all chosen parameters, program behaviour and outputs. ChipPy itself, along with all the software modules it makes use of are freely available and their code is made available under the General Public License (GPL). The flexible structure of the software makes customisation easy as core functions are separated from user interface and display modules. In particular, user interfaces are generated dynamically from command-line interface code, so user interfaces are simple to design and only need to be coded for once.

By separating the step of combining data from specific regions from the actual plotting of data, ChipPy allows a user to quickly generate data visualisations of their data without re-reading large data files (a step which may take over 5 minutes). ChipPy allows the user the flexibility to chose whether to restrict their study to specific genes or chromosomes at the export stage - saving time - or at the plotting stage - allowing greater flexibility.

After counts and expression data have been exported from the chosen annotated gene regions a wide range of plots... Per-gene mapped chromatin counts may be combined for any number of genes to improve signal-to-noise. Multiple datasets may be shown in the one plot, restricted to any of the ‘top’, ‘middle’ or ‘bottom’ expressed genes for these groups to highlight trends. Datasets may also be divided by another data set for the purposes of normalisation, as was done in Nekrasov et al where H3 histone signal was used to show the change in H2A.Z levels at varying cell cycle stages.

All aspects of the presentation of plots can be controlled for, including colour-schemes, fonts, font-sizes, figure size, legend creation and output format, so that publication-ready plots can be generated as part of a fully pipelined analysis.

To aid in data exploration and verification of experiment choices, several additional analyses are made available through scripts. These include ChrmVsExpr.py, which plots the amount of mapped chromatin versus its expression score for each gene, for any number of experiment samples; RankVsCounts.py, which displays how the absolute expression or mapped chromatin count increases versus its given rank – i.e. dynamic range and linearity; distribution_plots.py, which creates histograms or box-plots of the spread of expression scores or mapped-counts; and Diff_Abs_Plots.py, which produces dot-plots of the difference in expression of two expression experiments versus the absolute expression components. Scripts for summarising the state of the database (db_summary.py) and dropping unwanted data sets from it (drop_expr_db.py) are also included.

It is important to note that the annotations used by ChipPy are provided by ENSEMBL by way of PyCogent. Here the decision is made to consider the longest complete transcript available to be canonical.

Clone this wiki locally