Skip to content

Commit

Permalink
DOC: intro, install, tutorial, tips
Browse files Browse the repository at this point in the history
  • Loading branch information
arnavm committed Aug 7, 2020
1 parent 82bf1a0 commit 69bcad7
Show file tree
Hide file tree
Showing 12 changed files with 275 additions and 7 deletions.
3 changes: 2 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
blockify==0.2.0
sphinx-argparse
sphinx_rtd_theme
sphinx_rtd_theme
sphinxcontrib-bibtex
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
'sphinx.ext.viewcode',
'sphinx.ext.intersphinx',
'sphinxarg.ext',
'sphinxcontrib.bibtex',
'sphinx_rtd_theme',]

# Add any paths that contain templates here, relative to this directory.
Expand Down
9 changes: 5 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
blockify
==================================================
========

.. image:: https://img.shields.io/pypi/v/blockify.svg
:target: https://pypi.python.org/pypi/blockify
Expand All @@ -11,18 +11,19 @@ blockify
:target: https://travis-ci.org/arnavm/blockify
:alt: Build Status
.. image:: https://readthedocs.org/projects/blockify/badge/?version=latest
:target: http://blockify.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
:target: http://blockify.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

Blockify is a command line program and Python library for genome segmentation and peak calling using Bayesian blocks.

.. toctree::
:maxdepth: 1
:caption: Contents:
:caption: Contents

pages/introduction
pages/installation
pages/tutorial
pages/tips
pages/commands/blockify
pages/commands/segment
pages/commands/call
Expand Down
2 changes: 2 additions & 0 deletions docs/source/pages/commands/call.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _call:

Command: call
=============

Expand Down
2 changes: 2 additions & 0 deletions docs/source/pages/commands/downsample.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _downsample:

Command: downsample
===================

Expand Down
2 changes: 2 additions & 0 deletions docs/source/pages/commands/normalize.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _normalize:

Command: normalize
==================

Expand Down
2 changes: 2 additions & 0 deletions docs/source/pages/commands/segment.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _segment:

Command: segment
================

Expand Down
37 changes: 36 additions & 1 deletion docs/source/pages/installation.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,39 @@
.. _installation:

Installation
============

blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).
blockify runs on Python (>= 3.4) and is installable via ``pip``

.. code-block:: python
pip install blockify
Development
-----------

To actively develop blockify, clone from GitHub and switch to the
development branch:

.. code-block:: bash
git clone https://github.com/arnavm/blockify.git
cd blockify
git checkout dev
Unit tests are available from the top-level directory:

.. code-block:: python
python -m unittest tests.test_basic
Two batteries of tests are provided: ``tests.test_basic`` and
``tests.test_advanced``. For routine development, the basic set of tests
should be sufficient. The advanced suite takes much more time and
fetches several large datasets. It is best used when making major
changes to the code.

Disclaimer
----------

Not to be confused with the `similarly-named Spotify plugin <https://github.com/serialoverflow/blockify>`_.
19 changes: 19 additions & 0 deletions docs/source/pages/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,23 @@
.. _introduction:

Introduction
============

blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).

The package is built around the Bayesian blocks algorithm :cite:`scargle_studies_2013`, which finds the optimal change points in time series data assuming a Poisson counting process. We also implement a dynamic pruning strategy which achieves linear runtime performance :cite:`killick_optimal_2012`. An interactive notebook demonstrating Bayesian blocks can be found `here <https://observablehq.com/d/d2cafaa7d8c1e018>`_.

While Bayesian blocks was originally developed in the astrophysics community for photon-counting experiments, we find that it has applications in genomics. In particular, we use it to analyze transposon calling cards experiments. Calling cards uses a transposase fused to a transcription factor (TF) to deposit transposons near TF binding sites. Bayesian blocks partitions the genome based on the local density of insertions, which in turn are used to identify peaks and candidate TF binding sites. We have also had success using this algorithm to perfom general-purpose genome segmentation.

.. note::
Recent papers using calling cards include :cite:`shively_homotypic_2019` and :cite:`liu_quantitative_2020`. For examples of Bayesian blocks in practice, see :cite:`cammack_viral_2020` and :cite:`moudgil_self-reporting_2020`.

blockify is best designed to process qBED files :cite:`moudgil_qbed_2020`, although it will work with BED files.

To get started, please see our :ref:`installation` guide and :ref:`tutorial`.

References
----------

.. bibliography:: ../refs.bib
:cited:
30 changes: 30 additions & 0 deletions docs/source/pages/tips.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
.. _tips:

Tips and Tricks
===============

Here are some helpful tips and tricks to get the most out of blockify.

Resolution
----------

Transposition data can be sparse, particularly if the transposase is constricted to specific motifs (e.g. *piggyBac*). Sparse data lead to broader peaks, which can be harder to interpret. Here are some strategies to increase resolution:

* Omit or decrease the ``-d/--distance`` flag to minimize merging of significant blocks
* Specify ``-t/--tight``, which will pull in peak boundaries so they overlap qBED entries
* For the sharpest intervals, use the ``-s/--summit`` flag to return each peak's maximum
* Finally, increasing the value of ``--p0`` (default: 0.05) can lead to more peaks being called, at the risk of returning more false positives.

Miscellaneous
-------------

* Although the :ref:`tutorial` demonstrated first generating a list of blocks for input into ``blockify call``, this step is not strictly necessary. If a regions file is not supplied, blockify will generate one behind the scenes using the default settings in ``blockify segment``. However, this can result in considerable memory usage. Pre-computing the blocks file is one way to minimize memory consumption and improve performance.
* Similarly, the regions over which to run ``blockify call`` need not be Bayesian blocks. The program can operate on any set of intervals provided in BED format. This flexibility can be useful if there are a set of features that are biologically meaningful to your analysis. For example, this could be a file of promoter regions or accessible loci where a TF might be bound.
* Peaks are output in BED6 format with a generic annotation, like ``peak_1743``. The program does not re-calculate p-values on these *post hoc* by default. If you want to further calculate significance or density of these peaks, simply re-run ``blockify call`` with the ``--intermediate`` flag set and supply the peaks file to ``-r/--regions``. Then inspect the intermediate file for these details. Picking up with the BRD4 example from the :ref:`tutorial`:

.. code-block::
> blockify call -i HCT-116_PBase.ccf -r HCT-116_PBase_peaks.bed -bg hg38_TTAA.bed -c 0 -p 1e-30 -d 12500 --intermediate HCT-116_PBase_peaks_annotated.csv > /dev/null
> head -n 2 HCT-116_PBase_peaks_annotated.csv
,chrom,start,end,name,score,strand,Input,Background,Normed_bg,Net_density,pValue,negLog10pValue,rejected
0,chr1,7298597,7304456,peak_0,1,.,130.0,32,2.533130940448789,0.021755738020063357,3.74243334103237e-169,168.42684592642368,True
62 changes: 61 additions & 1 deletion docs/source/pages/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,64 @@
.. _tutorial:

Tutorial
========

blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).
This tutorial illustrates basic blockify usage. We will analyze some previously published data :cite:`moudgil_self-reporting_2020`. Specifically, we will segment bulk SP1-*piggyBac* calling cards data and call SP1-directed peaks. We will also segment wild-type *piggyBac* calling cards data and call BRD4-directed peaks.

Getting Started
---------------

We need to download and decompress the processed data files, as well as a reference set of TTAA tetramers.

.. code-block:: bash
> wget -O HCT-116_PBase.ccf.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4471636&format=file&file=GSM4471636%5FHCT%2D116%5FPBase%2Eccf%2Etxt%2Egz"
> gunzip HCT-116_PBase.ccf.gz
> wget -O HCT-116_SP1-PBase.ccf.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4471637&format=file&file=GSM4471637%5FHCT%2D116%5FSP1%2DPBase%2Eccf%2Etxt%2Egz"
> gunzip HCT-116_SP1-PBase.ccf.gz
> wget https://gitlab.com/arnavm/calling_cards/-/raw/master/Ref/TTAA/hg38_TTAA.bed.xz
> xz -d hg38_TTAA.bed.xz
> ls
HCT-116_PBase.ccf
HCT-116_SP1-PBase.ccf
hg38_TTAA.bed
Here, HCT-116_SP1-PBase.ccf are the insertions from the SP1-directed experiment, HCT-116_PBase.ccf are the insertions from the wild-type transposase, and hg38_TTAA.bed are the set of potential *piggyBac* insertion sites.

Calling SP1 Peaks
-----------------

We first segment the experiment file as this gives us a candidate set of regions with piecewise-constant densities. These contiguous, bookended intervals are known as blocks.

.. code-block:: bash
> blockify segment -i HCT-116_SP1-PBase.ccf -o HCT-116_SP1-PBase.blocks
> wc -l HCT-116_SP1-PBase.blocks
21375
> head -n 3 HCT-116_SP1-PBase.blocks
chr1 54672 758707
chr1 758707 906326
chr1 906326 906925
We now use this set of blocks along with the insertions themselves to call peaks. Since the blocks have piecewise-constant density, we model the number of insertions in each block as a Poisson process. For each block, we parameterize the background Poisson process using the control, undirected insertions, scaled for library size. A set of peaks might be called like this:

.. code-block:: bash
> blockify call -i HCT-116_SP1-PBase.ccf -r HCT-116_SP1-PBase.blocks -bg HCT-116_PBase.ccf -a 0.05 --correction fdr_bh -d 250 -t -o HCT-116_SP1-PBase_peaks.bed
> wc -l HCT-116_SP1-PBase_peaks.bed
8356
Here, we specified the input file with ``-i``, the input regions with ``-r``, and the background data with ``-bg``. We used Benjamini-Hochberg correction at a false-discovery rate of 5% (``-a 0.05 --correction fdr_bh``), merging significant blocks within 250 bp of each other (``-d 250``), and tightening the final peaks (``-t``) to improve peak resolution. For a full list of options, see :ref:`call`.

Calling BRD4 Peaks
------------------

To identify BRD4-bound peaks from undirected insertions, we follow a similar set of commands for calling TF-directed peaks. There are two key differences: first, we use hg38_TTAA.bed as the background file, as our null model assumes insertions would be uniformly distributed across the genome; and second, we set the pseudocount to 0 (``-c 0``). When calling TF-directed peaks, both the input and background sets of insertions are random variables. Thus, in any given block, it is possible that there are zero insertions in the background file. To account for undersampling, we added a pseudocount (default: 1). This is not necessary when calling BRD4 peaks because there will always be a TTAA in each block, as *piggyBac* virtually always inserts into this motif. Retaining a pseudocount, while not technically wrong, will decrease sensitivity.

.. code-block:: bash
> blockify call -i HCT-116_PBase.ccf -r HCT-116_PBase.blocks -bg hg38_TTAA.bed -c 0 -p 1e-30 -d 12500 -o HCT-116_PBase_peaks.bed
> wc -l HCT-116_PBase_peaks.bed
1935
Here, we've optimized peak calling to detect BRD4-bound super-enhancers. We've set a very strict threshold, using an unadjusted p-value cutoff of 1e-30 (``-p 1e-30``) and merging significant blocks within 12,500 bp (``-d 12500``).

0 comments on commit 69bcad7

Please sign in to comment.