DOC: intro, install, tutorial, tips

arnavm · Aug 7, 2020 · 69bcad7 · 69bcad7
1 parent 82bf1a0
commit 69bcad7
Show file tree

Hide file tree

Showing 12 changed files with 275 additions and 7 deletions.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,3 +1,4 @@
 blockify==0.2.0
 sphinx-argparse
-sphinx_rtd_theme
+sphinx_rtd_theme
+sphinxcontrib-bibtex
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -36,6 +36,7 @@
               'sphinx.ext.viewcode',
               'sphinx.ext.intersphinx',
               'sphinxarg.ext',
+              'sphinxcontrib.bibtex',
               'sphinx_rtd_theme',]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,5 +1,5 @@
 blockify
-==================================================
+========
 
 .. image:: https://img.shields.io/pypi/v/blockify.svg
    :target: https://pypi.python.org/pypi/blockify
@@ -11,18 +11,19 @@ blockify
    :target: https://travis-ci.org/arnavm/blockify
    :alt: Build Status
 .. image:: https://readthedocs.org/projects/blockify/badge/?version=latest
-  :target: http://blockify.readthedocs.io/en/latest/?badge=latest
-  :alt: Documentation Status
+   :target: http://blockify.readthedocs.io/en/latest/?badge=latest
+   :alt: Documentation Status
 
 Blockify is a command line program and Python library for genome segmentation and peak calling using Bayesian blocks.
 
 .. toctree::
    :maxdepth: 1
-   :caption: Contents:
+   :caption: Contents
 
    pages/introduction
    pages/installation
    pages/tutorial
+   pages/tips
    pages/commands/blockify
    pages/commands/segment
    pages/commands/call

diff --git a/docs/source/pages/commands/call.rst b/docs/source/pages/commands/call.rst
@@ -1,3 +1,5 @@
+.. _call:
+
 Command: call
 =============
 

diff --git a/docs/source/pages/commands/downsample.rst b/docs/source/pages/commands/downsample.rst
@@ -1,3 +1,5 @@
+.. _downsample:
+
 Command: downsample
 ===================
 

diff --git a/docs/source/pages/commands/normalize.rst b/docs/source/pages/commands/normalize.rst
@@ -1,3 +1,5 @@
+.. _normalize:
+
 Command: normalize
 ==================
 

diff --git a/docs/source/pages/commands/segment.rst b/docs/source/pages/commands/segment.rst
@@ -1,3 +1,5 @@
+.. _segment:
+
 Command: segment
 ================
 

diff --git a/docs/source/pages/installation.rst b/docs/source/pages/installation.rst
@@ -1,4 +1,39 @@
+.. _installation:
+
 Installation
 ============
 
-blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).
+blockify runs on Python (>= 3.4) and is installable via ``pip``
+
+.. code-block:: python
+
+   pip install blockify
+
+Development
+-----------
+
+To actively develop blockify, clone from GitHub and switch to the
+development branch:
+
+.. code-block:: bash
+
+   git clone https://github.com/arnavm/blockify.git
+   cd blockify
+   git checkout dev
+
+Unit tests are available from the top-level directory:
+
+.. code-block:: python
+
+   python -m unittest tests.test_basic
+
+Two batteries of tests are provided: ``tests.test_basic`` and
+``tests.test_advanced``. For routine development, the basic set of tests
+should be sufficient. The advanced suite takes much more time and
+fetches several large datasets. It is best used when making major
+changes to the code.
+
+Disclaimer
+----------
+
+Not to be confused with the `similarly-named Spotify plugin <https://github.com/serialoverflow/blockify>`_.
diff --git a/docs/source/pages/introduction.rst b/docs/source/pages/introduction.rst
@@ -1,4 +1,23 @@
+.. _introduction:
+
 Introduction
 ============
 
 blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).
+
+The package is built around the Bayesian blocks algorithm :cite:`scargle_studies_2013`, which finds the optimal change points in time series data assuming a Poisson counting process. We also implement a dynamic pruning strategy which achieves linear runtime performance :cite:`killick_optimal_2012`. An interactive notebook demonstrating Bayesian blocks can be found `here <https://observablehq.com/d/d2cafaa7d8c1e018>`_.
+
+While Bayesian blocks was originally developed in the astrophysics community for photon-counting experiments, we find that it has applications in genomics. In particular, we use it to analyze transposon calling cards experiments. Calling cards uses a transposase fused to a transcription factor (TF) to deposit transposons near TF binding sites. Bayesian blocks partitions the genome based on the local density of insertions, which in turn are used to identify peaks and candidate TF binding sites. We have also had success using this algorithm to perfom general-purpose genome segmentation.
+
+.. note::
+   Recent papers using calling cards include :cite:`shively_homotypic_2019` and :cite:`liu_quantitative_2020`. For examples of Bayesian blocks in practice, see :cite:`cammack_viral_2020` and :cite:`moudgil_self-reporting_2020`.
+
+blockify is best designed to process qBED files :cite:`moudgil_qbed_2020`, although it will work with BED files.
+
+To get started, please see our :ref:`installation` guide and :ref:`tutorial`.
+
+References
+----------
+
+.. bibliography:: ../refs.bib
+   :cited:
diff --git a/docs/source/pages/tips.rst b/docs/source/pages/tips.rst
@@ -0,0 +1,30 @@
+.. _tips:
+
+Tips and Tricks
+===============
+
+Here are some helpful tips and tricks to get the most out of blockify.
+
+Resolution
+----------
+
+Transposition data can be sparse, particularly if the transposase is constricted to specific motifs (e.g. *piggyBac*). Sparse data lead to broader peaks, which can be harder to interpret. Here are some strategies to increase resolution:
+
+* Omit or decrease the ``-d/--distance`` flag to minimize merging of significant blocks
+* Specify ``-t/--tight``, which will pull in peak boundaries so they overlap qBED entries
+* For the sharpest intervals, use the ``-s/--summit`` flag to return each peak's maximum
+* Finally, increasing the value of ``--p0`` (default: 0.05) can lead to more peaks being called, at the risk of returning more false positives.
+
+Miscellaneous
+-------------
+
+* Although the :ref:`tutorial` demonstrated first generating a list of blocks for input into ``blockify call``, this step is not strictly necessary. If a regions file is not supplied, blockify will generate one behind the scenes using the default settings in ``blockify segment``. However, this can result in considerable memory usage. Pre-computing the blocks file is one way to minimize memory consumption and improve performance.
+* Similarly, the regions over which to run ``blockify call`` need not be Bayesian blocks. The program can operate on any set of intervals provided in BED format. This flexibility can be useful if there are a set of features that are biologically meaningful to your analysis. For example, this could be a file of promoter regions or accessible loci where a TF might be bound.
+* Peaks are output in BED6 format with a generic annotation, like ``peak_1743``. The program does not re-calculate p-values on these *post hoc* by default. If you want to further calculate significance or density of these peaks, simply re-run ``blockify call`` with the ``--intermediate`` flag set and supply the peaks file to ``-r/--regions``. Then inspect the intermediate file for these details. Picking up with the BRD4 example from the :ref:`tutorial`:
+
+.. code-block::
+
+   > blockify call -i HCT-116_PBase.ccf -r HCT-116_PBase_peaks.bed -bg hg38_TTAA.bed -c 0 -p 1e-30 -d 12500 --intermediate HCT-116_PBase_peaks_annotated.csv > /dev/null
+   > head -n 2 HCT-116_PBase_peaks_annotated.csv
+   ,chrom,start,end,name,score,strand,Input,Background,Normed_bg,Net_density,pValue,negLog10pValue,rejected
+   0,chr1,7298597,7304456,peak_0,1,.,130.0,32,2.533130940448789,0.021755738020063357,3.74243334103237e-169,168.42684592642368,True
diff --git a/docs/source/pages/tutorial.rst b/docs/source/pages/tutorial.rst
@@ -1,4 +1,64 @@
+.. _tutorial:
+
 Tutorial
 ========
 
-blockify is a fast and optimal genomic peak caller for one-dimensional data (e.g. BED, qBED, CCF).
+This tutorial illustrates basic blockify usage. We will analyze some previously published data :cite:`moudgil_self-reporting_2020`. Specifically, we will segment bulk SP1-*piggyBac* calling cards data and call SP1-directed peaks. We will also segment wild-type *piggyBac* calling cards data and call BRD4-directed peaks.
+
+Getting Started
+---------------
+
+We need to download and decompress the processed data files, as well as a reference set of TTAA tetramers.
+
+.. code-block:: bash
+
+   > wget -O HCT-116_PBase.ccf.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4471636&format=file&file=GSM4471636%5FHCT%2D116%5FPBase%2Eccf%2Etxt%2Egz"
+   > gunzip HCT-116_PBase.ccf.gz
+   > wget -O HCT-116_SP1-PBase.ccf.gz "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4471637&format=file&file=GSM4471637%5FHCT%2D116%5FSP1%2DPBase%2Eccf%2Etxt%2Egz"
+   > gunzip HCT-116_SP1-PBase.ccf.gz
+   > wget https://gitlab.com/arnavm/calling_cards/-/raw/master/Ref/TTAA/hg38_TTAA.bed.xz
+   > xz -d hg38_TTAA.bed.xz
+   > ls
+   HCT-116_PBase.ccf
+   HCT-116_SP1-PBase.ccf
+   hg38_TTAA.bed
+
+Here, HCT-116_SP1-PBase.ccf are the insertions from the SP1-directed experiment, HCT-116_PBase.ccf are the insertions from the wild-type transposase, and hg38_TTAA.bed are the set of potential *piggyBac* insertion sites.
+
+Calling SP1 Peaks
+-----------------
+
+We first segment the experiment file as this gives us a candidate set of regions with piecewise-constant densities. These contiguous, bookended intervals are known as blocks.
+
+.. code-block:: bash
+
+   > blockify segment -i HCT-116_SP1-PBase.ccf -o HCT-116_SP1-PBase.blocks
+   > wc -l HCT-116_SP1-PBase.blocks
+   21375
+   > head -n 3 HCT-116_SP1-PBase.blocks
+   chr1	54672	758707
+   chr1	758707	906326
+   chr1	906326	906925
+
+We now use this set of blocks along with the insertions themselves to call peaks. Since the blocks have piecewise-constant density, we model the number of insertions in each block as a Poisson process. For each block, we parameterize the background Poisson process using the control, undirected insertions, scaled for library size. A set of peaks might be called like this:
+
+.. code-block:: bash
+
+   > blockify call -i HCT-116_SP1-PBase.ccf -r HCT-116_SP1-PBase.blocks -bg HCT-116_PBase.ccf -a 0.05 --correction fdr_bh -d 250 -t -o HCT-116_SP1-PBase_peaks.bed
+   > wc -l HCT-116_SP1-PBase_peaks.bed
+   8356
+
+Here, we specified the input file with ``-i``, the input regions with ``-r``, and the background data with ``-bg``. We used Benjamini-Hochberg correction at a false-discovery rate of 5% (``-a 0.05 --correction fdr_bh``), merging significant blocks within 250 bp of each other (``-d 250``), and tightening the final peaks (``-t``) to improve peak resolution. For a full list of options, see :ref:`call`.
+
+Calling BRD4 Peaks
+------------------
+
+To identify BRD4-bound peaks from undirected insertions, we follow a similar set of commands for calling TF-directed peaks. There are two key differences: first, we use hg38_TTAA.bed as the background file, as our null model assumes insertions would be uniformly distributed across the genome; and second, we set the pseudocount to 0 (``-c 0``). When calling TF-directed peaks, both the input and background sets of insertions are random variables. Thus, in any given block, it is possible that there are zero insertions in the background file. To account for undersampling, we added a pseudocount (default: 1). This is not necessary when calling BRD4 peaks because there will always be a TTAA in each block, as *piggyBac* virtually always inserts into this motif. Retaining a pseudocount, while not technically wrong, will decrease sensitivity.
+
+.. code-block:: bash
+
+   > blockify call -i HCT-116_PBase.ccf -r HCT-116_PBase.blocks -bg hg38_TTAA.bed -c 0 -p 1e-30 -d 12500 -o HCT-116_PBase_peaks.bed
+   > wc -l HCT-116_PBase_peaks.bed
+   1935
+
+Here, we've optimized peak calling to detect BRD4-bound super-enhancers. We've set a very strict threshold, using an unadjusted p-value cutoff of 1e-30 (``-p 1e-30``) and merging significant blocks within 12,500 bp (``-d 12500``).