Skip to content

Commit

Permalink
merge
Browse files Browse the repository at this point in the history
  • Loading branch information
bgruening committed Mar 31, 2017
2 parents 568fc85 + 7b65de3 commit ede12d5
Show file tree
Hide file tree
Showing 6 changed files with 91 additions and 86 deletions.
31 changes: 19 additions & 12 deletions docs/content/example_usage.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _example_usage:

Example usage
=============

Expand Down Expand Up @@ -52,10 +54,10 @@ We have used the HiCExplorer sucessfuly with `bwa`, `bowtie2` and `hisat2`. Howe
# -L INT[,INT] penalty for 5'- and 3'-end clipping [5,5] # this is set to no penalty.
$ bwa mem -A1 -B4 -E50 -L0 index_path \
-U mate_R1.fastq.gz ) 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam
-U mate_R1.fastq.gz 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam
$ bwa mem -A1 -B4 -E50 -L0 index_path \
-U mate_R2.fastq.gz ) 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
-U mate_R2.fastq.gz 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
Creation of a Hi-C matrix
Expand All @@ -64,10 +66,12 @@ Creation of a Hi-C matrix
Once the reads have been mapped the Hi-C matrix can be built. For this, the minimal
extra information required is the `binSize` used for the matrix. Is it best
to enter a low number like 10.000 because lower
resolution matrices (larget bins) can be easily constructed using `hicMergeMatrixBins`. Matrices
resolution matrices (larger bins) can be easily constructed using :ref:`hicMergeMatrixBins`. Matrices
at restriction fragment resolution can be created by providing a file
containing the restriction sites, this file can be created with the tool
`findRestSite` that is part of HiCExplorer.
containing the restriction sites, this file can be created with the tool :ref:`findRestSite`

:ref:`findRestSite`
that is part of HiCExplorer.

.. code-block:: bash
Expand All @@ -78,26 +82,29 @@ containing the restriction sites, this file can be created with the tool
--binSize 10000 \
--restrictionSequence GATC \
--outBam hic.bam \
-o hic_matrix.npz > build_matrix.log
-o hic_matrix.npz
--QCfolder ./hicQC
`hicBuildMatrix` creates two files, a bam file containing only the valid Hi-C read pairs and a matrix containing the
Hi-C contacts at the given resolution. The bam file is useful to check the quality of the
Hi-C library on the genome browser. A good Hi-C library should contain piles of reads near
the restriction fragment sites. The log output contains the number of valid pairs, duplicated pairs and
other useful information. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly
the restriction fragment sites. In the `QCfolder` a html file is saved with plots containing
useful information for the quality control of the Hi-C sample like the number of valid pairs, duplicated pairs,
self-ligations etc. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly
because of the reads that are on repetitive regions that need to be discarded.

An important quality control measurement to check is the `inter chromosomal` fraction of reads as this is an indirect
measure of random Hi-C contacts. Good Hi-C libraries have lower than 10% inter chromosomal contacts.
measure of random Hi-C contacts. Good Hi-C libraries have lower than 10% inter chromosomal contacts. The `hicQC`
module can be used to compare the QC measures from different samples.

Correction of Hi-C matrix
^^^^^^^^^^^^^^^^^^^^^^^^^

The Hi-C matrix has to be corrected to remove GC, open chromatin biases and, most importantly,
to normalize the number of restriction sites per bin. Because a fraction of bins from repetitive regions
contain few contacts it is necessary to filter those regions first. Also, in mammalian genomes some regions
enriched by reads should be discarded. To aid in the filtering of regions `hicCorrectMatrix` generates a
enriched by reads should be discarded. To aid in the filtering of regions :ref:`hicCorrectMatrix` generates a
diagnostic plot as follows:

.. code-block:: bash
Expand Down Expand Up @@ -131,8 +138,8 @@ Once the thresholds have been decided, the matrix can be corrected
Visualization of results
^^^^^^^^^^^^^^^^^^^^^^^^

There are two ways to see the resulting matrix, one using `hicPlotMatrix` and the
other is using `hicPlotTADs`. The first one allows the visualization over large regions
There are two ways to see the resulting matrix, one using :ref:`hicPlotMatrix` and the
other is using :ref:`hicPlotTADs`. The first one allows the visualization over large regions
while the second one is preferred to see specific parts together with other information,
for example genes or bigwig tracks.

Expand Down
2 changes: 1 addition & 1 deletion docs/content/list-of-tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ HiCExplorer tools
+--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+
| tool | type | input files | main output file(s) | application |
+================================+==================+===================================+=============================================+===================================================================================+
|:ref:`findRestSites` | preprocessing | 1 genome FASTA file | bed file with restriction site coordinates | Identifies the genomic locations of restriction sites |
|:ref:`findRestSite` | preprocessing | 1 genome FASTA file | bed file with restriction site coordinates | Identifies the genomic locations of restriction sites |
+--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+
|:ref:`hicBuildMatrix` | preprocessing | 2 BAM/SAM files | hicMatrix object | Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads |
+--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+
Expand Down
137 changes: 68 additions & 69 deletions docs/content/mES-HiC_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,6 @@ Lieberman-Aiden et al. and sequenced on the NextSeq 500 platform using 2
Prepare for analysis
--------------------

Install HiC explorer
~~~~~~~~~~~~~~~~~~~~

The following commands install HiCExplorar globally. In case you don't
have admin privileges, you can first install and setup a `virtual
environment <https://virtualenv.pypa.io/en/latest/>`__ for python, and
then install HiCExplorer inside the virtual environment.

.. code:: bash
# To Clone and install in the ~/programs directory
cd ~/programs
git clone https://github.com/maxplanck-ie/HiCExplorer.git
cd HiCExplorer.git
python setup.py install
## now add HiCExplorer in $PATH
export PATH=:$PATH
Download Raw fastq files
~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -57,115 +39,127 @@ Fastq files can be downloaded from the EBI archive (or NCBI archive).
Map Raw fastq files
~~~~~~~~~~~~~~~~~~~

The raw fastq files are now aligned to mouse genome. Here we will map
them to *GRCm37* using
`HISAT <https://ccb.jhu.edu/software/hisat/index.shtml>`__, a fast and
memory efficient read aligner.
Mates have to be mapped individually to avoid mapper specific heuristics designed
for standard paired-end libraries.

The option : **--sensitive-local** is used, since we want local
alignment. In HiC, many reads in the read-pair overlap with their mates.
Using *local* alignment avoids the alignment of the end of the reads,
where the sequence might correspond to the read mate.
We have used the HiCExplorer sucessfuly with `bwa`, `bowtie2` and `hisat2`. However, it is important to:

**--reorder** is used to get the reads in the sam file in the order they
are aligned in. HiCExplorer reads the alignment in this order.
* for either `bowtie2` or `hisat2` use the `--reorder` parameter which tells bowtie2 or hisat2 to output
the *sam* files in the **exact** same order as in the *.fastq* files.
* use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end
thus, local mapping is important to increase the number of mapped reads.
* Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with
gaps if they happen to be chimeric.


The raw fastq files are aligned to mouse genome. Here we will map
them to *GRCm37* using
`BWA mem <http://bio-bwa.sourceforge.net/bwa.shtml>`__. See :ref:`example_usage` for an explanation of
the bwa mem parameter used.

Same options can be used with bowtie2 too.

.. code:: bash
mkdir mapped_files
for fq in $(ls ${inputdir} | grep '.fastq.gz')
do fqbase=$(basename ${fq} .fastq.gz)
/path/to/hisat/bin/hisat -x /path/to/hisat_index/prefix --sensitive-local --reorder -p 40 -U inputdir/${fq} -S mapped_files/${fqbase}.sam
bwa mem -A1 -B4 -E50 -L0 /path/to/bwa_index -U inputdir/${fq} | samtools view -Shb - > mapped_files/${fqbase}.sam
done
Build, visualize and correct Hi-C matrix
----------------------------------------

Create a Hi-C matrix using the aligned files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Build restriction-site bed files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before creating the HiC matrix, we need to know the restriction cut
site. The program *findRestSite* creates it for us. It needs the
restriction sequence as input. Here `DpnII
sequence <https://www.neb.com/products/r0543-dpnii>`__ is used.

.. code:: bash
findRestSite --fasta /path/to/GRCm37/genome_fasta/genome.fa --searchPattern GATC -o dpnII_positions_GRCm37.bed
## Sort the file
sort -k1,1 -k2,2n dpnII_positions_GRCm37.bed > dpnII_positions_GRCm37-sorted.bed
the restriction site bed file [-rs],
restriction sequence [-seq]
#-rs dpnII_positions_GRCm37-sorted.bed

Build Hi-C matrix
^^^^^^^^^^^^^^^^^

**hicBuildMatrix** builds the matrix of read counts over the bins in the
:ref:`hicBuildMatrix` builds the matrix of read counts over the bins in the
genome, considering the sites around the given restriction site. We need
to provide the input bams, the restriction site bed file [-rs],
restriction sequence [-seq] , binsize [-bs], name of output matrix file
to provide the input BAM/SAM files, binsize [-bs], restriction sequence [-seq],
name of output matrix file
[-o] and the name of output bam file (which contains the accepted
alignments) [-b] .
alignments) [-b].

.. code:: bash
mkdir hiCmatrix
for SRR in SRR1956527 SRR1956528 SRR1956529;
do hicBuildMatrix \
-s mapped_files/${SRR}_1.bam mapped_files/${SRR}_2.bam \
-bs 10000 \#-rs dpnII_positions_GRCm37-sorted.bed -seq GATC
-b ${SRR}_ref.bam -o hiCmatrix/${SRR}.matrix & done
-s mapped_files/${SRR}_1.bam mapped_files/${SRR}_2.bam \
-bs 10000 -seq GATC \
-b ${SRR}_ref.bam -o hiCmatrix/${SRR}.matrix --QCfolder hiCmatrix/${SRR}_QC & done
The output bam files show that we have around 34M, 54M and 58M selected
reads for SRR1956527, SRR1956528 & SRR1956529, respectively. Normally
25% of the total reads et selected.
25% of the total reads are selected.

The output matrices have counts for the genomic regions. The extention
of output matrix files is *.npz*.
The output matrices have counts for the genomic regions. The extension
of output matrix files is *.h5*.

Merge Matrices from Replicates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Merge (sum) matrices from replicates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To increase the depth of reads we merge the counts from these three
replicates.

.. code:: bash
hicSumMatrices -m hiCmatrix/SRR1956527.matrix.npz hiCmatrix/SRR1956528.matrix.npz hiCmatrix/SRR1956529.matrix.npz -o hiCmatrix/replicateMerged.matrix
hicSumMatrices -m hiCmatrix/SRR1956527.matrix.h5 hiCmatrix/SRR1956528.matrix.h5 \
hiCmatrix/SRR1956529.matrix.h5 -o hiCmatrix/replicateMerged.matrix.h5
Correct Hi-C Matrix
^^^^^^^^^^^^^^^^^^^

**hiCorrectMatrix** corrects the matrix counts in an iterative manner.
:ref:`hicCorrectMatrix` corrects the matrix counts in an iterative manner.
For correcting the matrix, it's important to remove the unassembled
scaffolds (eg NT\_) and keep only chromosomes, as scaffolds create
problems with marix correction. Therefore we use the chromosome names
(1-19, X, Y) here.
problems with matrix correction. Therefore we use the chromosome names
(1-19, X, Y) here. **Important** use 'chr1 chr2 chr3 etc.' if your genome index uses
chromosome names with the 'chr' prefix.

Matrix correction works in two steps: first a histogram containing the sum of contact per bin (row sum) is
produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads. The
second steps removes the low scoring bins and does the correction.

.. code:: bash
hicCorrectMatrix diagnostic_plot \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
-m hiCmatrix/replicateMerged.matrix.npz -o hiCmatrix/diagnostic_plot.png
The output of the program prints a threshold suggestion that is usually accurate but is better to
revise the histogram plot. See :ref:`example_usage` for an example and for more info.

Next we do the correction using the best filter threshold.

.. code:: bash
hicCorrectMatrix correct \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--filterThreshold -1.5 10 \
-m hiCmatrix/replicateMerged.matrix.npz -o hiCmatrix/replicateMerged.Corrected.npz
Plot Hi-C matrix
~~~~~~~~~~~~~~~~

since a big matrix takes too longs to plot, we merge the small bins into
larger one.
A 10kb bin matrix is quite large to plot and is better to reduce the resolution (to know the size
of a Hi-C matrix use the tool :ref:`hicInfo`). For this we use the tool :ref:`hicMergeMatrixBins`

Merge matrix bins for plotting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**hicMergeMatrixBins** merges the bins into larger bins of given number
(specified by -nb). We will merge the original (uncorrected) matrix and
then correct it.
:ref:`hicMergeMatrixBins` merges the bins into larger bins of given number
(specified by -nb). We will merge 100 bins in the original (uncorrected) matrix and
then correct it. The new bin size is going to be 10.000 bp * 100 = 1.000.000 bp

.. code:: bash
Expand All @@ -176,12 +170,17 @@ then correct it.
Correct the merged matrix
^^^^^^^^^^^^^^^^^^^^^^^^^

We will now correct the merged matrix befor plotting.
We will now correct the merged matrix before plotting.

.. code:: bash
hicCorrectMatrix diagnostic_plot \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
-m hiCmatrix/replicateMerged.matrix-100bins.npz -o hiCmatrix/diagnostic_plot_100bins.png
hicCorrectMatrix correct \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--filterThreshold 0.9 10 \
-m hiCmatrix/replicateMerged.matrix-100bins.npz -o hiCmatrix/replicateMerged.Corrected-100bins.npz
Plot the corrected Hi-C Matrix
Expand All @@ -197,7 +196,7 @@ resolution
hicPlotMatrix \
--log1p --dpi 300 \
-m hiCmatrix/replicateMerged.Corrected-100bins.npz \
--chromosomeOrder 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--clearMaskedBins \
-o plots/replicateMerged_Corrected-100bins_plot.png
.. figure:: ./plots/replicateMerged_Corrected-100bins_plot.png
Expand All @@ -209,7 +208,7 @@ Remove outliers from hic-Matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Outliers can be removed by a cutoff after looking at the diagnostic plot
for **hicCorrectMatrix** (using **diagnostic\_plot** option). Here we
for :ref:`hicCorrectMatrix` (using **diagnostic\_plot** option). Here we
are using a matrix with 20kb bins (produced by *hicMergeMatrixBins -nb
2*), since 20kb seems to be decent resolution.

Expand Down
2 changes: 1 addition & 1 deletion docs/content/tools/findRestSite.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _findRestSites:
.. _findRestSite:

findRestSites
=============
Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The following is the list of tools available in HiCExplorer
=============================== ===========================================================================================
tool description
=============================== ===========================================================================================
:ref:`findRestSites` Identifies the genomic locations of restriction sites
:ref:`findRestSite` Identifies the genomic locations of restriction sites
:ref:`hicBuildMatrix` Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads
:ref:`hicQC` Plots QC measures from the output of hicBuildMatrix
:ref:`hicCorrectMatrix` Uses iterative correction to remove biases from a Hi-C matrix
Expand Down
3 changes: 1 addition & 2 deletions hicexplorer/trackPlot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1174,8 +1174,7 @@ def __init__(self, *args, **kwarg):
self.colormap = None
# check if the color given is a color map
color_options = [m for m in matplotlib.cm.datad]
if self.properties['color'] in color_options and 'min_value' in self.properties and \
'max_value' in self.properties:
if self.properties['color'] in color_options and 'min_value' in self.properties and 'max_value' in self.properties:
norm = matplotlib.colors.Normalize(vmin=self.properties['min_value'],
vmax=self.properties['max_value'])
cmap = matplotlib.cm.get_cmap(self.properties['color'])
Expand Down

0 comments on commit ede12d5

Please sign in to comment.