merge

deeptools · Mar 31, 2017 · ede12d5 · ede12d5
2 parents 568fc85 + 7b65de3
commit ede12d5
Show file tree

Hide file tree

Showing 6 changed files with 91 additions and 86 deletions.
diff --git a/docs/content/example_usage.rst b/docs/content/example_usage.rst
@@ -1,3 +1,5 @@
+.. _example_usage:
+
 Example usage
 =============
 
@@ -52,10 +54,10 @@ We have used the HiCExplorer sucessfuly with `bwa`, `bowtie2` and `hisat2`. Howe
    #       -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5] # this is set to no penalty.
 
    $ bwa mem -A1 -B4  -E50 -L0  index_path \
-       -U mate_R1.fastq.gz ) 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam
+       -U mate_R1.fastq.gz 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam
 
    $ bwa mem -A1 -B4  -E50 -L0  index_path \
-       -U mate_R2.fastq.gz ) 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
+       -U mate_R2.fastq.gz 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
 
 
 Creation of a Hi-C matrix
@@ -64,10 +66,12 @@ Creation of a Hi-C matrix
 Once the reads have been mapped the Hi-C matrix can be built. For this, the minimal
 extra information required is the `binSize` used for the matrix. Is it best
 to enter a low number like 10.000 because lower
-resolution matrices (larget bins) can be easily constructed using `hicMergeMatrixBins`. Matrices
+resolution matrices (larger bins) can be easily constructed using :ref:`hicMergeMatrixBins`. Matrices
 at restriction fragment resolution can be created by providing a file
-containing the restriction sites, this file can be created with the tool
-`findRestSite` that is part of HiCExplorer.
+containing the restriction sites, this file can be created with the tool :ref:`findRestSite`
+
+:ref:`findRestSite`
+that is part of HiCExplorer.
 
 .. code-block:: bash
 
@@ -78,26 +82,29 @@ containing the restriction sites, this file can be created with the tool
                     --binSize 10000 \
                     --restrictionSequence GATC \
                     --outBam hic.bam \
-                    -o hic_matrix.npz > build_matrix.log
+                    -o hic_matrix.npz
+                    --QCfolder ./hicQC
 
 
 `hicBuildMatrix` creates two files, a bam file containing only the valid Hi-C read pairs and a matrix containing the
 Hi-C contacts at the given resolution. The bam file is useful to check the quality of the
 Hi-C library on the genome browser. A good Hi-C library should contain piles of reads near
-the restriction fragment sites. The log output contains the number of valid pairs, duplicated pairs and
-other useful information. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly
+the restriction fragment sites. In the `QCfolder` a html file is saved with plots containing
+useful information for the quality control of the Hi-C sample like the number of valid pairs, duplicated pairs,
+self-ligations etc. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly
 because of the reads that are on repetitive regions that need to be discarded.
 
 An important quality control measurement to check is the `inter chromosomal` fraction of reads as this is an indirect
- measure of random Hi-C contacts. Good Hi-C libraries have lower than 10%  inter chromosomal contacts.
+measure of random Hi-C contacts. Good Hi-C libraries have lower than 10%  inter chromosomal contacts. The `hicQC`
+module can be used to compare the QC measures from different samples.
 
 Correction of Hi-C matrix
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Hi-C matrix has to be corrected to remove GC, open chromatin biases and, most importantly,
 to normalize the number of restriction sites per bin. Because a fraction of bins from repetitive regions
 contain few contacts it is necessary to filter those regions first. Also, in mammalian genomes some regions
-enriched by reads should be discarded. To aid in the filtering of regions `hicCorrectMatrix` generates a
+enriched by reads should be discarded. To aid in the filtering of regions :ref:`hicCorrectMatrix` generates a
 diagnostic plot as follows:
 
 .. code-block:: bash
@@ -131,8 +138,8 @@ Once the thresholds have been decided, the matrix can be corrected
 Visualization of results
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
-There are two ways to see the resulting matrix, one using `hicPlotMatrix` and the
-other is using `hicPlotTADs`. The first one allows the visualization over large regions
+There are two ways to see the resulting matrix, one using :ref:`hicPlotMatrix` and the
+other is using :ref:`hicPlotTADs`. The first one allows the visualization over large regions
 while the second one is preferred to see specific parts together with other information,
 for example genes or bigwig tracks.
 

diff --git a/docs/content/list-of-tools.rst b/docs/content/list-of-tools.rst
@@ -5,7 +5,7 @@ HiCExplorer tools
    +--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+
    | tool                           | type             | input files                       | main output file(s)                         | application                                                                       |
    +================================+==================+===================================+=============================================+===================================================================================+
-   |:ref:`findRestSites`            | preprocessing    | 1 genome FASTA file               | bed file with restriction site coordinates  | Identifies the genomic locations of restriction sites                             |
+   |:ref:`findRestSite`             | preprocessing    | 1 genome FASTA file               | bed file with restriction site coordinates  | Identifies the genomic locations of restriction sites                             |
    +--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+
    |:ref:`hicBuildMatrix`           | preprocessing    | 2 BAM/SAM files                   | hicMatrix object                            | Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads    |
    +--------------------------------+------------------+-----------------------------------+---------------------------------------------+-----------------------------------------------------------------------------------+

diff --git a/docs/content/mES-HiC_analysis.rst b/docs/content/mES-HiC_analysis.rst
@@ -17,24 +17,6 @@ Lieberman-Aiden et al. and sequenced on the NextSeq 500 platform using 2
 Prepare for analysis
 --------------------
 
-Install HiC explorer
-~~~~~~~~~~~~~~~~~~~~
-
-The following commands install HiCExplorar globally. In case you don't
-have admin privileges, you can first install and setup a `virtual
-environment <https://virtualenv.pypa.io/en/latest/>`__ for python, and
-then install HiCExplorer inside the virtual environment.
-
-.. code:: bash
-
-    # To Clone and install in the ~/programs directory
-    cd ~/programs
-    git clone https://github.com/maxplanck-ie/HiCExplorer.git
-    cd HiCExplorer.git
-    python setup.py install
-    ## now add HiCExplorer in $PATH
-    export PATH=:$PATH
-
 Download Raw fastq files
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -57,115 +39,127 @@ Fastq files can be downloaded from the EBI archive (or NCBI archive).
 Map Raw fastq files
 ~~~~~~~~~~~~~~~~~~~
 
-The raw fastq files are now aligned to mouse genome. Here we will map
-them to *GRCm37* using
-`HISAT <https://ccb.jhu.edu/software/hisat/index.shtml>`__, a fast and
-memory efficient read aligner.
+Mates have to be mapped individually to avoid mapper specific heuristics designed
+for standard paired-end libraries.
 
-The option : **--sensitive-local** is used, since we want local
-alignment. In HiC, many reads in the read-pair overlap with their mates.
-Using *local* alignment avoids the alignment of the end of the reads,
-where the sequence might correspond to the read mate.
+We have used the HiCExplorer sucessfuly with `bwa`, `bowtie2` and `hisat2`. However, it is important to:
 
-**--reorder** is used to get the reads in the sam file in the order they
-are aligned in. HiCExplorer reads the alignment in this order.
+ * for either `bowtie2` or `hisat2` use the `--reorder` parameter which tells bowtie2 or hisat2 to output
+   the *sam* files in the **exact** same order as in the *.fastq* files.
+ * use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end
+   thus, local mapping is important to increase the number of mapped reads.
+ * Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with
+   gaps if they happen to be chimeric.
+
+
+The raw fastq files are aligned to mouse genome. Here we will map
+them to *GRCm37* using
+`BWA mem <http://bio-bwa.sourceforge.net/bwa.shtml>`__. See :ref:`example_usage` for an explanation of
+the bwa mem parameter used.
 
-Same options can be used with bowtie2 too.
 
 .. code:: bash
 
     mkdir mapped_files
 
     for fq in $(ls ${inputdir} | grep '.fastq.gz')
     do fqbase=$(basename ${fq} .fastq.gz)
-    /path/to/hisat/bin/hisat -x /path/to/hisat_index/prefix --sensitive-local --reorder -p 40 -U inputdir/${fq} -S mapped_files/${fqbase}.sam
+     bwa mem -A1 -B4 -E50 -L0  /path/to/bwa_index -U inputdir/${fq} | samtools view -Shb - > mapped_files/${fqbase}.sam
     done
 
 Build, visualize and correct Hi-C matrix
 ----------------------------------------
 
 Create a Hi-C matrix using the aligned files
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Build restriction-site bed files
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Before creating the HiC matrix, we need to know the restriction cut
-site. The program *findRestSite* creates it for us. It needs the
-restriction sequence as input. Here `DpnII
-sequence <https://www.neb.com/products/r0543-dpnii>`__ is used.
-
-.. code:: bash
-
-    findRestSite --fasta /path/to/GRCm37/genome_fasta/genome.fa --searchPattern GATC -o dpnII_positions_GRCm37.bed
-
-    ## Sort the file
-    sort -k1,1 -k2,2n dpnII_positions_GRCm37.bed > dpnII_positions_GRCm37-sorted.bed
+the restriction site bed file [-rs],
+restriction sequence [-seq]
+#-rs dpnII_positions_GRCm37-sorted.bed
 
 Build Hi-C matrix
 ^^^^^^^^^^^^^^^^^
 
-**hicBuildMatrix** builds the matrix of read counts over the bins in the
+:ref:`hicBuildMatrix` builds the matrix of read counts over the bins in the
 genome, considering the sites around the given restriction site. We need
-to provide the input bams, the restriction site bed file [-rs],
-restriction sequence [-seq] , binsize [-bs], name of output matrix file
+to provide the input BAM/SAM files,  binsize [-bs], restriction sequence [-seq],
+name of output matrix file
 [-o] and the name of output bam file (which contains the accepted
-alignments) [-b] .
+alignments) [-b].
 
 .. code:: bash
 
     mkdir hiCmatrix
 
     for SRR in SRR1956527 SRR1956528 SRR1956529;
     do hicBuildMatrix \
-    -s mapped_files/${SRR}_1.bam mapped_files/${SRR}_2.bam \
-    -bs 10000 \#-rs dpnII_positions_GRCm37-sorted.bed -seq GATC
-    -b ${SRR}_ref.bam -o hiCmatrix/${SRR}.matrix & done
+       -s mapped_files/${SRR}_1.bam mapped_files/${SRR}_2.bam \
+       -bs 10000 -seq GATC \
+       -b ${SRR}_ref.bam -o hiCmatrix/${SRR}.matrix --QCfolder hiCmatrix/${SRR}_QC  & done
 
 The output bam files show that we have around 34M, 54M and 58M selected
 reads for SRR1956527, SRR1956528 & SRR1956529, respectively. Normally
-25% of the total reads et selected.
+25% of the total reads are selected.
 
-The output matrices have counts for the genomic regions. The extention
-of output matrix files is *.npz*.
+The output matrices have counts for the genomic regions. The extension
+of output matrix files is *.h5*.
 
-Merge Matrices from Replicates
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Merge (sum) matrices from replicates
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 To increase the depth of reads we merge the counts from these three
 replicates.
 
 .. code:: bash
 
-    hicSumMatrices -m hiCmatrix/SRR1956527.matrix.npz hiCmatrix/SRR1956528.matrix.npz hiCmatrix/SRR1956529.matrix.npz -o hiCmatrix/replicateMerged.matrix
+    hicSumMatrices -m hiCmatrix/SRR1956527.matrix.h5 hiCmatrix/SRR1956528.matrix.h5 \
+                      hiCmatrix/SRR1956529.matrix.h5 -o hiCmatrix/replicateMerged.matrix.h5
 
 Correct Hi-C Matrix
 ^^^^^^^^^^^^^^^^^^^
 
-**hiCorrectMatrix** corrects the matrix counts in an iterative manner.
+:ref:`hicCorrectMatrix` corrects the matrix counts in an iterative manner.
 For correcting the matrix, it's important to remove the unassembled
 scaffolds (eg NT\_) and keep only chromosomes, as scaffolds create
-problems with marix correction. Therefore we use the chromosome names
-(1-19, X, Y) here.
+problems with matrix correction. Therefore we use the chromosome names
+(1-19, X, Y) here. **Important** use 'chr1 chr2 chr3 etc.' if your genome index uses
+chromosome names with the 'chr' prefix.
+
+Matrix correction works in two steps: first a histogram containing the sum of contact per bin (row sum) is
+produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads. The
+second steps removes the low scoring bins and does the correction.
+
+.. code:: bash
+
+    hicCorrectMatrix diagnostic_plot \
+    --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
+    -m hiCmatrix/replicateMerged.matrix.npz -o hiCmatrix/diagnostic_plot.png
+
+The output of the program prints a threshold suggestion that is usually accurate but is better to
+revise the histogram plot. See :ref:`example_usage` for an example and for more info.
+
+Next we do the correction using the best filter threshold.
 
 .. code:: bash
 
     hicCorrectMatrix correct \
     --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
+    --filterThreshold -1.5 10 \
     -m hiCmatrix/replicateMerged.matrix.npz -o hiCmatrix/replicateMerged.Corrected.npz
 
+
 Plot Hi-C matrix
 ~~~~~~~~~~~~~~~~
 
-since a big matrix takes too longs to plot, we merge the small bins into
-larger one.
+A 10kb bin matrix is quite large to plot and is better to reduce the resolution (to know the size
+of a Hi-C matrix use the tool :ref:`hicInfo`). For this we use the tool :ref:`hicMergeMatrixBins`
 
 Merge matrix bins for plotting
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-**hicMergeMatrixBins** merges the bins into larger bins of given number
-(specified by -nb). We will merge the original (uncorrected) matrix and
-then correct it.
+:ref:`hicMergeMatrixBins` merges the bins into larger bins of given number
+(specified by -nb). We will merge 100 bins in the original (uncorrected) matrix and
+then correct it. The new bin size is going to be 10.000 bp * 100 = 1.000.000 bp
 
 .. code:: bash
 
@@ -176,12 +170,17 @@ then correct it.
 Correct the merged matrix
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We will now correct the merged matrix befor plotting.
+We will now correct the merged matrix before plotting.
 
 .. code:: bash
 
+    hicCorrectMatrix diagnostic_plot \
+    --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
+    -m hiCmatrix/replicateMerged.matrix-100bins.npz -o hiCmatrix/diagnostic_plot_100bins.png
+
     hicCorrectMatrix correct \
     --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
+    --filterThreshold 0.9 10 \
     -m hiCmatrix/replicateMerged.matrix-100bins.npz -o hiCmatrix/replicateMerged.Corrected-100bins.npz
 
 Plot the corrected Hi-C Matrix
@@ -197,7 +196,7 @@ resolution
     hicPlotMatrix \
     --log1p --dpi 300 \
     -m hiCmatrix/replicateMerged.Corrected-100bins.npz \
-    --chromosomeOrder 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
+    --clearMaskedBins \
     -o plots/replicateMerged_Corrected-100bins_plot.png
 
 .. figure:: ./plots/replicateMerged_Corrected-100bins_plot.png
@@ -209,7 +208,7 @@ Remove outliers from hic-Matrix
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Outliers can be removed by a cutoff after looking at the diagnostic plot
-for **hicCorrectMatrix** (using **diagnostic\_plot** option). Here we
+for :ref:`hicCorrectMatrix` (using **diagnostic\_plot** option). Here we
 are using a matrix with 20kb bins (produced by *hicMergeMatrixBins -nb
 2*), since 20kb seems to be decent resolution.
 

diff --git a/docs/content/tools/findRestSite.rst b/docs/content/tools/findRestSite.rst
@@ -1,4 +1,4 @@
-.. _findRestSites:
+.. _findRestSite:
 
 findRestSites
 =============

diff --git a/docs/index.rst b/docs/index.rst
@@ -16,7 +16,7 @@ The following is the list of tools available in HiCExplorer
 =============================== ===========================================================================================
 tool                            description
 =============================== ===========================================================================================
-:ref:`findRestSites`             Identifies the genomic locations of restriction sites
+:ref:`findRestSite`              Identifies the genomic locations of restriction sites
 :ref:`hicBuildMatrix`            Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads
 :ref:`hicQC`                     Plots QC measures from the output of hicBuildMatrix
 :ref:`hicCorrectMatrix`          Uses iterative correction to remove biases from a Hi-C matrix

diff --git a/hicexplorer/trackPlot.py b/hicexplorer/trackPlot.py
@@ -1174,8 +1174,7 @@ def __init__(self, *args, **kwarg):
         self.colormap = None
         # check if the color given is a color map
         color_options = [m for m in matplotlib.cm.datad]
-        if self.properties['color'] in color_options and 'min_value' in self.properties and \
-                        'max_value' in self.properties:
+        if self.properties['color'] in color_options and 'min_value' in self.properties and 'max_value' in self.properties:
             norm = matplotlib.colors.Normalize(vmin=self.properties['min_value'],
                                                vmax=self.properties['max_value'])
             cmap = matplotlib.cm.get_cmap(self.properties['color'])