Skip to content
dphansti edited this page Apr 15, 2016 · 20 revisions

What does Mango do?

Mango processes ChIA-PET sequencing data (in fastq format) and determines statistically significant DNA-DNA interactions (loops) between pairs of genomic loci bound by the target protein. Mango performs the following five steps which can run all at once or one at a time:

  • Linker parsing: finds and removes linker sequences.
  • Read alignment: aligns reads to the genome using bowtie.
  • PET filtering: removes reads potential due to PCR duplication and organizes reads into bedpe format.
  • Peak calling: calls DNA binding peaks using MACS2.
  • Interaction calling: expceted interaction frequecnies are modeled based-on both peak depths and distances between peaks and statistically significant deviations are reported as interactions.

Linker parsing

During ChIA-PET library preparation DNA linker sequences are ligated to the ends of genomic DNA fragments. These are then ligated together under dilute conditions to give rise to DNA fragments that contain genomic DNA followed by two linker sequences and another region of genomic DNA. These fragments are sequenced via paired end tag sequencing. The first step in the Mango pipeline involve finding the location of linker sequences. Using default settings (keepempty = FALSE) PETs are only retained if a linker sequence can be found in both reads. The user can specify two linker sequences to look for. Only PETs that contain the same linker sequence at both ends are retained. If only one linker sequence was used during library construction, enter the same sequence into linkerA and linkerB. After identifying the location of linker sequences Mango trims the reads to remove the linker sequence (and any sequence after it). PETs are further filtered to retain only PETs whose linkers were found within a certain range (by default 15-25 bp into the read). These limits can be adjusted using 'minlength' and 'maxlength'. All retained PETs are printed out into files named ..._1.same.fastq and ..._2.same.fastq.

Some ChIA-PET protocols now use a transposase-based method to generate sequencing libraries ('tagmentation'). Libraries prepared in this fashion require several important changes to the Mango default parameters.

  • --keepempty TRUE. Because many of the resulting reads are now quite long, it is common for the linker sequence not to be present in the read. Therefore 'keepempty' should be set to TRUE.
  • --maxlength 1000. tagmentation-generated libraries have variable length stretches of genomic DNA (as opposed to the ~20bp stretches generated using the original MmeI-based protocol). Therefore 'maxlength' should be set to a value longer than the readlength sequenced.
  • --shortreads FALSE. The longer reads resulting from the tagmentation-generated libraries require different alignment parameters. Mango determines which parameters to use based on the 'shortreads' parameter. Therefore 'shortreads' should be set to FALSE.

Read alignment

In the second step reads are aligned to the genome (separately for each end of the PET) using bowtie. If 'shortreads' is TRUE the following bowtie settings are used:

-S -v 0 -k 1 --chunkmbs 500 --sam-nohead --mapq 40 -m 1

If 'shortreads' is FALSE the following bowtie settings are used:

-S -n 2 -l 50 -k 1 --chunkmbs 500 --sam-nohead --mapq 40 -m 1 --best

The value of the 'shortreads' parameter should be TRUE if the original ChIA-PET protocol was used and FALSE if tagmentation was used to generate the libraries.

PET filtering

The SAM files generated in step 2 are organized into BEDPE format and filtered to remove reads that could be due to PCR duplication. As of version 1.1.6, this step also generates a plot of the distribution of PET distances (distance between the reads on either end of a PET). In our experiences these plots are the single best predictors of ChIA-PET library quality. PET distances of less than 10Kb are often the result of self-ligation. PET distances of greater than 1000Kb are often the result of non-specific interactions (either in the nucleus or during the proximity ligation step). PET distances ranging from roughly 10Kb-1000Kb are the useful reads in identifying looping interactions. An example of a plot generated in step 3 of Mango is shown below. That library would be considered of average quality. The 10-100Kb peak is a decent size. In an exceptionally good quality data set the height of the 10-100Kb peak would exceed that of the >1000Kb peak.

PET distance distribution plot

Peak calling

Clone this wiki locally