Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


MOtif inference with advanced BInding site selection.

Methods overview


As shown in this figure, we postulate there are several TF binding modes. In the result of a typical ChIP-seq experiment targeting a certain TF (here shown in red), the identified binding sites could either contain the target motif (fig. A) or not (fig. B). The ideal binding sites for motif inference should be those from scenario one.

Notice that among all these cases:

  • Binding site in scenario one is least "crowded", i.e. there are fewest possibly interacting proteins in the binding site window.
  • The target binding motif should locate in or close to the peak summit (in this figure, it is the center).

Therefore, we select binding sites with low "crowdness" score and trim the binding sites to a shorter length. Using such binding sites result in more accuate motif inference.

For more details, see our paper:
Discovering a less-is-more effect to select transcription factor binding sites informative for motif inference
Jinrui Xu, Jiahao Gao, Mark Gerstein
bioRxiv 2020.11.29.402941; doi:


We applied MOBI to the 4 samples from ENCODE respectively with their best parameters and infer the motifs. You could download all our predicted motifs here.

Drosophila melanogaster 454 Download Download Download Download 52 TFs
Caenorhabditis elegans 283 Download Download Download Download 33 TFs
GM12878 136 Download Download Download Download 56 TFs
K562 336 Download Download Download Download 89 TFs

(Note that files will be in MEME format. File name is the TF name but any "(", ")" and ":" characters in the TF name will be omitted, e.g. TF name A(B)C will have motif file named

Examples output motifs

See here for a few examples of the inferred motifs.

Download C-score files

We provide the C-score we used in the paper in bigWig format for users to download.

Requirement of scripts

  • python3
  • pybedtools
  • numpy
  • pandas

To run the inference step, you need to have DREME/MEME/STREME/HOMER properly installed. See here for meme-suite installation. You can test by meme -version or other equivalent commands.

Run the example

  1. Download the github and unzip into MOBI/, go to the folder cd MOBI/.
  2. Download the example ChIP-seq data: bash
  3. Download the genome into example_data/, then uncompress and index the fasta (require samtools):
cd example_data/  
gunzip dm6.fa.gz  
samtools faidx dm6.fa  
cd ..
  1. Run the scripts python This will generate a file called All intermediate files are in example_result/.
  2. Make sure you have the inference tool (e.g. DREME) installed. Run the scripts bash All predicted motifs are now under example_result/inference/.

Infer motifs for your own data

Modify line 7-14 in file accordingly (will be updated to argument). Then run step 4 and 5 in the previous section.

  • data_chip(str): Path to the folder containing all the ENCODE ChIP-seq files. All these files are used to calculated the C-score (see below)
  • data_meta(str): Path to a tab-seperated file. The first column is the basename of the file and the second column is the TF name. Notice the first column should be a subset of the basenames of files in data_chip
  • genome_fasta(str): Path to the genome fasta file. The index file should be generated beforehand and located in the same folder
  • result_main(str): Path to the main result folder.
  • width_list(list of int): A list of binding regions length to try. 100 indicates a binding regions of +-100bp around ChIP-seq peak summit (a total of 200bp)
  • rank_list(list of str): Choice of RankSPP, RankCrowdness or RankLinear0.1 where 0.1 could be changed to other float. For detail of the ranking method, see the paper
  • tool_list(list of str): Choice of DREME, MEME, STREME and HOMER.

Finding optimal parameters for your own data

In order to find the best ranking method (weights) and binding sites width, we simply do a brute force search for the parameters that give the best result:

  • Run the above section with width_list and rank_list cover all the parameters you want to try.
  • Having a list of "known" motifs in the MEME format. You could download this from Cis-BP or other database.
  • Modify line 5-12 in, run the scripts with python This will generate a file called Make sure you have meme-suite install by verifing tomtom --help. Run the script with bash This is to compare the inferred motifs to the known motifs.
  • Modify line 5-12 in, run the scripts with python The best parameters will be shown in example_result/stats/DREME_idx.txt if you are using DREME. You can find the result for this optimal paremeters by the file names in example_result/inference/DREME/.


For any questions, please contact Jiahao Gao(
Gerstein Lab 2022


No description, website, or topics provided.






No releases published


No packages published