MOtif inference with advanced BInding site selection.
As shown in this figure, we postulate there are several TF binding modes. In the result of a typical ChIP-seq experiment targeting a certain TF (here shown in red), the identified binding sites could either contain the target motif (fig. A) or not (fig. B). The ideal binding sites for motif inference should be those from scenario one.
Notice that among all these cases:
- Binding site in scenario one is least "crowded", i.e. there are fewest possibly interacting proteins in the binding site window.
- The target binding motif should locate in or close to the peak summit (in this figure, it is the center).
Therefore, we select binding sites with low "crowdness" score and trim the binding sites to a shorter length. Using such binding sites result in more accuate motif inference.
For more details, see our paper:
Discovering a less-is-more effect to select transcription factor binding sites informative for motif inference
Jinrui Xu, Jiahao Gao, Mark Gerstein
bioRxiv 2020.11.29.402941; doi: https://doi.org/10.1101/2020.11.29.402941
We applied MOBI to the 4 samples from ENCODE respectively with their best parameters and infer the motifs. You could download all our predicted motifs here.
|Drosophila melanogaster||454||Download||Download||Download||Download||52 TFs|
|Caenorhabditis elegans||283||Download||Download||Download||Download||33 TFs|
(Note that files will be in MEME format. File name is the TF name but any "(", ")" and ":" characters in the TF name will be omitted, e.g. TF name A(B)C will have motif file named ABC.meme)
Examples output motifs
See here for a few examples of the inferred motifs.
Download C-score files
Requirement of scripts
To run the inference step, you need to have DREME/MEME/STREME/HOMER properly installed. See here for meme-suite installation. You can test by
meme -version or other equivalent commands.
Run the example
- Download the github and unzip into
MOBI/, go to the folder
- Download the example ChIP-seq data:
- Download the genome https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz into
example_data/, then uncompress and index the fasta (require samtools):
cd example_data/ gunzip dm6.fa.gz samtools faidx dm6.fa cd ..
- Run the scripts
python MOBI.py. This will generate a file called
joblist_inference.sh. All intermediate files are in
- Make sure you have the inference tool (e.g. DREME) installed. Run the scripts
bash joblist_inference.sh. All predicted motifs are now under
Infer motifs for your own data
Modify line 7-14 in file MOBI.py accordingly (will be updated to argument). Then run step 4 and 5 in the previous section.
data_chip(str): Path to the folder containing all the ENCODE ChIP-seq files. All these files are used to calculated the C-score (see below)
data_meta(str): Path to a tab-seperated file. The first column is the basename of the file and the second column is the TF name. Notice the first column should be a subset of the basenames of files in
genome_fasta(str): Path to the genome fasta file. The index file should be generated beforehand and located in the same folder
result_main(str): Path to the main result folder.
width_list(list of int): A list of binding regions length to try. 100 indicates a binding regions of +-100bp around ChIP-seq peak summit (a total of 200bp)
rank_list(list of str): Choice of
RankLinear0.1where 0.1 could be changed to other float. For detail of the ranking method, see the paper
tool_list(list of str): Choice of
Finding optimal parameters for your own data
In order to find the best ranking method (weights) and binding sites width, we simply do a brute force search for the parameters that give the best result:
- Run the above section with
rank_listcover all the parameters you want to try.
- Having a list of "known" motifs in the MEME format. You could download this from Cis-BP or other database.
- Modify line 5-12 in
MOBI_tomtom.py, run the scripts with
python MOBI_tomtom.py. This will generate a file called
joblist_tomtom.sh. Make sure you have meme-suite install by verifing
tomtom --help. Run the script with
bash joblist_tomtom.sh. This is to compare the inferred motifs to the known motifs.
- Modify line 5-12 in
MOBI_stats.py, run the scripts with
python MOBI_stats.py. The best parameters will be shown in
example_result/stats/DREME_idx.txtif you are using DREME. You can find the result for this optimal paremeters by the file names in