MOtif inference with advanced BInding site selection.
As shown in this figure, we postulate there are several TF binding modes. In the result of a typical ChIP-seq experiment targeting a certain TF (here shown in red), the identified binding sites could either contain the target motif (fig. A) or not (fig. B). The ideal binding sites for motif inference should be those from scenario one.
Notice that among all these cases:
- Binding site in scenario one is least "crowded", i.e. there are fewest possibly interacting proteins in the binding site window.
- The target binding motif should locate in or close to the peak summit (in this figure, it is the center).
Therefore, we select binding sites with low "crowdness" score and trim the binding sites to a shorter length. Using such binding sites result in more accuate motif inference.
For more details, see our paper:
Xu, J., Gao, J., Ni, P., & Gerstein, M. (2024). Less-is-more: selecting transcription factor binding regions informative for motif inference. Nucleic Acids Research, gkad1240.
We applied MOBI to the 4 samples from ENCODE respectively with their best parameters and infer the motifs. You could download all our predicted motifs here.
Sample | TFs | DREME | MEME | STREME | HOMER | DESSO |
---|---|---|---|---|---|---|
Drosophila melanogaster | 454 | Download | Download | Download | Download | 52 TFs |
Caenorhabditis elegans | 283 | Download | Download | Download | Download | 33 TFs |
GM12878 | 136 | Download | Download | Download | Download | 56 TFs |
K562 | 336 | Download | Download | Download | Download | 89 TFs |
(Note that files will be in MEME format. File name is the TF name but any "(", ")" and ":" characters in the TF name will be omitted, e.g. TF name A(B)C will have motif file named ABC.meme)
See here for a few examples of the inferred motifs.
We provide the C-score we used in the paper in bigWig format for users to download. (If the download didn't start, right click to copy the link and paste into the browser manually).
Fly
Worm
GM12878
K562
- python3
- pybedtools
- numpy
- pandas
To run the inference step, you need to have DREME/MEME/STREME/HOMER properly installed. See here for meme-suite installation. You can test by meme -version
or other equivalent commands.
- Download the github and unzip into
MOBI/
, go to the foldercd MOBI/
. - Download the example ChIP-seq data:
bash DownloadExampleData.sh
. - Download the genome https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz into
example_data/
, then uncompress and index the fasta (require samtools):
cd example_data/
gunzip dm6.fa.gz
samtools faidx dm6.fa
cd ..
- Run the scripts
python MOBI.py
. This will generate a file calledjoblist_inference.sh
. All intermediate files are inexample_result/
. - Make sure you have the inference tool (e.g. DREME) installed. Run the scripts
bash joblist_inference.sh
. All predicted motifs are now underexample_result/inference/
.
Modify line 7-14 in file MOBI.py accordingly (will be updated to argument). Then run step 4 and 5 in the previous section.
data_chip
(str): Path to the folder containing all the ENCODE ChIP-seq files. All these files are used to calculated the C-score (see below)data_meta
(str): Path to a tab-seperated file. The first column is the basename of the file and the second column is the TF name. Notice the first column should be a subset of the basenames of files indata_chip
genome_fasta
(str): Path to the genome fasta file. The index file should be generated beforehand and located in the same folderresult_main
(str): Path to the main result folder.width_list
(list of int): A list of binding regions length to try. 100 indicates a binding regions of +-100bp around ChIP-seq peak summit (a total of 200bp)rank_list
(list of str): Choice ofRankSPP
,RankCrowdness
orRankLinear0.1
where 0.1 could be changed to other float. For detail of the ranking method, see the papertool_list
(list of str): Choice ofDREME
,MEME
,STREME
andHOMER
.
In order to find the best ranking method (weights) and binding sites width, we simply do a brute force search for the parameters that give the best result:
- Run the above section with
width_list
andrank_list
cover all the parameters you want to try. - Having a list of "known" motifs in the MEME format. You could download this from Cis-BP or other database.
- Modify line 5-12 in
MOBI_tomtom.py
, run the scripts withpython MOBI_tomtom.py
. This will generate a file calledjoblist_tomtom.sh
. Make sure you have meme-suite install by verifingtomtom --help
. Run the script withbash joblist_tomtom.sh
. This is to compare the inferred motifs to the known motifs. - Modify line 5-12 in
MOBI_stats.py
, run the scripts withpython MOBI_stats.py
. The best parameters will be shown inexample_result/stats/DREME_idx.txt
if you are using DREME. You can find the result for this optimal paremeters by the file names inexample_result/inference/DREME/
.
For any questions, please contact Jiahao Gao(jiahao.gao@yale.edu)
Gerstein Lab 2024
MIT License