GitHub - chen-bioinfo/iEnhancer-ELM

📋 iEnhancer-ELM: Learning Explainable Contextual Information to Improve Enhancer Identification using Enhancer Language Models

Abstract

In this paper, we propose an enhancer language model (iEnhancer-ELM) for enhancer identification by incorporating a pre-trained BERT-based DNA model. iEnhancer-ELM treats enhancer sequences as natural sentences that are composed of k-mer nucleic acids to extract informative biological features from raw enhancer sequences. Benefiting from the complementary information in various k-mer (k=3,4,5,6) tokens, we ensemble four iEnhancer-ELM models for improving enhancer identification. The experimental results show that our model achieve an accuracy of 83.00%, outperforming competing state-of-the-art methods. Moreover, 40% of motifs found by iEnhancer-ELM can exhibit statistical and biological significance, demonstrating our model is explainable and has a potential ability to reveal the biological mechanism.

Contribution

Experiments on the benchmark dataset shows **iEnhancer-ELM with various k-mers achieves an accuracy of about 80%, outperforming all kind of well-known models with single feature. We ensemble multiple iEnhancer-ELM models based on various k-mer tokens to achieve a better performance with accuracy of 83.00%, outperforming existing state-of-the-art methods. Furthermore, we interpret the actions of iEnhancer-ELM by analyzing the patterns in attention mechanism. We find the motifs extracted from the attention mechanism match existing motifs with significant level, demonstrating the ability of iEnhancer-ELM capturing important biological features for enhancer identification. The contribution of this work can be summarized as follows:

We propose the enhancer language models by incorporating a pre-trained BERT-based DNA model to capture global contextual information from raw enhancer sequences.
Our iEnhancer-ELM achieves the best performance comparing with well-known models based on single feature, and the ensemble iEnhancer-ELM outperforms existing state-of-the-art methods.
iEnhancer-ELM has the ability to capture important biological motifs for enhancer identification, demonstrating its potentials for revealing the biological mechanism of enhancers.

Model Structure

The following figure is the illustration of our proposed method. The top subfigure is the flowchart of enhancer identification, and the below subfigure is our motif analysis via the attention mechanism in iEnhancer-ELM with 3-mer.

Result

Performance comparison in the independent dataset of the first dataset

The below table shows the performance comparison in the independent dataset between ensemble iEnhancer-ELM and the state-of-the-art predictors for enhancer indefication.

Method	Acc	Sn	Sn	MCC	Source
iEnhancer-2L	0.7300	0.7100	0.7500	0.4600	Liu et al.
iEnhancer-EL	0.7475	0.7100	0.7850	0.4960	Liu et al.
iEnhancer-XG	0.7575	0.7400	0.7750	0.5140	Cai et sl.
iEnhancer-Deep	0.7402	0.8150	0.6700	0.4902	Kamran et al.
iEnhancer-GAN	0.7840	0.8110	0.7580	0.5670	Yang et al.
iEnhancer-5Step	0.7900	0.8200	0.7600	0.5800	Le et al.
BERT-Enhancer	0.7560	0.8000	0.7750	0.5150	Le et al.
iEnhancer-ELM	0.8300	0.8000	0.8600	0.6612	ours

Moitf Analysis

Based on the method of motif analysis with attention mechanism[1], we extract motifs from the raw sequences according to the following steps. More details are shown in the main paper.

Step 1: calculate the attention weights of single nucleotide in enhancer sequences.
Step 2: find out the potential patterns.
Step 3: filter significant candidates.
Step 4: obtain motifs according to the sequence similarity.

Finally, we can obtain 30 motifs, and make comparison with STREME[2] and JASPAR[3] by the tool of TomTom[4]. The result is shown in the below figure.

Performance comparison in the independent dataset of the second dataset

The below table shows the performance comparison in the independent dataset of the second dataset. Enhancer-IF [5] is one stat-of-the-art method, which contains enhancer datasets in varying length.

cell line	method	Bacc	Sn	Sn	MCC	AUC
HEK293	Enhancer-IF	0.8170	0.7750	0.8580	0.6250	0.8930
	non-fine-tuning	0.8278	0.8047	0.8509	0.6409	0.9114
	fine-tuning	0.8732	0.8666	0.8798	0.7283	0.9289
NHEK	Enhancer-IF	0.7480	0.7340	0.7620	0.4770	0.8100
	non-fine-tuning	0.7191	0.7717	0.6665	0.4137	0.7928
	fine-tuning	0.7706	0.7909	0.7501	0.5152	0.8467
K652	Enhancer-IF	0.7730	0.7730	0.7720	0.5230	0.8520
	non-fine-tuning	0.7834	0.8394	0.7275	0.5360	0.8605
	fine-tuning	0.7974	0.8180	0.7767	0.5679	0.8712
GM12878	Enhancer-IF	0.8140	0.7550	0.8730	0.6260	0.9010
	non-fine-tuning	0.8186	0.7402	0.8969	0.6463	0.9152
	fine-tuning	0.8222	0.7564	0.8879	0.6475	0.9179
HMEC	Enhancer-IF	0.7530	0.7400	0.7650	0.4850	0.8390
	non-fine-tuning	0.7628	0.7694	0.7563	0.5022	0.8352
	fine-tuning	0.7685	0.7827	0.7543	0.5122	0.8403
HSMM	Enhancer-IF	0.7180	0.7090	0.7270	0.4170	0.7940
	non-fine-tuning	0.7225	0.7592	0.6859	0.4208	0.7987
	fine-tuning	0.7201	0.7388	0.7013	0.4175	0.7909
NHLF	Enhancer-IF	0.7940	0.7890	0.7980	0.5650	0.8650
	non-fine-tuning	0.7551	0.8325	0.6777	0.4810	0.8266
	fine-tuning	0.7884	0.8236	0.7532	0.5479	0.8623
HUVEC	Enhancer-IF	0.7370	0.7520	0.7220	0.4500	0.8110
	non-fine-tuning	0.7353	0.8085	0.6620	0.4436	0.8061
	fine-tuning	0.7334	0.7691	0.6977	0.4417	0.8045

Code

Enviromnent

# Clone this repository
git clone https://github.com/chen-bioinfo/iEnhancer-ELM.git
cd iEnhancer-ELM
# download the pre-trained BERT-based DNA models from the link (https://drive.google.com/drive/folders/1qzvCzYbx0UIZV3HY4pEEeIm3d_mqZRcb?usp=sharing);
# With these pre-trained models and the code file of iEnhancer-ELM/code/DNA_bert_finetuning_average_L2.ipynb, 
# we can reproduce the training process.
cd iEnhancer-ELM/code
# download the fine-trained classification models form the link (https://drive.google.com/drive/folders/1EdOYQ2BLcAUtS_dupWdmJ-v6bkne4xAM?usp=sharing);
# With these fine-trained modles and the code file of iEnhancer-ELM/code/DNA_Bert_finetuning_L2_ensemble.ipynb, 
# we can reproduce the best performance in independent dataset. And our motif analysis is based on these fine-trained models. 
# the key elements of 'iEnhancer-ELM' operating environment are listed below:
# python=3.6.9; Torch=1.9.0+cull 
# Numpy=1.19.5; Transformers=3.0.16
# GPU=NVIDIA A100 80GB PCIe
# the more details about code will been shown in the folder of 'code'.

Reference

[1] Ji Y, Zhou Z, Liu H, et al. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome[J]. Bioinformatics, 2021, 37(15): 2112-2120.
[2] Bailey T L. STREME: accurate and versatile sequence motif discovery[J]. Bioinformatics, 2021, 37(18): 2834-2840.
[3] Castro-Mondragon J A, Riudavets-Puig R, Rauluseviciute I, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles[J]. Nucleic acids research, 2022, 50(D1): D165-D173.
[4] Gupta S, Stamatoyannopoulos J A, Bailey T L, et al. Quantifying similarity between motifs[J]. Genome biology, 2007, 8(2): 1-9.
[5] Basith S, Hasan M M, Lee G, et al. Integrative machine learning framework for the identification of cell-specific enhancers from the human genome[J]. Briefings in Bioinformatics, 2021, 22(6): bbab252.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
DNA-BERT		DNA-BERT
Figure		Figure
For the first dataset		For the first dataset
For the second dataset		For the second dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Contribution

Model Structure

Result

Performance comparison in the independent dataset of the first dataset

Moitf Analysis

Performance comparison in the independent dataset of the second dataset

Code

Enviromnent

Reference

About

Releases

Packages

Languages

chen-bioinfo/iEnhancer-ELM

Folders and files

Latest commit

History

Repository files navigation

Abstract

Contribution

Model Structure

Result

Performance comparison in the independent dataset of the first dataset

Moitf Analysis

Performance comparison in the independent dataset of the second dataset

Code

Enviromnent

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages