Skip to content

8. Model Comparisons

WangLabTHU edited this page Jan 30, 2024 · 12 revisions

hcwang and qxdu edited on Aug 4, 2023, 1 version

Here we utilize the 50bp ecoli dataset, 165bp ecoli dataset, 80bp yeast dataset, 1000bp yeast dataset. Here the 80bp yeast core promoters were obtained by removing 17bp from the left flanking and 13bp from the right flanking of each sequence in the 110bp dataset. The 50bp and 165bp E. coli datasets each contain approximately 10,000 sequences, while the 80bp yeast dataset has around 100,000 sequences, and the 1000bp yeast dataset consists of only about 4,000 sequences.

In general, different generative models have their distinct advantages. GANs show promising results in long sequence generation, VAEs provide quicker runtime, and the diffusion model obtained the best results in generating sequences when assessed using the 6-mer frequency PCC criterion. The diffusion model achieved 6-mer frequency PCC scores of 0.89, 0.98, and 0.98 in the 50bp E. coli dataset, 165bp E. coli dataset, and 80bp yeast dataset, respectively. And it demonstrates reliable results against mode collapse in most cases. Our recommendation to new users is to initially trying to use the diffusion model. As they gain experience, they can consider incorporating either the WGAN or VAE models into their workflow.

In the predictor section, when using the regression PCC criterion, CNN_K15 achieved the best results in the Ecoli_50 and Yeast_80 datasets. However, it encountered challenges with longer sequences, especially in the Yeast_1000 dataset. On the other hand, AttnBiLSTM obtained the most robust prediction across four datasets and performed well in long sequences datasets. The model obtained the 0.17, 0.66, 0.82, 0.57 PCC in the 50bp E. coli dataset, 165bp E. coli dataset, 80bp yeast dataset, and 1000bp yeast dataset, respectively. However, AttnBiLSTM came with a relatively longer runtime. Our recommendation is to select the appropriate model based on available time and the length of sequences in the datasets. Users can use either CNN_K15 or AttnBiLSTM as a baseline model, while exploring alternative models for specific applications and dataset characteristics.

Analyses for Generator

We evaluated the runtime of generative models (VAE, WGAN, Diffusion), as well as the 6-mer similarity between the final sampling results and natural data. Here the training epochs for VAE is 200, for WGAN is 12, and for Diffusion is 100. These number of epochs has been experimentally validated to ensure the model is adequately trained and does not manifest anomalies, such as mode collapse.

Model performance is as follows:

Execution Time/Epoch Ecoli_50 Ecoli_165 Yeast_80 Yeast_1000
VAE 12.81 14.47 110.81 20.99
WGAN 25.86 56.87 303.18 88.65
Diffusion 14.97 38.45 174.46 *
6-mer Frequency PCC Ecoli_50 Ecoli_165 Yeast_80 Yeast_1000
VAE 0.86 0.82 0.49 0.94
WGAN 0.59 0.82 0.13 0.96
Diffusion 0.89 0.98 0.98 *

Caution*: When sequence length is 1kb, the diffusion model raise "OUT of CUDA Memory" error, indicating that diffusion model should not be applied in kb-levels of generation.

The results demonstrate that WGAN model is the most time-saving approach, with relatively good performance in similarity, consisting with previous conclusions. However, there have been mode collapses in WGAN model, and sometimes the model at epoch 12 might not be the appropriate result. The diffusion model always has the best similarity and remains stable.

Analyses for Predictor

We evaluated the runtime of predictive models (CNN_K15, CNN_Wangye,DenseNet,DenseLSTM,AttnBiLSTM), as well as the pearson coefficient between the predicted value and natural data. All parameters for all predictors remain as default, and the training epochs are consistently set to 200.

Model performance is as follows:

Execution Time(Total) Ecoli_50 Ecoli_165 Yeast_80 Yeast_1000
CNN_K15 73.13 41.07 292.67 73.85
CNN_Wangye 209.55 703.32 1305.26 181.31
DenseNet 35.65 66.89 973.63 41.36
DenseLSTM 62.33 159.97 1398.89 311.91
AttnBiLSTM 104.76 132.74 778.40 464.80
Regression PCC Ecoli_50 Ecoli_165 Yeast_80 Yeast_1000
CNN_K15 0.22 0.64 0.83 0.37
CNN_Wangye 0.19 * 0.57 0.47
DenseNet -0.07 0.59 0.74 0.33
DenseLSTM 0.03 0.61 0.82 0.38
AttnBiLSTM 0.17 0.66 0.82 0.57

Caution*: CNN_Wangye cannot be applied to a training set with negative values.

The results indicate that CNN_K15 is suitable for short sequences (<100bp), while CNN_Wangye and AttnBiLSTM are better suited for long sequences (>200bp). DenseNet and DenseLSTM demonstrate a relatively balanced performance, with their variations remaining widely used in other studies.