-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have you tried to train VNMT from scratch? #2
Comments
This is my result (score is BLEU score on dev set): experiment 1 (VNMT):
experiment 2 (VNMT w/o KL):
experiment 3 (VNMT w/o KL && set h'_e to zero vector, thus equivalent to RNNSearch):
It's easy to see VNMT is better than VNMT w/o KL, but none of them can match performance of RNNSearch. It seems to be VNMT hasn't converged, and I keep training VNMT to 230k steps and it now achieves a BLEU score of 27 on dev set (still not good enough but better than that in 75k steps). Has this slow convergence observed in your experiment? Thanks. |
@soloice Yes, training VNMT from scratch without any other optimzation tricks usually results in undesirable convergence and relatively worse performance. We also observed this phenomena in our empirical experiments, and finally we chose to train VNMT based on a well-pretrained NMT model. I'm not sure whether your experimental design is correct, but I believe there are something wrong with your VNMT w/o KL. This is because VNMT w/o KL just extends NMT with a global source representation. Generally, you can treat it as a combination of the vanilla seq2seq model and the attentional seq2seq model. So, it's performance should be near that of NMT. Based on your results, I guess that after removing the KL term, maybe you forgot to change the reparameterization part where the noise should be dropped. If you prefer training VNMT from the very begining, I suggest you borrow some idea from the weight annealing on the KL term. However, in my experience, selecting the optimal strategy for the annealing is , to some extent, also task- and data- dependent. You can find more details in this paper. |
Your suggestion is very informative! |
Hi, I checked my implementation and I don't think my (Btw, XMUNMT is also from XMU, and I don't know if you know that work, lol~ It has several differences from the original RNNSearch, e.g.: initialization, optimizer, etc.) I tuned my hyperparameters (mainly learning rate and initializer) and made
I guess I should try fine-tuning instead of insisting training |
Hi @soloice , XMUNMT is an excellent open-source toolkit. It's glad to see more users. I glanced over your code for VNMT w/o KL, and I think your implementation is ok. But your training result is different from ours. For this model, we just observe very similar performance as the NMT. And I also do not know where is the problem. |
It's very kind of you to review my code. |
I found training VNMT from scratch leads to a low performance. Then I monitored the training loss, and figured it out: the algorithm just converges too slow and haven't fully converged when I stop it.
Does this suggest a buggy implementation? Or this training dynamics is normal?
The text was updated successfully, but these errors were encountered: