Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have you tried to train VNMT from scratch? #2

Closed
soloice opened this issue Dec 6, 2017 · 6 comments
Closed

Have you tried to train VNMT from scratch? #2

soloice opened this issue Dec 6, 2017 · 6 comments

Comments

@soloice
Copy link

soloice commented Dec 6, 2017

I found training VNMT from scratch leads to a low performance. Then I monitored the training loss, and figured it out: the algorithm just converges too slow and haven't fully converged when I stop it.
Does this suggest a buggy implementation? Or this training dynamics is normal?

@soloice
Copy link
Author

soloice commented Dec 7, 2017

This is my result (score is BLEU score on dev set):

experiment 1 (VNMT):

INFO:tensorflow:Best score at step 5000: 0.041646
INFO:tensorflow:Best score at step 10000: 0.089593
INFO:tensorflow:Best score at step 15000: 0.120238
INFO:tensorflow:Best score at step 20000: 0.157707
INFO:tensorflow:Best score at step 25000: 0.162056
INFO:tensorflow:Best score at step 30000: 0.204992
INFO:tensorflow:Best score at step 35000: 0.206457
INFO:tensorflow:Best score at step 40000: 0.219764
INFO:tensorflow:Best score at step 45000: 0.219764
INFO:tensorflow:Best score at step 50000: 0.219764
INFO:tensorflow:Best score at step 55000: 0.239175
INFO:tensorflow:Best score at step 60000: 0.241595
INFO:tensorflow:Best score at step 65000: 0.241595
INFO:tensorflow:Best score at step 70000: 0.241641
INFO:tensorflow:Best score at step 75000: 0.250475

experiment 2 (VNMT w/o KL):

INFO:tensorflow:Best score at step 5000: 0.064169
INFO:tensorflow:Best score at step 10000: 0.083137
INFO:tensorflow:Best score at step 15000: 0.126807
INFO:tensorflow:Best score at step 20000: 0.145696
INFO:tensorflow:Best score at step 25000: 0.169591
INFO:tensorflow:Best score at step 30000: 0.201930
INFO:tensorflow:Best score at step 35000: 0.201930
INFO:tensorflow:Best score at step 40000: 0.201930
INFO:tensorflow:Best score at step 45000: 0.220347
INFO:tensorflow:Best score at step 50000: 0.220347
INFO:tensorflow:Best score at step 55000: 0.234179
INFO:tensorflow:Best score at step 60000: 0.234179
INFO:tensorflow:Best score at step 65000: 0.235006
INFO:tensorflow:Best score at step 70000: 0.235006
INFO:tensorflow:Best score at step 75000: 0.242985

experiment 3 (VNMT w/o KL && set h'_e to zero vector, thus equivalent to RNNSearch):

INFO:tensorflow:Best score at step 5000: 0.195378
INFO:tensorflow:Best score at step 10000: 0.229090
INFO:tensorflow:Best score at step 15000: 0.245143
INFO:tensorflow:Best score at step 20000: 0.249300
INFO:tensorflow:Best score at step 25000: 0.263898
INFO:tensorflow:Best score at step 30000: 0.282799
INFO:tensorflow:Best score at step 35000: 0.285575
INFO:tensorflow:Best score at step 40000: 0.288144
INFO:tensorflow:Best score at step 45000: 0.288144
INFO:tensorflow:Best score at step 50000: 0.295313
INFO:tensorflow:Best score at step 55000: 0.302703
INFO:tensorflow:Best score at step 60000: 0.306085
INFO:tensorflow:Best score at step 65000: 0.306085
INFO:tensorflow:Best score at step 70000: 0.306247
INFO:tensorflow:Best score at step 75000: 0.308485

It's easy to see VNMT is better than VNMT w/o KL, but none of them can match performance of RNNSearch. It seems to be VNMT hasn't converged, and I keep training VNMT to 230k steps and it now achieves a BLEU score of 27 on dev set (still not good enough but better than that in 75k steps).

Has this slow convergence observed in your experiment? Thanks.

@bzhangGo
Copy link
Contributor

bzhangGo commented Dec 7, 2017

@soloice Yes, training VNMT from scratch without any other optimzation tricks usually results in undesirable convergence and relatively worse performance. We also observed this phenomena in our empirical experiments, and finally we chose to train VNMT based on a well-pretrained NMT model.

I'm not sure whether your experimental design is correct, but I believe there are something wrong with your VNMT w/o KL. This is because VNMT w/o KL just extends NMT with a global source representation. Generally, you can treat it as a combination of the vanilla seq2seq model and the attentional seq2seq model. So, it's performance should be near that of NMT. Based on your results, I guess that after removing the KL term, maybe you forgot to change the reparameterization part where the noise should be dropped.

If you prefer training VNMT from the very begining, I suggest you borrow some idea from the weight annealing on the KL term. However, in my experience, selecting the optimal strategy for the annealing is , to some extent, also task- and data- dependent. You can find more details in this paper.

@soloice
Copy link
Author

soloice commented Dec 7, 2017

Your suggestion is very informative!
I'll check my implementation to see if one of the above issues exists, especially my VNMT w/o KL implementation.

@soloice soloice closed this as completed Dec 7, 2017
@soloice
Copy link
Author

soloice commented Dec 11, 2017

Hi, I checked my implementation and I don't think my VNMT w/o KL model is wrong.
My code can be found here, which is based on XMUNMT. The relevant part is in file vnmt.py in L94-95, L109-121, L167 and L277-311.
I use params.enable_KL to switch between VNMT and VNMT w/o KL, and disabling params.enable_KL && setting h_e_prime to zero vector gives RNNSearch.

(Btw, XMUNMT is also from XMU, and I don't know if you know that work, lol~ It has several differences from the original RNNSearch, e.g.: initialization, optimizer, etc.)

I tuned my hyperparameters (mainly learning rate and initializer) and made VNMT w/o KL converges better (BLEU achieves 0.265 on dev set at step 75k), but still not good enough:

INFO:tensorflow:Best score at step 2500: 0.007113
INFO:tensorflow:Best score at step 5000: 0.028652
INFO:tensorflow:Best score at step 7500: 0.062944
INFO:tensorflow:Best score at step 10000: 0.104961
INFO:tensorflow:Best score at step 12500: 0.122892
INFO:tensorflow:Best score at step 15000: 0.152752
INFO:tensorflow:Best score at step 17500: 0.167090
INFO:tensorflow:Best score at step 20000: 0.183927
INFO:tensorflow:Best score at step 22500: 0.195720
INFO:tensorflow:Best score at step 25000: 0.200146
INFO:tensorflow:Best score at step 27500: 0.209534
INFO:tensorflow:Best score at step 30000: 0.217454
INFO:tensorflow:Best score at step 32500: 0.220276
INFO:tensorflow:Best score at step 35000: 0.227976
INFO:tensorflow:Best score at step 37500: 0.230734
INFO:tensorflow:Best score at step 40000: 0.230734
INFO:tensorflow:Best score at step 42500: 0.238548
INFO:tensorflow:Best score at step 45000: 0.239537
INFO:tensorflow:Best score at step 47500: 0.241912
INFO:tensorflow:Best score at step 50000: 0.247325
INFO:tensorflow:Best score at step 52500: 0.253986
INFO:tensorflow:Best score at step 55000: 0.258078
INFO:tensorflow:Best score at step 57500: 0.258078
INFO:tensorflow:Best score at step 60000: 0.258078
INFO:tensorflow:Best score at step 62500: 0.258346
INFO:tensorflow:Best score at step 65000: 0.259014
INFO:tensorflow:Best score at step 67500: 0.264653
INFO:tensorflow:Best score at step 70000: 0.265454
INFO:tensorflow:Best score at step 72500: 0.265454
INFO:tensorflow:Best score at step 75000: 0.265454

I guess I should try fine-tuning instead of insisting training VNMT or VNMT w/o KL from scratch= =
But I still don't understand the hyper-parameters which work well for RNNSearch failed to do so for VNMT w/o KL, sad~

@bzhangGo
Copy link
Contributor

Hi @soloice , XMUNMT is an excellent open-source toolkit. It's glad to see more users.

I glanced over your code for VNMT w/o KL, and I think your implementation is ok. But your training result is different from ours. For this model, we just observe very similar performance as the NMT. And I also do not know where is the problem.

@soloice
Copy link
Author

soloice commented Dec 13, 2017

It's very kind of you to review my code.
Since the reproduction in XMUNMT is slightly different from that of RNNSearch, a possible reason is that AdaDelta optimizer or different initialization causes this phenomenon.
Now I'm working on finetune VNMT given a well trained RNNSearch model, and or course, I'll release my TF code if I can make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants