Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

How to reproduce the result on the poly-encoder model for dstc7 #2306

Closed
JiaQiSJTU opened this issue Dec 26, 2019 · 14 comments
Closed

How to reproduce the result on the poly-encoder model for dstc7 #2306

JiaQiSJTU opened this issue Dec 26, 2019 · 14 comments
Assignees

Comments

@JiaQiSJTU
Copy link

In the readme.md about "Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring", it just shows how to reproduce the result on ConvAI2. I changed the "-t convai2" into "-t dstc7" and finetuned on this task. The results are around 5%-6% lower than the results showed in this paper. Are there any other hyper-parameters I need to change to reproduce the results on DSTC7 ?

@klshuster klshuster self-assigned this Dec 26, 2019
@klshuster
Copy link
Contributor

Hi,

For the poly-encoder, please try with the following hyperparameters for dstc7:

--init-model zoo:pretrained_transformers/poly_model_huge_reddit/model 
--batchsize 256  
--model transformer/polyencoder 
--warmup_updates 100 
--lr-scheduler-patience 0 
--lr-scheduler-decay 0.4 
-lr 5e-05 
--data-parallel True 
--history-size 20 
--label-truncate 72 
--text-truncate 360 
-vp 5 
-veps 0.5 
--validation-metric accuracy 
--validation-metric-mode max 
--save-after-valid True 
--log_every_n_secs 20 
--candidates batch 
--dict-tokenizer bpe  
--dict-lower True 
--optimizer adamax 
--output-scaling 0.06 
--variant xlm 
--reduction_type mean 
--share-encoders False 
--learn-positional-embeddings True 
--n-layers 12 
--n-heads 12 
--ffn-size 3072 
--attention-dropout 0.1 
--relu-dropout 0.0 
--dropout 0.1 
--n-positions 1024 
--embedding-size 768 
--activation gelu 
--embeddings-scale False 
--n-segments 2 
--learn-embeddings True 
--share-word-embeddings False 
--dict-endtoken __start__ 
--fp16 True 
--polyencoder-type codes 
--codes-attention-type basic 
--poly-n-codes 64 
--poly-attention-type basic 
--polyencoder-attention-keys context 

@JiaQiSJTU
Copy link
Author

Hi,

For the poly-encoder, please try with the following hyperparameters for dstc7:

--init-model zoo:pretrained_transformers/poly_model_huge_reddit/model 
--batchsize 256  
--model transformer/polyencoder 
--warmup_updates 100 
--lr-scheduler-patience 0 
--lr-scheduler-decay 0.4 
-lr 5e-05 
--data-parallel True 
--history-size 20 
--label-truncate 72 
--text-truncate 360 
-vp 5 
-veps 0.5 
--validation-metric accuracy 
--validation-metric-mode max 
--save-after-valid True 
--log_every_n_secs 20 
--candidates batch 
--dict-tokenizer bpe  
--dict-lower True 
--optimizer adamax 
--output-scaling 0.06 
--variant xlm 
--reduction_type mean 
--share-encoders False 
--learn-positional-embeddings True 
--n-layers 12 
--n-heads 12 
--ffn-size 3072 
--attention-dropout 0.1 
--relu-dropout 0.0 
--dropout 0.1 
--n-positions 1024 
--embedding-size 768 
--activation gelu 
--embeddings-scale False 
--n-segments 2 
--learn-embeddings True 
--share-word-embeddings False 
--dict-endtoken __start__ 
--fp16 True 
--polyencoder-type codes 
--codes-attention-type basic 
--poly-n-codes 64 
--poly-attention-type basic 
--polyencoder-attention-keys context 

Thanks a lot! What about the bi-encoder and cross encoder? Are their hyper-parameters the same as this one except the "--model, --init model" and without "--polyencoder-type, --codes-attention-type, --poly-n-codes, --poly-attention-type, --polyencoder-attention-keys"?

@JiaQiSJTU
Copy link
Author

Hi,

For the poly-encoder, please try with the following hyperparameters for dstc7:

--init-model zoo:pretrained_transformers/poly_model_huge_reddit/model 
--batchsize 256  
--model transformer/polyencoder 
--warmup_updates 100 
--lr-scheduler-patience 0 
--lr-scheduler-decay 0.4 
-lr 5e-05 
--data-parallel True 
--history-size 20 
--label-truncate 72 
--text-truncate 360 
-vp 5 
-veps 0.5 
--validation-metric accuracy 
--validation-metric-mode max 
--save-after-valid True 
--log_every_n_secs 20 
--candidates batch 
--dict-tokenizer bpe  
--dict-lower True 
--optimizer adamax 
--output-scaling 0.06 
--variant xlm 
--reduction_type mean 
--share-encoders False 
--learn-positional-embeddings True 
--n-layers 12 
--n-heads 12 
--ffn-size 3072 
--attention-dropout 0.1 
--relu-dropout 0.0 
--dropout 0.1 
--n-positions 1024 
--embedding-size 768 
--activation gelu 
--embeddings-scale False 
--n-segments 2 
--learn-embeddings True 
--share-word-embeddings False 
--dict-endtoken __start__ 
--fp16 True 
--polyencoder-type codes 
--codes-attention-type basic 
--poly-n-codes 64 
--poly-attention-type basic 
--polyencoder-attention-keys context 

I tried with this hyper-parameter setting (added with "-pyt dstc7 ----eval-batchsize 10 --mode-file ")

The result is still not good:
'hits@1': 0.641, 'hits@10': 0.889, 'mrr': 0.727.

While in paper, the results are
hits@1 = 70.9--0.6 ; hits@10 = 91.5--0.5; mrr = 78.0--0.3

Emmm, are there anything else i can do to rescue it?

@klshuster
Copy link
Contributor

Would you mind providing your full train log, perhaps in a github gist?

@JiaQiSJTU
Copy link
Author

Would you mind providing your full train log, perhaps in a github gist?

Here is the link to two training logs: https://github.com/JiaQiSJTU/poly-encoder

@klshuster
Copy link
Contributor

So we actually used an augmented training set for our models; once #2314 is merged, you can specify -t dstc7:DSTC7TeacherAugmentedSampled to train on this augmented set.

@JiaQiSJTU
Copy link
Author

So we actually used an augmented training set for our models; once #2314 is merged, you can specify -t dstc7:DSTC7TeacherAugmentedSampled to train on this augmented set.

Thanks a lot! I still have some problems:

  1. How did u do the data augmentation?
  2. Did the results in the poly paper use the augmented data or just the original data?
  3. It seems that "hits@1" and "mrr" are greatly improved by the data augmentation. I'm doubt that whether the pre-trained model works or the data augmentation works.

@klshuster
Copy link
Contributor

A better way of phrasing this is that the data was not "augmented" but rather presented to an agent on an episodic basis. I.e., similar to how the ConvAI2 dataset is presented, i.e. utterance by utterance, we "augmented" the DSTC7 data to include intermediate utterance predictions.

For example, suppose we have a dialogue between two speakers with utterances [A, B, C, D], and we are attempting to model speaker 2. In the original dataset, the conversation would be presented to a model as:

text: A,B,C
label: D

In the augmented version, we present the data as:

episode 1:
    text: A
    label: B

episode 2:
    text: A, B, C
    label: D

The results in the paper use this representation. All of our reported results use this dataset, so it is still clear that the pre-training helps.

@JiaQiSJTU
Copy link
Author

Get it!

@klshuster
Copy link
Contributor

I'll close for now, please re-open if you find you have further issues

@sjkoo1989
Copy link

sjkoo1989 commented Mar 15, 2020

Hi. @klshuster Would you mind me asking a question?

  1. We found that the interactive.py needs a user to input persona information sentences (for CONVAI2). Are they necessary in reproducing the training process? If so, which token did you use for separating each persona question? Are persona information sentences considered as context input?

  2. In the previous replies, you mentioned that contexts are augmented as
    episode N:
    text: A1, B1, A2, B2,...
    label: BN
    However, in reproducing your result, we found that finding an adequate turn-separation scheme is important when we use base bert weights from google-bert(& hugging-face). How did you manage to separate sentences in given context?

@klshuster
Copy link
Contributor

  1. for training on dstc7, persona information sentences are not necessary; for training on convai2, persona information sentences are presented in the input to the model, so there is no additional work to be done there.

  2. We simply used a new-line delimiter to separate sentences in the given context.

@sjkoo1989
Copy link

@klshuster Thanks for your kind response!
Would you mind me asking another question?
How can we set google-BERT for the initial model?
(It seems that not giving --init-model would work, but we are not sure about it.)

@klshuster
Copy link
Contributor

We offer the google-BERT model as an init model for the bi-encoder and cross-encoder - see https://github.com/facebookresearch/ParlAI/tree/master/parlai/agents/bert_ranker. We have no plans to release a poly-encoder-based agent in this paradigm.

However, if you are using the transformer/polyencoder and specify --init-model zoo:pretrained_transformers/poly_model_huge_wikito/model, you will initialize the model with weights that were pre-trained on the same data as google-BERT (i.e. wikipedia and toronto books) and which obtain similar results as google-BERT on the tasks we considered in the paper.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants