Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to perform data augmentation? #37

Closed
LorrinWWW opened this issue Jan 13, 2021 · 7 comments
Closed

How to perform data augmentation? #37

LorrinWWW opened this issue Jan 13, 2021 · 7 comments

Comments

@LorrinWWW
Copy link

Thanks for the awesome work! I have a question on the examples (while not on the framework itself).
In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples.
Thank you again!

@airaria
Copy link
Owner

airaria commented Jan 14, 2021 via email

@LorrinWWW
Copy link
Author

Thank you for the quick reply! It solves my question.

@LorrinWWW
Copy link
Author

Sorry to bother again, but I have a little problem in reproducing the NER results in the paper.

By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results.

I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper.
I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100.
And I got: F1(of T3) = 84.6, F1(of T6) = 89.7.

Here is the train script:

export OUTPUT_DIR=outputs-model-distill
export BATCH_SIZE=32
export GRAD_ACCUM_STEPS=1
export NUM_EPOCHS=100
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER=outputs-model-base
export BERT_MODEL_STUDENT=bert-base-cased

mkdir -p $OUTPUT_DIR

python run_ner_distill.py \
--data_dir data \
--model_type bert \
--model_name_or_path $BERT_MODEL_TEACHER \
--model_name_or_path_student $BERT_MODEL_STUDENT \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache

I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria

@airaria
Copy link
Owner

airaria commented Jan 16, 2021 via email

@LorrinWWW
Copy link
Author

Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either.
When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz..

@airaria
Copy link
Owner

airaria commented Jan 19, 2021 via email

@LorrinWWW
Copy link
Author

Thanks a lot! That's very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants