How to perform data augmentation? #37

LorrinWWW · 2021-01-13T03:09:49Z

Thanks for the awesome work! I have a question on the examples (while not on the framework itself).
In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples.
Thank you again!

airaria · 2021-01-14T02:20:56Z

For SQuAD, we select all the passage-question pairs that are answerable (have non-empty answer spans in the passages) from the NewsQA training set, and treat them as the SQuAD PQ-pairs. For CONLL-2003, we select randomly from passages in HotpotQA (wiki articles) and split them into sentences. We tokenize the sentences with NLTK and treat each tokenized sentence as an NER example. The augmented dataset contains about 10M tokens. We feed all the examples (the original and the augmented) to the teachers and asked the student to learn from the representations from the teachers. Hope it helps. At 2021-01-13 11:10:02, "Jue WANG" <notifications@github.com> wrote: Thanks for the awesome work! I have a question on the examples (while not on the framework itself). In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples. Thank you again! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

LorrinWWW · 2021-01-14T02:38:17Z

Thank you for the quick reply! It solves my question.

LorrinWWW · 2021-01-16T06:46:33Z

Sorry to bother again, but I have a little problem in reproducing the NER results in the paper.

By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results.

I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper.
I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100.
And I got: F1(of T3) = 84.6, F1(of T6) = 89.7.

Here is the train script:

export OUTPUT_DIR=outputs-model-distill
export BATCH_SIZE=32
export GRAD_ACCUM_STEPS=1
export NUM_EPOCHS=100
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER=outputs-model-base
export BERT_MODEL_STUDENT=bert-base-cased

mkdir -p $OUTPUT_DIR

python run_ner_distill.py \
--data_dir data \
--model_type bert \
--model_name_or_path $BERT_MODEL_TEACHER \
--model_name_or_path_student $BERT_MODEL_STUDENT \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache

I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria

airaria · 2021-01-16T12:39:58Z

Maybe you could try distilling without the mmd loss in NER. For the hyperparameters, l'll check that with my colleague next Monday. At 2021-01-16 14:46:45, "Jue WANG" <notifications@github.com> wrote: Sorry to bother again, but I have a little problem in reproducing the NER results in the paper. By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results. I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper. I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100. And I got: F1(of T3) = 84.6, F1(of T6) = 89.7. Here is the train script: export OUTPUT_DIR=outputs-model-distill export BATCH_SIZE=32 export GRAD_ACCUM_STEPS=1 export NUM_EPOCHS=100 export SAVE_STEPS=750 export SEED=42 export MAX_LENGTH=128 export BERT_MODEL_TEACHER=outputs-model-base export BERT_MODEL_STUDENT=bert-base-cased mkdir -p $OUTPUT_DIR python run_ner_distill.py \ --data_dir data \ --model_type bert \ --model_name_or_path $BERT_MODEL_TEACHER \ --model_name_or_path_student $BERT_MODEL_STUDENT \ --output_dir $OUTPUT_DIR \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_gpu_train_batch_size $BATCH_SIZE \ --gradient_accumulation_steps $GRAD_ACCUM_STEPS \ --num_hidden_layers 3 \ --save_steps $SAVE_STEPS \ --learning_rate 1e-4 \ --warmup_steps 0.1 \ --seed $SEED \ --do_distill \ --do_train \ --do_eval \ --do_predict \ --overwrite_output_dir \ --overwrite_cache I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

LorrinWWW · 2021-01-19T06:52:18Z

Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either.
When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz..

airaria · 2021-01-19T09:33:17Z

Sorry for the unclarity in experiment settings! We have treated CoNLL-2003 differently with fine-tuning the student model before distillation. Take T3 for example. Our procedure is : 1. Initialize T3 with the weights from the first 3 layers and the embddings of BERT-base. 2. Fine-tune the T3 on the training set. 3. Perform knowledge distilation on the fine-tuned model. This strategy can help to improve about 2 points on T3. At 2021-01-19 14:52:33, "Jue WANG" <notifications@github.com> wrote: Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either. When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

LorrinWWW · 2021-01-19T13:17:25Z

Thanks a lot! That's very useful!

LorrinWWW closed this as completed Jan 14, 2021

airaria mentioned this issue Feb 10, 2021

[Question] 论文中数据增强方法的一些疑惑？ #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to perform data augmentation? #37

How to perform data augmentation? #37

LorrinWWW commented Jan 13, 2021

airaria commented Jan 14, 2021 via email

LorrinWWW commented Jan 14, 2021

LorrinWWW commented Jan 16, 2021

airaria commented Jan 16, 2021 via email

LorrinWWW commented Jan 19, 2021

airaria commented Jan 19, 2021 via email

LorrinWWW commented Jan 19, 2021

How to perform data augmentation? #37

How to perform data augmentation? #37

Comments

LorrinWWW commented Jan 13, 2021

airaria commented Jan 14, 2021 via email

LorrinWWW commented Jan 14, 2021

LorrinWWW commented Jan 16, 2021

airaria commented Jan 16, 2021 via email

LorrinWWW commented Jan 19, 2021

airaria commented Jan 19, 2021 via email

LorrinWWW commented Jan 19, 2021