-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to perform data augmentation? #37
Comments
For SQuAD, we select all the passage-question pairs that are answerable (have non-empty answer spans in the passages) from the NewsQA training set, and treat them as the SQuAD PQ-pairs.
For CONLL-2003, we select randomly from passages in HotpotQA (wiki articles) and split them into sentences. We tokenize the sentences with NLTK and treat each tokenized sentence as an NER example. The augmented dataset contains about 10M tokens.
We feed all the examples (the original and the augmented) to the teachers and asked the student to learn from the representations from the teachers.
Hope it helps.
At 2021-01-13 11:10:02, "Jue WANG" <notifications@github.com> wrote:
Thanks for the awesome work! I have a question on the examples (while not on the framework itself).
In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples.
Thank you again!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thank you for the quick reply! It solves my question. |
Sorry to bother again, but I have a little problem in reproducing the NER results in the paper. By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results. I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper. Here is the train script: export OUTPUT_DIR=outputs-model-distill
export BATCH_SIZE=32
export GRAD_ACCUM_STEPS=1
export NUM_EPOCHS=100
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER=outputs-model-base
export BERT_MODEL_STUDENT=bert-base-cased
mkdir -p $OUTPUT_DIR
python run_ner_distill.py \
--data_dir data \
--model_type bert \
--model_name_or_path $BERT_MODEL_TEACHER \
--model_name_or_path_student $BERT_MODEL_STUDENT \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria |
Maybe you could try distilling without the mmd loss in NER.
For the hyperparameters, l'll check that with my colleague next Monday.
At 2021-01-16 14:46:45, "Jue WANG" <notifications@github.com> wrote:
Sorry to bother again, but I have a little problem in reproducing the NER results in the paper.
By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results.
I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper.
I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100.
And I got: F1(of T3) = 84.6, F1(of T6) = 89.7.
Here is the train script:
export OUTPUT_DIR=outputs-model-distill
export BATCH_SIZE=32
export GRAD_ACCUM_STEPS=1
export NUM_EPOCHS=100
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER=outputs-model-base
export BERT_MODEL_STUDENT=bert-base-cased
mkdir -p $OUTPUT_DIR
python run_ner_distill.py \
--data_dir data \
--model_type bert \
--model_name_or_path $BERT_MODEL_TEACHER \
--model_name_or_path_student $BERT_MODEL_STUDENT \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache
I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either. |
Sorry for the unclarity in experiment settings!
We have treated CoNLL-2003 differently with fine-tuning the student model before distillation.
Take T3 for example. Our procedure is :
1. Initialize T3 with the weights from the first 3 layers and the embddings of BERT-base.
2. Fine-tune the T3 on the training set.
3. Perform knowledge distilation on the fine-tuned model.
This strategy can help to improve about 2 points on T3.
At 2021-01-19 14:52:33, "Jue WANG" <notifications@github.com> wrote:
Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either.
When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thanks a lot! That's very useful! |
Thanks for the awesome work! I have a question on the examples (while not on the framework itself).
In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples.
Thank you again!
The text was updated successfully, but these errors were encountered: