-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interpreting intermediate matches #103
Comments
First let me explain the distillation configuration above if it is confusing. For BERT-base, there are 12 layers we name them as 1,2,...12. Since the dimensions of teacher and student are different, we use a linear mapping from 312 (student's dim) to 768 (teacher's dim) to project student's hidden states into a higher dimensional space. For the above mappings, we take the 'hidden' features (which should be defined in the adaptor by users; it's users responsibility to tell textbrewer what 'hidden' is) from each layer and calculate the 'hidden_mse' loss (defined in the losses.py) between the features from the student and the teacher. The following lines
use a different loss 'nst', which requires two similarity matrices.
For a three-layer thiner BERT T3-small, you can map the layers 0-0, 4-1, 8-2, 12-3, and use 'proj':[384,768] to match the dimensions. |
Thank you so much for the detailed explanation. If you can add these to your docs it will be super useful. I am following up on the conll2003 example. I changed the distill_config as the following. (I am using Transformers 4.17.0) distill_config = DistillationConfig(
temperature = 8,
# intermediate_matches = [{'layer_T':10, 'layer_S':3, 'feature':'hidden','loss': 'hidden_mse', 'weight' : 1}]
intermediate_matches = [
{'layer_T':0, 'layer_S':0, 'feature':'hidden','loss': 'hidden_mse', 'weight' : 1,'proj':['linear',384,768]},
{'layer_T':4, 'layer_S':1, 'feature':'hidden','loss': 'hidden_mse', 'weight' : 1,'proj':['linear',384,768]},
{'layer_T':8, 'layer_S':2, 'feature':'hidden','loss': 'hidden_mse', 'weight' : 1,'proj':['linear',384,768]},
{'layer_T':12, 'layer_S':3, 'feature':'hidden','loss': 'hidden_mse', 'weight' : 1,'proj':['linear',384,768]}]
) The run_conll2003_distill_T3.sh file looks as the following. export OUTPUT_DIR="resource/taggers/T3-small-bert-finetuned"
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER="resource/taggers/bert-finetuned"
python run_ner_distill.py \
--data_dir english_dataset \
--model_type bert \
--labels label_prod.txt \
--model_name_or_path $BERT_MODEL_TEACHER \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict I am getting an index out of range error. Can you please check? Traceback (most recent call last): |
Did you set the model to return hidden states by If it is still not working, would you please print the length of the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance. |
Can you please provide more insights into how to construct intermediate matches for a smaller model? For example for T3, T3-small and BiGRU. Somehow this is not clearly stated in the paper nor in the documents
The text was updated successfully, but these errors were encountered: