Skip to content

v0.0.9: Fixed bug of max.-sequence-length mismatch between student and teacher; keep full precision of pseudo labels

Compare
Choose a tag to compare
@kwang2049 kwang2049 released this 11 Jan 23:53
· 25 commits to main since this release

Fixed bug of max.-sequence-length mismatch between student and teacher

Previously, the teacher (i.e. the cross-encoder) got the input of the concatenation of query and document texts and had no limits of max. sequence length (cf. here and here). However, the students actually had the limits of max. sequence length on both query texts and document texts separately. This causes the mismatch between the information which can be seen by the student and the teacher models.

In the new release, we fixed this by doing "retokenization": Right before pseudo labeling, we let the tokenizer of the teacher model tokenize the query texts and the document texts also separately and then decode the results (token IDs) back into the texts again. The resulting texts will meet the same max.-sequence-length requirements as the student model does and thus fix this bug.

Keep full precision of the pseudo labels

Previously, we saved the pseudo labels from PyTorch's tensors directly, which would not give the full precision. Now we have fixed this by doing labels.tolist() right before the data dumping. This actually would not influence a lot, since previously it kept 6-digit precision and was high enough.