Skip to content

v0.1.0: Updated paper (accepted by NAACL 2022); automatic hyper-parameter; released checkpoints (TAS-B ones)

Compare
Choose a tag to compare
@kwang2049 kwang2049 released this 19 Apr 17:09
· 23 commits to main since this release

Updated paper, accepted by NAACL 2022

The GPL paper has been accepted by NAACL 2022! Major updates:

  • Improved the setting: Down-sampled the corpus if it is too large; calculate the number of generated queries according to the corpus size;
  • Added more analysis about the influence of the number of generated queries: Small corpus needs more queries;
  • Added results on the full 18 BeIR datasets: The conclusions remain the same, while we also tried training GPL on top of the power TAS-B model and achieved new improvements.

Automatic hyper-parameter

Previously, we use the whole corpus and number of generated queries = 3, no matter the corpus size. This actually results in a very bad training efficiency for large corpus. In the new version, we automatically set these two hyper-parameters by meeting the standard: the total number of generated queries = 250K.

In detail, we first set the queries_per_passage >= 3 and uniformly down-sample the corpus if 3 × |C| > 250K, where |C| is the corpus size; then we calculate queries_per_passage = 250K/|C|. For example, the queries_per_passage values for FiQA (original size = 57.6K) and Robust04 (original size = 528.2K) are 5 and 3, resp. and the Robust04 corpus is down-sampled to 83.3K.

Released checkpoints (TAS-B ones)

We now release the pre-trained GPL models via the https://huggingface.co/GPL. They also include the power GPL models trained on top of TAS-B.