KeyError during pseudo labeling #11

sudhanshu-shukla-git · 2022-06-03T13:44:47Z

Hi ,

I am facing a key error while pseudo labeling. Looks like pos_pid selected is not found in the corpus.

INFO [gpl.toolkit.pl.run:60] Begin pseudo labeling
.....
File ~gpl/toolkit/dataset.py:78, in HardNegativeDataset._sample_tuple(self, query_dict)
     75 query_text = self.queries[query_dict['qid']]
     77 pos_pid = random.choice(pos_pids)
---> 78 pos_text = concat_title_and_body(pos_pid, self.corpus, self.sep)
     80 neg_pid = random.choice(list(neg_pids))
     81 neg_text = concat_title_and_body(neg_pid, self.corpus, self.sep)

File ~gpl/toolkit/dataset.py:12, in concat_title_and_body(did, corpus, sep)
     10 def concat_title_and_body(did, corpus, sep):
     11     document = []
---> 12     title = corpus[did]['title'].strip()
     13     body = corpus[did]['text'].strip()
     14     if len(title):

KeyError: '92974'

The corpus, I have has the below structure. Does the order of the _id and numbers matter?

{"text":"This is the domain text","_id":3,"title":"","metadata":{}}
{"text":"This is the domain text 2","_id":4,"title":"","metadata":{}}

Code to train:

gpl.train(
    path_to_generated_data=f"generated/{dataset}",
    mnrl_output_dir="mnrl_output_dir",
    mnrl_evaluation_output="mnrl_evaluation_output",
    base_ckpt="distilbert-base-uncased",  
    # base_ckpt='GPL/msmarco-distilbert-margin-mse',  
    # The starting checkpoint of the experiments in the paper
    gpl_score_function="dot",
    # Note that GPL uses MarginMSE loss, which works with dot-product
    batch_size_gpl=64,
    gpl_steps=140000,
    new_size=-1,
    # Resize the corpus to `new_size` (|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3
    queries_per_passage=-1,
    # Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3
    output_dir=f"output/{dataset}",
    evaluation_data=f"./{dataset}",
    evaluation_output=f"evaluation/{dataset}",
    generator="BeIR/query-gen-msmarco-t5-base-v1",
    retrievers=["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"],
    retriever_score_functions=["cos_sim", "cos_sim"],
    # Note that these two retriever model work with cosine-similarity
    cross_encoder="cross-encoder/ms-marco-MiniLM-L-6-v2",
    qgen_prefix="qgen",
    # This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default.
    do_evaluation=True,
    # --use_amp   # One can use this flag for enabling the efficient float16 precision
)

Could you help in what I am missing or doing wrong?

ahadda5 · 2022-06-06T13:32:40Z

i'm exactly here :) still trying to figure it out
some thoughts

I don't believe corpus was changed between all the pos, neg reading/generation and this point (pseudo labeling). Looking at the file (corpus.jsonl) that _id does exist, So i'm puzzled as to why it cannot recognize that key now?!
Also checking in the .tsv, the generated queries have that matching _id

I wonder what happens to that corpus in between being read from file and getting to that point?!

ahadda5 · 2022-06-06T15:33:31Z

o well! our mistake is that the corpus.jsonl has the ids as int not strings. The code dataloader expects it to be string so it errors at that Key.

Change the corpus.jsonl to have string _ids.

sudhanshu-shukla-git · 2022-06-06T15:42:14Z

@ahadda5 Thanks. Yes, even I have _ids as int. Let me change it to string and try again.

kwang2049 · 2022-06-08T19:06:07Z

Thanks for both of your attention @ahadda5 @sudhanshu-shukla-git! I will add a type assertion assert type(did) == str here.

This setting follows the one in the BeIR repo. I think string type is used instead of integers can make the IDs more universal.

kwang2049 · 2022-06-08T19:48:39Z

Have added the type hints and assertion: #12

EvilFreelancer · 2024-05-09T21:17:41Z

Hello! For those who have encountered this issue during dataset generation using pandas, the following data type conversion may be helpful for transforming a column:

df = df.astype({'_id': 'string'})

kwang2049 mentioned this issue Jun 8, 2022

set mnrl_** to None by default; type assertion for concat_title_and_body #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError during pseudo labeling #11

KeyError during pseudo labeling #11

sudhanshu-shukla-git commented Jun 3, 2022

ahadda5 commented Jun 6, 2022 •

edited

Loading

ahadda5 commented Jun 6, 2022

sudhanshu-shukla-git commented Jun 6, 2022

kwang2049 commented Jun 8, 2022

kwang2049 commented Jun 8, 2022

EvilFreelancer commented May 9, 2024

KeyError during pseudo labeling #11

KeyError during pseudo labeling #11

Comments

sudhanshu-shukla-git commented Jun 3, 2022

ahadda5 commented Jun 6, 2022 • edited Loading

ahadda5 commented Jun 6, 2022

sudhanshu-shukla-git commented Jun 6, 2022

kwang2049 commented Jun 8, 2022

kwang2049 commented Jun 8, 2022

EvilFreelancer commented May 9, 2024

ahadda5 commented Jun 6, 2022 •

edited

Loading