issue while running the training script #5

aqsa27 · 2022-05-09T18:01:02Z

Hi,

I have created a custom corpus.jsonl in the format structure as instructed.
I am successfully able to install the library the gpl library on mac machine.

I use the following piece of code:
import gpl

dataset = 'fiqa'
gpl.train(
path_to_generated_data=f"generated/{dataset}",
base_ckpt="distilbert-base-uncased",
# base_ckpt='GPL/msmarco-distilbert-margin-mse',
# The starting checkpoint of the experiments in the paper
gpl_score_function="dot",
# Note that GPL uses MarginMSE loss, which works with dot-product
batch_size_gpl=32,
gpl_steps=140000,
new_size=-1,
# Resize the corpus to new_size (|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3
queries_per_passage=-1,
# Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3
output_dir=f"output/{dataset}",
evaluation_data=f"./{dataset}",
evaluation_output=f"evaluation/{dataset}",
generator="BeIR/query-gen-msmarco-t5-base-v1",
retrievers=["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"],
retriever_score_functions=["cos_sim", "cos_sim"],
# Note that these two retriever model work with cosine-similarity
cross_encoder="cross-encoder/ms-marco-MiniLM-L-6-v2",
qgen_prefix="qgen",
# This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default.
do_evaluation=True,
# --use_amp # One can use this flag for enabling the efficient float16 precision
)

I have changed the following paths:

path_to_generated_data=f"generated/{dataset}",
Here i am adding my path to custom data corpus.jsonl

as i run this file, I get the following error:

train() missing 2 required positional arguments: " mnrl_output_dir" and "mnrl_evaluation_output"

My purpose here is to do domain adaption for questions in form of sentences for semantic search task.

Please let me know what would be the exact steps to train on custom data ?

The text was updated successfully, but these errors were encountered:

kwang2049 · 2022-06-01T23:50:38Z

Hi @aqsa27, sorry for the delay. Could you please try this quick toy example:

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
unzip fiqa.zip
export dataset="fiqa"
python -m gpl.train \
    --path_to_generated_data "generated/$dataset" \
    --base_ckpt "distilbert-base-uncased" \
    --gpl_score_function "dot" \
    --batch_size_gpl 4 \
    --gpl_steps 100 \
    --new_size 10 \
    --queries_per_passage 1 \
    --output_dir "output/$dataset" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "qgen"

I have just tried this (building the env from scratch) and it works. And please keep the same data format and argument format in the example (e.g. please make sure --path_to_generated_data gets a directory instead of a file).

ahadda5 · 2022-06-02T15:36:56Z

@aqsa27 your target corpus should be called 'corpus.jsonl' to start with. also you have to have the folder 'generated/CustomDB' .
The times I got that "mnrl positional " error , was either i had BeIR format but not specifically called corpus.jsonl or the folder structure was off.

kwang2049 · 2022-06-05T15:26:26Z

Thanks for sharing your experience @ahadda5 and pointing out the misleading point! I will add some assertions about it to give hints and make it more clear.

kwang2049 · 2022-06-08T19:52:02Z

Now mnrl_** are set to None by default and one will not be bothered with MNRL (i.e. the baseline QGen) issues: #12

junefeld · 2022-06-27T19:16:49Z

does anyone have a working example they could share that contains their folder structure?

kwang2049 · 2022-06-27T22:05:20Z

Hi @christopherfeld, I have created google colab showing how to run this toy example. Please have a look at here: https://colab.research.google.com/drive/1Wis4WugIvpnSAc7F7HGBkB38lGvNHTtX?usp=sharing and hope this can help:)

This was referenced Jun 8, 2022

set mnrl_** to None by default; type assertion for concat_title_and_body #12

Merged

How to create dataset to train in GPL from normal set of domain specific word docs or pdfs. #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue while running the training script #5

issue while running the training script #5

aqsa27 commented May 9, 2022

kwang2049 commented Jun 1, 2022

ahadda5 commented Jun 2, 2022 •

edited

Loading

kwang2049 commented Jun 5, 2022

kwang2049 commented Jun 8, 2022

junefeld commented Jun 27, 2022

kwang2049 commented Jun 27, 2022 •

edited

Loading

issue while running the training script #5

issue while running the training script #5

Comments

aqsa27 commented May 9, 2022

kwang2049 commented Jun 1, 2022

ahadda5 commented Jun 2, 2022 • edited Loading

kwang2049 commented Jun 5, 2022

kwang2049 commented Jun 8, 2022

junefeld commented Jun 27, 2022

kwang2049 commented Jun 27, 2022 • edited Loading

ahadda5 commented Jun 2, 2022 •

edited

Loading

kwang2049 commented Jun 27, 2022 •

edited

Loading