Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow trec_eval to take symbols representing standard qrels (instead of full qrel files) #2391

Closed
lintool opened this issue Feb 23, 2024 · 7 comments · Fixed by #2470
Closed
Assignees

Comments

@lintool
Copy link
Member

lintool commented Feb 23, 2024

Currently, for trec_eval, we have to do something like:

target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25.txt

It would be great if we could do something like:

target/appassembler/bin/trec_eval -c -M 10 -m recip_rank msmarco-passage runs/run.msmarco-passage.bm25.txt

That is, take a symbol instead of full file.

Simple algorithm: take symbol, find the prefix that matches, e.g., so user can specify msmarco-passage instead of msmarco-passage.dev-subset; behind the scenes, we download to ~/.cache.

Something like this already works with Pyserini, we need to backport to Pyserini.

@xpbowler
Copy link
Contributor

xpbowler commented Apr 5, 2024

working on this

@lintool
Copy link
Member Author

lintool commented Apr 22, 2024

@xpbowler any progress here?

@xpbowler
Copy link
Contributor

Sorry! I've been busy with finals these past 2 weeks. Last exam is tomorrow, so I'll get back on this after.

@DanielKohn1208
Copy link
Contributor

@lintool What should happen in the case where we have two file names that would in theory have the same associated symbol?

Ex. Symbol msmarco-v2-passage could be associated with qrels.msmarco-v2-passage.dev.txt and qrels.msmarco-v2-passage.dev2.txt.

Would it be okay to simply require symbol names to be longer?

@lintool
Copy link
Member Author

lintool commented Apr 26, 2024

@lintool What should happen in the case where we have two file names that would in theory have the same associated symbol?

Ex. Symbol msmarco-v2-passage could be associated with qrels.msmarco-v2-passage.dev.txt and qrels.msmarco-v2-passage.dev2.txt.

Would it be okay to simply require symbol names to be longer?

In your example, msmarco-v2-passage would not be a valid symbol. User would need to specify either dev or dev2.

As long as the file names are not identical (and they shouldn't be), it's fine. We specify which qrels we're using in the command line to trec_eval.

@lintool
Copy link
Member Author

lintool commented Apr 27, 2024

For the symbols, let's use the bindings here: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/eval/Qrels.java

So instead of

java -cp anserini-0.35.1-fatjar.jar trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt run.msmarco-v1-passage.dev.bm25.txt

We can just do:

java -cp anserini-0.35.1-fatjar.jar trec_eval -c -M 10 -m recip_rank msmarco-passage.dev-subset run.msmarco-v1-passage.dev.bm25.txt

@lintool
Copy link
Member Author

lintool commented Apr 27, 2024

hey @xpbowler just to prevent duplicate work, I think @DanielKohn1208 is on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants