New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308
Comments
e.g. msmarco-doc
we want to search with id sorted? |
oh I see, the pyserini is following the increasing order of query id. |
Correct. If you take the pyserini and anserini output, sort both, you'll see that they're identical... so the issue must come from query ordering.... |
but if we want to get the array of id, we have to manually load the id from topics.msmarco-doc.dev.txt? |
My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1 That way, you'll never "lose" the file. Space is cheap? |
and then for loop that global variable here? pyserini/pyserini/search/__main__.py Line 134 in 9c7c3b1
|
Yea, except this:
Thoughts? |
I prefer the first one actually.
we just fix that for experiment replications right? so it is straightforward. I had a try on msmarco-doc
the only shortage is the global variable took 40k columns but that is fine. |
Now that I think about it, I think introducing a query iterator is cleaner? If you just fix here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1 It'll only be fixed for |
i see. |
ill draft pr |
BTW, please make this work for both MS MARCO {doc, passage} x {dev, eval}. |
I was thinking about making the query_order keyed by the topics name, but seems pyserini/pyserini/search/_base.py Line 75 in 9c7c3b1
|
They're here in anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/TopicReader.java#L67 Part of the latest release... they just need to be exposed - via the hook you linked to above. |
Closed by #309 |
If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md
Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md
We'll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:
Compared to anserini:
Previously, we tracked it down issue #257
I'd like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it'll work: Let's just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we're iterating over the dataset in
pyserini.search
, we just follow the order of the integers.Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.
Thoughts @MXueguang? I was thinking you could work on this?
The text was updated successfully, but these errors were encountered: