Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

lintool · 2021-01-09T19:03:00Z

If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md
Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We'll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################

Compared to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

Previously, we tracked it down issue #257

I'd like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it'll work: Let's just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we're iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?

The text was updated successfully, but these errors were encountered:

MXueguang · 2021-01-09T19:32:05Z

e.g. msmarco-doc

174249	does xpress bet charge to deposit money in your account
320792	how much is a cost to run disneyland
1090270	botulinum definition
1101279	do physicians pay for insurance from their salaries?
201376	here there be dragons comic
54544	blood diseases that are sexually transmitted

we want to search with id sorted?

MXueguang · 2021-01-09T19:57:04Z

oh I see, the pyserini is following the increasing order of query id.
but we want it follow the order of https://github.com/castorini/anserini/blob/master/src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt?

lintool · 2021-01-09T19:59:18Z

Correct. If you take the pyserini and anserini output, sort both, you'll see that they're identical... so the issue must come from query ordering....

MXueguang · 2021-01-09T20:00:09Z

but if we want to get the array of id, we have to manually load the id from topics.msmarco-doc.dev.txt?

lintool · 2021-01-09T20:04:40Z

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

That way, you'll never "lose" the file. Space is cheap?

MXueguang · 2021-01-09T20:10:24Z

and then for loop that global variable here?

pyserini/pyserini/search/__main__.py

Line 134 in 9c7c3b1

for index, topic in enumerate(tqdm(sorted(topics.keys()))):

lintool · 2021-01-09T20:24:07Z

Yea, except this:

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts?

MXueguang · 2021-01-09T20:30:57Z

I prefer the first one actually.

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

we just fix that for experiment replications right? so it is straightforward.

I had a try on msmarco-doc
now we can get result identical to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

the only shortage is the global variable took 40k columns but that is fine.

lintool · 2021-01-09T20:33:39Z

I prefer the first one actually.

Now that I think about it, I think introducing a query iterator is cleaner? If you just fix here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

It'll only be fixed for pyserini.search - if we introduce query iterator, it'll be more future proof... i.e., for dense retrieval, hybrid, etc. and direct library usage...

MXueguang · 2021-01-09T21:06:13Z

i see.
so query iterator takes cur_dict and order_array and yield (id, text) pairs

MXueguang · 2021-01-09T21:06:30Z

ill draft pr

lintool · 2021-01-09T22:18:55Z

BTW, please make this work for both MS MARCO {doc, passage} x {dev, eval}.

MXueguang · 2021-01-09T22:46:36Z

I was thinking about making the query_order keyed by the topics name, but seems eval topics are not in here

pyserini/pyserini/search/_base.py

Line 75 in 9c7c3b1

def get_topics(collection_name):

lintool · 2021-01-09T22:56:00Z

They're here in anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/TopicReader.java#L67

Part of the latest release... they just need to be exposed - via the hook you linked to above.

lintool · 2021-01-14T19:00:55Z

Closed by #309

lintool assigned MXueguang Jan 9, 2021

lintool closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021 •

edited

lintool commented Jan 9, 2021

lintool commented Jan 14, 2021

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

Comments

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021

MXueguang commented Jan 9, 2021

lintool commented Jan 9, 2021

MXueguang commented Jan 9, 2021 • edited

lintool commented Jan 9, 2021

lintool commented Jan 14, 2021

MXueguang commented Jan 9, 2021 •

edited