Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

Closed
lintool opened this issue Jan 9, 2021 · 15 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Jan 9, 2021

If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md
Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We'll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################

Compared to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

Previously, we tracked it down issue #257

I'd like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it'll work: Let's just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we're iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?

@MXueguang
Copy link
Member

e.g. msmarco-doc

174249	does xpress bet charge to deposit money in your account
320792	how much is a cost to run disneyland
1090270	botulinum definition
1101279	do physicians pay for insurance from their salaries?
201376	here there be dragons comic
54544	blood diseases that are sexually transmitted

we want to search with id sorted?

@MXueguang
Copy link
Member

oh I see, the pyserini is following the increasing order of query id.
but we want it follow the order of https://github.com/castorini/anserini/blob/master/src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt?

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

Correct. If you take the pyserini and anserini output, sort both, you'll see that they're identical... so the issue must come from query ordering....

@MXueguang
Copy link
Member

but if we want to get the array of id, we have to manually load the id from topics.msmarco-doc.dev.txt?

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

That way, you'll never "lose" the file. Space is cheap?

@MXueguang
Copy link
Member

and then for loop that global variable here?

for index, topic in enumerate(tqdm(sorted(topics.keys()))):

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

Yea, except this:

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts?

@MXueguang
Copy link
Member

I prefer the first one actually.

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

we just fix that for experiment replications right? so it is straightforward.

I had a try on msmarco-doc
now we can get result identical to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

the only shortage is the global variable took 40k columns but that is fine.

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

I prefer the first one actually.

Now that I think about it, I think introducing a query iterator is cleaner? If you just fix here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

It'll only be fixed for pyserini.search - if we introduce query iterator, it'll be more future proof... i.e., for dense retrieval, hybrid, etc. and direct library usage...

@MXueguang
Copy link
Member

i see.
so query iterator takes cur_dict and order_array and yield (id, text) pairs

@MXueguang
Copy link
Member

ill draft pr

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

BTW, please make this work for both MS MARCO {doc, passage} x {dev, eval}.

@MXueguang
Copy link
Member

MXueguang commented Jan 9, 2021

I was thinking about making the query_order keyed by the topics name, but seems eval topics are not in here

def get_topics(collection_name):

@lintool
Copy link
Member Author

lintool commented Jan 9, 2021

They're here in anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/TopicReader.java#L67

Part of the latest release... they just need to be exposed - via the hook you linked to above.

@lintool
Copy link
Member Author

lintool commented Jan 14, 2021

Closed by #309

@lintool lintool closed this as completed Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants