Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DenseRetrievalExactSearch evaluation #154

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NouamaneTazi
Copy link
Contributor

@NouamaneTazi NouamaneTazi commented Aug 12, 2023

I noticed there was a problem in the way we handled queries that exist in the retrieval corpus. By default we have ignore_identical_ids=True which pops these duplicated queries from the results. Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents.

Fixing this behaviour gives a noticeable change in scores. Here's the difference in scores noticed for "intfloat/e5-large" on ArguAna evaluated using MTEB:

    model = SentenceTransformer("intfloat/e5-large", device="cuda")
    eval = MTEB(tasks=["ArguAna"])
    eval.run(model, batch_size=512*2, corpus_chunk_size=10000, overwrite_results=True)

Scores before fix:

INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.27596, 'ndcg_at_3': 0.42701, 'ndcg_at_5': 0.48151, 'ndcg_at_10': 0.53452, 'ndcg_at_100': 0.57081, 'ndcg_at_1000': 0.57226, 'map_at_1': 0.27596, 'map_at_3': 0.38976, 'map_at_5': 0.41967, 'map_at_10': 0.44187, 'map_at_100': 0.4507, 'map_at_1000': 0.45077, 'recall_at_1': 0.27596, 'recall_at_3': 0.53485, 'recall_at_5': 0.66856, 'recall_at_10': 0.83073, 'recall_at_100': 0.98578, 'recall_at_1000': 0.99644, 'precision_at_1': 0.27596, 'precision_at_3': 0.17828, 'precision_at_5': 0.13371, 'precision_at_10': 0.08307, 'precision_at_100': 0.00986, 'precision_at_1000': 0.001, 'mrr_at_1': 0.28378, 'mrr_at_3': 0.39284, 'mrr_at_5': 0.42261, 'mrr_at_10': 0.44498, 'mrr_at_100': 0.45374, 'mrr_at_1000': 0.45381, 'evaluation_time': 127.59}

Scores after fix:

INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.41963, 'ndcg_at_3': 0.57859, 'ndcg_at_5': 0.62677, 'ndcg_at_10': 0.65648, 'ndcg_at_100': 0.67739, 'ndcg_at_1000': 0.67846, 'map_at_1': 0.41963, 'map_at_3': 0.53983, 'map_at_5': 0.56664, 'map_at_10': 0.57907, 'map_at_100': 0.58407, 'map_at_1000': 0.58413, 'recall_at_1': 0.41963, 'recall_at_3': 0.69061, 'recall_at_5': 0.80725, 'recall_at_10': 0.89829, 'recall_at_100': 0.98862, 'recall_at_1000': 0.99644, 'precision_at_1': 0.41963, 'precision_at_3': 0.2302, 'precision_at_5': 0.16145, 'precision_at_10': 0.08983, 'precision_at_100': 0.00989, 'precision_at_1000': 0.001, 'mrr_at_1': 0.41963, 'mrr_at_3': 0.53983, 'mrr_at_5': 0.56664, 'mrr_at_10': 0.57907, 'mrr_at_100': 0.58407, 'mrr_at_1000': 0.58413, 'evaluation_time': 112.69}

cc @thakur-nandan

@NouamaneTazi NouamaneTazi marked this pull request as ready for review August 12, 2023 17:29
Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully understanding yet, maybe you can help me out 😅🧐

@@ -45,6 +46,9 @@ def search(self,
logger.info("Sorting Corpus by document length (Longest first)...")

corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True)
if ignore_identical_ids:
# We remove the query from results if it exists in corpus
corpus_ids = [cid for cid in corpus_ids if cid not in query_ids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this make the task "easier" by removing all other queries as options for each query?

I.e. previously, given query1 the model could wrongly retrieve query2 (if it was also in the corpus).
Now the model cannot retrieve any of the other queries which makes it easier assuming the answer is never another query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thus option was for Quora: You want to find paraphrases of queries, but not the original start query. But this original query will always be ranked first at it is also part of the corpus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why we have the ignore_identical_ids option I think. This PR only tries to fix ignore_identical_ids=True case

Comment on lines -73 to +77
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k+1, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted)
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You write that Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents., but didn't this +1 ensure that that does not happen cuz we retrieve top_k+1 but then only allow top_k lateron?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the problem comes from this line

if len(result_heaps[query_id]) < top_k:

So we only keep the top_k (which sometimes include the query inside the retrieved docs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I thought the if corpus_id != query_id: would ensure that the query would never be added to result_heaps[query_id] 🧐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then why do we get different results? 🧐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to check, we just have to assert that number of results of each query is top_k. Can you check that please @Muennighoff ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants