Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multi-threads support with multiple params to SearchCollection #470

Merged
merged 5 commits into from
Nov 11, 2018

Conversation

Peilin-Yang
Copy link
Collaborator

In this PR I fundamentally change how SearchCollection works.

  1. With this PR one can provide multiple params for base models, for example -b 0.2 0.75 for BM25 or -mu 200 2000 for QL. One can also provide multiple params for reranking models, at the same time, for example -rm3.fbDocs 10 20 -rm3.fbTerms 50 100 for RM3. As a result, there are N1N2N3*... run files being generated where N1..Nn are the values of the params.
  2. Now the SearchCollection spawns new threads for all retrievals by default. This, together with the newly introduced -inmem option, make a bunch of retrieval less expensive since multiple retrievals need to load the index (potentially in the memory for better multi-threading) once.

… list of params and construct the Lucene searcher in a for loop and output to different files. This can reduce the effor of reading index from disk every time one'd like to run another set of params. The reranking is also supported in the similar way

import org.apache.lucene.search.similarities.Similarity;

public class AuxSimilarity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we name this class TaggedSimilarity?

package io.anserini.search.similarity;

import org.apache.lucene.search.similarities.Similarity;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add top-level javadoc?

searcher.close();
final long durationMillis = TimeUnit.MILLISECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS);
LOG.info("Total " + numTopics + " topics searched in "
+ DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss"));
LOG.info("Total run time: " + DurationFormatUtils.formatDuration(durationMillis, "HH:mm:ss"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we print out number of topics anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now it is print inside each retrieval thread.
Search for LOG.info("Run " + topics.size() in this PR

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of code change, but fairly straightforward, actually...


private SearcherThread(IndexSearcher searcher, SortedMap<K, Map<String, String>> topics, AuxSimilarity auxSimilarity,
String cascadeTag, RerankerCascade cascade, String outputPath, String runTag) throws IOException {
this.searcher = searcher;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate cascade tag? Each reranker has its own tag, right? So can't the RerankerCascade reconstruct the tag by joining each individual tag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time we will probably have just 1 reranker and I think it is better to have just 1 tag.
Also, we have TieBreaker as the default last reranker and I am not sure if it is worthy to concatenate all tags.

@lintool
Copy link
Member

lintool commented Nov 7, 2018

High-level comments, for discussion:

  1. Instead of depending on toString, maybe use explicit tag method?
  2. Would it make sense to introduce a new abstraction called RankingModel that is a combination of the similarity and the rerankers?

@Peilin-Yang
Copy link
Collaborator Author

Instead of depending on toString, maybe use explicit tag method?
Will do
Would it make sense to introduce a new abstraction called RankingModel that is a combination of the similarity and the rerankers?
Yes, this makes sense. But let's make it in another PR?

@lintool
Copy link
Member

lintool commented Nov 11, 2018

Can you create a separate issue for above so we don't lose track of it?

Also, please make sure regressions pass before you merge?

@Peilin-Yang
Copy link
Collaborator Author

ALL Regression tests passed on tuna, going to merge

@Peilin-Yang Peilin-Yang merged commit 4d81578 into master Nov 11, 2018
@Peilin-Yang Peilin-Yang deleted the multi_paras_retrieval_test branch November 16, 2018 19:58
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants