Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to SearchCollection to use hgf tokenizer #1988

Closed
lintool opened this issue Oct 1, 2022 · 2 comments
Closed

Add option to SearchCollection to use hgf tokenizer #1988

lintool opened this issue Oct 1, 2022 · 2 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Oct 1, 2022

Hey @ToluClassics - I'm looking at the latest hgf wp regressions for MS MARCO, and it seems to be using pre-tokenized queries, e.g., msmarco-doc-hgf-wp.yaml

We should probably add an option in SearchCollection to use hgf tokenization?

@ToluClassics
Copy link
Member

it uses the hgf tokenization for the queries as well

Here's the path to the queries

path: topics.msmarco-doc.dev.txt

SearchCollection::

} else if (args.analyzeWithHuggingFaceTokenizer != null){
analyzer = new HuggingFaceTokenizerAnalyzer(args.analyzeWithHuggingFaceTokenizer);

@lintool
Copy link
Member Author

lintool commented Oct 1, 2022

Ah, sorry I misread the PR.

@lintool lintool closed this as completed Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants