Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

Merged

Conversation

psorianom
Copy link
Contributor

Hi all!
Regarding issue #783, I propose this simple solution. I realize that this parameter rises some questions, such as, why only determine the class for the tokenizer and not the model also ? how to deal with each model, should we be taking into account each model (query, passage) separately? How deep in configuration detail should we go during the instantiation of the DPRetriever? and so on.

Of course, it is open for improvements :)

Cheers,

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @psorianom,

Looking good to me! Ready to merge once the mypy issue in the CI is resolved and infer_tokenizer_classes is added to the docstring.

@psorianom
Copy link
Contributor Author

super! thanks for the quick review :)

@tholor tholor changed the title added parameter to infer DPR tokenizers class Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg Feb 12, 2021
@tholor tholor merged commit 8adf5b4 into deepset-ai:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants