Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

psorianom · 2021-02-08T11:03:08Z

Hi all!
Regarding issue #783, I propose this simple solution. I realize that this parameter rises some questions, such as, why only determine the class for the tokenizer and not the model also ? how to deal with each model, should we be taking into account each model (query, passage) separately? How deep in configuration detail should we go during the instantiation of the DPRetriever? and so on.

Of course, it is open for improvements :)

Cheers,

tholor

Hey @psorianom,

Looking good to me! Ready to merge once the mypy issue in the CI is resolved and infer_tokenizer_classes is added to the docstring.

psorianom · 2021-02-09T10:40:58Z

super! thanks for the quick review :)

…pr_tokenizer_class

Pavel Soriano and others added 2 commits February 8, 2021 11:56

added parameter to infer DPR tokenizers class

d911105

Add latest docstring and tutorial changes

c938a08

tholor reviewed Feb 8, 2021

View reviewed changes

tholor and others added 2 commits February 12, 2021 13:23

Update docstring. fix mypy

42e7815

Add latest docstring and tutorial changes

2164482

tholor changed the title ~~added parameter to infer DPR tokenizers class~~ Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg Feb 12, 2021

Merge remote-tracking branch 'upstream/master' into psorianom/infer_d…

994b1c6

…pr_tokenizer_class

tholor merged commit 8adf5b4 into deepset-ai:master Feb 12, 2021

tholor mentioned this pull request Feb 23, 2021

Allow for a custom tokenizer_class while loading DPR models #783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

psorianom commented Feb 8, 2021

tholor left a comment

psorianom commented Feb 9, 2021

Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811

Conversation

psorianom commented Feb 8, 2021

tholor left a comment

Choose a reason for hiding this comment

psorianom commented Feb 9, 2021