Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify language_modeling.py and tokenization.py #2703

Merged
merged 119 commits into from Jul 22, 2022

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Jun 22, 2022

Related Issue(s): #2445, closes #2704

Proposed changes:
This PR is a near-full rewrite of haystack/modeling/language_model.py and a heavy rewrite of haystack/modeling/tokenization.py. Although not strictly necessary for the multimodality support, this allows us to add new models with ease and without any code duplication. In addition it removes heavy and unnecessary use of **kwargs across many function calls and clarifies which parameters are expected by which function.

One core change is the removal of LanguageModel.load(). This factory method has been misused as a replacement for the __init__ function of model classes, making the whole initialization process quite confusing. To prevent similar issues, the factory method has been extracted from the class, named get_language_model(), and located at the bottom of the module.

Another core change is the removal of copy-pasted model wrappers like Bert, Roberta etc. These classes were mostly copies of each other and have been all replaced with only two classes called HFLanguageModel and HFLanguageModelWithPooler. DPR model classes have been merged into a single DPREncoder class, but could be further simplified.

In the tokenization part, all tokenizers have been replaced by AutoTokenizer (as suggested by @bogdankostic), which also simplified the codebase heavily. The Tokenizer.load() factory method has also been removed and replace with the get_tokenizer() factory method. Whether this method brings any value, now that the only tokenizer used is AutoTokenizer, is open for discussion.

Other classes in the haystack/modeling folder have been adapted to these changes. Also DPRetriever and EmbeddingRetriever have been slightly affected. Tests have also been corrected.

TODOs:

  • I plan to introduce better unit testing for tokenization, language model initialization and dpr.
  • I plan to try make DPREncoder a subclass of HFLanguageModel

Pre-flight checklist

@masci masci linked an issue Jun 22, 2022 that may be closed by this pull request
@masci masci requested a review from vblagoje June 22, 2022 13:18
@ZanSara ZanSara marked this pull request as draft July 14, 2022 13:51
@ZanSara ZanSara marked this pull request as ready for review July 14, 2022 13:51
ZanSara and others added 6 commits July 18, 2022 09:22
…2862)

* Add language unit tests

* Parametrize language unit tests

* Simplify model_type handling

* Remove model_type parameter from get_language_model

* Use the actual dpr encoders rather than tiny-bert in test_retriever.py

* Additional checks for pretrained_model_name_or_path parameter in get_language_model

* Fix get_language_model docs

* Small fix
@ZanSara ZanSara marked this pull request as draft July 21, 2022 13:07
@ZanSara ZanSara marked this pull request as ready for review July 21, 2022 13:07
@ZanSara ZanSara mentioned this pull request Jul 21, 2022
8 tasks
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, CI green

@ZanSara ZanSara merged commit 4e45062 into master Jul 22, 2022
@ZanSara ZanSara deleted the simplify-language-modeling branch July 22, 2022 14:29
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* Simplification of language_model.py and tokenization.py to remove code duplication

Co-authored-by: vblagoje <dovlex@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify language_modeling.py and tokenization.py
2 participants