Simplify `language_modeling.py` and `tokenization.py` #2703

ZanSara · 2022-06-22T10:26:23Z

Related Issue(s): #2445, closes #2704

Proposed changes:
This PR is a near-full rewrite of haystack/modeling/language_model.py and a heavy rewrite of haystack/modeling/tokenization.py. Although not strictly necessary for the multimodality support, this allows us to add new models with ease and without any code duplication. In addition it removes heavy and unnecessary use of **kwargs across many function calls and clarifies which parameters are expected by which function.

One core change is the removal of LanguageModel.load(). This factory method has been misused as a replacement for the __init__ function of model classes, making the whole initialization process quite confusing. To prevent similar issues, the factory method has been extracted from the class, named get_language_model(), and located at the bottom of the module.

Another core change is the removal of copy-pasted model wrappers like Bert, Roberta etc. These classes were mostly copies of each other and have been all replaced with only two classes called HFLanguageModel and HFLanguageModelWithPooler. DPR model classes have been merged into a single DPREncoder class, but could be further simplified.

In the tokenization part, all tokenizers have been replaced by AutoTokenizer (as suggested by @bogdankostic), which also simplified the codebase heavily. The Tokenizer.load() factory method has also been removed and replace with the get_tokenizer() factory method. Whether this method brings any value, now that the only tokenizer used is AutoTokenizer, is open for discussion.

Other classes in the haystack/modeling folder have been adapted to these changes. Also DPRetriever and EmbeddingRetriever have been slightly affected. Tests have also been corrected.

TODOs:

I plan to introduce better unit testing for tokenization, language model initialization and dpr.
~~I plan to try make DPREncoder a subclass of HFLanguageModel~~

Pre-flight checklist

~~I have read the contributors guidelines~~
~~I have enabled actions on my fork~~
If this is a code change, I added tests or updated existing ones
If this is a code change, I updated the docstrings

…ystack into simplify-language-modeling

…-modeling

…2862) * Add language unit tests * Parametrize language unit tests * Simplify model_type handling * Remove model_type parameter from get_language_model * Use the actual dpr encoders rather than tiny-bert in test_retriever.py * Additional checks for pretrained_model_name_or_path parameter in get_language_model * Fix get_language_model docs * Small fix

…ystack into simplify-language-modeling

vblagoje

Approved, CI green

* Simplification of language_model.py and tokenization.py to remove code duplication Co-authored-by: vblagoje <dovlex@gmail.com>

ZanSara and others added 9 commits June 6, 2022 17:53

Simplification of language_model.py to remove code duplication

f800184

restructure language_model.py

91caa7f

Merge branch 'master' into image_retriever

0f0fb64

Working on removing Tokenizer

23d38ec

Removing Tokenizer

c61ed79

working on normalizing DPR implementation too

a7c9bc0

Fixing dpr issue in test

b6b4e1d

Fixing DPRetriever, Embedding Retriever and usage of new API in modeling

268cacd

Update Documentation & Code Style

39419f3

ZanSara added topic:DPR topic:speed topic:modeling type:refactor Not necessarily visible to the users labels Jun 22, 2022

ZanSara and others added 3 commits June 22, 2022 12:33

Remove mentions to data2vecvision

6d4857b

Minor fixes

a551c05

Update Documentation & Code Style

63ab0cb

masci linked an issue Jun 22, 2022 that may be closed by this pull request

Simplify language_modeling.py and tokenization.py #2704

Closed

masci requested a review from vblagoje June 22, 2022 13:18

ZanSara and others added 12 commits June 22, 2022 16:11

fixing mypy issues

34b9973

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

e01efd0

…ystack into simplify-language-modeling

Update Documentation & Code Style

0253b14

typing tokenization better

d78d55a

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

263f55d

…ystack into simplify-language-modeling

more fixes for mypy

7551850

Update Documentation & Code Style

e78fe2e

pylint

8ed07ff

more mypy

4226eea

more mypy

26d9eb0

remove merge tags

e4e9ba1

Update Documentation & Code Style

7fc2443

github-actions bot and others added 13 commits July 13, 2022 08:39

Update Documentation & Code Style

c909599

mypy & pylint

a1685bd

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

c1034a8

…ystack into simplify-language-modeling

Update Documentation & Code Style

44c7726

mypy & pylint again

d5eb606

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

fbbea3b

…ystack into simplify-language-modeling

Improve management of output_hidden_states

0bb1104

Update Documentation & Code Style

34121d7

mypy

c5a6dd0

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

f1cdba1

…ystack into simplify-language-modeling

fix tests

8df63f7

remove excess params from trainer

2e9f12f

Update Documentation & Code Style

3a5b9ec

ZanSara marked this pull request as draft July 14, 2022 13:51

ZanSara marked this pull request as ready for review July 14, 2022 13:51

ZanSara and others added 6 commits July 18, 2022 09:22

Merge remote-tracking branch 'upstream/master' into simplify-language…

8cf3969

…-modeling

simplifying tokenizer tests

e7ebad4

Merge branch 'simplify-language-modeling' of github.com:deepset-ai/ha…

8ec0cbf

…ystack into simplify-language-modeling

fix tokenization tests

b7c3329

Update Documentation & Code Style

216ef43

ZanSara marked this pull request as draft July 21, 2022 13:07

ZanSara marked this pull request as ready for review July 21, 2022 13:07

ZanSara mentioned this pull request Jul 21, 2022

Add support for images #2418

Closed

8 tasks

Adjust model_type resolution to check config architectures (#2871)

0e7ec82

vblagoje approved these changes Jul 22, 2022

View reviewed changes

ZanSara merged commit 4e45062 into master Jul 22, 2022

ZanSara deleted the simplify-language-modeling branch July 22, 2022 14:29

Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022

Simplify language_modeling.py and tokenization.py (deepset-ai#2703)

7ee4861

* Simplification of language_model.py and tokenization.py to remove code duplication Co-authored-by: vblagoje <dovlex@gmail.com>

bogdankostic mentioned this pull request Jul 26, 2022

DPR training is broken #2885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `language_modeling.py` and `tokenization.py` #2703

Simplify `language_modeling.py` and `tokenization.py` #2703

ZanSara commented Jun 22, 2022 •

edited

vblagoje left a comment

Simplify language_modeling.py and tokenization.py #2703

Simplify language_modeling.py and tokenization.py #2703

Conversation

ZanSara commented Jun 22, 2022 • edited

TODOs:

Pre-flight checklist

vblagoje left a comment

Choose a reason for hiding this comment

Simplify `language_modeling.py` and `tokenization.py` #2703

Simplify `language_modeling.py` and `tokenization.py` #2703

ZanSara commented Jun 22, 2022 •

edited