Adding hugging-face tokenizer #3270

arunppsg · 2023-03-10T15:05:39Z

Pull Request Template

Description

I am adding support for using huggingface tokenizers in DeepChem in this pull request. To this end, I have added a class HuggingFaceFeaturizer which allows use of hugging-face tokenizers with a has-a relationship model. Corresponding tests have also been added.

Type of change

Please check the option that is related to your PR.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
- In this case, we recommend to discuss your modification on GitHub issues before creating the PR
Documentations (modification for documents)

Checklist

rbharath

A couple of minor comments

deepchem/feat/huggingface_featurizer.py

rbharath · 2023-03-14T04:59:59Z

deepchem/feat/tests/test_huggingface_featurizer.py

+
+
+def testHuggingFaceFeaturizer():
+    # NOTE: The test depends on the sanity of the pretrained tokenizer,


How fast is this test to run? If the download is fast, this is probably OK.

Can you explain more what the comment means? That the sanity of the pretrained tokenizer isn't guaranteed?

The download took ~5 second.

By sanity, if the vocabulary is modified or deleted (which can be since we are depending on an external resource), the test might fail.

Can you update the comment to have the explanation you gave? sanity is an ambiguous term while your explanation here is clearer

updated comment in 607baf5

rbharath

One more minor comment and should be good to go soon

rbharath · 2023-03-15T01:50:04Z

deepchem/feat/tests/test_huggingface_featurizer.py

+
+
+def testHuggingFaceFeaturizer():
+    # NOTE: The test depends on the sanity of the pretrained tokenizer,


Can you update the comment to have the explanation you gave? sanity is an ambiguous term while your explanation here is clearer

rbharath

Almost good to go. One minor request for clarification

rbharath · 2023-03-15T22:42:21Z

deepchem/feat/huggingface_featurizer.py

+
+    def __init__(
+        self,
+        tokenizer: 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'


I am not familiar with this notation. The type of the tokenizer is set to a string? Is this something using in type checking to avoid a circular import?

transformers is an expensive module for import (takes about 3-4 seconds in my machine).

In the above module, we need to import the transformers module only for type checking purpose. Hence, I have enclosed it in strings so that it is hided from the interpreter runtime, thereby reducing import time of the module. During type check, it gets imported because the variable TYPE_CHECKING (used here) will be True.

I found this usage from python docs (ref) and I have also seen a similar usage in other projects.

rbharath

LGTM. Feel free to merge in once CI looks good

rbharath reviewed Mar 14, 2023

View reviewed changes

arunppsg and others added 4 commits March 14, 2023 16:08

added huggingface tokenizer

ed5bd15

clean transformers import

0574c37

test for hugging face tokenizer

525b547

added docs

447b88d

arunppsg force-pushed the hf-tokenizer branch from 4bb2a58 to 447b88d Compare March 14, 2023 10:38

rbharath reviewed Mar 15, 2023

View reviewed changes

test docs

607baf5

rbharath reviewed Mar 15, 2023

View reviewed changes

rbharath approved these changes Mar 17, 2023

View reviewed changes

arunppsg merged commit ee8430c into deepchem:master Mar 17, 2023

arunppsg deleted the hf-tokenizer branch April 11, 2023 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding hugging-face tokenizer #3270

Adding hugging-face tokenizer #3270

arunppsg commented Mar 10, 2023

rbharath left a comment

rbharath Mar 14, 2023

rbharath Mar 14, 2023

arunppsg Mar 14, 2023

rbharath Mar 15, 2023

arunppsg Mar 15, 2023

rbharath left a comment

rbharath Mar 15, 2023

rbharath left a comment

rbharath Mar 15, 2023

arunppsg Mar 16, 2023

rbharath left a comment •

edited

Loading



		def testHuggingFaceFeaturizer():
		# NOTE: The test depends on the sanity of the pretrained tokenizer,

Adding hugging-face tokenizer #3270

Adding hugging-face tokenizer #3270

Conversation

arunppsg commented Mar 10, 2023

Pull Request Template

Description

Type of change

Checklist

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbharath left a comment • edited Loading

Choose a reason for hiding this comment

rbharath left a comment •

edited

Loading