-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding hugging-face tokenizer #3270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor comments
|
||
|
||
def testHuggingFaceFeaturizer(): | ||
# NOTE: The test depends on the sanity of the pretrained tokenizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fast is this test to run? If the download is fast, this is probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more what the comment means? That the sanity of the pretrained tokenizer isn't guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The download took ~5 second.
By sanity, if the vocabulary is modified or deleted (which can be since we are depending on an external resource), the test might fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comment to have the explanation you gave? sanity is an ambiguous term while your explanation here is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated comment in 607baf5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor comment and should be good to go soon
|
||
|
||
def testHuggingFaceFeaturizer(): | ||
# NOTE: The test depends on the sanity of the pretrained tokenizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comment to have the explanation you gave? sanity is an ambiguous term while your explanation here is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to go. One minor request for clarification
|
||
def __init__( | ||
self, | ||
tokenizer: 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this notation. The type of the tokenizer is set to a string? Is this something using in type checking to avoid a circular import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transformers
is an expensive module for import (takes about 3-4 seconds in my machine).
In the above module, we need to import the transformers module only for type checking purpose. Hence, I have enclosed it in strings so that it is hided from the interpreter runtime, thereby reducing import time of the module. During type check, it gets imported because the variable TYPE_CHECKING
(used here) will be True.
I found this usage from python docs (ref) and I have also seen a similar usage in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to merge in once CI looks good
Pull Request Template
Description
I am adding support for using huggingface tokenizers in DeepChem in this pull request. To this end, I have added a class
HuggingFaceFeaturizer
which allows use of hugging-face tokenizers with a has-a relationship model. Corresponding tests have also been added.Type of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.32.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errorspython -m doctest <modified file>
and check no errors