New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RobertaFeaturizer #3476 #3898
base: master
Are you sure you want to change the base?
Conversation
Hey, it seems there are some failing tests (see here). Could you take a look at it? |
Modified the unit tests in |
The changes looks good to me and the failures are not related to this PR. I request @rbharath for a second pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions on the changes below
deepchem/feat/roberta_tokenizer.py
Outdated
def __call__(self, *args, **kwargs) -> Dict[str, List[int]]: | ||
return super().__call__(*args, **kwargs) | ||
def __call__(self, | ||
datapoints, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a type annotation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the latest commit
deepchem/feat/roberta_tokenizer.py
Outdated
def __call__(self, | ||
datapoints, | ||
padding: bool = True, | ||
**kwargs) -> Dict[str, List[int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we are not directly calling the superclass, let's add a brief docstring here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the latest commit
@@ -16,8 +16,7 @@ def test_smiles_call(): | |||
add_special_tokens=True, | |||
truncation=True) | |||
for emb in [embedding, embedding_long]: | |||
assert 'input_ids' in emb.keys() and 'attention_mask' in emb.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing asserts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the __call__
function in RobertaFeaturizer
now returns the input_ids
only, which are the actual embeddings of the molecules. The reason is that _featurize_shard()
inCSVLoader
is later called, in which the features
List needs the embeddings only but not the attention masks (see here). This is actually why I modified the __call__
in RobertaFeaturizer
so that it returns the input_ids only, unlike the __call__
in the inherited HuggingFace's RobertaTokenizerFast
, which returns the whole dict (input_ids and attention_mask).
So the asserts are removed since attention_mask
are not returned anymore, and the test asserts that the lists of input_ids
for all the molecules all have the same length after specifying padding=True
.
Description
Fix #3476
Convert
datapoints
toList
if it is apd.Series
and addpadding
to pad the embeddings to the same lengthType of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.32.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errorspython -m doctest <modified file>
and check no errors