Fix RobertaFeaturizer #3476 #3898

quincylin1 · 2024-03-15T16:18:31Z

Description

Fix #3476

Convert datapoints to List if it is a pd.Series and add padding to pad the embeddings to the same length

Type of change

Please check the option that is related to your PR.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
- In this case, we recommend to discuss your modification on GitHub issues before creating the PR
Documentations (modification for documents)

Checklist

arunppsg · 2024-03-18T04:28:21Z

Hey, it seems there are some failing tests (see here). Could you take a look at it?

quincylin1 · 2024-03-18T13:15:36Z

Hey, it seems there are some failing tests (see here). Could you take a look at it?

Modified the unit tests in test_roberta_tokenizer.py slightly. Please let me know if there's any issue :)

arunppsg · 2024-03-18T16:53:12Z

The changes looks good to me and the failures are not related to this PR. I request @rbharath for a second pass.

rbharath

I have a few questions on the changes below

rbharath · 2024-03-18T18:01:51Z

deepchem/feat/roberta_tokenizer.py

-    def __call__(self, *args, **kwargs) -> Dict[str, List[int]]:
-        return super().__call__(*args, **kwargs)
+    def __call__(self,
+                 datapoints,


Can you add a type annotation here?

Please see the latest commit

rbharath · 2024-03-18T18:02:05Z

deepchem/feat/roberta_tokenizer.py

+    def __call__(self,
+                 datapoints,
+                 padding: bool = True,
+                 **kwargs) -> Dict[str, List[int]]:


Given we are not directly calling the superclass, let's add a brief docstring here

Please see the latest commit

rbharath · 2024-03-18T18:02:34Z

deepchem/feat/tests/test_roberta_tokenizer.py

@@ -16,8 +16,7 @@ def test_smiles_call():
                                add_special_tokens=True,
                                truncation=True)
    for emb in [embedding, embedding_long]:
-        assert 'input_ids' in emb.keys() and 'attention_mask' in emb.keys()


Why are we removing asserts here?

Because the __call__ function in RobertaFeaturizer now returns the input_ids only, which are the actual embeddings of the molecules. The reason is that _featurize_shard() inCSVLoader is later called, in which the features List needs the embeddings only but not the attention masks (see here). This is actually why I modified the __call__ in RobertaFeaturizer so that it returns the input_ids only, unlike the __call__ in the inherited HuggingFace's RobertaTokenizerFast, which returns the whole dict (input_ids and attention_mask).

So the asserts are removed since attention_mask are not returned anymore, and the test asserts that the lists of input_ids for all the molecules all have the same length after specifying padding=True.

quincylin1 · 2024-04-17T14:28:55Z

@arunppsg @rbharath any chance to review this PR? Thanks!

Fix RobertaFeaturizer deepchem#3476

403df0e

quincylin1 added 2 commits March 18, 2024 17:03

fixed featurizer() and call method unit test

9468bd9

added assert for featurize unit test

8346dd2

arunppsg approved these changes Mar 18, 2024

View reviewed changes

rbharath reviewed Mar 18, 2024

View reviewed changes

quincylin1 added 2 commits March 19, 2024 16:30

Added docstring for call function

4053d9e

fixed docstring for _featurize

61605e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RobertaFeaturizer #3476 #3898

Fix RobertaFeaturizer #3476 #3898

quincylin1 commented Mar 15, 2024 •

edited

arunppsg commented Mar 18, 2024

quincylin1 commented Mar 18, 2024

arunppsg commented Mar 18, 2024

rbharath left a comment

rbharath Mar 18, 2024

quincylin1 Mar 19, 2024

rbharath Mar 18, 2024

quincylin1 Mar 19, 2024

rbharath Mar 18, 2024

quincylin1 Mar 19, 2024

quincylin1 commented Apr 17, 2024

Fix RobertaFeaturizer #3476 #3898

Are you sure you want to change the base?

Fix RobertaFeaturizer #3476 #3898

Conversation

quincylin1 commented Mar 15, 2024 • edited

Description

Type of change

Checklist

arunppsg commented Mar 18, 2024

quincylin1 commented Mar 18, 2024

arunppsg commented Mar 18, 2024

rbharath left a comment

Choose a reason for hiding this comment

rbharath Mar 18, 2024

Choose a reason for hiding this comment

quincylin1 Mar 19, 2024

Choose a reason for hiding this comment

rbharath Mar 18, 2024

Choose a reason for hiding this comment

quincylin1 Mar 19, 2024

Choose a reason for hiding this comment

rbharath Mar 18, 2024

Choose a reason for hiding this comment

quincylin1 Mar 19, 2024

Choose a reason for hiding this comment

quincylin1 commented Apr 17, 2024

quincylin1 commented Mar 15, 2024 •

edited