[WIP] Smiles Tokenizer in dc.feat #2113

seyonechithrananda · 2020-08-22T00:22:36Z

Added new tokenizer for SMILES based off the RXNFP tokenizer. See #2076 for more details

The tokenizer loads its vocab from the vocab.txt found in 'deepchem/feat/tests/data/', is there a better place to place this?

Will also add the import guard shortly.

@rbharath @peastman

rbharath

This looks good! I've made a number of small comments, mainly requests about documentation style and type annotations that should be relatively easy to fix.

deepchem/feat/smiles_tokenizer.py

rbharath · 2020-08-24T02:50:33Z

deepchem/feat/smiles_tokenizer.py

+# export
+SMI_REGEX_PATTERN = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
+
+def get_default_tokenizer():


Could you add a docstring here?

I might be missing it, but looks like get_default_tokenizer() doesn't have a docstring yet?

woops confused docstring with pep8 code styling lol

Looks like the docstring for get_default_tokenizer() still needs to be added here

deepchem/feat/smiles_tokenizer.py

rbharath · 2020-08-24T02:51:59Z

deepchem/feat/smiles_tokenizer.py

+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):


Could you switch to Numpy doc style and add type annotations for these other methods as well?

Type annotation still missing here and on some of the other methods

deepchem/feat/smiles_tokenizer.py

rbharath · 2020-08-24T02:53:17Z

deepchem/feat/tests/test_smiles_tokenizer.py

+      model.num_parameters()
+
+      tokenizer = SmilesTokenizer(vocab_path, max_len=model.config.max_position_embeddings)
+      print(tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1"))


Could you add some assertions into this test? It would be good to add checks that the tokenizer has ocrrect behavior and have assertions to guard it

New assert looks good! Could you remove the print statement now that we have the assert in?

deepchem/feat/smiles_tokenizer.py

rbharath · 2020-08-24T03:07:06Z

deepchem/feat/smiles_tokenizer.py

+    r"""
+    Constructs a SmilesTokenizer.
+    Bulk of code is from https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp
+    Args:


Could you add a usage example here showing how to invoke the tokenizer?

will do, i believe the test has an example use-case we can port over here

seyonechithrananda · 2020-08-24T04:47:18Z

@rbharath thanks for the comments! Will look over them in further detail tomorrow and address them.

peastman · 2020-08-24T17:58:10Z

deepchem/feat/smiles_tokenizer.py

+
+class SmilesTokenizer(BertTokenizer):
+    r"""
+    Constructs a SmilesTokenizer.


This is the documentation for the class, not the constructor.

peastman · 2020-08-24T17:58:31Z

deepchem/feat/smiles_tokenizer.py

+            # mask_token="[MASK]",
+            **kwargs
+    ):
+        """Constructs a BertTokenizer.


That should be SmilesTokenizer.

peastman · 2020-08-24T18:10:58Z

A lot more documentation would be good. What is a SmilesTokenizer, what do you use it for, what algorithm does it use, how do you use it, etc. A lot of the current docstrings are also pretty uninformative. Like

"""Run basic SMILES tokenization"""

or

"""
Adds special tokens to the a sequence for sequence classification tasks.
A BERT sequence has the following format: [CLS] X [SEP]
"""

If you already understand what the function does, I assume that makes sense. If you don't already understand, it leaves you just as confused as before!

rbharath

Looking good! Have a few more minor documentation related comments

rbharath · 2020-08-25T06:34:10Z

deepchem/feat/smiles_tokenizer.py

+    .. [1]  Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David;
+            Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural
+            Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3
+    Note


This should be "Notes" instead of "Note"

deepchem/feat/smiles_tokenizer.py

deepchem/feat/__init__.py

nissy-dev · 2020-09-02T04:42:12Z

@rbharath Thank you for your summary! I really appreciate 🙇‍♂️

If I'm understanding @seyonechithrananda's point, the main reason to migrate the code into DeepChem is that rxnfp isn't actively being developed further, so it's unlikely to see future improvements. In this case, since we anticipate using tokenizers more heavily moving forward, I think it would make sense to add a DeepChem implementation based off rxnfp's code.

If there is a high possibility to customize the original rxnfp's SmilesTokenizer in the future, I can agree.

It makes sense to me that we should defer to HuggingFace's API since they're the established leader in this area. But we should have a clear API for tokenizing large datasets. If we decide to follow HuggingFace's lead, we don't need to make tokenizers inherit from MolecularFeaturizer, but we should add a section to tokenizers.rst explaining how to tokenize a large dataset HuggingFace-style and load it into DeepChem.

What do you folks think would be a clean API for tokenization here?

In my thought, SmilesTokenizer should inherits both MolecularFeaturizer and BertTokenizer.

class SmilesTokenizer(BertTokenizer, MolecularFeaturizer):
  def __init__(
      self,
      vocab_file: str='',
      **kwargs):
    ....

 def _featurize(self, mol: RDKitMol) -> np.ndarray:
    from rdkit import Chem
    smiles = Chem.MolToSmiles(mol)
    return np.array(self.encode(smiles))

 def .....

And TokenizingFeaturizer class is also good for me, but we need to implement more codes.

An alternative would be to do it through encapsulation, for example with a TokenizingFeaturizer class that takes a tokenizer as an argument and uses it to perform featurization. I can see advantages to both approaches.

rbharath · 2020-09-02T19:29:59Z

@nd-02110114 I really like this design! It's a light weight addition that would let us use SmilesTokenizer in both HuggingFace's and DeepChem's pipelines. @seyonechithrananda What do you think of this API design? I believe it would just require adding the single extra _featurize method that @nd-02110114 has implemented above and adding a test case for invoking the _featurize() method. I'd be OK doing this in a follow-on PR if we're agreed about the design.

@seyonechithrananda It looks like yapf needs to be run and it looks like test_tokenize is failing in the test suite. Would you be able to fix these failures? No more code change requests on my end for this PR :)

seyonechithrananda · 2020-09-02T19:34:31Z

@rbharath I agree, I really like @nd-02110114's suggested design. I'll definitely make a follow-up PR with this design once I have some more time, probably around mid-September (after the Arxiv release). That way, we can use more of the DeepChem pipeline (as currently, the SmilesTokenizer works well with Huggingface but not as well with deepchem). Will run yapf again on the unit test + tokenizer script, and look at why it's failing. Should be good to be merged after!

seyonechithrananda · 2020-09-02T23:50:28Z

Fixed the CI issue, it's with a specific method, but I believe we can remove it as it corresponds to WordPiece tokenization which we don't use.

I also added YAPF formatting to the tokenizer + unit test and tested the unit test and the assertion passes. Fingers crossed it passes now 🤞

seyonechithrananda · 2020-09-03T01:05:34Z

@rbharath I'm still getting an error with yapf formatting over the init.py, but I ran a quick check and its already been formatted properly. Any thoughts on how to fix this? otherwise, the CI is passing.

…emberta-tutorial

peastman · 2020-09-03T16:18:02Z

Make sure you're using yapf 0.22. If you have a different version it will format things slightly differently and the test won't pass.

seyonechithrananda · 2020-09-03T17:24:27Z

@peastman Will check! thanks :)

seyonechithrananda · 2020-09-03T17:33:53Z

@peastman I just verified that my version of yapf is indeed 0.22, no changes were suggested over the files though. Any ideas why this is still failing?

Additionally, when I run bash devtools/run_yapf.sh, the script which is throwing an error in the CI, the formatting test also passes.

rbharath · 2020-09-03T19:20:02Z

@seyonechithrananda Can you try running this sequence of yapf commands from the command line? The sequence I usually use is just

yapf --version # to check version
yapf -i /path/to/file
git add /path/to/file
git commit -m "My message"
git push origin branch # Use your remote/branch here

I'm not sure what's happening here but perhaps there's something in your system setup that is a little different here so might be good to try the simplest setup to see if that helps

seyonechithrananda · 2020-09-03T19:34:24Z

@seyonechithrananda Can you try running this sequence of yapf commands from the command line? The sequence I usually use is just
yapf --version # to check version
yapf -i /path/to/file
git add /path/to/file
git commit -m "My message"
git push origin branch # Use your remote/branch here
I'm not sure what's happening here but perhaps there's something in your system setup that is a little different here so might be good to try the simplest setup to see if that helps

@rbharath
version: 0.11.1 (I installed 0.22 using pip install yapf==0.22)
When I run yapf -i on smiles_tokenizer.py, __init__.py, and test_smiles_tokenizer.py, no changes are reported.

rbharath · 2020-09-03T19:37:12Z

Hmm, that looks like the wrong yapf version. Here's what I see when I run the version command:

(deepchem) bharath@Bharaths-MBP tests % yapf --version
yapf 0.22.0

seyonechithrananda · 2020-09-03T19:38:08Z

@rbharath Yeah, kinda weird because I did a fresh install of yapf before running again. Will try with a new environment. Did you use the pip command or clone yapf?

rbharath · 2020-09-03T19:40:47Z

I used pip, but might be worth running which yapf to make sure that you're using the pip installed yapf in your environment

seyonechithrananda · 2020-09-03T19:48:06Z

@rbharath Just fixed it with the fresh install! Adding changes now

seyonechithrananda · 2020-09-03T19:50:49Z

Lets hope it finally passes now haha, thanks so much for the help @rbharath

coveralls · 2020-09-03T20:18:55Z

Coverage decreased (-0.03%) to 77.843% when pulling 1314bd1 on seyonechithrananda:chemberta-tutorial into 3d257a0 on deepchem:master.

rbharath · 2020-09-03T21:12:32Z

Travis is now green so going to merge this in! Congrats on the new feature merged in @seyonechithrananda :)

We should do further work to integrate Tokenizers more closely into DeepChem as indicated in the discussion above in future PRs

seyonechithrananda added 12 commits August 21, 2020 19:58

create SmilesTokenizer class

865210a

merge

ce0b204

add BasicSmilesTokenizer class

eca05f6

add load_vocab method

2e39ffb

add commenting

cdc6790

create smiles tokenizer unit test

92cb308

edit testing script

8874201

add vocab file

2db2f0e

update vocab file path

2df051c

use os to get vocab path

3fd7731

commenting source of tokenizer

6d108df

os.path to get vocab path

875c57f

rbharath reviewed Aug 24, 2020

View reviewed changes

peastman reviewed Aug 24, 2020

View reviewed changes

seyonechithrananda added 5 commits August 24, 2020 17:29

convert to numpy docstring

7e5b395

reference to chemxrviv paper

a3bed5f

add docstring for regex

7f64535

add type annotations

0f767b5

numpy docs style for methods

ab21834

rbharath reviewed Aug 25, 2020

View reviewed changes

seyonechithrananda added 6 commits August 25, 2020 23:22

docs update

46a025a

docs for token_ids methods

c922c40

numpy docs for add_special_tokens_ids

8f141e4

last bit of numpy docs changes

180c35d

create tokenizers doc file

c1653de

add SmilesTokenizer docs

de8d7d3

seyonechithrananda commented Sep 1, 2020

View reviewed changes

deepchem/feat/__init__.py Outdated Show resolved Hide resolved

seyonechithrananda added 4 commits September 1, 2020 22:34

add more commenting to smilestokenizer class

1026f77

add regex expression to docs

4f061b0

docs pass, adding installation command

6b992af

edit code block in docs

a457915

seyonechithrananda added 4 commits September 2, 2020 19:23

yapf changes

835cdad

yapf on init

9fc285a

remove add_special_tokens() method

64116e3

fix assertion error

bcbeb51

Merge branch 'master' of https://github.com/deepchem/deepchem into ch…

bce90b4

…emberta-tutorial

remove if __main__

0e0ed6e

update yapf to 0.22

1314bd1

rbharath merged commit 9a76353 into deepchem:master Sep 3, 2020

[WIP] Smiles Tokenizer in dc.feat #2113

[WIP] Smiles Tokenizer in dc.feat #2113

Conversation

seyonechithrananda commented Aug 22, 2020

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seyonechithrananda commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peastman commented Aug 24, 2020

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nissy-dev commented Sep 2, 2020 • edited Loading

rbharath commented Sep 2, 2020

seyonechithrananda commented Sep 2, 2020

seyonechithrananda commented Sep 2, 2020

seyonechithrananda commented Sep 3, 2020

peastman commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020 • edited Loading

rbharath commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020

rbharath commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020

rbharath commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020

seyonechithrananda commented Sep 3, 2020

coveralls commented Sep 3, 2020

rbharath commented Sep 3, 2020

nissy-dev commented Sep 2, 2020 •

edited

Loading

seyonechithrananda commented Sep 3, 2020 •

edited

Loading