Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Smiles Tokenizer in dc.feat #2113

Merged
merged 68 commits into from
Sep 3, 2020
Merged

[WIP] Smiles Tokenizer in dc.feat #2113

merged 68 commits into from
Sep 3, 2020

Conversation

seyonechithrananda
Copy link
Member

Added new tokenizer for SMILES based off the RXNFP tokenizer. See #2076 for more details

The tokenizer loads its vocab from the vocab.txt found in 'deepchem/feat/tests/data/', is there a better place to place this?

Will also add the import guard shortly.

@rbharath @peastman

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I've made a number of small comments, mainly requests about documentation style and type annotations that should be relatively easy to fix.

deepchem/feat/smiles_tokenizer.py Outdated Show resolved Hide resolved
# export
SMI_REGEX_PATTERN = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"

def get_default_tokenizer():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docstring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing it, but looks like get_default_tokenizer() doesn't have a docstring yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops confused docstring with pep8 code styling lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the docstring for get_default_tokenizer() still needs to be added here

deepchem/feat/smiles_tokenizer.py Outdated Show resolved Hide resolved
"""Converts an index (integer) in a token (string/unicode) using the vocab."""
return self.ids_to_tokens.get(index, self.unk_token)

def convert_tokens_to_string(self, tokens):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you switch to Numpy doc style and add type annotations for these other methods as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation still missing here and on some of the other methods

deepchem/feat/smiles_tokenizer.py Show resolved Hide resolved
model.num_parameters()

tokenizer = SmilesTokenizer(vocab_path, max_len=model.config.max_position_embeddings)
print(tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some assertions into this test? It would be good to add checks that the tokenizer has ocrrect behavior and have assertions to guard it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New assert looks good! Could you remove the print statement now that we have the assert in?

deepchem/feat/smiles_tokenizer.py Show resolved Hide resolved
r"""
Constructs a SmilesTokenizer.
Bulk of code is from https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp
Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a usage example here showing how to invoke the tokenizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, i believe the test has an example use-case we can port over here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@seyonechithrananda
Copy link
Member Author

@rbharath thanks for the comments! Will look over them in further detail tomorrow and address them.


class SmilesTokenizer(BertTokenizer):
r"""
Constructs a SmilesTokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the documentation for the class, not the constructor.

# mask_token="[MASK]",
**kwargs
):
"""Constructs a BertTokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be SmilesTokenizer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@peastman
Copy link
Contributor

A lot more documentation would be good. What is a SmilesTokenizer, what do you use it for, what algorithm does it use, how do you use it, etc. A lot of the current docstrings are also pretty uninformative. Like

"""Run basic SMILES tokenization"""

or

"""
Adds special tokens to the a sequence for sequence classification tasks.
A BERT sequence has the following format: [CLS] X [SEP]
"""

If you already understand what the function does, I assume that makes sense. If you don't already understand, it leaves you just as confused as before!

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Have a few more minor documentation related comments

.. [1] Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David;
Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural
Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3
Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "Notes" instead of "Note"

deepchem/feat/smiles_tokenizer.py Show resolved Hide resolved
deepchem/feat/smiles_tokenizer.py Show resolved Hide resolved
deepchem/feat/smiles_tokenizer.py Show resolved Hide resolved
@nissy-dev
Copy link
Member

nissy-dev commented Sep 2, 2020

@rbharath Thank you for your summary! I really appreciate 🙇‍♂️

If I'm understanding @seyonechithrananda's point, the main reason to migrate the code into DeepChem is that rxnfp isn't actively being developed further, so it's unlikely to see future improvements. In this case, since we anticipate using tokenizers more heavily moving forward, I think it would make sense to add a DeepChem implementation based off rxnfp's code.

If there is a high possibility to customize the original rxnfp's SmilesTokenizer in the future, I can agree.

It makes sense to me that we should defer to HuggingFace's API since they're the established leader in this area. But we should have a clear API for tokenizing large datasets. If we decide to follow HuggingFace's lead, we don't need to make tokenizers inherit from MolecularFeaturizer, but we should add a section to tokenizers.rst explaining how to tokenize a large dataset HuggingFace-style and load it into DeepChem.

What do you folks think would be a clean API for tokenization here?

In my thought, SmilesTokenizer should inherits both MolecularFeaturizer and BertTokenizer.

class SmilesTokenizer(BertTokenizer, MolecularFeaturizer):
  def __init__(
      self,
      vocab_file: str='',
      **kwargs):
    ....

 def _featurize(self, mol: RDKitMol) -> np.ndarray:
    from rdkit import Chem
    smiles = Chem.MolToSmiles(mol)
    return np.array(self.encode(smiles))

 def .....

And TokenizingFeaturizer class is also good for me, but we need to implement more codes.

An alternative would be to do it through encapsulation, for example with a TokenizingFeaturizer class that takes a tokenizer as an argument and uses it to perform featurization. I can see advantages to both approaches.

@rbharath
Copy link
Member

rbharath commented Sep 2, 2020

@nd-02110114 I really like this design! It's a light weight addition that would let us use SmilesTokenizer in both HuggingFace's and DeepChem's pipelines. @seyonechithrananda What do you think of this API design? I believe it would just require adding the single extra _featurize method that @nd-02110114 has implemented above and adding a test case for invoking the _featurize() method. I'd be OK doing this in a follow-on PR if we're agreed about the design.

@seyonechithrananda It looks like yapf needs to be run and it looks like test_tokenize is failing in the test suite. Would you be able to fix these failures? No more code change requests on my end for this PR :)

@seyonechithrananda
Copy link
Member Author

@rbharath I agree, I really like @nd-02110114's suggested design. I'll definitely make a follow-up PR with this design once I have some more time, probably around mid-September (after the Arxiv release). That way, we can use more of the DeepChem pipeline (as currently, the SmilesTokenizer works well with Huggingface but not as well with deepchem). Will run yapf again on the unit test + tokenizer script, and look at why it's failing. Should be good to be merged after!

@seyonechithrananda
Copy link
Member Author

Fixed the CI issue, it's with a specific method, but I believe we can remove it as it corresponds to WordPiece tokenization which we don't use.

I also added YAPF formatting to the tokenizer + unit test and tested the unit test and the assertion passes. Fingers crossed it passes now 🤞

@seyonechithrananda
Copy link
Member Author

@rbharath I'm still getting an error with yapf formatting over the init.py, but I ran a quick check and its already been formatted properly. Any thoughts on how to fix this? otherwise, the CI is passing.
Screen Shot 2020-09-02 at 9 00 22 PM

@peastman
Copy link
Contributor

peastman commented Sep 3, 2020

Make sure you're using yapf 0.22. If you have a different version it will format things slightly differently and the test won't pass.

@seyonechithrananda
Copy link
Member Author

@peastman Will check! thanks :)

@seyonechithrananda
Copy link
Member Author

seyonechithrananda commented Sep 3, 2020

@peastman I just verified that my version of yapf is indeed 0.22, no changes were suggested over the files though. Any ideas why this is still failing?

Additionally, when I run bash devtools/run_yapf.sh, the script which is throwing an error in the CI, the formatting test also passes.

@rbharath
Copy link
Member

rbharath commented Sep 3, 2020

@seyonechithrananda Can you try running this sequence of yapf commands from the command line? The sequence I usually use is just

yapf --version # to check version
yapf -i /path/to/file
git add /path/to/file
git commit -m "My message"
git push origin branch # Use your remote/branch here

I'm not sure what's happening here but perhaps there's something in your system setup that is a little different here so might be good to try the simplest setup to see if that helps

@seyonechithrananda
Copy link
Member Author

@seyonechithrananda Can you try running this sequence of yapf commands from the command line? The sequence I usually use is just

yapf --version # to check version
yapf -i /path/to/file
git add /path/to/file
git commit -m "My message"
git push origin branch # Use your remote/branch here

I'm not sure what's happening here but perhaps there's something in your system setup that is a little different here so might be good to try the simplest setup to see if that helps

@rbharath
version: 0.11.1 (I installed 0.22 using pip install yapf==0.22)
When I run yapf -i on smiles_tokenizer.py, __init__.py, and test_smiles_tokenizer.py, no changes are reported.

@rbharath
Copy link
Member

rbharath commented Sep 3, 2020

Hmm, that looks like the wrong yapf version. Here's what I see when I run the version command:

(deepchem) bharath@Bharaths-MBP tests % yapf --version
yapf 0.22.0

@seyonechithrananda
Copy link
Member Author

@rbharath Yeah, kinda weird because I did a fresh install of yapf before running again. Will try with a new environment. Did you use the pip command or clone yapf?

@rbharath
Copy link
Member

rbharath commented Sep 3, 2020

I used pip, but might be worth running which yapf to make sure that you're using the pip installed yapf in your environment

@seyonechithrananda
Copy link
Member Author

@rbharath Just fixed it with the fresh install! Adding changes now

@seyonechithrananda
Copy link
Member Author

Lets hope it finally passes now haha, thanks so much for the help @rbharath

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 77.843% when pulling 1314bd1 on seyonechithrananda:chemberta-tutorial into 3d257a0 on deepchem:master.

@rbharath
Copy link
Member

rbharath commented Sep 3, 2020

Travis is now green so going to merge this in! Congrats on the new feature merged in @seyonechithrananda :)

We should do further work to integrate Tokenizers more closely into DeepChem as indicated in the discussion above in future PRs

@rbharath rbharath merged commit 9a76353 into deepchem:master Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants