-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Smiles Tokenizer in dc.feat #2113
[WIP] Smiles Tokenizer in dc.feat #2113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I've made a number of small comments, mainly requests about documentation style and type annotations that should be relatively easy to fix.
# export | ||
SMI_REGEX_PATTERN = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])" | ||
|
||
def get_default_tokenizer(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a docstring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it, but looks like get_default_tokenizer()
doesn't have a docstring yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops confused docstring with pep8 code styling lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the docstring for get_default_tokenizer()
still needs to be added here
deepchem/feat/smiles_tokenizer.py
Outdated
"""Converts an index (integer) in a token (string/unicode) using the vocab.""" | ||
return self.ids_to_tokens.get(index, self.unk_token) | ||
|
||
def convert_tokens_to_string(self, tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you switch to Numpy doc style and add type annotations for these other methods as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation still missing here and on some of the other methods
model.num_parameters() | ||
|
||
tokenizer = SmilesTokenizer(vocab_path, max_len=model.config.max_position_embeddings) | ||
print(tokenizer.encode("CCC(CC)COC(=O)[C@H](C)N[P@](=O)(OC[C@H]1O[C@](C#N)([C@H](O)[C@@H]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some assertions into this test? It would be good to add checks that the tokenizer has ocrrect behavior and have assertions to guard it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New assert looks good! Could you remove the print statement now that we have the assert in?
deepchem/feat/smiles_tokenizer.py
Outdated
r""" | ||
Constructs a SmilesTokenizer. | ||
Bulk of code is from https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a usage example here showing how to invoke the tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, i believe the test has an example use-case we can port over here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@rbharath thanks for the comments! Will look over them in further detail tomorrow and address them. |
deepchem/feat/smiles_tokenizer.py
Outdated
|
||
class SmilesTokenizer(BertTokenizer): | ||
r""" | ||
Constructs a SmilesTokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the documentation for the class, not the constructor.
deepchem/feat/smiles_tokenizer.py
Outdated
# mask_token="[MASK]", | ||
**kwargs | ||
): | ||
"""Constructs a BertTokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be SmilesTokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
A lot more documentation would be good. What is a SmilesTokenizer, what do you use it for, what algorithm does it use, how do you use it, etc. A lot of the current docstrings are also pretty uninformative. Like """Run basic SMILES tokenization""" or """
Adds special tokens to the a sequence for sequence classification tasks.
A BERT sequence has the following format: [CLS] X [SEP]
""" If you already understand what the function does, I assume that makes sense. If you don't already understand, it leaves you just as confused as before! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Have a few more minor documentation related comments
deepchem/feat/smiles_tokenizer.py
Outdated
.. [1] Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David; | ||
Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural | ||
Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3 | ||
Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "Notes" instead of "Note"
@rbharath Thank you for your summary! I really appreciate 🙇♂️
If there is a high possibility to customize the original rxnfp's SmilesTokenizer in the future, I can agree.
In my thought, SmilesTokenizer should inherits both MolecularFeaturizer and BertTokenizer. class SmilesTokenizer(BertTokenizer, MolecularFeaturizer):
def __init__(
self,
vocab_file: str='',
**kwargs):
....
def _featurize(self, mol: RDKitMol) -> np.ndarray:
from rdkit import Chem
smiles = Chem.MolToSmiles(mol)
return np.array(self.encode(smiles))
def ..... And TokenizingFeaturizer class is also good for me, but we need to implement more codes.
|
@nd-02110114 I really like this design! It's a light weight addition that would let us use @seyonechithrananda It looks like |
@rbharath I agree, I really like @nd-02110114's suggested design. I'll definitely make a follow-up PR with this design once I have some more time, probably around mid-September (after the Arxiv release). That way, we can use more of the DeepChem pipeline (as currently, the SmilesTokenizer works well with Huggingface but not as well with deepchem). Will run yapf again on the unit test + tokenizer script, and look at why it's failing. Should be good to be merged after! |
Fixed the CI issue, it's with a specific method, but I believe we can remove it as it corresponds to WordPiece tokenization which we don't use. I also added YAPF formatting to the tokenizer + unit test and tested the unit test and the assertion passes. Fingers crossed it passes now 🤞 |
@rbharath I'm still getting an error with yapf formatting over the init.py, but I ran a quick check and its already been formatted properly. Any thoughts on how to fix this? otherwise, the CI is passing. |
…emberta-tutorial
Make sure you're using yapf 0.22. If you have a different version it will format things slightly differently and the test won't pass. |
@peastman Will check! thanks :) |
@peastman I just verified that my version of yapf is indeed 0.22, no changes were suggested over the files though. Any ideas why this is still failing? Additionally, when I run |
@seyonechithrananda Can you try running this sequence of yapf commands from the command line? The sequence I usually use is just
I'm not sure what's happening here but perhaps there's something in your system setup that is a little different here so might be good to try the simplest setup to see if that helps |
@rbharath |
Hmm, that looks like the wrong yapf version. Here's what I see when I run the version command:
|
@rbharath Yeah, kinda weird because I did a fresh install of yapf before running again. Will try with a new environment. Did you use the pip command or clone yapf? |
I used pip, but might be worth running |
@rbharath Just fixed it with the fresh install! Adding changes now |
Lets hope it finally passes now haha, thanks so much for the help @rbharath |
Travis is now green so going to merge this in! Congrats on the new feature merged in @seyonechithrananda :) We should do further work to integrate Tokenizers more closely into DeepChem as indicated in the discussion above in future PRs |
Added new tokenizer for SMILES based off the RXNFP tokenizer. See #2076 for more details
The tokenizer loads its vocab from the vocab.txt found in 'deepchem/feat/tests/data/', is there a better place to place this?
Will also add the import guard shortly.
@rbharath @peastman