Skip to content

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Notifications You must be signed in to change notification settings

ecvgit/bert-loves-chemistry

 
 

Repository files navigation

bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Right now the notebooks are all for the RoBERTa model (a variant of BERT) trained on the task of masked-language modelling (MLM). Training was done over 5 epochs until loss converged to around 0.39. The model weights for training are available using HuggingFace. I hope this is of use to developers, students and researchers exploring the use of transformers and the attention mechanism for chemistry!

You can load the tokenizer + model for MLM prediction tasks using the following code:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

model = AutoModelWithLMHead.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

Todo:

  • Finish writing notebook to train model
  • Finish notebook to preload and run predictions on a single molecule —> test if HuggingFace works
  • Train RoBERTa model until convergence
  • Upload weights onto HuggingFace
  • Create tutorial using evaluation + fine-tuning notebook.
  • Create documentation + writing, visualizations for notebook.
  • Setup PR into DeepChem

About

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%