Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
connorcoley Merge pull request #2 from stephenra/master
fix default values in arg help
Latest commit 2c9c19a Feb 6, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Initial migration of files for release Apr 7, 2017
main fix default values in arg help Feb 6, 2018
scripts Using original version of atom-level features Jul 18, 2017
utils Adding back database... Apr 9, 2017
.gitignore Initial commit Mar 8, 2017
LICENSE Update Apr 18, 2017 Initial migration of files for release Apr 7, 2017 Initial migration of files for release Apr 7, 2017


Project summary

This project uses open source reaction data from the USPTO (pre-extracted by Daniel Lowe, to train a neural network model to predict the outcomes of organic reactions. Reaction templates are used to enumerate potential products; a neural network scores each product and ranks likely outcomes. By examining thousands of experimental outcomes, the model learns which modes of reactivity are likely to occur. The full details can be found at

The code relies on Keras for its machine learning components using the Theano background. RDKit is used for all chemistry-related parsing and processing. Please note that due to the unique reaction representation used, generating candidate outcomes requires the modified RDKit version available at In the modified version, atom-mapping numbers associated with reactant molecules are preseved after calling RunReactants. The code is set up to use MongoDB to store reaction examples, transform strings, and candidate sets. A mongodump containing all data used in the project can be found at The database/collection names are defined in utils/

Generating templates

Reaction templates are extracted from ca. 1M atom-mapped reaction SMILES strings using data/ They are designed to be overgeneral to cover a broad range of chemistry at the expense of specificity. The extracted templates can be found in the mongodump, so they do not need to be re-extracted.

Generating candidates

A forward enumeration algorithm is used to generate plausible candidates for each set of reactants using data/ with the help of the main/ class. Reagents, catalysts, and solvents (if present) are allowed to react in addition to the reactants. This makes the prediction task artificially hard (as the reaction database already contains information about which atoms react), but it is reasonable given that role labelling was performed with knowledge of the reaction outcome. Candidates are inserted into a MongoDB automatically.

Preprocessing candidates

To prepare the data for training, data/ is used to generate necessary atom-level descriptors for reactant molecules, which will be used in the edit-based representation. Data is pickled in a compressed format to minimize storage size and file read limitations, but is expanded during training and testing into its full many-tensor representation.

Model training/testing

Models are trained and tested using main/ Many command-line options are available to set different architecture/training parameters, including which fold of a 5-fold CV is being run. A demo model using just 10 reactions is included in main/output/10rxn_demo1.

Trained model testing

An already-trained model can be loaded using scripts/ to make predictions on demand. You will be prompted to enter reactant SMILES strings; the results of the forward prediction are saved as a table of products, scores, and probabilities.

You can’t perform that action at this time.