Skip to content

Commit

Permalink
Merge pull request #246 from chemprop/reaction_solvent
Browse files Browse the repository at this point in the history
Add reaction in a solvent option
  • Loading branch information
hesther committed Feb 23, 2022
2 parents 6e62b5d + 9482002 commit ac6bc9f
Show file tree
Hide file tree
Showing 14 changed files with 675 additions and 65 deletions.
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Please see [aicures.mit.edu](https://aicures.mit.edu) and the associated [data G
* [Atomic Features](#atom-level-features)
* [Spectra](#spectra)
* [Reaction](#reaction)
* [Reaction in a solvent / Reaction and a molecule](#reaction-in-a-solvent--reaction-and-a-molecule)
* [Pretraining](#pretraining)
* [Missing target values](#missing-target-values)
* [Weighted training by target and data](#weighted-training-by-target-and-data)
Expand Down Expand Up @@ -269,6 +270,35 @@ In absorption spectra, sometimes the phase of collection will create regions in
As an alternative to molecule SMILES, Chemprop can also process atom-mapped reaction SMILES (see [Daylight manual](https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html) for details on reaction SMILES), which consist of three parts denoting reactants, agents and products, separated by ">". Use the option `--reaction` to enable the input of reactions, which transforms the reactants and products of each reaction to the corresponding condensed graph of reaction and changes the initial atom and bond features to hold information from both the reactant and product (option `--reaction_mode reac_prod`), or from the reactant and the difference upon reaction (option `--reaction_mode reac_diff`, default) or from the product and the difference upon reaction (option `--reaction_mode prod_diff`). In reaction mode, Chemprop thus concatenates information to each atomic and bond feature vector, for example, with option `--reaction_mode reac_prod`, each atomic feature vector holds information on the state of the atom in the reactant (similar to default Chemprop), and concatenates information on the state of the atom in the product, so that the size of the D-MPNN increases slightly. Agents are discarded. Functions incompatible with a reaction as input (scaffold splitting and feature generation) are carried out on the reactants only. If the atom-mapped reaction SMILES contain mapped hydrogens, enable explicit hydrogens via `--explicit_h`. Example of an atom-mapped reaction SMILES denoting the reaction of methanol to formaldehyde without hydrogens: `[CH3:1][OH:2]>>[CH2:1]=[O:2]` and with hydrogens: `[C:1]([H:3])([H:4])([H:5])[O:2][H:6]>>[C:1]([H:3])([H:4])=[O:2].[H:5][H:6]`. The reactions do not need to be balanced and can thus contain unmapped parts, for example leaving groups, if necessary. With reaction modes `reac_prod`, `reac_diff` and `prod_diff`, the atom and bond features of unbalanced aroma are set to zero on the side of the reaction they are not specified. Alternatively, features can be set to the same values on the reactant and product side via the modes `reac_prod_balance`, `reac_diff_balance` and `prod_diff_balance`, which corresponds to a rough balancing of the reaction.
For further details and benchmarking, as well as a citable reference, please refer to the [article](https://doi.org/10.1021/acs.jcim.1c00975).

### Reaction in a solvent / Reaction and a molecule]

Chemprop can process a reaction in a solvent or a reaction and a molecule with the `--reaction_solvent` option. While this
option is originally built to model a reaction in a solvent, this option works for any reaction and a molecule where
the molecule can represent anything, i.e. a solvent, a reagent, etc.
This requires the input csv file to have two separate columns of SMILES: one column for atom-mapped reaction SMILES
and the other column for solvent/molecule SMILES. The reaction and solvent/molecule SMILES columns can be ordered in
any way (i.e. the first column can be either reaction SMILES or solvent SMILES and the second column can then be
solvent SMILES or reaction SMILES). However, the same column ordering as used in the training must be used for the prediction
(i.e. if the input csv file used for model training had reaction SMILES as the first column and solvent SMILES as the
second columns, the csv file used for prediction should also have the first column as reaction SMILES and second column
as the solvent SMILES). For the information on atom-mapped reaction SMILES, please refer to [Reaction](#reaction).

When using the `--reaction_solvent` option, `--number_of_molecules` must be set to 2. All options listed in the [Reaction](#reaction)
section such as different `--reaction_mode` and `--explicit_h` can be used for `--reaction_solvent`. Note that
`--explicit_h` option is only applicable to reaction SMILES. The `--adding_h` option can be used instead for
solvent/molecule if one wishes to add hydrogens to solvent/molecule SMILES. Chemprop allows differently sized MPNNs to be used for each
reaction and solvent/molecule encoding. Below are the input arguments for specifying the size and option of the two MPNNs:
* Reaction:
* `--bias` Whether to add bias to linear layers.
* `--hidden_size` Dimensionality of hidden layers.
* `--depth` Number of message passing steps.
* `--explicit_h` Whether H are explicitly specified in input and should be kept this way. Only applicable to reaction SMILES.
* Solvent / Molecule:
* `--bias_solvent` Whether to add bias to linear layers for solvent/molecule MPN.
* `--hidden_size_solvent` Dimensionality of hidden layers in solvent/molecule MPN.
* `--depth_solvent` Number of message passing steps for solvent/molecule.
* `--adding_h` Whether RDKit molecules will be constructed with adding the Hs to them. Applicable to any SMILES that is not reaction.

### Pretraining

Pretraining can be carried out using previously trained checkpoint files to set some or all of the initial values of a model for training. Additionally, some model parameters from the previous model can be frozen in place, so that they will not be updated during training.
Expand Down
28 changes: 26 additions & 2 deletions chemprop/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,12 @@ class TrainArgs(CommonArgs):
"""Dimensionality of hidden layers in MPN."""
depth: int = 3
"""Number of message passing steps."""
bias_solvent: bool = False
"""Whether to add bias to linear layers for solvent MPN if :code:`reaction_solvent` is True."""
hidden_size_solvent: int = 300
"""Dimensionality of hidden layers in solvent MPN if :code:`reaction_solvent` is True."""
depth_solvent: int = 3
"""Number of message passing steps for solvent if :code:`reaction_solvent` is True."""
mpn_shared: bool = False
"""Whether to use the same message passing neural network for all input molecules
Only relevant if :code:`number_of_molecules > 1`"""
Expand Down Expand Up @@ -369,13 +375,19 @@ class TrainArgs(CommonArgs):
:code:`reac_diff_balance`: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions.
:code:`prod_diff_balance`: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
"""
reaction_solvent: bool = False
"""
Whether to adjust the MPNN layer to take as input a reaction and a molecule, and to encode them with separate MPNNs.
"""
explicit_h: bool = False
"""
Whether H are explicitly specified in input (and should be kept this way).
Whether H are explicitly specified in input (and should be kept this way). This option is intended to be used
with the :code:`reaction` or :code:`reaction_solvent` options, and applies only to the reaction part.
"""
adding_h: bool = False
"""
Whether RDKit molecules will be constructed with adding the Hs to them.
Whether RDKit molecules will be constructed with adding the Hs to them. This option is intended to be used
with Chemprop's default molecule or multi-molecule encoders, or in :code:`reaction_solvent` mode where it applies to the solvent only.
"""

# Training arguments
Expand Down Expand Up @@ -511,6 +523,10 @@ def process_args(self) -> None:

global temp_save_dir # Prevents the temporary directory from being deleted upon function return

#Adapt the number of molecules for reaction_solvent mode
if self.reaction_solvent is True and self.number_of_molecules != 2:
raise ValueError('In reaction_solvent mode, --number_of_molecules 2 must be specified.')

# Process SMILES columns
self.smiles_columns = chemprop.data.utils.preprocess_smiles_columns(
path=self.data_path,
Expand All @@ -525,6 +541,14 @@ def process_args(self) -> None:
for key, value in config.items():
setattr(self, key, value)

#Check whether the number of input columns is two for the reaction_solvent mode
if self.reaction_solvent is True and len(self.smiles_columns) != 2:
raise ValueError(f'In reaction_solvent mode, exactly two smiles column must be provided (one for reactions, and one for molecules)')

#Validate reaction/reaction_solvent mode
if self.reaction is True and self.reaction_solvent is True:
raise ValueError('Only reaction or reaction_solvent mode can be used, not both.')

# Create temporary directory as save directory if not provided
if self.save_dir is None:
temp_save_dir = TemporaryDirectory()
Expand Down
37 changes: 20 additions & 17 deletions chemprop/data/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from .scaler import StandardScaler
from chemprop.features import get_features_generator
from chemprop.features import BatchMolGraph, MolGraph
from chemprop.features import is_explicit_h, is_reaction, is_adding_hs
from chemprop.features import is_explicit_h, is_reaction, is_adding_hs, is_mol
from chemprop.rdkit import make_mol

# Cache of graph featurizations
Expand Down Expand Up @@ -99,9 +99,10 @@ def __init__(self,
self.bond_features = bond_features
self.overwrite_default_atom_features = overwrite_default_atom_features
self.overwrite_default_bond_features = overwrite_default_bond_features
self.is_reaction = is_reaction()
self.is_explicit_h = is_explicit_h()
self.is_adding_hs = is_adding_hs()
self.is_mol_list = [is_mol(s) for s in smiles]
self.is_reaction_list = [is_reaction(x) for x in self.is_mol_list]
self.is_explicit_h_list = [is_explicit_h(x) for x in self.is_mol_list]
self.is_adding_hs_list = [is_adding_hs(x) for x in self.is_mol_list]

if data_weight is not None:
self.data_weight = data_weight
Expand All @@ -116,8 +117,8 @@ def __init__(self,

for fg in self.features_generator:
features_generator = get_features_generator(fg)
for m in self.mol:
if not self.is_reaction:
for m, reaction in zip(self.mol, self.is_reaction_list):
if not reaction:
if m is not None and m.GetNumHeavyAtoms() > 0:
self.features.extend(features_generator(m))
# for H2
Expand Down Expand Up @@ -156,10 +157,9 @@ def __init__(self,
self.atom_descriptors, self.atom_features, self.bond_features

@property
def mol(self) -> Union[List[Chem.Mol], List[Tuple[Chem.Mol, Chem.Mol]]]:
def mol(self) -> List[Union[Chem.Mol, Tuple[Chem.Mol, Chem.Mol]]]:
"""Gets the corresponding list of RDKit molecules for the corresponding SMILES list."""
mol = make_mols(self.smiles, self.is_reaction, self.is_explicit_h, self.is_adding_hs)

mol = make_mols(self.smiles, self.is_reaction_list, self.is_explicit_h_list, self.is_adding_hs_list)
if cache_mol():
for s, m in zip(self.smiles, mol):
SMILES_TO_MOL[s] = m
Expand Down Expand Up @@ -740,18 +740,21 @@ def __iter__(self) -> Iterator[MoleculeDataset]:
return super(MoleculeDataLoader, self).__iter__()


def make_mols(smiles: List[str], reaction: bool, keep_h: bool, add_h: bool):
def make_mols(smiles: List[str], reaction_list: List[bool], keep_h_list: List[bool], add_h_list: List[bool]):
"""
Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.
:param smiles: List of SMILES strings.
:param reaction: Boolean whether the SMILES strings are to be treated as a reaction.
:param keep_h: Boolean whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
:param add_h: Boolean whether to add hydrogens to the input smiles.
:param reaction_list: List of booleans whether the SMILES strings are to be treated as a reaction.
:param keep_h_list: List of booleans whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
:param add_h_list: List of booleasn whether to add hydrogens to the input smiles.
:return: List of RDKit molecules or list of tuple of molecules.
"""
if reaction:
mol = [SMILES_TO_MOL[s] if s in SMILES_TO_MOL else (make_mol(s.split(">")[0], keep_h, add_h), make_mol(s.split(">")[-1], keep_h, add_h)) for s in smiles]
else:
mol = [SMILES_TO_MOL[s] if s in SMILES_TO_MOL else make_mol(s, keep_h, add_h) for s in smiles]
mol = []
for s, reaction, keep_h, add_h in zip(smiles, reaction_list, keep_h_list, add_h_list):
if reaction:
mol.append(SMILES_TO_MOL[s] if s in SMILES_TO_MOL else (make_mol(s.split(">")[0], keep_h, add_h), make_mol(s.split(">")[-1], keep_h, add_h)))
else:
mol.append(SMILES_TO_MOL[s] if s in SMILES_TO_MOL else make_mol(s, keep_h, add_h))
return mol

14 changes: 11 additions & 3 deletions chemprop/data/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
from .data import MoleculeDatapoint, MoleculeDataset, make_mols
from .scaffold import log_scaffold_stats, scaffold_split
from chemprop.args import PredictArgs, TrainArgs
from chemprop.features import load_features, load_valid_atom_or_bond_features

from chemprop.features import load_features, load_valid_atom_or_bond_features, is_mol

def preprocess_smiles_columns(path: str,
smiles_columns: Optional[Union[str, List[Optional[str]]]],
Expand Down Expand Up @@ -200,8 +199,17 @@ def get_invalid_smiles_from_list(smiles: List[List[str]], reaction: bool = False
"""
invalid_smiles = []

# If the first SMILES in the column is a molecule, the remaining SMILES in the same column should all be a molecule.
# Similarly, if the first SMILES in the column is a reaction, the remaining SMILES in the same column should all
# correspond to reaction. Therefore, get `is_mol_list` only using the first element in smiles.
is_mol_list = [is_mol(s) for s in smiles[0]]
is_reaction_list = [True if not x and reaction else False for x in is_mol_list]
is_explicit_h_list = [False for x in is_mol_list] # set this to False as it is not needed for invalid SMILES check
is_adding_hs_list = [False for x in is_mol_list] # set this to False as it is not needed for invalid SMILES check

for mol_smiles in smiles:
mols = make_mols(smiles=mol_smiles, reaction=reaction, keep_h=False, add_h=False)
mols = make_mols(smiles=mol_smiles, reaction_list=is_reaction_list, keep_h_list=is_explicit_h_list,
add_h_list=is_adding_hs_list)
if any(s == '' for s in mol_smiles) or \
any(m is None for m in mols) or \
any(m.GetNumHeavyAtoms() == 0 for m in mols if not isinstance(m, tuple)) or \
Expand Down
3 changes: 2 additions & 1 deletion chemprop/features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
rdkit_2d_normalized_features_generator, register_features_generator
from .featurization import atom_features, bond_features, BatchMolGraph, get_atom_fdim, get_bond_fdim, mol2graph, \
MolGraph, onek_encoding_unk, set_extra_atom_fdim, set_extra_bond_fdim, set_reaction, set_explicit_h, \
set_adding_hs, is_reaction, is_explicit_h, is_adding_hs, reset_featurization_parameters
set_adding_hs, is_reaction, is_explicit_h, is_adding_hs, is_mol, reset_featurization_parameters
from .utils import load_features, save_features, load_valid_atom_or_bond_features

__all__ = [
Expand All @@ -26,6 +26,7 @@
'is_reaction',
'is_explicit_h',
'is_adding_hs',
'is_mol',
'mol2graph',
'MolGraph',
'onek_encoding_unk',
Expand Down

0 comments on commit ac6bc9f

Please sign in to comment.