Merge pull request #246 from chemprop/reaction_solvent

Add reaction in a solvent option
chemprop · Feb 23, 2022 · ac6bc9f · ac6bc9f
2 parents 6e62b5d + 9482002
commit ac6bc9f
Show file tree

Hide file tree

Showing 14 changed files with 675 additions and 65 deletions.
diff --git a/README.md b/README.md
@@ -38,6 +38,7 @@ Please see [aicures.mit.edu](https://aicures.mit.edu) and the associated [data G
     * [Atomic Features](#atom-level-features)
   * [Spectra](#spectra)
   * [Reaction](#reaction)
+  * [Reaction in a solvent / Reaction and a molecule](#reaction-in-a-solvent--reaction-and-a-molecule)
   * [Pretraining](#pretraining)
   * [Missing target values](#missing-target-values)
   * [Weighted training by target and data](#weighted-training-by-target-and-data)
@@ -269,6 +270,35 @@ In absorption spectra, sometimes the phase of collection will create regions in
 As an alternative to molecule SMILES, Chemprop can also process atom-mapped reaction SMILES (see [Daylight manual](https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html) for details on reaction SMILES), which consist of three parts denoting reactants, agents and products, separated by ">". Use the option `--reaction` to enable the input of reactions, which transforms the reactants and products of each reaction to the corresponding condensed graph of reaction and changes the initial atom and bond features to hold information from both the reactant and product (option `--reaction_mode reac_prod`), or from the reactant and the difference upon reaction (option `--reaction_mode reac_diff`, default) or from the product and the difference upon reaction (option `--reaction_mode prod_diff`). In reaction mode, Chemprop thus concatenates information to each atomic and bond feature vector, for example, with option `--reaction_mode reac_prod`, each atomic feature vector holds information on the state of the atom in the reactant (similar to default Chemprop), and concatenates information on the state of the atom in the product, so that the size of the D-MPNN increases slightly. Agents are discarded. Functions incompatible with a reaction as input (scaffold splitting and feature generation) are carried out on the reactants only. If the atom-mapped reaction SMILES contain mapped hydrogens, enable explicit hydrogens via `--explicit_h`. Example of an atom-mapped reaction SMILES denoting the reaction of methanol to formaldehyde without hydrogens: `[CH3:1][OH:2]>>[CH2:1]=[O:2]` and with hydrogens: `[C:1]([H:3])([H:4])([H:5])[O:2][H:6]>>[C:1]([H:3])([H:4])=[O:2].[H:5][H:6]`. The reactions do not need to be balanced and can thus contain unmapped parts, for example leaving groups, if necessary. With reaction modes `reac_prod`, `reac_diff` and `prod_diff`, the atom and bond features of unbalanced aroma are set to zero on the side of the reaction they are not specified. Alternatively, features can be set to the same values on the reactant and product side via the modes `reac_prod_balance`, `reac_diff_balance` and `prod_diff_balance`, which corresponds to a rough balancing of the reaction.
 For further details and benchmarking, as well as a citable reference, please refer to the [article](https://doi.org/10.1021/acs.jcim.1c00975).
 
+### Reaction in a solvent / Reaction and a molecule]
+
+Chemprop can process a reaction in a solvent or a reaction and a molecule with the `--reaction_solvent` option. While this
+option is originally built to model a reaction in a solvent, this option works for any reaction and a molecule where 
+the molecule can represent anything, i.e. a solvent, a reagent, etc.
+This requires the input csv file to have two separate columns of SMILES: one column for atom-mapped reaction SMILES 
+and the other column for solvent/molecule SMILES. The reaction and solvent/molecule SMILES columns can be ordered in 
+any way (i.e. the first column can be either reaction SMILES or solvent SMILES and the second column can then be 
+solvent SMILES or reaction SMILES). However, the same column ordering as used in the training must be used for the prediction
+(i.e. if the input csv file used for model training had reaction SMILES as the first column and solvent SMILES as the 
+second columns, the csv file used for prediction should also have the first column as reaction SMILES and second column 
+as the solvent SMILES). For the information on atom-mapped reaction SMILES, please refer to [Reaction](#reaction).
+
+When using the `--reaction_solvent` option, `--number_of_molecules` must be set to 2. All options listed in the [Reaction](#reaction) 
+section such as different `--reaction_mode` and `--explicit_h` can be used for `--reaction_solvent`. Note that 
+`--explicit_h` option is only applicable to reaction SMILES. The `--adding_h` option can be used instead for 
+solvent/molecule if one wishes to add hydrogens to solvent/molecule SMILES. Chemprop allows differently sized MPNNs to be used for each 
+reaction and solvent/molecule encoding. Below are the input arguments for specifying the size and option of the two MPNNs:
+* Reaction:
+  * `--bias` Whether to add bias to linear layers.
+  * `--hidden_size` Dimensionality of hidden layers.
+  * `--depth` Number of message passing steps.
+  * `--explicit_h` Whether H are explicitly specified in input and should be kept this way. Only applicable to reaction SMILES.
+* Solvent / Molecule:
+  * `--bias_solvent` Whether to add bias to linear layers for solvent/molecule MPN.
+  * `--hidden_size_solvent` Dimensionality of hidden layers in solvent/molecule MPN.
+  * `--depth_solvent` Number of message passing steps for solvent/molecule.
+  * `--adding_h` Whether RDKit molecules will be constructed with adding the Hs to them. Applicable to any SMILES that is not reaction.
+
 ### Pretraining
 
 Pretraining can be carried out using previously trained checkpoint files to set some or all of the initial values of a model for training. Additionally, some model parameters from the previous model can be frozen in place, so that they will not be updated during training.

diff --git a/chemprop/args.py b/chemprop/args.py
@@ -311,6 +311,12 @@ class TrainArgs(CommonArgs):
     """Dimensionality of hidden layers in MPN."""
     depth: int = 3
     """Number of message passing steps."""
+    bias_solvent: bool = False
+    """Whether to add bias to linear layers for solvent MPN if :code:`reaction_solvent` is True."""
+    hidden_size_solvent: int = 300
+    """Dimensionality of hidden layers in solvent MPN if :code:`reaction_solvent` is True."""
+    depth_solvent: int = 3
+    """Number of message passing steps for solvent if :code:`reaction_solvent` is True."""
     mpn_shared: bool = False
     """Whether to use the same message passing neural network for all input molecules
     Only relevant if :code:`number_of_molecules > 1`"""
@@ -369,13 +375,19 @@ class TrainArgs(CommonArgs):
     :code:`reac_diff_balance`: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. 
     :code:`prod_diff_balance`: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions. 
     """
+    reaction_solvent: bool = False
+    """
+    Whether to adjust the MPNN layer to take as input a reaction and a molecule, and to encode them with separate MPNNs.
+    """
     explicit_h: bool = False
     """
-    Whether H are explicitly specified in input (and should be kept this way).
+    Whether H are explicitly specified in input (and should be kept this way). This option is intended to be used
+    with the :code:`reaction` or :code:`reaction_solvent` options, and applies only to the reaction part.
     """
     adding_h: bool = False
     """
-    Whether RDKit molecules will be constructed with adding the Hs to them.
+    Whether RDKit molecules will be constructed with adding the Hs to them. This option is intended to be used
+    with Chemprop's default molecule or multi-molecule encoders, or in :code:`reaction_solvent` mode where it applies to the solvent only.
     """
 
     # Training arguments
@@ -511,6 +523,10 @@ def process_args(self) -> None:
 
         global temp_save_dir  # Prevents the temporary directory from being deleted upon function return
 
+        #Adapt the number of molecules for reaction_solvent mode
+        if self.reaction_solvent is True and self.number_of_molecules != 2:
+            raise ValueError('In reaction_solvent mode, --number_of_molecules 2 must be specified.')
+
         # Process SMILES columns
         self.smiles_columns = chemprop.data.utils.preprocess_smiles_columns(
             path=self.data_path,
@@ -525,6 +541,14 @@ def process_args(self) -> None:
                 for key, value in config.items():
                     setattr(self, key, value)
 
+        #Check whether the number of input columns is two for the reaction_solvent mode
+        if self.reaction_solvent is True and len(self.smiles_columns) != 2:
+            raise ValueError(f'In reaction_solvent mode, exactly two smiles column must be provided (one for reactions, and one for molecules)')
+
+        #Validate reaction/reaction_solvent mode
+        if self.reaction is True and self.reaction_solvent is True:
+            raise ValueError('Only reaction or reaction_solvent mode can be used, not both.')
+
         # Create temporary directory as save directory if not provided
         if self.save_dir is None:
             temp_save_dir = TemporaryDirectory()

diff --git a/chemprop/data/data.py b/chemprop/data/data.py
@@ -10,7 +10,7 @@
 from .scaler import StandardScaler
 from chemprop.features import get_features_generator
 from chemprop.features import BatchMolGraph, MolGraph
-from chemprop.features import is_explicit_h, is_reaction, is_adding_hs
+from chemprop.features import is_explicit_h, is_reaction, is_adding_hs, is_mol
 from chemprop.rdkit import make_mol
 
 # Cache of graph featurizations
@@ -99,9 +99,10 @@ def __init__(self,
         self.bond_features = bond_features
         self.overwrite_default_atom_features = overwrite_default_atom_features
         self.overwrite_default_bond_features = overwrite_default_bond_features
-        self.is_reaction = is_reaction()
-        self.is_explicit_h = is_explicit_h()
-        self.is_adding_hs = is_adding_hs()
+        self.is_mol_list = [is_mol(s) for s in smiles]
+        self.is_reaction_list = [is_reaction(x) for x in self.is_mol_list]
+        self.is_explicit_h_list = [is_explicit_h(x) for x in self.is_mol_list]
+        self.is_adding_hs_list = [is_adding_hs(x) for x in self.is_mol_list]
 
         if data_weight is not None:
             self.data_weight = data_weight
@@ -116,8 +117,8 @@ def __init__(self,
 
             for fg in self.features_generator:
                 features_generator = get_features_generator(fg)
-                for m in self.mol:
-                    if not self.is_reaction:
+                for m, reaction in zip(self.mol, self.is_reaction_list):
+                    if not reaction:
                         if m is not None and m.GetNumHeavyAtoms() > 0:
                             self.features.extend(features_generator(m))
                         # for H2
@@ -156,10 +157,9 @@ def __init__(self,
             self.atom_descriptors, self.atom_features, self.bond_features
 
     @property
-    def mol(self) -> Union[List[Chem.Mol], List[Tuple[Chem.Mol, Chem.Mol]]]:
+    def mol(self) -> List[Union[Chem.Mol, Tuple[Chem.Mol, Chem.Mol]]]:
         """Gets the corresponding list of RDKit molecules for the corresponding SMILES list."""
-        mol = make_mols(self.smiles, self.is_reaction, self.is_explicit_h, self.is_adding_hs)
-
+        mol = make_mols(self.smiles, self.is_reaction_list, self.is_explicit_h_list, self.is_adding_hs_list)
         if cache_mol():
             for s, m in zip(self.smiles, mol):
                 SMILES_TO_MOL[s] = m
@@ -740,18 +740,21 @@ def __iter__(self) -> Iterator[MoleculeDataset]:
         return super(MoleculeDataLoader, self).__iter__()
 
 
-def make_mols(smiles: List[str], reaction: bool, keep_h: bool, add_h: bool):
+def make_mols(smiles: List[str], reaction_list: List[bool], keep_h_list: List[bool], add_h_list: List[bool]):
     """
     Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.
 
     :param smiles: List of SMILES strings.
-    :param reaction: Boolean whether the SMILES strings are to be treated as a reaction.
-    :param keep_h: Boolean whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
-    :param add_h: Boolean whether to add hydrogens to the input smiles.
+    :param reaction_list: List of booleans whether the SMILES strings are to be treated as a reaction.
+    :param keep_h_list: List of booleans whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
+    :param add_h_list: List of booleasn whether to add hydrogens to the input smiles.
     :return: List of RDKit molecules or list of tuple of molecules.
     """
-    if reaction:
-        mol = [SMILES_TO_MOL[s] if s in SMILES_TO_MOL else (make_mol(s.split(">")[0], keep_h, add_h), make_mol(s.split(">")[-1], keep_h, add_h)) for s in smiles]
-    else:
-        mol = [SMILES_TO_MOL[s] if s in SMILES_TO_MOL else make_mol(s, keep_h, add_h) for s in smiles]
+    mol = []
+    for s, reaction, keep_h, add_h in zip(smiles, reaction_list, keep_h_list, add_h_list):
+        if reaction:
+            mol.append(SMILES_TO_MOL[s] if s in SMILES_TO_MOL else (make_mol(s.split(">")[0], keep_h, add_h), make_mol(s.split(">")[-1], keep_h, add_h)))
+        else:
+            mol.append(SMILES_TO_MOL[s] if s in SMILES_TO_MOL else make_mol(s, keep_h, add_h))
     return mol
+
diff --git a/chemprop/data/utils.py b/chemprop/data/utils.py
@@ -13,8 +13,7 @@
 from .data import MoleculeDatapoint, MoleculeDataset, make_mols
 from .scaffold import log_scaffold_stats, scaffold_split
 from chemprop.args import PredictArgs, TrainArgs
-from chemprop.features import load_features, load_valid_atom_or_bond_features
-
+from chemprop.features import load_features, load_valid_atom_or_bond_features, is_mol
 
 def preprocess_smiles_columns(path: str,
                               smiles_columns: Optional[Union[str, List[Optional[str]]]],
@@ -200,8 +199,17 @@ def get_invalid_smiles_from_list(smiles: List[List[str]], reaction: bool = False
     """
     invalid_smiles = []
 
+    # If the first SMILES in the column is a molecule, the remaining SMILES in the same column should all be a molecule.
+    # Similarly, if the first SMILES in the column is a reaction, the remaining SMILES in the same column should all
+    # correspond to reaction. Therefore, get `is_mol_list` only using the first element in smiles.
+    is_mol_list = [is_mol(s) for s in smiles[0]]
+    is_reaction_list = [True if not x and reaction else False for x in is_mol_list]
+    is_explicit_h_list = [False for x in is_mol_list]  # set this to False as it is not needed for invalid SMILES check
+    is_adding_hs_list = [False for x in is_mol_list]  # set this to False as it is not needed for invalid SMILES check
+
     for mol_smiles in smiles:
-        mols = make_mols(smiles=mol_smiles, reaction=reaction, keep_h=False, add_h=False)
+        mols = make_mols(smiles=mol_smiles, reaction_list=is_reaction_list, keep_h_list=is_explicit_h_list,
+                         add_h_list=is_adding_hs_list)
         if any(s == '' for s in mol_smiles) or \
            any(m is None for m in mols) or \
            any(m.GetNumHeavyAtoms() == 0 for m in mols if not isinstance(m, tuple)) or \

diff --git a/chemprop/features/__init__.py b/chemprop/features/__init__.py
@@ -3,7 +3,7 @@
     rdkit_2d_normalized_features_generator, register_features_generator
 from .featurization import atom_features, bond_features, BatchMolGraph, get_atom_fdim, get_bond_fdim, mol2graph, \
     MolGraph, onek_encoding_unk, set_extra_atom_fdim, set_extra_bond_fdim, set_reaction, set_explicit_h, \
-    set_adding_hs, is_reaction, is_explicit_h, is_adding_hs, reset_featurization_parameters
+    set_adding_hs, is_reaction, is_explicit_h, is_adding_hs, is_mol, reset_featurization_parameters
 from .utils import load_features, save_features, load_valid_atom_or_bond_features
 
 __all__ = [
@@ -26,6 +26,7 @@
     'is_reaction',
     'is_explicit_h',
     'is_adding_hs',
+    'is_mol',
     'mol2graph',
     'MolGraph',
     'onek_encoding_unk',