Merge pull request #383 from shihchengli/fix_atom_map

Remove atom map numbers for scaffold splits
chemprop · Apr 29, 2023 · 06b63c0 · 06b63c0
2 parents 5efacfa + 65305ae
commit 06b63c0
Show file tree

Hide file tree

Showing 3 changed files with 6 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -188,7 +188,7 @@ Notes:
 Our code supports several methods of splitting data into train, validation, and test sets.
 
 * **Random.** By default, the data will be split randomly into train, validation, and test sets.
-* **Scaffold.** Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding `--split_type scaffold_balanced`.
+* **Scaffold.** Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding `--split_type scaffold_balanced`. Note that the atom-mapped numbers for atom-mapped SMILES will be removed before computing the Bemis-Murcko scaffold.
 * **k-Fold Cross-Validation.** A split type specified with `--split_type cv` intended for use when training with cross-validation. The data are split randomly into k groups of equal size, where k is the number of cross-validation folds specified with `--num_folds <k>`. Each group is used once as the test set and once as the validation set in training the k folds of the model. Alternatively, the option `--split_type cv-no-test` can be used to train without a test splits.
 * **Random With Repeated SMILES.** Some datasets have multiple entries with the same SMILES. To constrain splitting so the repeated SMILES are in the same split, use the argument `--split_type random_with_repeated_smiles`.
 * **Separate val/test.** If you have separate data files you would like to use as the validation or test set, you can specify them with `--separate_val_path <val_path>` and/or `--separate_test_path <test_path>`. If both are provided, then the data specified by `--data_path` is used entirely as the training data. If only one separate path is provided, the `--data_path` data is split between train data and either val or test data, whichever is not provided separately.

diff --git a/chemprop/rdkit.py b/chemprop/rdkit.py
@@ -24,5 +24,8 @@ def make_mol(s: str, keep_h: bool, add_h: bool, keep_atom_map: bool):
             if idx + 1 != map_num:
                 new_order = np.argsort(atom_map_numbers).tolist()
                 return Chem.rdmolops.RenumberAtoms(mol, new_order)
+    elif not keep_atom_map and mol is not None:
+        for atom in mol.GetAtoms():
+            atom.SetAtomMapNum(0)
 
     return mol
diff --git a/tests/test_integration.py b/tests/test_integration.py
@@ -817,7 +817,7 @@ def test_predict_spectra(self,
         (
                 'chemprop_scaffold_split',
                 'chemprop',
-                2.18239804,
+                2.11470476,
                 ['--reaction', '--data_path', os.path.join(TEST_DATA_DIR, 'reaction_regression.csv'),'--split_type', 'scaffold_balanced']
         ),
         (
@@ -1032,7 +1032,7 @@ def test_multimolecule_fingerprint_with_single_input(self,
         (
                 'chemprop_reaction_solvent_diff_mpn_size',
                 'chemprop',
-                2.730379557,
+                2.899513794,
                 ['--reaction_solvent', '--number_of_molecules', '2',
                  '--data_path', os.path.join(TEST_DATA_DIR, 'reaction_solvent_regression.csv'), '--hidden_size', '500',
                  '--hidden_size_solvent', '250']