Skip to content

Commit

Permalink
Merge pull request #383 from shihchengli/fix_atom_map
Browse files Browse the repository at this point in the history
Remove atom map numbers for scaffold splits
  • Loading branch information
kevingreenman committed Apr 29, 2023
2 parents 5efacfa + 65305ae commit 06b63c0
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 3 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Notes:
Our code supports several methods of splitting data into train, validation, and test sets.

* **Random.** By default, the data will be split randomly into train, validation, and test sets.
* **Scaffold.** Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding `--split_type scaffold_balanced`.
* **Scaffold.** Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding `--split_type scaffold_balanced`. Note that the atom-mapped numbers for atom-mapped SMILES will be removed before computing the Bemis-Murcko scaffold.
* **k-Fold Cross-Validation.** A split type specified with `--split_type cv` intended for use when training with cross-validation. The data are split randomly into k groups of equal size, where k is the number of cross-validation folds specified with `--num_folds <k>`. Each group is used once as the test set and once as the validation set in training the k folds of the model. Alternatively, the option `--split_type cv-no-test` can be used to train without a test splits.
* **Random With Repeated SMILES.** Some datasets have multiple entries with the same SMILES. To constrain splitting so the repeated SMILES are in the same split, use the argument `--split_type random_with_repeated_smiles`.
* **Separate val/test.** If you have separate data files you would like to use as the validation or test set, you can specify them with `--separate_val_path <val_path>` and/or `--separate_test_path <test_path>`. If both are provided, then the data specified by `--data_path` is used entirely as the training data. If only one separate path is provided, the `--data_path` data is split between train data and either val or test data, whichever is not provided separately.
Expand Down
3 changes: 3 additions & 0 deletions chemprop/rdkit.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,8 @@ def make_mol(s: str, keep_h: bool, add_h: bool, keep_atom_map: bool):
if idx + 1 != map_num:
new_order = np.argsort(atom_map_numbers).tolist()
return Chem.rdmolops.RenumberAtoms(mol, new_order)
elif not keep_atom_map and mol is not None:
for atom in mol.GetAtoms():
atom.SetAtomMapNum(0)

return mol
4 changes: 2 additions & 2 deletions tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -817,7 +817,7 @@ def test_predict_spectra(self,
(
'chemprop_scaffold_split',
'chemprop',
2.18239804,
2.11470476,
['--reaction', '--data_path', os.path.join(TEST_DATA_DIR, 'reaction_regression.csv'),'--split_type', 'scaffold_balanced']
),
(
Expand Down Expand Up @@ -1032,7 +1032,7 @@ def test_multimolecule_fingerprint_with_single_input(self,
(
'chemprop_reaction_solvent_diff_mpn_size',
'chemprop',
2.730379557,
2.899513794,
['--reaction_solvent', '--number_of_molecules', '2',
'--data_path', os.path.join(TEST_DATA_DIR, 'reaction_solvent_regression.csv'), '--hidden_size', '500',
'--hidden_size_solvent', '250']
Expand Down

0 comments on commit 06b63c0

Please sign in to comment.