Atomic/bond targets prediction (#280)

* Multitask constraint for atomic/bond properties prediction * Uncertainty functions for atomic/bond properties prediction * Use `--constraints_path` to do constraints * Bugfix arguments in get_data * Bugfix applying constraints in different loss_function * Update README.md * Delete the comments * Bugfix repairing MPNN model * Modify get_header function * Bugfix adding is_atom_bond_targets in UncertaintyEvaluator * Bugfix repairing original_scaling referenced before assignment * Remove a redundant function in test_integration.py * Remove unnecessary import and raise * Fixtypo seperate * Fix UserWarning in torch.nn.Softmax UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. * Fix FutureWarning in `_check_reg_targets` in `sklearn.metrics._regression.py` FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead. This is coming from function calls to the loss function in chemprop's metrics.py, or more specifically, in `_check_reg_targets` in `sklearn.metrics._regression.py`. The targets need to be sanitized to floats as they're read in before they get to this step in chemprop. * Update README.md * Check whether `--adding_h` is used * Increase csv field limit For atomic/bond properties dataset, if .csv file is used, it might happen _csv.Error: field larger than field limit (131072). In order to solve this, the field limit is set to sys.maxsize. * Add atomic and bond targets data for testing * Bugfix constraints() in MoleculeDataset * Allow atom-level targets with scaffold_balanced * Add testing for atomic and bond targets * Update README.md Co-authored-by: Charles McGill <44245643+cjmcgill@users.noreply.github.com> * Let `--adding_h` be an optional choice * Minor changes in args.py * Avoid multiple copies of a task * Update README.md Co-authored-by: Charles McGill <44245643+cjmcgill@users.noreply.github.com> * Add get_mixed_task_names() in data/utils.py * Update test file * Add `is_atom_bond_targets` as a dataset property * Test get_header function with PICKLE file * Change the protocol of test .pkl file from 5 to 4 `ValueError: unsupported pickle protocol: 5` would happen for panadas users with lower python version. * FIxtypo atoml Co-authored-by: Charles McGill <44245643+cjmcgill@users.noreply.github.com> * Upload a new test .pkl file generated by older pandas version `AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals'` might appear when using older pandas version to open .pkl file saved by newer pandas version. * Fix number_of_atoms() and number_of_bonds() functions * Raise NotImplementedError for unsupported extensions * Add get_constraints() in data/utils.py * Replace some statements with `is_atom_bond_targets` * Remove separate device setting in MoleculeModel * Add a relative output size multiplier * Remove unnecessary attributes in MultiReadout * Remove the behavior of using pickle file as input or output * Remove the test of get_header() with .pkl file * Add AtomBondScaler in data/scaler.py * Use json.loads() to load data * Let the bond hidden be order invariant * Remove an unnecessary attribute in UncertaintyCalibrator * Bugfix read is_atom_bond_targets from args * Bigfix * Shared FFN Weights * Make ffn weight sharing optional * Fix consistency when num_layers=1 * Update chemprop/models/ffn.py Co-authored-by: Shih-Cheng Li <scli@mit.edu> * Update chemprop/models/ffn.py Co-authored-by: Shih-Cheng Li <scli@mit.edu> * Add description to readme for the atom bond shared ffn argument * Let the number of layers in weight FFN be controllable * Add option of adding bond types to output of bond targets * Bugfix set bond_types_batch as None * Fixtypo and remove eval() * Use bond descriptors as descriptor * Bugfix missing brackets * Bugfix for multitarget mve_weighting calibration (#291) * Bugfix for multitarget mve_weighting calibration * Remove troubleshooting prints * Bond oder invariant * Update the test score for mve_weighting * Support mve_weighting calibration in atomic/bond level * Support transfer learning for atomic/bond targets * Backward compatibility for parameter names * Bond order invariant in bond constraints model * Fixing typo in load_frzn_model (#294) * Fixbug for mve_weighting in atom level * Bugfix mve_weighting with ence * transfer missing variables to new shapes * define num_tasks in every evaluator * Update test results * transpose masks * transfer None to np.nan * replace 1 with True in mask * Replace missing values of None with null * Update README.md * Bugfix wrong break statement in get_mixed_task_names * Support atom-mapped SMILES for atomic property modeling * Refactor building FFN * Replace AttrProxy with ModuleList * Apply suggestions from code review Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Remove import of AttrProxy * Save results as list instead of np.ndarray * Apply suggestions from code review Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Import torch.nn.functional module in ffn.py * raise error when using unrecognized ffn_type * Add return type of build_ffn() * Fix typo and remove old legacy * Replace narrow with tensor indexing * Make notation in the code closer to the paper * Fixbug wrong constraints_batch in prediction * Use camel case Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Replace ValueError with RuntimeError Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Change the notation of formula Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * use broadcasting instead of indexing and flattening * Add `FFN` for unconstrained predictions * Use dropout probability as input instead of nn.Dropout * Apply suggestions from code review Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Make `ffn_base` be optional argument * Load activation function by `get_activation_function()` * Passing dict to load ffn params * Apply suggestions from code review Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Define `FFNAtten` as a subclass of `FFN` * Avoid defining functions in `forward()` * Rename the `dropout_layer` to `dropout` * Check smiles_columns by str * Use torch.torch to define PyTorch tensors * Rename the function names * BigFix for smiles_columns as None * Replace torch.float32 with torch.float Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> * Apply CodeQL suggestion * Delete comments * Replace `or` with explicit `if else` * Simplify the encodings concatenation * Black format moddel.py * Apply CodeQL suggestion * Support atom-mapped SMILES * Bugfix for considering atom-mapped during using get_mixed_task_names() * Bugfix sum of a list of lists `n_atoms` is a list of lists. Using `sum(n_atoms)` would raise `TypeError: unsupported operand type(s) for +: 'int' and 'list'` error. To avoid it, `np.array(n_atoms).sum()` is used instead. * Remove redundant code * Fix NumPy 1.24 compatibility issue * Update README.md * Fix typo and reorder two attributes * Revising the parameter description for `raw_constraints` and `constraints` * Refactor multitask_utils.py --------- Co-authored-by: Charles McGill <44245643+cjmcgill@users.noreply.github.com> Co-authored-by: Chas <charlesjmcgill@gmail.com> Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com>
chemprop · Feb 5, 2023 · 2677112 · 2677112
1 parent a952312
commit 2677112
Show file tree

Hide file tree

Showing 32 changed files with 4,249 additions and 782 deletions.
diff --git a/README.md b/README.md
@@ -41,6 +41,7 @@ Please see [aicures.mit.edu](https://aicures.mit.edu) and the associated [data G
   * [Spectra](#spectra)
   * [Reaction](#reaction)
   * [Reaction in a solvent / Reaction and a molecule](#reaction-in-a-solvent--reaction-and-a-molecule)
+  * [Atomic and bond properties prediction](#atomic-and-bond-properties-prediction)
   * [Pretraining](#pretraining)
   * [Missing target values](#missing-target-values)
   * [Weighted training by target and data](#weighted-training-by-target-and-data)
@@ -262,11 +263,11 @@ Similar to the molecule-level features, the atom-level descriptors and features
 
 #### Bond-Level Features
 
-Bond-level features can be provided in the same format as the atom-level features, using the option `--bond_features_path /path/to/features`. The order of the features for each molecule must match the bond ordering in the RDKit molecule object. 
+Bond-level features can be provided in the same format as the atom-level features, using the option `--bond_descriptors_path /path/to/features`. The order of the features for each molecule must match the bond ordering in the RDKit molecule object.
 
-The bond-level features are concatenated with the bond feature vectors before the D-MPNN, such that they are used during message-passing. Alternatively, the user can overwrite the default bond features with the custom features using the option `--overwrite_default_bond_features`. 
+Users must select in which way bond descriptors are used. The command line option `--bond_descriptors feature` concatenates the bond-level features with the bond feature vectors before the D-MPNN, such that they are used during message-passing. For atomic/bond properties prediction, the command line option `--bond_descriptors descriptor` concatenates the new features to the embedded bond features after the D-MPNN with an additional linear layer. Alternatively, the user can overwrite the default bond features with the custom features using the option `--overwrite_default_bond_features`.
 
-Similar to molecule-, and atom-level features, the bond-level features are scaled by default. This can be disabled with the option `--no_bond_features_scaling`.
+Similar to molecule-level and atom-level features, the bond-level descriptors and features are scaled by default. This can be disabled with the option `--no_bond_descriptor_scaling`.
 
 ### Spectra
 
@@ -281,7 +282,7 @@ In absorption spectra, sometimes the phase of collection will create regions in
 As an alternative to molecule SMILES, Chemprop can also process atom-mapped reaction SMILES (see [Daylight manual](https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html) for details on reaction SMILES), which consist of three parts denoting reactants, agents and products, separated by ">". Use the option `--reaction` to enable the input of reactions, which transforms the reactants and products of each reaction to the corresponding condensed graph of reaction and changes the initial atom and bond features to hold information from both the reactant and product (option `--reaction_mode reac_prod`), or from the reactant and the difference upon reaction (option `--reaction_mode reac_diff`, default) or from the product and the difference upon reaction (option `--reaction_mode prod_diff`). In reaction mode, Chemprop thus concatenates information to each atomic and bond feature vector, for example, with option `--reaction_mode reac_prod`, each atomic feature vector holds information on the state of the atom in the reactant (similar to default Chemprop), and concatenates information on the state of the atom in the product, so that the size of the D-MPNN increases slightly. Agents are discarded. Functions incompatible with a reaction as input (scaffold splitting and feature generation) are carried out on the reactants only. If the atom-mapped reaction SMILES contain mapped hydrogens, enable explicit hydrogens via `--explicit_h`. Example of an atom-mapped reaction SMILES denoting the reaction of methanol to formaldehyde without hydrogens: `[CH3:1][OH:2]>>[CH2:1]=[O:2]` and with hydrogens: `[C:1]([H:3])([H:4])([H:5])[O:2][H:6]>>[C:1]([H:3])([H:4])=[O:2].[H:5][H:6]`. The reactions do not need to be balanced and can thus contain unmapped parts, for example leaving groups, if necessary. With reaction modes `reac_prod`, `reac_diff` and `prod_diff`, the atom and bond features of unbalanced aroma are set to zero on the side of the reaction they are not specified. Alternatively, features can be set to the same values on the reactant and product side via the modes `reac_prod_balance`, `reac_diff_balance` and `prod_diff_balance`, which corresponds to a rough balancing of the reaction.
 For further details and benchmarking, as well as a citable reference, please refer to the [article](https://doi.org/10.1021/acs.jcim.1c00975).
 
-### Reaction in a solvent / Reaction and a molecule]
+### Reaction in a solvent / Reaction and a molecule
 
 Chemprop can process a reaction in a solvent or a reaction and a molecule with the `--reaction_solvent` option. While this
 option is originally built to model a reaction in a solvent, this option works for any reaction and a molecule where 
@@ -310,6 +311,22 @@ reaction and solvent/molecule encoding. Below are the input arguments for specif
   * `--depth_solvent` Number of message passing steps for solvent/molecule.
   * `--adding_h` Whether RDKit molecules will be constructed with adding the Hs to them. Applicable to any SMILES that is not reaction.
 
+### Atomic and bond properties prediction
+
+Chemprop can perform multitask constrained message passing neural networks for atomic/bond properties prediction as described in this [paper](https://chemrxiv.org/articles/preprint/Regio-Selectivity_Prediction_with_a_Machine-Learned_Reaction_Representation_and_On-the-Fly_Quantum_Mechanical_Descriptors/12907316). This model can train on any number of atomic/bond properties simultaneously. In the original work, a total loss was calculated as a weighted sum of every single loss, where the weights were required to be specified for the regression task. In this repository, these weights have been automatically taken into account by doing standardization of all the training targets. In order to train a model, training data containing molecules (as SMILES strings) and known atomic/bond target values are required, and the `--is_atom_bond_targets` flag is used. The input is a csv file. For example:
+```
+                              smiles                                  hirshfeld_charges  ...                                 bond_length_matrix                                  bond_index_matrix
+0     CNC(=S)N/N=C/c1c(O)ccc2ccccc12  [-0.026644, -0.075508, 0.096217, -0.287798, -0...  ...  [[0.0, 1.4372890960937539, 2.4525543850909814,...  [[0.0, 0.9595, 0.0158, 0.0162, 0.0103, 0.0008,...
+1      O=C(NCCn1cccc1)c1cccc2ccccc12  [-0.292411, 0.170263, -0.085754, 0.002736, 0.0...  ...  [[0.0, 1.2158509801073485, 2.2520730233154076,...  [[0.0, 1.6334, 0.1799, 0.0086, 0.0068, 0.0002,...
+2  C=C(C)[C@H]1C[C@@H]2OO[C@H]1C=C2C  [-0.101749, 0.012339, -0.07947, -0.020027, -0....  ...  [[0.0, 1.3223632546838255, 2.468055985361353, ...  [[0.0, 1.9083, 0.0179, 0.016, 0.0236, 0.001, 0...
+3                     OCCCc1cc[nH]n1  [-0.268379, 0.027614, -0.050745, -0.045047, 0....  ...  [[0.0, 1.4018301850170725, 2.4667588956616737,...  [[0.0, 0.9446, 0.0311, 0.002, 0.005, 0.0007, 0...
+4      CC(=N)NCc1cccc(CNCc2ccncc2)c1  [-0.083162, 0.114954, -0.274544, -0.100369, 0....  ...  [[0.0, 1.5137126697008916, 2.4882198180715465,...  [[0.0, 1.0036, 0.0437, 0.0108, 0.0134, 0.0004,......
+```
+where atomic properties (e.g. hirshfeld_charges) must be a 1D list with the order same as that of atoms in the SMILES string; and bond properties (e.g. bond_length_matrix) can either be a 2D list of shape (number_of_atoms × number_of_atoms) or a 1D list with the order same as that of bonds in the SMILES string. The `--keeping_atom_map` option can be used if atom-mapped SMILES is provided. The `--adding_h` option can be used if hydrogens are included in the atom targets and bonds to hydrogens are included in the bond targets.
+This model allows multitask constraints applied to different atomic/bond properties by specifying the argument `--constraints_path` with a given `.csv` file. Note that the constraints must be in the same order as the SMILES strings in your data file. Also note that `.csv` file must have a header row and the constraints should be comma-separated with one line per molecule. The optional argument `--no_shared_atom_bond_ffn` will make it so that the ffn weights used by each task are independent, otherwise the default is that atom tasks share ffn weights and bond tasks share ffn weights so that the ffn weights have the benefits of multitask training. The optional argument `--no_adding_bond_types` will let the bond types of each bond determined by RDKit molecules not be added to the output of bond targets. The optional argument `--weights_ffn_num_layers` can change the number of layers in FFN for determining weights used to correct the constrained targets.
+
+Please note that the current framework is only available for models trained on multiple atomic and bond properties simultaneously. Training on both atomic/bond and molecular targets is not supported.
+
 ### Pretraining
 
 Pretraining can be carried out using previously trained checkpoint files to set some or all of the initial values of a model for training. Additionally, some model parameters from the previous model can be frozen in place, so that they will not be updated during training.