Output options for make_predictions #235

cjmcgill · 2022-01-19T03:21:41Z

User noted that for the make_predictions function, invalid SMILES can be included in the arguments but the returned values will exclude the invalid entries without any notation of which ones are invalid. The default behavior before for the values saved to file has always been to include the entries with invalid SMILES with prediction values of "Invalid SMILES". Following the refactoring of make_predictions in #200, it's more accessible than before to use this function directly and so it's now more obvious that the saved behavior and the function return behavior do not match.

I've added two new options to the make_predictions function.

return_invalid_smiles. This option indicates whether to include predictions for invalid datapoints, with the value "Invalid SMILES" in the function return. This makes it consistent with the saved value behavior. This option has been set default to True. The previous behavior before the option would have been False.
return_index_dict. This option indicates whether to return a dictionary keyed with the index of the initial dataset or smiles list for the predictions rather than a list of list of values. This option defaults to False. If return_invalid_smiles is set to False, the invalid index values will be missing from the returned dictionary.

alongd

Thanks for the PR, I added some minor comments

alongd · 2022-01-22T12:44:25Z

chemprop/train/make_predictions.py

+        full_preds = []
+        for full_index in range(len(full_data)):
+            valid_index = full_to_valid_indices.get(full_index, None)
+            preds = avg_preds[valid_index] if valid_index is not None else ['Invalid SMILES'] * num_tasks


Is it possible to also return the invalid SMILES themselves, or is it not really beneficial?

I'm not sure that it's beneficial. But if people do want it, the right place is in the load_data function where these are getting split up in the first place. I've added it there as one of the outputs (unused in the full workflow). Is that a reasonable return structure do you think?

@alongd I've actually rethought this and I don't want to add another output to load_data where it would break existing code for people. The right place would be in a separate data/utils function. I've added new functions for that from both files and lists.

alongd · 2022-01-22T12:47:11Z

chemprop/train/make_predictions.py

@@ -259,6 +272,8 @@ def make_predictions(args: PredictArgs, smiles: List[List[str]] = None,
                 loading data and a model and making predictions.
    :param smiles: List of list of SMILES to make predictions on.
    :param model_objects: Tuple of output of load_model function which can be called separately.
+    :param return_invalid_smiles: Whether to return None values for invalid SMILES, otherwise will skip them in returned predictions.


Would it make sense to make the docstring description of return_invalid_smiles identical in both predict_and_save() and make_predictions(), or is it different on purpose?

This was unintentional, I will make them the same.

alongd · 2022-01-22T12:48:22Z

chemprop/train/make_predictions.py

@@ -122,6 +123,7 @@ def predict_and_save(args: PredictArgs, train_args: TrainArgs, test_data: Molecu
    :param full_to_valid_indices: A dictionary dictionary mapping full to valid indices.
    :param models: A list or generator object of :class:`~chemprop.models.MoleculeModel`\ s.
    :param scalers: A list or generator object of :class:`~chemprop.features.scaler.StandardScaler` objects.
+    :param return_invalid_smiles: Whether to include invalid SMILES with a value None in the returned predictions.


I probably misunderstood, but it looks like the description says that a None value is returned, when in practice an 'Invalid SMILES' string is returned

You are completely right here. This is a mistake in the docstring and I will fix.

alongd · 2022-01-22T21:25:52Z

Thanks for the modifications, @cjmcgill!
I agree with your view that invalid SMILES shouldn't be returned by default.
Feel free to squash the "Fix doc strings" commit.
I'm new to this repo, what are the guidelines re adding tests for new functions? Or is it complex to test these functions since an input is required?
Also, a very minor stylistic comment, perhaps put all PEP8 modifications made to make_predictions.py in a separate minor commit?

cjmcgill · 2022-01-22T22:02:58Z

@alongd I refactored the commits so the spacing changes were grouped in a separate commit.

Also you are right on about testing. Thus far, we have not had unit tests in this repo, only relying on only integration tests. I am adding unit tests in the in-progress #232, and tests on these functions will be included in it.

alongd

Looks good, thanks!

cjmcgill requested review from hesther, cbilodeau2, davidegraff, kevingreenman, mliu49 and yunsiechung January 19, 2022 03:21

alongd reviewed Jan 22, 2022

View reviewed changes

cjmcgill force-pushed the return_index_dict branch from 03aa8c5 to 8b96051 Compare January 22, 2022 20:47

cjmcgill added 2 commits January 22, 2022 16:41

Minor spacing corrections to make_predictions

19faab3

New get_invalid_smiles functions and output options for make_predictions

2dc8594

cjmcgill force-pushed the return_index_dict branch from 8b96051 to 2dc8594 Compare January 22, 2022 21:59

alongd approved these changes Jan 23, 2022

View reviewed changes

cjmcgill merged commit ec45082 into master Jan 23, 2022

cjmcgill deleted the return_index_dict branch January 23, 2022 16:41

kevingreenman mentioned this pull request Dec 12, 2022

[BUG]: Molecule fingerprinting with invalid SMILES in list #350

Closed

shihchengli mentioned this pull request Dec 13, 2022

Molecule fingerprinting with invalid SMILES in list #351

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output options for make_predictions #235

Output options for make_predictions #235

cjmcgill commented Jan 19, 2022

alongd left a comment

alongd Jan 22, 2022

cjmcgill Jan 22, 2022

cjmcgill Jan 22, 2022

alongd Jan 22, 2022

cjmcgill Jan 22, 2022

alongd Jan 22, 2022

cjmcgill Jan 22, 2022

alongd commented Jan 22, 2022

cjmcgill commented Jan 22, 2022

alongd left a comment

Output options for make_predictions #235

Output options for make_predictions #235

Conversation

cjmcgill commented Jan 19, 2022

alongd left a comment

Choose a reason for hiding this comment

alongd Jan 22, 2022

Choose a reason for hiding this comment

cjmcgill Jan 22, 2022

Choose a reason for hiding this comment

cjmcgill Jan 22, 2022

Choose a reason for hiding this comment

alongd Jan 22, 2022

Choose a reason for hiding this comment

cjmcgill Jan 22, 2022

Choose a reason for hiding this comment

alongd Jan 22, 2022

Choose a reason for hiding this comment

cjmcgill Jan 22, 2022

Choose a reason for hiding this comment

alongd commented Jan 22, 2022

cjmcgill commented Jan 22, 2022

alongd left a comment

Choose a reason for hiding this comment